Slowness in ADL_Unpacker due to number of files in outdir

Issues related to runtime execution of algorithms in ADL
Post Reply
houchin
Posts: 128
Joined: Mon Jan 10, 2011 6:20 am

Slowness in ADL_Unpacker due to number of files in outdir

Post by houchin »

Hi guys,

We are noticing an issue with the ADL_Unpacker related to the number of files in the output directory; we've noticed this on two specific occasions where we were unpacking a large number of VIIRS-OBC-IP files. The time it takes to unpack one file appears to be proportional to the number of files already in the output directory.

Right now, I'm trying to unpack about 14000 files. When I started the unpacking, with an empty output directory, was taking about 1 second a piece. After about 6000 files, it was now several seconds for each file.

Is there something in the ADL_Unpacker that is causing this slowdown? Best I can tell it is not creating a .jasc file, and I see no need for it to catalog the existing output directory.
Scott Houchin, Senior Engineering Specialist, The Aerospace Corporation
15049 Conference Center Dr CH3/310, Chantilly, VA 20151; 571-307-3914; scott.houchin@aero.org
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by kbisanz »

Are you calling the unpacker once for each file (i.e. in a shell or script loop) or are you providing multiple h5 files each time the unpacker is invoked?
Kevin Bisanz
Raytheon Company
houchin
Posts: 128
Joined: Mon Jan 10, 2011 6:20 am

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by houchin »

I'm calling it once for each file. It would not be possible to put all 14K files on the command line.

I am working on a script to run multiple instances of the unpacker on multiple threads, and will try to give some subset of the files to each run of the unpacker, but I still won't be able to give it everything at once.
Scott Houchin, Senior Engineering Specialist, The Aerospace Corporation
15049 Conference Center Dr CH3/310, Chantilly, VA 20151; 571-307-3914; scott.houchin@aero.org
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by kbisanz »

Each time you start the unpacker, it has to initialize its DMS client (because it wants to unpack the file into a DMS directory). So at first it has only a few files to read. However, as you unpack more and more files, the number of files it has to read into DMS inventory before it can unpack a file grows. I'm pretty confident that is where your slowdown is coming from. This initialization happens once each startup, so minimizing the number of times the unpacker is executed should be a goal.

You probably already realize this, but I suspect a script to run multiple unpackers will only be successful if you use different DMS directories for each of your unpackers. If you use the same directory and have multiple unpackers writing to it, you risk having a partially written files .asc that another unpacker will try to read. The .asc files are probably small enough that this is unlikely to happen, but it seems possible with 14000 files. Even if partially written .asc aren't a problem you'll still have the issue of the number of files in a directory.

It would be interesting to try something with the xargs command. I believe xargs tries to fit as many arguments on the command line as possible, but is smart enough to not go over a limit. Something like this might work:

Code: Select all

$> cd <directory with h5 files>
$> ls | grep h5$ | xargs $ADL_HOME/bin/ADL_Unpacker.exe
You are probably correct that for the ADL_Unpacker there is not a super critical reason to read the contents of the output directory. However it currently uses the same initialize routine as everything else which does need to catalog directory contents.

I am interested in which solution (if any) works out for you.
Kevin Bisanz
Raytheon Company
houchin
Posts: 128
Joined: Mon Jan 10, 2011 6:20 am

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by houchin »

kbisanz wrote:You probably already realize this, but I suspect a script to run multiple unpackers will only be successful if you use different DMS directories for each of your unpackers. If you use the same directory and have multiple unpackers writing to it, you risk having a partially written files .asc that another unpacker will try to read. The .asc files are probably small enough that this is unlikely to happen, but it seems possible with 14000 files. Even if partially written .asc aren't a problem you'll still have the issue of the number of files in a directory.
This I did not realize. I would consider this a bug. The unpacker isn't using any data in the output directory, so it really shouldn't care what is there.

I had forgotten that the ADL_Unpacker would take more than one thing, so I was actually using xargs with "-L 1", as I am piping generating the list of files to be unpacked dynamically from gdata. Certainly I could take that out. For the moment, I've just got another window open and I'm moving files out of the output directory into temporary directory every so often.

My longer term solution is to create a perl script that uses multiple threads, with some precalculated number of files to be processed at once. What I can do is create temp output directories for each run, have the ADL_Unpacker write into that temp directory, and then at the end of each chunk sweep the files into the user requested output directory.
Scott Houchin, Senior Engineering Specialist, The Aerospace Corporation
15049 Conference Center Dr CH3/310, Chantilly, VA 20151; 571-307-3914; scott.houchin@aero.org
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by kbisanz »

Your comment got me thinking that the partial write issue seemed familiar. The partial writing of a .asc file used to be an issue. It was fixed in the Mx8.1 release of ADL.

We now write to a temporary file and move it into place after the write is finished. The ADL DMS ignores the temporary files.
Kevin Bisanz
Raytheon Company
houchin
Posts: 128
Joined: Mon Jan 10, 2011 6:20 am

Re: Slowness in ADL_Unpacker due to number of files in outdi

Post by houchin »

Hi Kevin,

One thing I'm finding in this dataset is that some bad copies of the files got into the dataset. When ADL_Unpacker.exe is getting to those files, it does recognize them as bad. Unfortunately, it then just stops processing. I would be better if it just reported the error and continued on, either by default or triggered by a command line flag. Is this something that could be added?

The problem with these huge data sets is that not only do I have to restart the script, I have to edit the input to the script (the list of filenames) to eliminate all of the files that were unpacked successfully, before restarting it.
Scott Houchin, Senior Engineering Specialist, The Aerospace Corporation
15049 Conference Center Dr CH3/310, Chantilly, VA 20151; 571-307-3914; scott.houchin@aero.org
Post Reply