happened again. run ADL4.1+Mx7.1 problems

Issues related to runtime execution of algorithms in ADL
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

Please verify that you can create a core dump. You can use this program. My compiler is named g++44 (for g++ version 4.4). Yours may be named just "g++".

Code: Select all

~/test > cat causeCoreDump.cpp 
#include <iostream>

using namespace std;

int main()
{
   int array[100];
   int x = 10000;
   array[x] = 42;    // Over index the array.

   return 0;
}
~/test > g++44 -g -m64 causeCoreDump.cpp -o causeCoreDump.exe
~/test > ./causeCoreDump.exe 
Segmentation fault (core dumped)
~/test > gdb causeCoreDump.exe core.5815 
GNU gdb (GDB) 7.2
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /npd/kbisanz/test/causeCoreDump.exe...done.

warning: core file may not match specified executable file.
[New Thread 5815]
Reading symbols from /usr/lib64/libstdc++.so.6...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libstdc++.so.6
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Core was generated by `./causeCoreDump.exe'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000000000400673 in main () at causeCoreDump.cpp:9
9          array[x] = 42;    // Over index the array.
(gdb) where
#0  0x0000000000400673 in main () at causeCoreDump.cpp:9
(gdb) print x
$1 = 10000
(gdb)
Additionally, can you upload the problem RDR to ftp://ftp.ssec.wisc.edu/pub/incoming/

If you get a core from the algorithm, you can also upload that.
Kevin Bisanz
Raytheon Company
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

I happened to be looking through your log file and noticed a few things not related to your current issue, but which may provide you a slight performance speedup.

There is this message:

Code: Select all

2013/08/05 19:29:29.798.214 (18462.140448062871424): DBG_LOW DmApiClient.cpp|515| tid-140448062871424 Found 3436 .asc files in path=/data/data020/pub/NPP_DATA/TILES/NovGroundCCR692Tiles/GridIP-VIIRS-Snow-Ice-Cover-Rolling-Tile_Nov_2012/withMetadata
If you were to create a .jasc in that directory using $ADL_HOME/script/createJascFiles.sh the process of restoring files may go slightly faster. The reason is that 1 .jasc would be read instead of 3436 .asc files. The 1 .jasc or 3436 .asc files contain the same data overall, but reducing the number of file reads helps. Of course this is only a good idea if the contents of that directory are not changing.

Additionally, there are a number of messages such as

Code: Select all

The UR 4d0fb478-e5712-0a4f180b-576cddc5 already exists in inventory but is identical to the existing copy. This UR was not added.
The above message is not a problem, but indicates that the file 4d0fb478-e5712-0a4f180b-576cddc5.asc is present in multiple diretories. This should not cause a problem, but it looks like there are about 5000 messages of that type. So, each time the executable is run, there are 5000 extra file reads that take place. Removing the duplicate files would reduce the number of files read each time.
Kevin Bisanz
Raytheon Company
wzchen
Posts: 89
Joined: Wed Jul 18, 2012 3:01 pm

Re: happened again. run ADL4.1+Mx7.1 problems

Post by wzchen »

Hi Kevin,

I did create a .jasc file which supposes include all Tiles file. However, when I run ADL, some of granules still complained that it can not find snow cover rolling tiles. After I added snow cover rolling tiles, some of granules went through, but not all of them. I am not sure why it happened, so I just keep the extra snow cover rolling tiles folder into its input directory.

BTW, I think I have trouble to find the core files. I even searched all my home directory. It is not there either.

Code: Select all

[10:31 weizhong@rhw3022 temp]$ ulimit -a | grep core
core file size          (blocks, -c) unlimited
[10:32 weizhong@rhw3022 temp]$ ./causeCoreDump.exe
Segmentation fault (core dumped)
[10:32 weizhong@rhw3022 temp]$ ls core*
ls: cannot access core*: No such file or directory
Thanks,

Weizhong
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

That is very strange. If you do a "man core" you can find out some info on core dumps.

How much space is left in the current directory? You can do a "df -m ." to determine the free space (in megabytes) of the current directory. Core files for algorithms could be very big (up to 6-8 gig for VIIRS SDR), but the sample program I posted yesterday only had a core file size of around 290KB, which is tiny.

What distribution of Linux are you using? CentOS ? Can you do a "uname -a"? Are you using the virtual machine?

As a last resort, the man page says that the naming is controlled by the file /proc/sys/kernel/core_pattern. However, I doubt anyone changed that.
Kevin Bisanz
Raytheon Company
wzchen
Posts: 89
Joined: Wed Jul 18, 2012 3:01 pm

Re: happened again. run ADL4.1+Mx7.1 problems

Post by wzchen »

I am using redhat RHEL6.4.
My home directory has only 40G limit, but I only use about 5.5G. The machine that I were running ADL has 1.7T left.
[15:43 weizhong@rhw3022 int_chainrun_v4.0]$ df -m -h .
Filesystem Type Size Used Avail Use% Mounted on
rhw3022:/data020
nfs4 20T 19T 1.7T 92% /data/data020
The core file looks normal.
[15:51 weizhong@rhw3022 temp]$ cat /etc/redhat-release
Red Hat Enterprise Linux Workstation release 6.4 (Santiago)

[15:40 weizhong@rhw3022 temp]$ uname -a
Linux rhw3022.star1.nesdis.noaa.gov 2.6.32-358.11.1.el6.x86_64 #1 SMP Wed May 15 10:48:38 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

[15:40 weizhong@rhw3022 temp]$ cat /proc/sys/kernel/core_uses_pid
1
[15:40 weizhong@rhw3022 temp]$ cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e

[15:40 weizhong@rhw3022 temp]$ causeCoreDump.exe
Segmentation fault (core dumped)
[15:40 weizhong@rhw3022 temp]$ ls *core*
ls: cannot access *core*: No such file or directory

[15:41 weizhong@rhw3022 temp]$ ...
[15:41 weizhong@rhw3022 home001]$ du -hs weizhong
5.5G weizhong

[15:48 weizhong@rhw3022 temp]$ ulimit -a | grep core
core file size (blocks, -c) unlimited
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

This is still very strange.

I'm on a CentOS 6.4 system (should be same as RHEL 6.4) with apparently the same settings:
~ > cat /etc/redhat-release
CentOS release 6.4 (Final)
~ > cat /etc/centos-release
CentOS release 6.4 (Final)
~ > cat /proc/sys/kernel/core_uses_pid
1
~ > cat /proc/sys/kernel/core_pattern
|/usr/libexec/abrt-hook-ccpp %s %c %p %u %g %t e
~ > g++ -g -m64 causeCoreDump.cpp
~ > ls core*
ls: cannot access core*: No such file or directory
~ > g++ -g -m64 causeCoreDump.cpp
~ > ./a.out
Segmentation fault (core dumped)
~ > ls core*
core.28013
~ >
Did you set this computer up yourself? Is it only used for ADL or is/was it used for another purpose? If someone set it up for you, perhaps they changed some setting to change the location of core files? You might need to ask your system administrator.

If you have root access, you can force the "locate" database to be updated:

Code: Select all

~ > sudo updatedb
~ > locate core*
/home/adluser/core.28013
~ >
If you do not have root access, maybe the database is updated each night? If you are lucky it will update tonight and when you come in Wednesday you can do a "locate core*" and it will tell you where the core files are hiding. However, you don't seem to have been lucky so far. :(

After reading this: http://stackoverflow.com/questions/2065 ... -directory
Is your core file in any of
/var/cache/abrt/
/var/spool/abrt/

Have you tried doing this from the root directory: find / -name "*core*" ?
Kevin Bisanz
Raytheon Company
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

In addition to my previous post, have you looked in /etc/abrt/abrt.conf? If "/usr/libexec/abrt-hook-ccpp" is handling the cores, maybe there are clues in abrt.conf about where they are going?
Kevin Bisanz
Raytheon Company
wzchen
Posts: 89
Joined: Wed Jul 18, 2012 3:01 pm

Re: happened again. run ADL4.1+Mx7.1 problems

Post by wzchen »

Our SA said he already reset the core dump location to my current directory. I did see the core file by running your small program. However, I still can't find it under my ADL's directory or the location of the scripts.

In order to make sure I have an unlimited core file, I ran the "ulimit -c unlimited" first in the same terminal before I ran my perl scripts. (I wrapped the TK china runner into a perl script. ) However, it exited with the same problem. BTW, if the program dumped a core file, should I see some messages like "Segmentation fault (core dumped)" on the screen?

Thanks.
kbisanz
Posts: 280
Joined: Wed Jan 05, 2011 7:02 pm
Location: Omaha NE

Re: happened again. run ADL4.1+Mx7.1 problems

Post by kbisanz »

I believe the core file should be in the directory from which the command line chain runner was executed. Are you doing any "cd" or "chdir" commands in your perl wrapper? See below for an example. I have replaced some lines with "...".

I modified $ADL_HOME/SDR/VIIRS/Controller/src/ProSdrViirsControllerMain.cpp such that these lines were at the top of main:

Code: Select all

   int array[100];
   int x = 10000;
   array[x] = 42;    // Over index the array.

Code: Select all

~/ADL4.2/ADL > ls core*
ls: cannot access core*: No such file or directory
~/ADL4.2/ADL > script/runAdlChainRunner.pl ProSdrViirsController.exe NPP001212025477
...
Creating ProSdrViirsController.exe TK file for granule ID NPP001212025477
/home/adluser/ADL4.2/ADL/log/ProSdrViirsController_NPP001212025477.xml
Running ProSdrViirsController.exe for granule ID NPP001212025477
/home/adluser/ADL4.2/ADL/bin/ProSdrViirsController.exe
Algorithm Run Failure - ProSdrViirsController, NPP001212025477 Failed During Execution!
Algorithm Run Failure - ProSdrViirsController, NPP001212025477 Failed to Produce an Output Product!
Algorithm Run Failure - ProSdrViirsController, NPP001212025477 Algorithm Stopped Early or Core Dumped!

-----------------Failure Details Start---------------------
Did Not Find an Output Product For: VIIRS-ANC-Dig-Bath-Data-Mod-Gran
...
Did Not Find an Output Product For: VIIRS-DNB-GRC

-----------------Failure Details End---------------------
Error: An Algorithm in the Chain has Failed to Complete Successfully!
Chain Run did not Completed Successfully!
~/ADL4.2/ADL > ls core*
core.24090
~/ADL4.2/ADL > ls -l core*
-rw-------. 1 adluser adluser 1609805824 Aug  7 14:34 core.24090
~/ADL4.2/ADL >
Now that you know you can write a core file using a sample program, I would try some sort of global find. Or talk to your system admin.
Kevin Bisanz
Raytheon Company
wzchen
Posts: 89
Joined: Wed Jul 18, 2012 3:01 pm

Re: happened again. run ADL4.1+Mx7.1 problems

Post by wzchen »

I finally found the core file. BTW, what is the user name and password for FTP site? I tried my username/password for this forums and password "anonymous". None of them worked.
Thanks,
[10:54 weizhong@rhw3022 d20130704_t1649219_e1650473_b08730]$ gdb /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/bin/ProSdrViirsController.exe core.25570

(gdb) where
#0 0x00007f8d42b82374 in ProSdrViirsGeo::calcModFromImg(viirsSdrGeoPtrs*, ProSdrCmnGeo*) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProSdrViirsGeo.so
#1 0x00007f8d42b56525 in ProSdrViirsGeo::geolocateGranule(viirsSdrGeoPtrs*) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProSdrViirsGeo.so
#2 0x00007f8d42b3b472 in ProSdrViirsGeo::geolocate() ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProSdrViirsGeo.so
#3 0x00007f8d42b40ee0 in ProSdrViirsGeo::doProcessing() ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProSdrViirsGeo.so
#4 0x00007f8d4829cbdc in ProCmnAlgorithm::doPstage() ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#5 0x00007f8d482accfb in ProCmnAlgorithm::doIpoModel() ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#6 0x00007f8d4828b8a4 in ProCmnAlgorithm::runAlgorithmImpl(InfTk_TaskData const&, bool) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#7 0x00007f8d4827e9c9 in ProCmnAlgorithmBase::runAlgorithm(InfTk_TaskData const&, bool) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#8 0x00007f8d482f326a in ProCmnControllerAlgorithm::runAlgorithmChain(InfTk_TaskData const&, bool) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#9 0x00007f8d48306252 in ProCmnControllerBase::runAlgorithmImpl(InfTk_TaskData const&, bool) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#10 0x00007f8d4827e9c9 in ProCmnAlgorithmBase::runAlgorithm(InfTk_TaskData const&, bool) ()
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libProCmnIPO.so
#11 0x00007f8d4b386dc4 in ProCmnAppl::run() ()
---Type <return> to continue, or q <return> to quit---
from /data/data020/weizhong/ADL4.1/CSPP/ADL4.1_Mx7.1/lib/libAdlScienceAppl.so
#12 0x0000000000411673 in main ()
(gdb)
Post Reply