climateprediction.net home page
Disappearing run. Diagnostics?

Disappearing run. Diagnostics?

Questions and Answers : Macintosh : Disappearing run. Diagnostics?
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user1204

Send message
Joined: 25 Aug 04
Posts: 5
Credit: 103,128
RAC: 0
Message 1794 - Posted: 26 Aug 2004, 23:45:58 UTC

<P>I was running 2 runs on <B>boinc_4.05_powerpc-apple-darwin</B> on a PowerMac dual 1.25 GHz G4, 1.25 GB RAM MacOS 10.3.5. Runs started around 2004-08-25 18:54:30 for <I>00c3_300025420_0 using hadsm3 version 4.03</I> and <I>00c4_300025421_0 using hadsm3 version 4.03</I>.</P>
<P>Checking after coming home today 8/26, it appears that run 00c4 has disappeared. The thing is I don\'t know what diagnostics to look for to see <B>why</B> it disappeared? The log file for that model shows...</P>
<PRE>
00c4_300025421 - PH 1 TS 007633 - 10/05/1811 00:30 - H:M:S=0011:53:10 AVG= 5.61 DLT= 2.69
00c4_300025421 - PH 1 TS 007634 - 10/05/1811 01:00 - H:M:S=0011:53:23 AVG= 5.61 DLT=13.27
00c4_300025421 - PH 1 TS 007635 - 10/05/1811 01:30 - H:M:S=0011:53:25 AVG= 5.61 DLT= 1.94
00c4_300025421 - PH 1 TS 007636 - 10/05/1811 02:00 - H:M:S=0011:53:27 AVG= 5.61 DLT= 1.95
</PRE>
<P>Then nothing else, no messages or errors just that run stoped reporting. Kicking in viz on that run shows a blue planet. Checking my account on the website shows no status for the 00c4 run. So 2 questions: 1) How do I determine if this was just a \"normal\" failed model or something else (like a bug)? That is, how do I diagnose this? 2) How do I get boinc to report home to y\'all about 00c4 status or will it just do that on it\'s own in time and then download a new model?
</P>
<P>Thanks for your time.<br>
BCNU,<br>
Vance</P>
ID: 1794 · Report as offensive     Reply Quote
old_user1
Avatar

Send message
Joined: 5 Aug 04
Posts: 907
Credit: 299,864
RAC: 0
Message 1869 - Posted: 27 Aug 2004, 10:30:27 UTC

Hi, on the ./viz you can use a command line argument to attach to the specific model, i.e.:

./viz 00c4_300025421

if it's blue it could mean it crashed, probaby best to try a Ctrl+C, and then run ./boinc* again and see if they both pop up?

also you could do a

ps aux|grep hads

and see if the stuff is running (for two cpu's there should be a hadsm3_ and hadsm3um_ twice)


ID: 1869 · Report as offensive     Reply Quote
Profile old_user849

Send message
Joined: 14 Aug 04
Posts: 37
Credit: 276,676
RAC: 0
Message 1870 - Posted: 27 Aug 2004, 10:41:41 UTC
Last modified: 27 Aug 2004, 10:50:27 UTC

Hi,

In 'Finder' open up the folder you are using to run the project. You should see a folder 'projects' open this. Open folder 'climateprediction.net' and you will see a seperate folder for each of the runs. Your lost run should be there. Open it and look for the files stderr_um.txt. If the work unit (WU) failed there should be a message about it here.

If there is you can post it if its not to big or submit it to CPDN.

K.

@carl: you got in just ahead of me (LoL) In os X 10.3.5 you can go to utilities/Activity monitor where you easily see all that is running, including of course,

443 hadsm3um_4.02_po chuggybus 93.50 1 45.80 MB 91.16 MB
K.
ID: 1870 · Report as offensive     Reply Quote
old_user1204

Send message
Joined: 25 Aug 04
Posts: 5
Credit: 103,128
RAC: 0
Message 1884 - Posted: 27 Aug 2004, 14:15:25 UTC

<P>Thanks for your help! The <I>stderr_um.txt</I> and <I>stdout_um.txt</I> files for the <I>00c4</I> project were both zero length. The ps listing showed </P>
<PRE>
G4 /Applications/BOINC-CPDN/projects/climateprediction.net/00c4_300025421 $ ps aux | grep had
strick 925 96.2 3.4 93344 45132 p1 RN Wed06PM 2220:52.21 hadsm3um_4.03_powerpc-apple-darwin 24090 912
strick 912 0.0 0.1 30216 1168 p1 SN Wed06PM 0:18.29 hadsm3_4.03_powerpc-apple-darwin 00c3_300025420
strick 913 0.0 0.1 30216 1164 p1 SN Wed06PM 0:11.18 hadsm3_4.03_powerpc-apple-darwin 00c4_300025421
strick 924 0.0 0.0 0 0 p1 ZN 31Dec69 0:00.00 (hadsm3um_4.03_po)
strick 2434 0.0 0.0 18172 340 std S+ 8:37AM 0:00.01 grep had
</PRE>
<P>So it appears that the <I>00c4</I> run did indeed die off, probably the zombie process pid 924 above. Which makes me wonder why pid 913, which was probably the parent process, didn't catch the child process exit status? If it <I>wait()</I>'ed appropriately the child should have been cleaned up. Odd.</P>
<P>Anyway, doing a CTRL-C shutdown everything cleanly and on restart the log showed, yes indeed, the <I>00c4</I> model had crashed. Then it uploaded the results to y'all and downloaded a new run.</P>
<PRE>
Starting model ID 00c4_300025421 Phase 1
Waiting for model startup, this may take a minute...
Stack size=48.00 MB
00c4_300025421 - PH 1 TS 007633 - 00/00/0000 00:00 - H:M:S=0011:53:10 AVG= 5.61 DLT= 0.00
Model crashed...retrying...restart level 2
Preparing for restart...
Rewinding a model-year...
Error: Restart files for dataout/restart.year not found
Giving up, this result exceeded crash count for available restart files.
... entries about zipping up files...
2004-08-27 08:47:40 [climateprediction.net] Unrecoverable error for result 00c4_300025421_0 (process exited with code 25
1 (0xfb))
2004-08-27 08:47:40 [climateprediction.net] Unrecoverable error for result 00c4_300025421_0 (process exited with code 25
1 (0xfb))
2004-08-27 08:47:40 [climateprediction.net] Computation for result 00c4_300025421 finished
2004-08-27 08:47:40 [climateprediction.net] Started upload of 00c4_300025421_0_1.zip
...
</PRE>
<P>So that's a wrap. Thank you very much for helping me diagnose this. Things are on-track and crunching away again.</P>
BCNU,<BR>
Vance
ID: 1884 · Report as offensive     Reply Quote
old_user1
Avatar

Send message
Joined: 5 Aug 04
Posts: 907
Credit: 299,864
RAC: 0
Message 1893 - Posted: 27 Aug 2004, 16:02:27 UTC - in response to Message 1884.  

I looked up the error output from the upload server:

LOOKUP TABLE
19328 64-bit words long
Non constant polar row found in dump : field 1
Dump must be reconfigured
Model run aborted
IN U_MODEL1_WIN
Starting hadsm3 model for ID# 24170...
Changing to slots directory /Applications/BOINC-CPDN/slots/1

Model abandoned: UM has aborted the model
Detaching shared memory, closing model...


so it's definitely an odd crash, probably from a parameter for this run that caused the climate model to go unstable. The "monitor" program (hadsm3_) usually detects when the model (hadsm3um_) has crashed, but somehow the first time this wasn't detected.


ID: 1893 · Report as offensive     Reply Quote

Questions and Answers : Macintosh : Disappearing run. Diagnostics?

©2024 climateprediction.net