climateprediction.net home page
Hidden model crashes?

Hidden model crashes?

Questions and Answers : Windows : Hidden model crashes?
Message board moderation

To post messages, you must log in.

AuthorMessage
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 849 - Posted: 12 Aug 2004, 18:25:00 UTC

I started a model and ran it in parallel with classic cpdn (v3.0.01) on my stock dell laptop.

I set the preferences to run under Boinc only between 10pm and 7am.

When both were running simultaneously, each got around 47% CPU.

Every day, looking at the log file, I could see start and stop times reported as expected. The model appeared to trickle twice.

One day, I noticed that although the boinc gui reported that the model had been restarted, it was taking no CPU. The hadsm3* processes were not present in task manager. I let it run overnight anyway and then boinc gui reported that it was suspended again in the morning for time of day.

I observed the same behaviour the next night - the gui reported normal resume and suspend at 10pm and 7am, even though no cpu usage.

Concluding that the model had crashed and the boinc gui hadn\'t noticed, I restarted the computer. This had the effect of aborting the first model and downloading a new one.

1. Surely boinc should report when the model crashes

2. I wouldn\'t expect a restart to cause a download of a new model.

3. I\'ve no idea what caused the crash. Classic model continues to run w/o any problems.
ID: 849 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 856 - Posted: 12 Aug 2004, 19:46:56 UTC

The hadsm3* program is the interface between BOINC and the CPDN model program hadsum3*. Both programs should continue to run (but not actually do anything) when BOINC suspends processing. If hadsm3* is terminated (manually or abnormally) you could get an XML file corruption which can cause the model to be aborted.

The stdout.* and stderr.* files in your BOINC directory might give further clues.

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a>
ID: 856 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 858 - Posted: 12 Aug 2004, 20:01:40 UTC - in response to Message 856.  

I can't find anything out of the ordinary in either stderr or stdout. I rebooted on 11th Aug which caused the model to be aborted. Prior to that, stdout just reports (incorrectly) the suspend and restart activity. Can't actually tell at all when the model crashed.


stderr.old:
2004-08-05 21:20:35 [SETI@home] Scheduler RPC to http://setiboincdata.ssl.berkeley.edu/sah_cgi/cgi failed
2004-08-05 21:20:35 [SETI@home] No schedulers responded
2004-08-05 21:20:35 [SETI@home] Deferring communication with project for 1 minutes and 0 seconds
2004-08-05 21:21:53 [SETI@home] No work from project
2004-08-05 21:21:53 [SETI@home] Deferring communication with project for 1 days, 0 hours, 0 minutes, and 0 seconds

stdout.txt:
2004-08-05 21:20:18 [---] Starting BOINC client version 4.03 for windows_intelx86
2004-08-05 21:20:18 [SETI@home] Project prefs: no separate prefs for home; using your defaults
2004-08-05 21:20:18 [---] State file has different major version (3.19); resetting projects
2004-08-05 21:20:18 [SETI@home] Resetting project
2004-08-05 21:20:18 [SETI@home] Host ID is 48433
2004-08-05 21:20:18 [---] General prefs: from SETI@home (last modified 2004-07-04 18:18:01)
2004-08-05 21:20:18 [---] General prefs: no separate prefs for home; using your defaults
2004-08-05 21:20:31 [---] CPU scheduler starvation imminent; requesting more work
2004-08-05 21:20:31 [SETI@home] Requesting 10840 seconds of work
2004-08-05 21:20:32 [SETI@home] Sending request to scheduler: http://setiboincdata.ssl.berkeley.edu/sah_cgi/cgi
2004-08-05 21:21:30 [http://climateprediction.net/] Project prefs: no separate prefs for home; using your defaults
2004-08-05 21:21:31 [---] CPU scheduler starvation imminent; requesting more work
2004-08-05 21:21:34 [---] CPU scheduler starvation imminent; requesting more work
2004-08-05 21:21:34 [http://climateprediction.net/] Requesting 5420 seconds of work
2004-08-05 21:21:34 [http://climateprediction.net/] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2004-08-05 21:21:37 [http://climateprediction.net/] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
2004-08-05 21:21:37 [climateprediction.net] Project prefs: no separate prefs for home; using your defaults
2004-08-05 21:21:37 [climateprediction.net] Started download of hadsm3_4.02_windows_intelx86.exe
2004-08-05 21:21:37 [climateprediction.net] Started download of hadsm3data_4.02_windows_intelx86.zip
2004-08-05 21:21:39 [---] CPU scheduler starvation imminent; requesting more work
2004-08-05 21:21:39 [SETI@home] Requesting 5420 seconds of work
2004-08-05 21:21:39 [SETI@home] Sending request to scheduler: http://setiboincdata.ssl.berkeley.edu/sah_cgi/cgi
2004-08-05 21:21:53 [SETI@home] Scheduler RPC to http://setiboincdata.ssl.berkeley.edu/sah_cgi/cgi succeeded
2004-08-05 21:21:53 [SETI@home] Message from server: To participate in this project, you must use major version 3 of the BOINC core client. Your core client is major version 4.
2004-08-05 21:21:53 [---] General prefs: from SETI@home (last modified 2004-07-04 18:18:01)
2004-08-05 21:21:53 [---] General prefs: using your defaults
2004-08-05 21:22:11 [climateprediction.net] Finished download of hadsm3_4.02_windows_intelx86.exe
2004-08-05 21:22:11 [climateprediction.net] Approximate throughput 30805.184019 bytes/sec
2004-08-05 21:22:15 [climateprediction.net] Started download of hadsm3um_4.02_windows_intelx86.zip
2004-08-05 21:23:43 [climateprediction.net] Finished download of hadsm3um_4.02_windows_intelx86.zip
2004-08-05 21:23:43 [climateprediction.net] Approximate throughput 21780.601688 bytes/sec
2004-08-05 21:23:43 [climateprediction.net] Started download of hadsm3se_4.02_windows_intelx86.zip
2004-08-05 21:24:12 [climateprediction.net] Finished download of hadsm3data_4.02_windows_intelx86.zip
2004-08-05 21:24:12 [climateprediction.net] Approximate throughput 29680.966218 bytes/sec
2004-08-05 21:24:12 [climateprediction.net] Started download of 0091_000025310.zip
2004-08-05 21:24:14 [climateprediction.net] Finished download of 0091_000025310.zip
2004-08-05 21:24:14 [climateprediction.net] Approximate throughput 5004.223612 bytes/sec
2004-08-05 21:24:17 [climateprediction.net] Finished download of hadsm3se_4.02_windows_intelx86.zip
2004-08-05 21:24:17 [climateprediction.net] Approximate throughput 25218.657865 bytes/sec
2004-08-05 21:24:17 [climateprediction.net] Starting computation for result 0091_000025310_0 using hadsm3 version 4.02
2004-08-05 21:40:06 [SETI@home] Resetting project
2004-08-05 21:40:06 [SETI@home] Detaching from project
2004-08-05 21:44:49 [climateprediction.net] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2004-08-05 21:44:52 [climateprediction.net] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
2004-08-05 21:44:52 [climateprediction.net] General preferences have been updated
2004-08-05 21:44:52 [---] General prefs: from climateprediction.net (last modified 2004-08-05 21:42:21)
2004-08-05 21:44:52 [---] General prefs: no separate prefs for home; using your defaults
2004-08-05 21:44:52 [---] Suspending computation and network activity - time of day
2004-08-05 22:00:00 [---] Resuming computation and network activity
2004-08-06 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-06 22:00:00 [---] Resuming computation and network activity
2004-08-07 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-07 22:00:00 [---] Resuming computation and network activity
2004-08-07 22:09:16 [climateprediction.net] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2004-08-07 22:09:19 [climateprediction.net] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
2004-08-07 22:09:19 [---] General prefs: from climateprediction.net (last modified 2004-08-05 21:42:21)
2004-08-07 22:09:19 [---] General prefs: using your defaults
2004-08-08 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-08 22:00:00 [---] Resuming computation and network activity
2004-08-09 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-09 22:00:00 [---] Resuming computation and network activity
2004-08-10 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-10 22:00:00 [---] Resuming computation and network activity
2004-08-11 07:00:00 [---] Suspending computation and network activity - time of day
2004-08-11 22:00:00 [---] Resuming computation and network activity

ID: 858 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 859 - Posted: 12 Aug 2004, 20:16:11 UTC - in response to Message 858.  

BTW: This is the stdout file for the new model on restart. The reason for the new model download seems to be starvation - its lost track of the previous model...


2004-08-11 23:22:12 [---] Starting BOINC client version 4.03 for windows_intelx86
2004-08-11 23:22:12 [---] No general preferences found - using BOINC defaults
2004-08-11 23:22:12 [---] Running CPU benchmarks
2004-08-11 23:22:18 [---] Suspending computation and network activity - running CPU benchmarks
2004-08-11 23:23:13 [---] Benchmark results:
2004-08-11 23:23:13 [---] Number of CPUs: 1
2004-08-11 23:23:13 [---] 1827 double precision MIPS (Whetstone) per CPU
2004-08-11 23:23:13 [---] 3773 integer MIPS (Dhrystone) per CPU
2004-08-11 23:23:13 [---] Finished CPU benchmarks
2004-08-11 23:23:14 [---] Resuming computation and network activity
2004-08-11 23:25:17 [http://climateprediction.net/] Project prefs: using your defaults
2004-08-11 23:25:18 [---] CPU scheduler starvation imminent; requesting more work
2004-08-11 23:25:22 [---] CPU scheduler starvation imminent; requesting more work
2004-08-11 23:25:22 [http://climateprediction.net/] Requesting 17280 seconds of work
2004-08-11 23:25:22 [http://climateprediction.net/] Sending request to scheduler: http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
2004-08-11 23:25:25 [http://climateprediction.net/] Scheduler RPC to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
2004-08-11 23:25:25 [http://climateprediction.net/] General preferences have been updated
2004-08-11 23:25:25 [---] General prefs: from climateprediction.net (last modified 2004-08-05 21:42:21)
2004-08-11 23:25:25 [---] General prefs: no separate prefs for home; using your defaults
2004-08-11 23:25:25 [climateprediction.net] Project prefs: no separate prefs for home; using your defaults
2004-08-11 23:25:25 [climateprediction.net] Started download of hadsm3_4.02_windows_intelx86.exe
2004-08-11 23:25:25 [climateprediction.net] Started download of hadsm3data_4.02_windows_intelx86.zip
2004-08-11 23:25:57 [climateprediction.net] Finished download of hadsm3_4.02_windows_intelx86.exe
2004-08-11 23:25:57 [climateprediction.net] Approximate throughput 32978.527555 bytes/sec
2004-08-11 23:25:57 [climateprediction.net] Started download of hadsm3um_4.02_windows_intelx86.zip
2004-08-11 23:27:06 [climateprediction.net] Finished download of hadsm3um_4.02_windows_intelx86.zip
2004-08-11 23:27:06 [climateprediction.net] Approximate throughput 29382.080247 bytes/sec
2004-08-11 23:27:06 [climateprediction.net] Started download of hadsm3se_4.02_windows_intelx86.zip
2004-08-11 23:27:38 [climateprediction.net] Finished download of hadsm3se_4.02_windows_intelx86.zip
2004-08-11 23:27:38 [climateprediction.net] Approximate throughput 26471.214629 bytes/sec
2004-08-11 23:27:39 [climateprediction.net] Started download of 01xl_000027490.zip
2004-08-11 23:27:42 [climateprediction.net] Finished download of 01xl_000027490.zip
2004-08-11 23:27:42 [climateprediction.net] Approximate throughput 3204.227296 bytes/sec
2004-08-11 23:27:47 [climateprediction.net] Finished download of hadsm3data_4.02_windows_intelx86.zip
2004-08-11 23:27:47 [climateprediction.net] Approximate throughput 31635.423784 bytes/sec
2004-08-11 23:27:48 [climateprediction.net] Starting computation for result 01xl_000027490_0 using hadsm3 version 4.02
2004-08-12 07:00:00 [---] Suspending computation and network activity - time of day

ID: 859 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 860 - Posted: 12 Aug 2004, 20:18:11 UTC - in response to Message 858.  
Last modified: 12 Aug 2004, 20:21:03 UTC

&gt; I can't find anything out of the ordinary in either stderr or stdout. I
&gt; rebooted on 11th Aug which caused the model to be aborted. Prior to that,
&gt; stdout just reports (incorrectly) the suspend and restart activity. Can't
&gt; actually tell at all when the model crashed.
&gt;
&gt; stdout.txt:
&gt; 2004-08-05 21:20:18 [---] Starting BOINC client version 4.03 for
&gt; windows_intelx86
&gt; 2004-08-05 21:20:18 [SETI@home] Project prefs: no separate prefs for home;
&gt; using your defaults
&gt; 2004-08-05 21:20:18 [---] State file has different major version (3.19);
&gt; resetting projects
&gt; 2004-08-05 21:20:18 [SETI@home] Resetting project
&gt; 2004-08-05 21:20:18 [SETI@home] Host ID is 48433
&gt; 2004-08-05 21:21:30 [http://climateprediction.net/] Project prefs: no separate
&gt; prefs for home; using your defaults
&gt; 2004-08-05 21:21:53 [SETI@home] Message from server: To participate in this
&gt; project, you must use major version 3 of the BOINC core client. Your core
&gt; client is major version 4.

I think this is part of your problem. BOINC is attached to the SETI and CPDN projects but they're not compatible. SETI won't run with BOINC 4.02 (which you're running) and CPDN requires it.

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a>
ID: 860 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 863 - Posted: 12 Aug 2004, 20:41:27 UTC - in response to Message 860.  


I did start running while still registerd to SETI. But I disconnected from that project within 20 minutes of starting Boinc 4.02 up and 6 days before this error occurred. Do you really think that was the problem? Why wait 5 days to crash? It certainly trickled at least once after I disconnected from SETI.

Also I just noticed that I now have my machine registered twice. It is obviously registered first when I first joined the project on august 5th (id 158). But after the reboot, when the new model downloaded, it was aparently registered again for a second time (id 936). I only just found this out browsing my account.

I certainly don't want my machine re-registered every model download. Think about my stats!
ID: 863 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 866 - Posted: 12 Aug 2004, 21:18:53 UTC - in response to Message 858.  

&gt; 2004-08-05 21:24:17 [climateprediction.net] Starting computation for result
&gt; 0091_000025310_0 using hadsm3 version 4.02
&gt; 2004-08-05 21:40:06 [SETI@home] Resetting project
&gt; 2004-08-05 21:40:06 [SETI@home] Detaching from project

Missed your detach from SETI here. Sorry.

&gt; 2004-08-05 21:44:52 [---] Suspending computation and network activity - time
&gt; of day
&gt; 2004-08-05 22:00:00 [---] Resuming computation and network activity
&gt; 2004-08-06 07:00:00 [---] Suspending computation and network activity - time
&gt; of day
&gt; 2004-08-06 22:00:00 [---] Resuming computation and network activity
&gt; 2004-08-07 07:00:00 [---] Suspending computation and network activity - time
&gt; of day
&gt; 2004-08-07 22:00:00 [---] Resuming computation and network activity
&gt; 2004-08-07 22:09:16 [climateprediction.net] Sending request to scheduler:
&gt; http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
&gt; 2004-08-07 22:09:19 [climateprediction.net] Scheduler RPC to
&gt; http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded

That's got to be a trickle, after just over 2 days of running during your timed period.

&gt; 2004-08-07 22:09:19 [---] General prefs: from climateprediction.net (last
&gt; modified 2004-08-05 21:42:21)
&gt; 2004-08-07 22:09:19 [---] General prefs: using your defaults
&gt; 2004-08-08 07:00:00 [---] Suspending computation and network activity - time
&gt; of day
&gt; 2004-08-08 22:00:00 [---] Resuming computation and network activity
&gt; 2004-08-09 07:00:00 [---] Suspending computation and network activity - time
&gt; of day
&gt; 2004-08-09 22:00:00 [---] Resuming computation and network activity

If things were running normally you should probably have had a second trickle somewhere around here. Just had a thought. The file 0091_000025310.xml should still be in your climateprediction.net directory, and its timestamp will reveal when things went wrong. Then look in the event log to see if there's anything to indicate what happened around that time.

The fact that your machine has re-registered is pointing towards a corruption of the client_state.xml file in the BOINC directory.

I wouldn't worry too much about your stats when the machine re-registered (although you've obviously not been credited for the timesteps done after the first trickle). The new model is still crunching for the same account (it'll just have one more computer).

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a>
ID: 866 · Report as offensive     Reply Quote
KeeperC

Send message
Joined: 5 Aug 04
Posts: 66
Credit: 2,146,056
RAC: 0
Message 869 - Posted: 12 Aug 2004, 21:46:34 UTC - in response to Message 866.  


&gt; If things were running normally you should probably have had a second trickle
&gt; somewhere around here. Just had a thought. The file 0091_000025310.xml
&gt; should still be in your climateprediction.net directory, and its timestamp
&gt; will reveal when things went wrong. Then look in the event log to see if
&gt; there's anything to indicate what happened around that time.

Last modified 08.08.04 06:47

System event log extract:
Information 08/08/2004 14:29:30 Service Control Manager None 7036 N/A D800
Information 08/08/2004 14:29:30 Service Control Manager None 7035 SYSTEM D800
Warning 08/08/2004 10:32:54 Dhcp None 1003 N/A D800
Information 08/08/2004 10:26:17 RemoteAccess None 20159 N/A D800
Information 08/08/2004 10:26:08 RemoteAccess None 20158 N/A D800
Information 08/08/2004 10:25:48 Service Control Manager None 7035 SYSTEM D800
Information 08/08/2004 10:25:43 Service Control Manager None 7035 SYSTEM D800
Information 08/08/2004 10:25:38 Service Control Manager None 7035 SYSTEM D800
Information 08/08/2004 01:04:13 srservice None 108 N/A D800

Not sure what, if any, sign of a problem there is there.
ID: 869 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 887 - Posted: 13 Aug 2004, 6:47:14 UTC

Sorry, I should have specified the application log :(

<a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"><img src="http://www.teampicard.net/templates/fisubice/images/phpbb2_logo.jpg"></a>
ID: 887 · Report as offensive     Reply Quote

Questions and Answers : Windows : Hidden model crashes?

©2024 climateprediction.net