climateprediction.net home page
Oddities happen

Oddities happen

Message boards : Number crunching : Oddities happen
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 6897 - Posted: 13 Dec 2004, 6:03:27 UTC
Last modified: 13 Dec 2004, 6:39:56 UTC

Hi, all,

My Bbox uploaded another pair of Models today. All seemed well except for a to-be-expected "CPDN Monitor got quit request...". The oddity occurred afterwards, when average time went to zip.

24zj_100120789 - PH 1 TS 006571 - 17/04/1811 21:30 - H:M:S=0005:23:19 AVG= 2.95 DLT= 2.00
26nv_100122983 - PH 1 TS 000052 - 02/12/1810 02:00 - H:M:S=0000:02:39 AVG= 3.07 DLT= 0.81
26nv_100122983 - PH 1 TS 000053 - 02/12/1810 02:30 - H:M:S=0000:02:40 AVG= 3.03 DLT= 0.89
26nv_100122983 - PH 1 TS 000054 - 02/12/1810 03:00 - H:M:S=0000:02:42 AVG= 3.01 DLT= 1.83
CPDN Monitor got quit request...
Detaching shared memory...
2004-12-12 21:30:10 [climateprediction.net] Result 24zj_100120789_3 exited with zero status but no 'finished' file
2004-12-12 21:30:10 [climateprediction.net] If this happens repeatedly you may need to reset the project.
Starting model in /home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26015
Env Used=LD_LIBRARY_PATH=/home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc:/usr/local/lib:/usr/lib:/lib
2004-12-12 21:30:10 [climateprediction.net] Restarting result 24zj_100120789_3 using hadsm3 version 4.04
Starting model ID 24zj_100120789 Phase 1
Stack size=4096.00 MB
Waiting for model startup, this may take a minute...
26nv_100122983 - PH 1 TS 000055 - 02/12/1810 03:30 - H:M:S=0000:02:43 AVG= 2.97 DLT= 0.97
24zj_100120789 - PH 1 TS 006481 - 16/04/1811 00:30 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
26nv_100122983 - PH 1 TS 000056 - 02/12/1810 04:00 - H:M:S=0000:02:55 AVG= 3.13 DLT=11.92
24zj_100120789 - PH 1 TS 006482 - 16/04/1811 01:00 - H:M:S=0000:00:12 AVG= 0.00 DLT=12.41
.
.
.
24zj_100120789 - PH 1 TS 006508 - 16/04/1811 14:00 - H:M:S=0000:01:25 AVG= 0.01 DLT= 0.99
26nv_100122983 - PH 1 TS 000085 - 02/12/1810 18:30 - H:M:S=0000:04:11 AVG= 2.96 DLT= 0.99


The Model that uploaded a few hours earlier got the "quit request" and resumed with zero time. It's not the first instance I've had a time anomaly but the first time it happened in Phase 1 and reset to zero.

What next!?

(With this glitch in mind, and other stats difugalties, please excuse me for taking a dim view of all stats except total Models succesfully completed.)

Happy Holidays to all.


Edit: SuSE Linux 9.0, P4 3.0 HT, 1 gig RAM
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 6897 · Report as offensive     Reply Quote
Profile old_user156
Avatar

Send message
Joined: 5 Aug 04
Posts: 186
Credit: 1,612,182
RAC: 0
Message 6955 - Posted: 15 Dec 2004, 3:54:32 UTC
Last modified: 15 Dec 2004, 3:56:54 UTC

I happen to agree with you Jim and wish we still had a count of properley completed models displayed in our stats. One can add them up in your 'results' pages by looking for initially 5444.21 & laterly 6805.26 cobblestones credit awards for any particular result. Boinc server errors becsue of restoration from backup or other reasons can mean the 'Outcome' column is occasionally incorrect - much the same as Classic occasionally miscouted a 'full run'- I did something like 120 completed full runs under 'Classic CPDN' but 'only' have 101 reported complete in my Classic stats.

If models were only crashing and reporting back when they were actually non viable parameters instead of mostly (IMO) due to machine error, then _any_ run completed at whatever stage of computation, is IMNSHO, effectively a 'full run' but the line between stopped early due to machine error and stopped early due to parameters would be rather difficult to detect whilst we're running these models - post analysis of all the uploaded data ought to be able to pinpoint the exact line between the two cases more reliably.

<a href="http://www.nmvs.dsl.pipex.com/"><img src="http://boinc.mundayweb.com/cpdn/stats.php?userID=6&amp;team=off&amp;trans=off"></a>

<a href="http://www.nmvs.dsl.pipex.com/">Distributed Mania</a>
ID: 6955 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 7124 - Posted: 31 Dec 2004, 17:40:25 UTC
Last modified: 31 Dec 2004, 18:35:58 UTC

It's still happening, on my Dbox this time (P4, 3.0 GHz, 1 gig RAM, SuSE 9.0 [fully updated], tens of gig of free disk space, boinc 4.13):

2qvn_100149442 - PH 1 TS 000058 - 02/12/1810 05:00 - H:M:S=0000:02:40 AVG= 2.78 DLT= 0.93
2osr_100146719 - PH 1 TS 052769 - 20/12/1813 08:30 - H:M:S=0040:24:36 AVG= 2.76 DLT= 1.00
CPDN Monitor got quit request...
Detaching shared memory...
2004-12-31 09:46:39 [climateprediction.net] Result 2osr_100146719_2 exited with zero status but no 'finished' file
2004-12-31 09:46:39 [climateprediction.net] If this happens repeatedly you may need to reset the project.
Starting model in /home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc...
Created shared memory region key = 26330
Env Used=LD_LIBRARY_PATH=/home/jim/CPDNboinc/projects/climateapps2.oucs.ox.ac.uk_cpdnboinc:/usr/local/lib:/usr/lib:/lib
2004-12-31 09:46:39 [climateprediction.net] Restarting result 2osr_100146719_2 using hadsm3 version 4.04
Starting model ID 2osr_100146719 Phase 1
Stack size=4096.00 MB
Waiting for model startup, this may take a minute...
2qvn_100149442 - PH 1 TS 000059 - 02/12/1810 05:30 - H:M:S=0000:02:41 AVG= 2.75 DLT= 1.00
2osr_100146719 - PH 1 TS 052705 - 19/12/1813 00:30 - H:M:S=0000:00:00 AVG= 0.00 DLT= 0.00
2qvn_100149442 - PH 1 TS 000060 - 02/12/1810 06:00 - H:M:S=0000:02:42 AVG= 2.72 DLT= 1.00

One wonders what there is about start-up of the second Model in a pair that triggers (if it does) this anomaly, and why the time resets to zero. (Doesn't do the Sec/TS [AVG=] value any good.) The good part is that the Model doesn't die.

May your 2005 be better than your 2004.


Edit: There is also the ever-popular:

No heartbeat from core client - exiting

In this case, the time DOESN'T reset, so one is "double-charged" for the early processing and Model re-start, also making a hash of the Sec/TS value.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 7124 · Report as offensive     Reply Quote

Message boards : Number crunching : Oddities happen

©2024 climateprediction.net