climateprediction.net home page
Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 15 · Next

AuthorMessage
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2173
Credit: 64,760,426
RAC: 3,180
Message 30790 - Posted: 3 Oct 2007, 17:51:10 UTC
Last modified: 27 Mar 2010, 1:25:45 UTC

/Placeholder/
Please use this thread if you have the symptoms of a problem described in this sticky on iceworlds and associated slow model progress in hadsm3/hadsm3h.
ID: 30790 · Report as offensive
old_user81336

Send message
Joined: 10 Jun 05
Posts: 10
Credit: 4,863
RAC: 0
Message 30792 - Posted: 4 Oct 2007, 4:29:42 UTC

I have had that happen to me with 3 models in a row and not only with the SM ones but also with a CM model. Easing back on my overclock did seem to help. The jury is still out on that question, I haven\'t completed a model since lowering my overclock, but it is looking good.
ID: 30792 · Report as offensive
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 30793 - Posted: 4 Oct 2007, 4:47:56 UTC


You didn\'t indicate how drastic your O/C is. How well does it play with 24 hours of Prime-95 (one copy per core)?

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 30793 · Report as offensive
old_user81336

Send message
Joined: 10 Jun 05
Posts: 10
Credit: 4,863
RAC: 0
Message 30794 - Posted: 4 Oct 2007, 5:10:56 UTC - in response to Message 30793.  


You didn\'t indicate how drastic your O/C is. How well does it play with 24 hours of Prime-95 (one copy per core)?


I had it clocked at something like 3.07 GHz, now it is at 3.0 GHz (stock 2.4 GHz). Hearing/reading about all the achieved overclocks on a Q6600, 3 GHz does not seem excessive to me.

I haven\'t run prime95 for 24 hours, only for something like 6 hours (on all 4 cores) and just one pass for memtest86+.
ID: 30794 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 30795 - Posted: 4 Oct 2007, 7:01:59 UTC


6 hours isn\'t really enough for Prime95, my Q6600 was passing at 8 hours but failing before 24 hours. I\'m now awaiting an RMA from OCZ to return a dodgy memory stick...

Make sure that the Prime95 test is using a large proportion of the system memory, otherwise it will detect CPU errors but not memory errors.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 30795 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30796 - Posted: 4 Oct 2007, 9:18:07 UTC


Hearing/reading about all the achieved overclocks on a Q6600, 3 GHz does not seem excessive to me.


It depends on the context in which you heard/read this.
These climate programs are based on huge Fortran programs running on supercomputers, and are rather touchy when running on mere desktop computers.
Extreme stability is far more important than \'extreme\' overclocking.

ID: 30796 · Report as offensive
old_user81336

Send message
Joined: 10 Jun 05
Posts: 10
Credit: 4,863
RAC: 0
Message 30797 - Posted: 4 Oct 2007, 17:15:39 UTC

I tested my Q6600 at the maximum power consumption/heat production setting. I\'d say the test with memtest86+ should have taken care of the memory testing.
For Climateprediction all my testing may have been too short but in my opinion it should have been enough for normal BOINC-usage. The core temeratures were acceptable, I think. About 60C at 100% load, for a \'normal\' 100% load about 50C to 55C.

I haven\'t had any problems with CPDN since reducing my overclock. Would you recommend more testing despite this? BTW, I\'m not crunching 24/7.
ID: 30797 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 30802 - Posted: 4 Oct 2007, 20:23:24 UTC


Sounds like you\'ve solved the problem, but yes, it\'s worthwhile doing the full test.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 30802 · Report as offensive
Ant B

Send message
Joined: 29 Mar 06
Posts: 8
Credit: 2,793,692
RAC: 0
Message 30806 - Posted: 4 Oct 2007, 21:10:52 UTC

I had one of these I think. I hope this link works
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6732317
At that time I was running two slab models, and this one just stopped. Not quite stopped, but it went from two trickles a day to not managing to get out of June 2075 or wherever it was. I was frustrated because it was at about 95% complete. Killing it and restoring from a backup made no difference - it got to the same place and just \'stopped\' again. Still was using all the CPU, and did move between timesteps, but so slowly it never got to the next checkpoint even. It wasn\'t looping as far as I could see. PC is standard issue, Intel Core 2 6320, not overclocked, running on 1GB RAM at that stage. It\'s the first one I have aborted :(
ID: 30806 · Report as offensive
old_user81336

Send message
Joined: 10 Jun 05
Posts: 10
Credit: 4,863
RAC: 0
Message 30808 - Posted: 4 Oct 2007, 21:37:40 UTC - in response to Message 30802.  


Sounds like you\'ve solved the problem, but yes, it\'s worthwhile doing the full test.


I\'ll certainly consider it very seriously. I suppose I can stand to lose 24 hours of crunching. ;)
ID: 30808 · Report as offensive
Jim Kleine

Send message
Joined: 21 Dec 05
Posts: 3
Credit: 1,168,435
RAC: 0
Message 30829 - Posted: 6 Oct 2007, 3:31:35 UTC
Last modified: 6 Oct 2007, 3:49:33 UTC

I seem to have an example of this problem: Result 6817821.

I run BOINC as a service, so unfortunately I can\'t provide timestep, temperature display or s/TS from the graphics because I can\'t display them (unless someone can tell me how). Looking at the trickle history, s/TS increased from approx 1.35 to 3.14 between 18 Sep (last \"fast\" trickle) and 01 Oct (last trickle for this result).

The processor is an Intel Core 2 6420/not overclocked/2GB RAM. Progress on this model has slowed from 6 trickles per day to 1 trickle per 8 days i.e. by a factor of approx 50. There are no known reliability issues with this host with any BOINC project (it runs CPDN/Einstein/Rosetta; one core dedicated to CPDN). This is a server, so it runs 24x7.

Current progress is 75.934%. I believe the slow down started just prior to 75%. Does it serve any useful science purpose to allow this result to finish?
ID: 30829 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 30831 - Posted: 6 Oct 2007, 10:10:43 UTC


Hi Jim

There is/was an option in BOINC to \"Interact with the desktop\", which allowed for graphics.
I think this had to be selected at install, and it may also not be there now.
Various options/features/facilities have come and gone over the years, as BOINC has changed.

With such a large slow down, I think that it would be fair enough to abort it.


Backups: Here
ID: 30831 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 30833 - Posted: 6 Oct 2007, 10:36:47 UTC

There\'s another member processing the same workunit, a bit less advanced than Jim. Interestingly, his model seems to have slowed down at exactly the same point - 4 trickles into phase 3.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6817821

I think I\'ll send this member a private message advising them to abort it as well. (If anyone thinks this isn\'t a good idea, please post or send me a PM.) Not sure what to do about a third cruncher still crunching phase 1 of the model. We could let this person continue for the time being and if they hit the same slowdown at the same point, send a PM then.

Abort it, Jim.
Cpdn news
ID: 30833 · Report as offensive
Jim Kleine

Send message
Joined: 21 Dec 05
Posts: 3
Credit: 1,168,435
RAC: 0
Message 30835 - Posted: 6 Oct 2007, 12:37:03 UTC

Before I do abort it, is it worthwhile:

a) Waiting for the next trickle (should be within 48 hours). Is there any useful (science) data in a trickle other than progress stats?; or

b) Saving any files from the model run that might help analyse/troubleshoot the reason for the slowdown? If yes, which ones?

If you need time to consult the developers etc, that\'s fine. I can leave the model running until you have a consensus.
ID: 30835 · Report as offensive
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 30837 - Posted: 6 Oct 2007, 14:04:44 UTC

Jim,

Unfortunately, the science data for the slab models is in the Zip file upload at the end of a phase. It\'s different for the coupled models, which upload useful data every year, more at decades and even more every 40 years.

Given that the slab models are relatively short, I suspect that the project would debug the models by re-starting from the beginning - so your offer isn\'t likely to be taken up. There are quite a few of these odd slab models, so the project won\'t be short of candidates.

Iain
ID: 30837 · Report as offensive
Jim Kleine

Send message
Joined: 21 Dec 05
Posts: 3
Credit: 1,168,435
RAC: 0
Message 30848 - Posted: 6 Oct 2007, 23:01:15 UTC
Last modified: 6 Oct 2007, 23:05:16 UTC

OK ... aborted.

CPU Time: 647:32:42
Progress: 76.067%

I have saved a copy of the complete BOINC directory subtree and will keep it for 30 days in case anyone cares to revisit this.
ID: 30848 · Report as offensive
Profile old_user197041
Avatar

Send message
Joined: 27 Aug 06
Posts: 26
Credit: 162,685
RAC: 0
Message 30851 - Posted: 7 Oct 2007, 8:58:03 UTC - in response to Message 30831.  


...
There is/was an option in BOINC to \"Interact with the desktop\", which allowed for graphics.
I think this had to be selected at install, and it may also not be there now.
...


It has to be done manually. Instructions are here.
Kathryn :o)
The BOINC FAQ Service
The Unofficial BOINC Wiki
The Trac System
More BOINC information than you can shake a stick of RAM at.
ID: 30851 · Report as offensive
old_user91851

Send message
Joined: 8 Aug 05
Posts: 9
Credit: 46,744
RAC: 0
Message 31029 - Posted: 20 Oct 2007, 17:29:07 UTC

I\'ve got an iceworld.

Result ID: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6725943
Current timestep: 149254 of 259248
s/Ts: 1.78
Colour: Blue iceworld
CPU: Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz [x86 Family 6 Model 15 Stepping 6]
Overclocking: No, bog standard.

So I guess I should abort the task ? I\'ve got another one running on the other core but it isn\'t as far advanced. It\'s OK so far.
ID: 31029 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31034 - Posted: 20 Oct 2007, 19:32:27 UTC


I suspect that a lot of these \"iceworlds\" aren\'t.
It depends on the temperatures, rather than the colour of the globe display.

For instance, my current 80 year TCM is very hot, (here), but has started displaying a white globe, sometimes changing to a blue globe.
And the processing has stopped, even though the cpu time is still increasing.

I\'ve found that I can coax it along one timestep (and half hour), at a time, by opening and closing the globe display, (and waiting), and by other \'fiddling\'.
I think there may be a bug of some sort, and I\'m going to try and see what makes it \'move\', and also if I can get it going again by itself.

All of this after it slowed way down a couple of weeks ago, and now with 66 hours to go. :(

What you do with your model, though, is another matter.
There\'s millions of others to run if you want to abort.


Backups: Here
ID: 31034 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31035 - Posted: 20 Oct 2007, 20:01:42 UTC

DKR, it looks as if your model has slowed right down, so much that it hasn\'t trickled for days, since 16 Oct. The 1.78 timestep is a cumulative figure - it\'s probably much slower than that now. If this really is the case, I would abort it as it still has quite a few years left to crunch.
Cpdn news
ID: 31035 · Report as offensive
1 · 2 · 3 · 4 . . . 15 · Next

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

©2024 cpdn.org