climateprediction.net home page
Name BOINC mis-estimates runtime for hadam3pm2_k00w

Name BOINC mis-estimates runtime for hadam3pm2_k00w

Message boards : Number crunching : Name BOINC mis-estimates runtime for hadam3pm2_k00w
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,723,322
RAC: 2,854
Message 51076 - Posted: 30 Dec 2014, 10:42:05 UTC - in response to Message 51072.  

Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted?

That's a question only the scientists can really answer, since it depends on whether a hole in the data series matters to their analysis. The current generation of applications - and this one in particular - are not production quality in my opinion: an application should not be allowed out of beta testing if it requires the volunteer to do particular things or to never stop their machine. In mitigation, the application is Linux and Linux users might be more engaged than users of other platforms.

My protocol, for the flawed HADCM3N as well as these new models, is to watch the message boards to identify what problems there appear to be and to then opt in only to those model types that I can plausibly run to valid completion (ANZ, EU, PNW, AFR unconstrained; HADCM3N etc. uninterrupted), ignoring anything else.
ID: 51076 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51077 - Posted: 30 Dec 2014, 11:05:20 UTC

To test this "don't interrupt" idea, I have 3 different types running on one of my machines. See here.

This morning, after the 2nd zips for the MOSES models appeared, I suspended BOINC, exited, and then re-booted.
Checking the zip creation times, one type is just under 12 hours, the other a bit over 13 hours apart.
Both have now created their zip 3s, and the Africa model has all of it's zips, and is waiting to upload and report.

So I don't see a problem with any of these 3 types, but I'm not uploading anything until midday tomorrow, by which time I should have the zip 4s.


ID: 51077 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 51078 - Posted: 30 Dec 2014, 11:24:44 UTC - in response to Message 51077.  

Thanks for doing this test.

To test this "don't interrupt" idea, I have 3 different types running on one of my machines. See here.

This morning, after the 2nd zips for the MOSES models appeared, I suspended BOINC, exited, and then re-booted.
Checking the zip creation times, one type is just under 12 hours, the other a bit over 13 hours apart.
Both have now created their zip 3s, and the Africa model has all of it's zips, and is waiting to upload and report.

So I don't see a problem with any of these 3 types, but I'm not uploading anything until midday tomorrow, by which time I should have the zip 4s.




ID: 51078 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 51079 - Posted: 30 Dec 2014, 11:38:20 UTC - in response to Message 51076.  

I watch the message boards.
I choose the "linux only" because that's what I've got.
I've less time now than way back when I did beta stuff.
But I'll still take the problematic "linux only" because -- who else will - and I admit I've been slacking off about wingmen with no 32-bit -- I really should report them -- anyhow --

The big deal for me now is a huge choice of models and no complaints from anywhere about "out of work".

I gotta go recruiting :)

Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted?

That's a question only the scientists can really answer, since it depends on whether a hole in the data series matters to their analysis. The current generation of applications - and this one in particular - are not production quality in my opinion: an application should not be allowed out of beta testing if it requires the volunteer to do particular things or to never stop their machine. In mitigation, the application is Linux and Linux users might be more engaged than users of other platforms.

My protocol, for the flawed HADCM3N as well as these new models, is to watch the message boards to identify what problems there appear to be and to then opt in only to those model types that I can plausibly run to valid completion (ANZ, EU, PNW, AFR unconstrained; HADCM3N etc. uninterrupted), ignoring anything else.


ID: 51079 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51207 - Posted: 13 Jan 2015, 22:11:56 UTC

As a belated follow up to my post about testing some models:

All 3 model types successfully recovered from being shut down and the computer re-booted. So all 3 types are robust.

Africa regional model

MOSES II

MOSES II + Triffid


ID: 51207 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 51228 - Posted: 15 Jan 2015, 4:22:33 UTC - in response to Message 51207.  
Last modified: 15 Jan 2015, 4:24:04 UTC

As a belated follow up to my post about testing some models:

All 3 model types successfully recovered from being shut down and the computer re-booted. So all 3 types are robust.

Africa regional model

MOSES II

MOSES II + Triffid




I'm not sure Les. Three of my UK Met Office HadAM3P (global only) with MOSES II landsurface scheme v7.03 models were interrupted due to a power hit. All three ran to completion, but had error results because each had one upload file that was not produced. The upload files not produced were the ones that would have been created next after the power interruption.

Tasks for that computer
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1305759
ID: 51228 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51231 - Posted: 15 Jan 2015, 5:31:16 UTC - in response to Message 51228.  
Last modified: 15 Jan 2015, 5:33:18 UTC

Ah, I think I see where I went wrong.
I shut down the models gently - Suspend BOINC, wait, Exit BOINC.

But I think that any model type that's involved with a sudden power loss at a critical moment will be in the same boat; it's all of those open files. Lose power while it's check pointing, and it's history.

edit
But this modelling is getting a bit too complex for me, with too many possibilities. :(
ID: 51231 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 51254 - Posted: 17 Jan 2015, 6:05:16 UTC

My experience was -- that any shutdown, clean or hardware fail, or normal reboot, used to late last year anyhow, always cause these MOSES models to lose an upload and fail.
Now, however, I've done a clean shutdown and reboot for kernel security update, and clean shutdown didn't lose
I'm feeling much more confident about not ever interrupting these MOSES models, EU or GLOBAL.
Seems good to me. Maybe the latest edition has been fixed, but whatever.
I still try to let models run uninterrupted if I can.
ID: 51254 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 51265 - Posted: 18 Jan 2015, 17:50:57 UTC

I think you're right Les and Eirik. A clean shutdown should result in a successful completion with no missed upload files. I did that on one of my computers a few days ago, and both of the MOSES models running on that PC completed without error.

It must have just been one of those oddities where multiple models fail because of a power failure/hard shutdown. The thing that made me suspicious was that all three that failed were MOSES ones while the ANZ model running at the same time continued merrily along to success.

ID: 51265 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,558,383
RAC: 3,204
Message 51530 - Posted: 5 Mar 2015, 18:30:30 UTC

Hi fellows,
I have three of these hadam3p with MOSES II running under linux
first and second here.
They have been suspended numerous times as I can't keep the machines on for so long time. So far no errors, but I do not see any zips in their respective folders, which starts to bother me. In the end all CPU-ing may go in vein...

ID: 51530 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2183
Credit: 64,822,615
RAC: 5,275
Message 51539 - Posted: 6 Mar 2015, 13:18:42 UTC

Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result.
ID: 51539 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,692,765
RAC: 4,606
Message 51540 - Posted: 6 Mar 2015, 14:42:31 UTC
Last modified: 6 Mar 2015, 14:46:39 UTC

this issue affects also other model wu.
maybe database trouble?
for example:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17951788


Good luck cleaning up the mess.


bonsai911
ID: 51540 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,558,383
RAC: 3,204
Message 51548 - Posted: 6 Mar 2015, 21:19:22 UTC - in response to Message 51540.  

yeap the same for my UK Met Office HadAM3P-HadRM3P Europe v7.23 wu
EDIT: it seems it is Pending Validation
ID: 51548 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 51549 - Posted: 6 Mar 2015, 21:28:29 UTC - in response to Message 51548.  

Pending Validation is part of the boinc template but is meaningless in CPDN. The climate scientists have their own method of validation, different from that used by other projects.
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 51549 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1058
Credit: 36,567,376
RAC: 15,960
Message 51552 - Posted: 6 Mar 2015, 23:44:59 UTC

See the reply I've just posted in the adjacent thread. v7.23 was the rogue application that didn't produce trickles six weeks ago (too late at night to check why they're talking about v7.26 there). But it isn't - directly - related to mis-estimated runtime for the task.
ID: 51552 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,558,383
RAC: 3,204
Message 51555 - Posted: 7 Mar 2015, 7:24:40 UTC - in response to Message 51552.  

Thanks.
back to the original topic the MOSES II I run since late February all show miscalculated % completed vs time elapsed/remain. i.e. 1.789% - 84:40/516:37 I saw a formula that helps in calculations, but even newer hadam3pm2_pf8g from March show the same incorrect numbers.
ID: 51555 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,558,383
RAC: 3,204
Message 51565 - Posted: 8 Mar 2015, 7:50:59 UTC - in response to Message 51539.  
Last modified: 8 Mar 2015, 7:54:26 UTC

Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result.


Hi,
the last one has just errored out after trying to resume computation. I should note (maybe not related) that while CPDN was resuming the SSD was under heavy load for some reason (for 3-4 minutes after system start HDD/SSD symbol was in constant flashing)

Here is from the log file
Sun 08 Mar 2015 09:36:30 EET | | Resuming computation
Sun 08 Mar 2015 09:37:37 EET | climateprediction.net | task hadam3pm2_peu3_1991_10_009529524_1 resumed by user
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Computation for task hadam3pm2_peu3_1991_10_009529524_1 finished
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_1.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_2.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_3.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_4.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_5.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_6.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_7.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_8.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_9.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent
Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_10.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent

There is not a single zip in the folder of the model
Hope it helps
ID: 51565 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4529
Credit: 18,635,873
RAC: 13,412
Message 51574 - Posted: 8 Mar 2015, 21:52:27 UTC

And talking of time to completion. I have three of these models, hadam3pm2 (hadam3p model with MOSES II land scheme) Estimated time to completion for all of them is over 900hours but if I take the one that has done 70 hours as an example, it is showing less than 1% making actual time to completion over 7,000 hours. (Assuming something hasn't been messed up in the config files somewhere.)
ID: 51574 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 25,558,383
RAC: 3,204
Message 51654 - Posted: 19 Mar 2015, 22:41:43 UTC - in response to Message 51539.  

Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result.


Here we go the second one errored out with
all but the 6th one zips missing. It counted all 10 zips, not 9 as posted by
Eirik here. It is a linux model - hadam3pm2_pf8g_1991.....
ID: 51654 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 51663 - Posted: 21 Mar 2015, 9:03:48 UTC

All global MOSES will fail and lose an upload anytime they are restarted.
Fact. Been so for months. Check my page. Project doesn't seem to care, so I don't either.
This might or might not bother you, it doesn't bother me - much.
The uploads that succeed might be worth something.

About the long-name haadammepeeprmpeth whatever they are called -- MOSES and TRIFFID -- eu

SAME-O

I know that " model failure" in BOINC does not mean "all data lost" far from it.

What worries me a bit, is other post my name






ID: 51663 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Name BOINC mis-estimates runtime for hadam3pm2_k00w

©2024 cpdn.org