Message boards :
Number crunching :
Name BOINC mis-estimates runtime for hadam3pm2_k00w
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 16 Jan 10 Posts: 1084 Credit: 7,723,322 RAC: 2,854 |
Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted? That's a question only the scientists can really answer, since it depends on whether a hole in the data series matters to their analysis. The current generation of applications - and this one in particular - are not production quality in my opinion: an application should not be allowed out of beta testing if it requires the volunteer to do particular things or to never stop their machine. In mitigation, the application is Linux and Linux users might be more engaged than users of other platforms. My protocol, for the flawed HADCM3N as well as these new models, is to watch the message boards to identify what problems there appear to be and to then opt in only to those model types that I can plausibly run to valid completion (ANZ, EU, PNW, AFR unconstrained; HADCM3N etc. uninterrupted), ignoring anything else. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
To test this "don't interrupt" idea, I have 3 different types running on one of my machines. See here. This morning, after the 2nd zips for the MOSES models appeared, I suspended BOINC, exited, and then re-booted. Checking the zip creation times, one type is just under 12 hours, the other a bit over 13 hours apart. Both have now created their zip 3s, and the Africa model has all of it's zips, and is waiting to upload and report. So I don't see a problem with any of these 3 types, but I'm not uploading anything until midday tomorrow, by which time I should have the zip 4s. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
Thanks for doing this test. To test this "don't interrupt" idea, I have 3 different types running on one of my machines. See here. |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
I watch the message boards. I choose the "linux only" because that's what I've got. I've less time now than way back when I did beta stuff. But I'll still take the problematic "linux only" because -- who else will - and I admit I've been slacking off about wingmen with no 32-bit -- I really should report them -- anyhow -- The big deal for me now is a huge choice of models and no complaints from anywhere about "out of work". I gotta go recruiting :) Some of my hadam3pm2 wu's have had interruptions and I guess that's why they finish with an error of zip files being absent. Are these wu's still useful for something or could they just as well have been aborted? |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
As a belated follow up to my post about testing some models: All 3 model types successfully recovered from being shut down and the computer re-booted. So all 3 types are robust. Africa regional model MOSES II MOSES II + Triffid |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
As a belated follow up to my post about testing some models: I'm not sure Les. Three of my UK Met Office HadAM3P (global only) with MOSES II landsurface scheme v7.03 models were interrupted due to a power hit. All three ran to completion, but had error results because each had one upload file that was not produced. The upload files not produced were the ones that would have been created next after the power interruption. Tasks for that computer http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1305759 |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Ah, I think I see where I went wrong. I shut down the models gently - Suspend BOINC, wait, Exit BOINC. But I think that any model type that's involved with a sudden power loss at a critical moment will be in the same boat; it's all of those open files. Lose power while it's check pointing, and it's history. edit But this modelling is getting a bit too complex for me, with too many possibilities. :( |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
My experience was -- that any shutdown, clean or hardware fail, or normal reboot, used to late last year anyhow, always cause these MOSES models to lose an upload and fail. Now, however, I've done a clean shutdown and reboot for kernel security update, and clean shutdown didn't lose I'm feeling much more confident about not ever interrupting these MOSES models, EU or GLOBAL. Seems good to me. Maybe the latest edition has been fixed, but whatever. I still try to let models run uninterrupted if I can. |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
I think you're right Les and Eirik. A clean shutdown should result in a successful completion with no missed upload files. I did that on one of my computers a few days ago, and both of the MOSES models running on that PC completed without error. It must have just been one of those oddities where multiple models fail because of a power failure/hard shutdown. The thing that made me suspicious was that all three that failed were MOSES ones while the ANZ model running at the same time continued merrily along to success. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,558,383 RAC: 3,204 |
Hi fellows, I have three of these hadam3p with MOSES II running under linux first and second here. They have been suspended numerous times as I can't keep the machines on for so long time. So far no errors, but I do not see any zips in their respective folders, which starts to bother me. In the end all CPU-ing may go in vein... |
Send message Joined: 7 Aug 04 Posts: 2183 Credit: 64,822,615 RAC: 5,275 |
Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result. |
Send message Joined: 9 Sep 04 Posts: 228 Credit: 30,692,765 RAC: 4,606 |
this issue affects also other model wu. maybe database trouble? for example: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=17951788 Good luck cleaning up the mess. bonsai911 |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,558,383 RAC: 3,204 |
yeap the same for my UK Met Office HadAM3P-HadRM3P Europe v7.23 wu EDIT: it seems it is Pending Validation |
Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0 |
Pending Validation is part of the boinc template but is meaningless in CPDN. The climate scientists have their own method of validation, different from that used by other projects. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. |
Send message Joined: 1 Jan 07 Posts: 1058 Credit: 36,567,376 RAC: 15,960 |
See the reply I've just posted in the adjacent thread. v7.23 was the rogue application that didn't produce trickles six weeks ago (too late at night to check why they're talking about v7.26 there). But it isn't - directly - related to mis-estimated runtime for the task. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,558,383 RAC: 3,204 |
Thanks. back to the original topic the MOSES II I run since late February all show miscalculated % completed vs time elapsed/remain. i.e. 1.789% - 84:40/516:37 I saw a formula that helps in calculations, but even newer hadam3pm2_pf8g from March show the same incorrect numbers. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,558,383 RAC: 3,204 |
Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result. Hi, the last one has just errored out after trying to resume computation. I should note (maybe not related) that while CPDN was resuming the SSD was under heavy load for some reason (for 3-4 minutes after system start HDD/SSD symbol was in constant flashing) Here is from the log file Sun 08 Mar 2015 09:36:30 EET | | Resuming computation Sun 08 Mar 2015 09:37:37 EET | climateprediction.net | task hadam3pm2_peu3_1991_10_009529524_1 resumed by user Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Computation for task hadam3pm2_peu3_1991_10_009529524_1 finished Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_1.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_2.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_3.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_4.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_5.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_6.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_7.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_8.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_9.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent Sun 08 Mar 2015 09:37:38 EET | climateprediction.net | Output file hadam3pm2_peu3_1991_10_009529524_1_10.zip for task hadam3pm2_peu3_1991_10_009529524_1 absent There is not a single zip in the folder of the model Hope it helps |
Send message Joined: 15 May 09 Posts: 4529 Credit: 18,635,873 RAC: 13,412 |
And talking of time to completion. I have three of these models, hadam3pm2 (hadam3p model with MOSES II land scheme) Estimated time to completion for all of them is over 900hours but if I take the one that has done 70 hours as an example, it is showing less than 1% making actual time to completion over 7,000 hours. (Assuming something hasn't been messed up in the config files somewhere.) |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,558,383 RAC: 3,204 |
Yes. I think, at the end, there will be an error where one (or more) of the upload files that were supposed to be sent up during the model run were missing. Let us know if that is the result. Here we go the second one errored out with all but the 6th one zips missing. It counted all 10 zips, not 9 as posted by Eirik here. It is a linux model - hadam3pm2_pf8g_1991..... |
Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,896,461 RAC: 649 |
All global MOSES will fail and lose an upload anytime they are restarted. Fact. Been so for months. Check my page. Project doesn't seem to care, so I don't either. This might or might not bother you, it doesn't bother me - much. The uploads that succeed might be worth something. About the long-name haadammepeeprmpeth whatever they are called -- MOSES and TRIFFID -- eu SAME-O I know that " model failure" in BOINC does not mean "all data lost" far from it. What worries me a bit, is other post my name |
©2024 cpdn.org