HadCM3n release

Author	Message
Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 53379 - Posted: 3 Feb 2016, 5:50:02 UTC I've got only one of these still running, re-issued 3rd time late January and looking to finish in 2-3 days (maybe with the infamous "INVALID THETA" like happened to one of the wingmen very near end of model. The next-to-last of this batch on my machines finished a few hours ago, issued December but unluckily at about 90% the host died of PSU failure. Weeks later was able to copy the BOINC folder from the surviving disk into a virtualbox and it completed ok. ID: 53379 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 53382 - Posted: 3 Feb 2016, 8:35:40 UTC - in response to Message 53379. I too have just one of these grinding along. 63%complete. 463hours elapsed 658 hours remaining! My second one failed when I had to reboot the machine for something, I forget exactly what now. ID: 53382 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 53395 - Posted: 4 Feb 2016, 2:01:57 UTC - in response to Message 53379. Last modified: 4 Feb 2016, 3:05:30 UTC Hi, Eirik, My i5-3350 (Desktop) has a third copy of: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10218540 running well beyond its two predecessors, which died at different percentages and for different reasons. So far, 24 Trickles and 60.7% @ T.S. 630,950 (0.74 s/TS [it seems to thrive in Win10]). I suspect it will fail at the end, failing to upload #12 .zip file. However, it should send what is (based on earlier experience with this release) a full #13 .zip file (Restart Dump) -- at least I hope it will be complete. [EDIT: Corrected monthly Trickle count; I forgot that only 20 are shown unless one 'clicks' for the full list ...] "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 53395 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 53531 - Posted: 27 Feb 2016, 1:32:59 UTC Last modified: 27 Feb 2016, 2:41:09 UTC That task finished OK, as did another after it. The bad part is CPU time and Time Remaining don't work. The good news is that graphics work so that progress can be followed easily. Unfortunately, boinc can't decrement time remaining to allow proper management for downloading new work according to your settings. Three new HadCM3n batches released today: 350/351/352. Some details: 352 HadCM3N perturbed physics low sensitivity resubmissions for control experiment 351 HadCM3N perturbed physics low sensitivity resubmissions for step experiment 350 HadCM3N perturbed physics low sensitivity resubmissions for ramp experiment A few are running on my machines. One from #352 crashed in less than six minutes with "INVALID THETA" error. That isn't surprising for the array of tests covered by this set. I won't post to scientists unless more meet the same early fate. By the way, CPU time and Time remaining work on some of these tasks. [Edit] So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff. "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 53531 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 53536 - Posted: 27 Feb 2016, 17:28:47 UTC So far, three of four tasks from Batch 352 crashed, in seconds, with "INVALID THETA" -- the other nears two hours as this is written. I'll advise staff. Sarah has come back on this with, This is a batch that is pushing the limits of parameter space so we would expect higher than normal failures with this so hopefully this is nothing unexpected. Looking at the statistics now it doesn�t look like all are failing but there are still a number in the queue! Though looking at the numbers there won't be any left in the queue shortly. ID: 53536 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 53537 - Posted: 27 Feb 2016, 21:16:58 UTC Last modified: 27 Feb 2016, 21:24:31 UTC My one linux machine that hasn't switched to wine - Just got 4 wu's of the 351 and 352 batch. (those batches are already all issued) These are now 1-3 hours running no errors yet. And yes, the cpu time and the "time remaining" numbers that BOINC estimates are - well within an order of magnitude :) -- "Remaining time (estimated) 700 hours" or so -- not so. More like a week or two C'est la software. ID: 53537 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 53547 - Posted: 1 Mar 2016, 23:56:56 UTC I haven't seen the HadCM3n for a while, so with my larger cache and working memory I was eager to give them a try. Unfortunately, two days into the run, I had to shut down the machine to change the UPS. It was just a normal software shutdown of Win7 64-bit, with the contents of the large (20 GB)write-cache being written to the SSD. However, upon startup, all five of the tasks had errored out. http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10333673 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332728 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10332668 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10335105 http://climateapps2.oerc.ox.ac.uk/cpdnboinc/workunit.php?wuid=10334857 That seems even more fragile than I remembered them to be. But they have errored out on other machines as well. Maybe the shutdown of my machine just accelerated the inevitable? ID: 53547 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 53548 - Posted: 2 Mar 2016, 4:07:33 UTC - in response to Message 53547. Jim They're the ones that Astro and Dave mentioned just below, that are "pushing the limits of stability", so that may indeed have exacerbated the failures. ID: 53548 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,059,184 RAC: 596	Message 53549 - Posted: 2 Mar 2016, 5:04:27 UTC I haven�t had the same problem with the hadcm3n�s. Earlier today I rebooted my system (after first suspending the running models and exiting boinc manager) with no problem. Everything started right back up on reboot and unsuspend. So the problem is not general. ID: 53549 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 53550 - Posted: 2 Mar 2016, 7:25:01 UTC - in response to Message 53549. Last modified: 2 Mar 2016, 7:25:24 UTC Like JIM, I haven't had the same problems with hadcm3n. I am closing my system down twice a day and taking backups. (Got builders in and I am anticipating them crashing the power supply!) Perhaps it's in the method? I suspend all Tasks, wait 10 seconds, suspend the cpdn Project, wait 10 seconds, resume all tasks (but keep the project suspended), then exit BOINC Manager. Then, on opening BOINC Manager again, it all just carries on smoothly. Or perhaps I've just hit lucky with my allocated hadcm3n (351 type). ID: 53550 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 53551 - Posted: 2 Mar 2016, 8:47:47 UTC Thanks for the input. A reboot should be possible, but somehow between the shutting down of BOINC and the writing of the cache to the SSD a few bits are being lost. I will try Lockleys technique of shutting down BOINC first, to see if I can make it work more reliably. I was running another work unit at the time, a hadam3p_eu, which started just before the reboot and which did not error out, so it seems that the HadCM3n are the most vulnerable (no surprise there, but interesting). http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19321778 That accounts for all six cores of my i7-4790, plus the two that are supporting GTX 960s on POEM. It is all I run on that machine; not even an AV, and it has plenty of memory, stable power, etc. However, the SSD, a Samsung 840 Pro, is not above reproach. Once in a while I have seen a few bad blocks, which may call for some remedial action or replacement, as the case may be. I don't want to give up the HadCM3n work, but need to get it more reliable to justify the long times spent on it. But you have convinced me that it should be possible. ID: 53551 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 53552 - Posted: 2 Mar 2016, 8:55:56 UTC Like y'all say, no problems here with this batch being more vulnerable to shutdowns and reboots. I do the drill when I need to reboot for security upgrades and such - suspend all tasks , suspend network, for good measure suspend the project - sync sync sync - shutdown - - - - Reboot - all is well. As for hardware problems -- who knows? BUT - there was a recent stretch with one of my hosts, where, dunno why? - Even normally safe shutdowns crashed some models on restart. Tried Windows update, restore saved files from backup, kept crashing new WU's -- Now magically, problem gone. No clue. BUT also, the uploads are going so slow now, maybe the server just lost track? :) ID: 53552 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 53553 - Posted: 2 Mar 2016, 9:03:51 UTC My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive. Clearly I need to get a UPS! ID: 53553 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 53554 - Posted: 2 Mar 2016, 9:27:17 UTC - in response to Message 53553. My experience with these tasks (not this batch - don't have any of them.) is that on my Linux boxes, something like 3/4 fall over following restart after kernel update. If shut down and restart is for any other reason except power failure over 9/10 survive. Clearly I need to get a UPS! Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. Both on linux, linux with wine, and Virtualbox Windows 10. Maybe 1 in 10 tasks fails after reboot after clean shutdown, whatever reason for shutdown. But - I've had a few clusters of "every model craps out" after the cleanest shutdown, And -- I've had all models survive and complete OK after power failures. I've no clue? UPS can help, my 2 out of 7 UPS protected hosts do better, but not perfect. Me, no clue why some models more vulnerable -- or maybe accumulated errors show up after restart? ID: 53554 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,497,933 RAC: 6,477	Message 53555 - Posted: 2 Mar 2016, 9:40:57 UTC Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. It may just be that my sample size isn't big enough for the results to be significant. ID: 53555 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 53556 - Posted: 2 Mar 2016, 10:17:30 UTC I have just noticed the obvious: each of my five errors above were "INVALID THETA DETECTED" (I am the 1349694 machine by the way). Whether that absolves my machine is not clear, and why it would take a reboot to bring that out is also not clear. Someone who knows more about the models than I do will have to puzzle that one out. ID: 53556 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 53557 - Posted: 2 Mar 2016, 11:24:32 UTC - in response to Message 53555. Strange -- I've had few problems with kernel updates and the necessary reboots, as long as I've got a clean suspend tasks before reboot. It may just be that my sample size isn't big enough for the results to be significant. :) You got the idea! ID: 53557 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 53573 - Posted: 5 Mar 2016, 11:11:25 UTC i have 3 of these WUs (hadcm3n) running on my machine 1317408...clock running but NO cpu usage visible...progress still going up at 7% to 13%... don't remember seeing this behavior before...are these units doing anything ??? or just shoveling sand uphill ??? frank ID: 53573 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,059,184 RAC: 596	Message 53574 - Posted: 5 Mar 2016, 15:25:01 UTC - in response to Message 53573. Are they sending trickles and zip files? ID: 53574 · Reply Quote

KWSN - Sir Frank of the Wood Send message Joined: 3 Nov 10 Posts: 39 Credit: 2,494,427 RAC: 0	Message 53575 - Posted: 5 Mar 2016, 15:32:39 UTC - in response to Message 53573. hello jim haven't seen any...should the trickles be at 8% or 12.5% or some other interval ??? frank ID: 53575 · Reply Quote