Long shutdowns and keeping HADSM model alive

Author	Message
glaesum Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0	Message 38148 - Posted: 20 Oct 2009, 18:32:58 UTC the question is a little subtler than I could put in the title: I\'ve just finished a slab model (shortly to report), the first for a while after that run of HADAM models, and I\'ll be having to shut down in a few weeks for well over a month. Now I noticed with the recent slab model that only a few replicants were issued to begin with, then more were sent out gradually and even now 2 wus are unsent. This seems a good policy of reducing a bit of the duplication of the high IR needed in cpdn\'s long tasks. Unusually I was the first to finish this last model, as I\'m near end of the life-cycle of this pc, and it\'s not so fast relatively speaking. Anyway I\'ll be shutting down for a period when the next slab model would be only half or at best 2/3rds complete. I have a vague recollection of seeing a post saying that after 6weeks of \'not calling home\' the models are deemed inactive; I suppose without progress more replicants will be issued to other computers, so my question is whether this is unnecessarily inefficient and whether I would do better to go NNW now and concentrate on other projects meanwhile. In the new year I\'ll let boinc run cpdn models on both cores and catch up a bit giving cpdn higher priority for a couple of months. (and with luck a new multi-core pc soon after.) It\'s a pity there are no short HADAM models around at the moment because I put an extra 1GB of 2nd-hand memory from a tech friend in the old tub during the summer and it ran them very well until the well ran dry. /pg ID: 38148 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 38149 - Posted: 20 Oct 2009, 19:27:31 UTC The original models, (and the server software) a few years ago, were setup to flag models not heard from in about 6 weeks for possible re-issue. About 2 years ago, (about the time when lots of model types became available), this changed. Now a batch of models are issued from one data set at the start, and that\'s it. Failed/unheard-from models just fail, but hopefully one or more of the others in that set will continue to be processed. One problem you may have, is that slab models don\'t like being interrupted at the point where they\'re just finishing one phase, and haven\'t yet returned a trickle at the start of the next phase. Otherwise, if you want to stop for a while half way through a model, that\'s OK. ID: 38149 · Reply Quote

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 38150 - Posted: 20 Oct 2009, 19:43:18 UTC Last modified: 20 Oct 2009, 19:43:50 UTC I may be out of date, but my impression is that the CPDN scheduler doesn\'t attempt anything clever. The reason that the WU models are being issued slowly is that there are so many WUs in the new batch. The models for each WU used to be issued sequentially so that one WU \'filled up\' quickly, but that changed (or rather I noticed it changing) with the last batch of HADSM3MH, which started to be issued \"one from one WU, then one from the next WU (not necessarily adjacent)\". This has the effect of reducing duplicates in the short term, until the scheduler has visited all WUs and they all fill up. So it makes no difference to duplication in the long run. However, there is a severe downside to the new schedule. Because the WU parameters are not set correctly, the probability of a WU being sterilised by a model failure is now much higher than before because the time between results being issued from a single WU is longer, so the chance of one of the issued models failing is higher. A failed result sterilises the WU (i.e. no more models will be issued from it). I queried the new schedule when it first appeared, but didn\'t get an explanation. It\'s probably a BOINC upgrade or some such thing. In your case, I wouldn\'t worry about the six week re-issuing - I\'ve never seen any evidence that it really happens. And if you abort the result you\'ll sterilise the work unit. [Edit: Les types quicker than I do.] ID: 38150 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 38152 - Posted: 20 Oct 2009, 21:00:53 UTC In any case, whether or not new tasks from the same workunit are reissued doesn\'t affect the number of different results produced for the scientists; every task from the same WU, even on the same platform, can be expected to produce slightly different results. So even replications from the same WU are all used. If more results are required from a particular WU, the programmers can reissue the same set of parameter values in a new WU. Even if a task sends no trickles for more than six weeks the server still accepts extra results afterwards. So it isn\'t a good idea to abandon/abort tasks that have already spent time crunching. Cpdn news ID: 38152 · Reply Quote

glaesum Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0	Message 38162 - Posted: 21 Oct 2009, 21:02:13 UTC thanks Iain, Les and Mo - each of you can be relied on to add some extra piece of information and advice. so, in summary, I take it that all means it\'s ok to go ahead! ;-) I just had a blast at Einstein@H to get myself into the top quartile of crunchers - job done so back to cpdn again! ID: 38162 · Reply Quote

glaesum Send message Joined: 24 Feb 06 Posts: 47 Credit: 782,082 RAC: 0	Message 38326 - Posted: 19 Nov 2009, 20:02:54 UTC your advice ended up pretty well. I\'ve got to 92% before my long shutdown, so two phases have reported. If I hadn\'t gone away for a weekend I might have actually finished the wu. anyway, thanks for keeping me crunching confidently. hope everyone has happy break at the end of the year. /pg ID: 38326 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 38339 - Posted: 21 Nov 2009, 5:07:11 UTC I had a system crash just before making a monthly backup. I installed a new version of my Linux OS and I downloaded the BOINC directory, but about a month\'s work was lost.I have two CPDN models running, a HADCM3 and a HADAM3P. They have restarted OK and sending up trickles but while the HADAM3P has accepted them with a date after the crash date the last date for the HADCM3 is still the last one before the crash. But both models show increasing percentage of work done, so I leave them running. Tullio ID: 38339 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,118,017 RAC: 2,445	Message 38344 - Posted: 21 Nov 2009, 18:44:55 UTC - in response to Message 38339. I had a system crash just before making a monthly backup. I installed a new version of my Linux OS and I downloaded the BOINC directory, but about a month\'s work was lost.I have two CPDN models running, a HADCM3 and a HADAM3P. They have restarted OK and sending up trickles but while the HADAM3P has accepted them with a date after the crash date the last date for the HADCM3 is still the last one before the crash. But both models show increasing percentage of work done, so I leave them running. Tullio Hi, Tullio You might want to consider weekly or even bi-weekly backups. The crashes always seem to happen just before you were going to make the next one. ID: 38344 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 38345 - Posted: 21 Nov 2009, 18:50:54 UTC A manual backup of just the Boinc Data directory after completely exiting from Boinc is quick and easy. It only takes a few minutes. Cpdn news ID: 38345 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 38346 - Posted: 21 Nov 2009, 19:11:21 UTC - in response to Message 38345. A manual backup of just the Boinc Data directory after completely exiting from Boinc is quick and easy. It only takes a few minutes. I forgot to say that I am using Linux. I make a copy of the BOINC directory every week or so on a flash memory in a USB port. But since I reloaded my whole home directory after an OS reinstallation, and that file was about a month old, I went back about one month. I should have downloaded also the BOINC directory, about one week old, I forgot to do this. All my projects have shorter deadlines than CPDN, and they restarted easily, losing very little, save for the 2 CPDN models. Now Hadam3p is showing new trickles, but Hadcm3 is still showing an earlier date for its trickles. But it is running OK. A lesson learned. Tullio ID: 38346 · Reply Quote

tullio Send message Joined: 6 Aug 04 Posts: 264 Credit: 965,476 RAC: 0	Message 38356 - Posted: 22 Nov 2009, 12:25:38 UTC Hadcm3 has accepted new trickles. All OK, it seems. Tullio ID: 38356 · Reply Quote