Crash recovery: my 1wk-old backup or more recent .zip uploads?

Author	Message
JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 41,989,607 RAC: 2,734	Message 35187 - Posted: 6 Oct 2008, 23:07:40 UTC Last modified: 6 Oct 2008, 23:09:50 UTC Computer ID 903841 is one I have at the University where I teach once per week. Apparently it experienced a compute error early this morning. I\'ve captured the BOINC work directory in the state I found it this afternoon (post error). My last WinZIP backup was 1 week ago, so I\'ve restored the work directory using that archive and am resuming work from that point (1-week ago). But from the message log, I see that some .zip file(s) of intermediate results were uploaded to CPDN more recently -- a few days ago. Is it possible to resume work from the point-in-time when those intermediate results were uploaded (a few days ago), or do I need to restart from my last WinZIP backup (1-week ago) ? Thanks, -- Jim ID: 35187 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 35189 - Posted: 6 Oct 2008, 23:41:34 UTC Climate models are \'a work in progress\' on peoples computers. The only place that contains ALL of the data needed to restart a model is on said computers, in copies of the complete BOINC folder (in the case of BOINC version 5), and the complete BOINC data folder, (in the case of BOINC version 6), that were made before the failure. Backups: Here ID: 35189 · Reply Quote

JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 41,989,607 RAC: 2,734	Message 35190 - Posted: 7 Oct 2008, 0:45:29 UTC - in response to Message 35189. Hi Les -- Thanks for confirming what I already thought was true. Hopefully the trickle upload of intermediate results a few days from now, that duplicate the upload of a few days ago (prior to the compute error), will be easy to identify as such. Still TBD is whether the restarted model will similarly fail roughly a week from now, or whether some other non-model-computation-related event (glitch in hardware or in Windows or some network-triggered event) may have caused the failure this time. Lesson may be that this machine isn\'t up to the task of running unattended (and without backups) for 1 week reliably (week after week), in which case I\'d either have to swing by the University mid-week to make backups every few days, or disconnect this machine from the CPDN project. We\'ll see.... -- Jim ID: 35190 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 35191 - Posted: 7 Oct 2008, 2:02:01 UTC The restored backup will send duplicate trickles but you\'ll see no sign of them on the model\'s web page until the model reaches the point where it\'s doing new work. That unattended computer has been crashing its tasks which isn\'t the case with your other machines. Some of the crashed tasks have rather unusual error messages. Could someone who knows more than I do about crash diagnostics please have a look in case they indicate something Jim could do to put things right if that wouldn\'t be too time-consuming. Cpdn news ID: 35191 · Reply Quote

JimMcCarthy_StellarSolns Send message Joined: 3 Sep 08 Posts: 23 Credit: 41,989,607 RAC: 2,734	Message 35383 - Posted: 28 Oct 2008, 1:33:49 UTC Last modified: 28 Oct 2008, 1:41:29 UTC I returned after being away from the university for about 2 weeks to find that this computer had again stopped work on its task, although in this case I\'m not sure a \"crash\" actually occurred. Under the BOINC manager \"Project\" tab, I had previously disabled \"Get new tasks\" so that any \"crash\" while PC is unattended would not result in downloading new tasks (and having those immediately crash, and so on, as happened with this machine once previously). Anyway, this time the BOINC manager (and client?) were still alive and working, but work on the task had apparently stopped with the following messages logged by the BOINC manager: 10/17/2008 12:07:15 AM\|climateprediction.net\|Sending scheduler request: To send trickle-up message. Requesting 0 seconds of work, reporting 0 completed tasks 10/17/2008 12:07:20 AM\|climateprediction.net\|Scheduler request succeeded: got 0 new tasks 10/17/2008 9:57:42 PM\|climateprediction.net\|Sending scheduler request: To send trickle-up message. Requesting 0 seconds of work, reporting 0 completed tasks 10/17/2008 9:57:47 PM\|climateprediction.net\|Scheduler request succeeded: got 0 new tasks 10/17/2008 9:59:23 PM\|climateprediction.net\|Started upload of hadcm3istd_002w_1920_160_16000770_2_2.zip 10/17/2008 10:02:40 PM\|climateprediction.net\|Finished upload of hadcm3istd_002w_1920_160_16000770_2_2.zip 10/18/2008 6:15:37 PM\|climateprediction.net\|Computation for task hadcm3istd_002w_1920_160_16000770_2 finished 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_3.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_4.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_5.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_6.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_7.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_8.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_9.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_10.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_11.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_12.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_13.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_14.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_15.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/18/2008 6:15:37 PM\|climateprediction.net\|Output file hadcm3istd_002w_1920_160_16000770_2_16.zip for task hadcm3istd_002w_1920_160_16000770_2 absent 10/19/2008 6:15:38 PM\|climateprediction.net\|Sending scheduler request: To report completed tasks. Requesting 0 seconds of work, reporting 1 completed tasks 10/19/2008 6:15:43 PM\|climateprediction.net\|Scheduler request succeeded: got 0 new tasks 10/21/2008 3:46:03 PM\|\|Running CPU benchmarks 10/21/2008 3:46:03 PM\|\|Suspending computation - running CPU benchmarks 10/21/2008 3:46:35 PM\|\|Benchmark results: 10/21/2008 3:46:35 PM\|\| Number of CPUs: 1 10/21/2008 3:46:35 PM\|\| 1437 floating point MIPS (Whetstone) per CPU 10/21/2008 3:46:35 PM\|\| 2284 integer MIPS (Dhrystone) per CPU 10/21/2008 3:46:36 PM\|\|Resuming computation 10/26/2008 3:46:35 PM\|\|Running CPU benchmarks 10/26/2008 3:46:35 PM\|\|Suspending computation - running CPU benchmarks 10/26/2008 3:47:07 PM\|\|Benchmark results: 10/26/2008 3:47:07 PM\|\| Number of CPUs: 1 10/26/2008 3:47:07 PM\|\| 1439 floating point MIPS (Whetstone) per CPU 10/26/2008 3:47:07 PM\|\| 2385 integer MIPS (Dhrystone) per CPU 10/26/2008 3:47:08 PM\|\|Resuming computation Upon my arrival, I found the PC still running and the BOINC mananger window still open, and the BOINC client still running, but no active tasks -- indeed, no tasks listed, active or otherwise. My last backup of the BOINC data directory was on 13-Oct-2008, so after backing up the system as I found it today (27-Oct) -- in case I learn it was behaving normally on 10/18 (which I doubt) and might(?) resume from there after download of \"new tasks\" (maybe just the next phase of the current task?) is re-enabled -- I\'ve restored the backup and restarted the task from 4 or 5 days prior to the events in red above. The \"stderr\" {+} button on the web page for the failed(?) task on this computer did not show any error information. Insights and/or comments anyone ? Checking in on, and backing up, this system (along with a cold reboot for good measure?) more frequently certainly seems like a smart approach. But besides that, what else appears to be wrong here ? Thanks, -- Jim ID: 35383 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 35384 - Posted: 28 Oct 2008, 2:44:54 UTC As they used to say on the Goon Show: \"Pull up a bollard, and let me tell you a tale\". 1) All messages sent between the server and client computer, and vice versa, are logged with a message number. When a model is restored from a backup, this also restores an older number in the sequence than the latest one received by the server, so it issues a new computer ID, because, left to themselves, neither of the computers will repeat a message number. At least, this is what used to happen. With the new BOINC version, the message (on the client computer), has changed. It now says: Generated new host CPID, and on the server, that model now shows: Client detached against: Outcome. 2) And, as always, the first \"outcome\" message to reach the server becomes permanent. Which is why you didn\'t get any error messages against the \"stderr\" [+] button when the model failed on the 18th. 3) When a new data set is sent to a client computer, part of this is a list of what files to expect during the creation of the model, and at it\'s completion. This list will, for climate models, include \"trickles\", and the server url to which to return them, and the zip files, and the server url for them. All of this info is stored in the file: client_state.xml. If a model fails, it\'s percentage completed goes to 100%, and BOINC looks at it\'s \'to-do\'list, (client_state.xml), and starts doing as instructed. Part of this is to return zips that haven\'t already been \'ticked off\' it\'s list. But these zips haven\'t been created yet, as the model failed before then. So BOINC can\'t find the zip files, and returns an error message for each one, to say so. Which is why you have the message lines that you have in red. All perfectly normal. And why did the model fail between the 17th and the 18th? No idea, as there\'s no info to give a clue. Backups: Here ID: 35384 · Reply Quote