Error while computing???

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4341 Credit: 16,497,933 RAC: 6,477	Message 59243 - Posted: 26 Dec 2018, 13:41:27 UTC - in response to Message 59242. On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go. In contrast, I have one 65% complete in just over 5 days with 2 days estimated to complete. However it is a retread so may not make it. ID: 59243 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59244 - Posted: 26 Dec 2018, 14:58:38 UTC - in response to Message 59243. On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go. In contrast, I have one 65% complete in just over 5 days with 2 days estimated to complete. However it is a retread so may not make it. Well, mine is a double retread: two attempts by others have failed before it was issued to me. Currently about 24% complete, 58 hours CPU time done, 282 hours to go. It sure is not failing in 30 to 60 seconds of CPU time like the two failures before me. ID: 59244 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59245 - Posted: 26 Dec 2018, 15:10:59 UTC My last failure was over a year ago. In my opinion, barring a hardware error, the only reason one gets a segmentation violation in a Linux system is if there is an error in the program. And if my machine were getting segmentation violations in this one program, it would get them in other programs too. I have some programs that run 24/7 starting at boot up. Surely they would have problems too, and they don't. Name wah2_sas50_l09y_198612_13_617_011131907_1 Workunit 11131907 Created 28 Jul 2017, 16:02:10 UTC Sent 28 Jul 2017, 16:02:19 UTC Report deadline 10 Jul 2018, 21:22:19 UTC Received 29 Jul 2017, 20:11:45 UTC Server state Over Outcome Computation error Client state Compute error Exit status 0 (0x0) Computer ID 1256552 Run time 13 hours 52 min 54 sec CPU time 12 hours 39 min 56 sec Validate state Invalid Credit 0.00 Device peak FLOPS 1.28 GFLOPS Application version Weather At Home 2 (wah2) v8.25 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Suspended CPDN Monitor - Suspend request from BOINC... SIGSEGV: segmentation violation [snip] ID: 59245 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4341 Credit: 16,497,933 RAC: 6,477	Message 59246 - Posted: 26 Dec 2018, 15:19:33 UTC Well, mine is a double retread: two attempts by others have failed before it was issued to me. Mine is also a double re-tread. Failures were after 2 and seven days, the 7 day machine being considerably faster than my desktop. It has a higher failure rate than this desktop but not a massive failure rate so this one may well be destined to fail. Judgement may get easier at some point next year because I understand there is a planned rebuild of the hadcm3s model which should resolve the problem of trickles not showing, though as with all these things, I am not holding my breath! https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11669984 ID: 59246 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59247 - Posted: 27 Dec 2018, 0:11:59 UTC These are the times for 3 of my "shorts" way back in April 2017: hadcm3s_81bx_201412_120_564_011004032_1 Run time 3 days 22 hours 3 min 50 sec CPU time 3 days 21 hours 22 min 23 sec hadcm3s_82nj_201412_120_564_011005746_1 Run time 3 days 22 hours 44 min 36 sec CPU time 3 days 22 hours 2 min 58 sec hadcm3s_82lx_201412_120_564_011005688_1 Run time 3 days 18 hours 46 min 8 sec CPU time 3 days 18 hours 3 min 37 sec ID: 59247 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,471,353 RAC: 3,620	Message 59248 - Posted: 27 Dec 2018, 0:44:48 UTC - in response to Message 59242. What I find interesting about this work unit hadcm3s_st249_190012_120_771_011667216 is the large amount of Run Time required (149,672.31, 138,538.43 seconds) to get 30 to 60 seconds of CPU time. This is on two different machines with different CPUs, both running 64-bit Windows 10. What are they spending that time on without using a CPU? On my machine with this same work unit, I already have over 56 hours of CPU time, have uploaded a trickle, and still running with 283 hours predicted to go. It has something to do with error conditions. Obviously a lot more CPU time is used before failure on at least some of these hadcm3s models. All the ones that failed on one of my Linux PCs did so well after the first trickle, and the first trickle took 50,000+ seconds CPU time. It's when they fail in a certain way, the CPU time gets reset somehow. ID: 59248 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59249 - Posted: 27 Dec 2018, 4:25:01 UTC - in response to Message 59247. These are the times for 3 of my "shorts" way back in April 2017: hadcm3s_81bx_201412_120_564_011004032_1 Run time 3 days 22 hours 3 min 50 sec CPU time 3 days 21 hours 22 min 23 sec hadcm3s_82nj_201412_120_564_011005746_1 Run time 3 days 22 hours 44 min 36 sec CPU time 3 days 22 hours 2 min 58 sec hadcm3s_82lx_201412_120_564_011005688_1 Run time 3 days 18 hours 46 min 8 sec CPU time 3 days 18 hours 3 min 37 sec These seem to be normal, whether they succeeded or not. By normal, I mean that the Run time was only slightly more than the CPU time. ID: 59249 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59258 - Posted: 29 Dec 2018, 12:34:14 UTC - in response to Message 59235. And this IS research. Perhaps your computer is just slightly different in a way that will mean that it WON'T fail. Perhaps, but it failed last night, with over 100 hours of CPU time (sorry I did not write it down). There must be several bugs, though, to lose the correct amount of CPU time -- not that this one matters much. Name hadcm3s_st249_190012_120_771_011667216_2 Workunit 11667216 Run time 5 days 4 hours 15 min 25 sec CPU time 39 sec Application version UK Met Office HadCM3 short v8.34 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Model crashed: ATM_DYN : INVALID THETA DETECTED. tmp/pipe_dummy Sorry, too many model crashes! :-( Calling boinc_finish...04:22:04 (23029): called boinc_finish(22) In boinc_exit called with status 22 Calloing set_signal_exit_code with status 22 </stderr_txt> ]]> Interesting that the other two got 3,111.26 credit and I got none. Computer Sent Time reported or deadline explain Status Run time (sec) CPU time (sec) Credit Application 21453655 1256552 24 Dec 2018, 5:05:52 UTC 29 Dec 2018, 12:21:57 UTC Error while computing 447,325.52 39.37 --- UK Met Office HadCM3 short v8.34 i686-pc-linux-gnu 21389954 1425854 26 Nov 2018, 9:52:13 UTC 24 Dec 2018, 5:05:42 UTC Error while computing 149,672.31 30.61 3,111.26 UK Met Office HadCM3 short v8.34 windows_intelx86 21366260 1468717 6 Nov 2018, 14:50:15 UTC 26 Nov 2018, 9:52:08 UTC Error while computing 138,538.43 61.14 3,111.26 UK Met Office HadCM3 short v8.34 windows_intelx86 ID: 59258 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59263 - Posted: 29 Dec 2018, 20:26:59 UTC INVALID THETA DETECTED That's an "unacceptable physics" error, so it looks like that set of starting values finally pushed things too far. And now you've got me doing it :( My last running model says it's been running for 2d 22h 53m. But the Event log says it started 3d 1h 30m ago. It's a batch 780, and as all my batch 781s have failed with a mismatch in some of the data files, (REPLANCA), I'm guessing this one will too in a couple of hours. So, no more until some of the project people come back from where ever and find my emails. ID: 59263 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,471,353 RAC: 3,620	Message 59265 - Posted: 30 Dec 2018, 2:52:25 UTC - in response to Message 59258. Interesting that the other two got 3,111.26 credit and I got none. But now you did. You just looked at it before the last weekly credit run on the server. :) ID: 59265 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59268 - Posted: 30 Dec 2018, 13:12:36 UTC I just lost two more. After about 16 seconds. Name hadcm3s_e239_191012_120_782_011725147_0 Workunit 11725147 Application version UK Met Office HadCM3 short v8.34 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error tmp/pipe_dummy Sorry, too many model crashes! :-( Calling boinc_finish...07:03:33 (4372): called boinc_finish(22) In boinc_exit called with status 22 Calloing set_signal_exit_code with status 22 </stderr_txt> ]]> 30-Dec-2018 07:03:01 [climateprediction.net] Starting task hadcm3s_ze54_190012_120_782_011725889_0 using hadcm3s version 834 in slot 0 30-Dec-2018 07:03:19 [climateprediction.net] Computation for task hadcm3s_ze54_190012_120_782_011725889_0 finished 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_1.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_2.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_3.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_4.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_5.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_6.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_7.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_8.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_9.zip for task hadcm3s_ze54_190012_120_782_011725 889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_10.zip for task hadcm3s_ze54_190012_120_782_01172 5889_0 absent 30-Dec-2018 07:03:19 [climateprediction.net] Output file hadcm3s_ze54_190012_120_782_011725889_0_r743848479_restart.zip for task hadcm3s_ze54_190012_120_782_ 011725889_0 absent ID: 59268 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59269 - Posted: 30 Dec 2018, 13:54:07 UTC - in response to Message 59268. I just lost two more. After about 16 seconds. And two more ... ID: 59269 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,972,865 RAC: 3,926	Message 59270 - Posted: 30 Dec 2018, 16:53:52 UTC - in response to Message 59269. I just lost two more. After about 16 seconds. And two more ... As far as I can see they all fail with that error, which is new to me: buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error ID: 59270 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1055 Credit: 16,516,801 RAC: 955	Message 59274 - Posted: 30 Dec 2018, 19:24:01 UTC - in response to Message 59270. As far as I can see they all fail with that error, which is new to me: buffout error in ASWAP! -- 17116840 Model crashed: TRANSOUT: I/O write error New to me too. I looked at the boinc_client.log and included the part of it, above, about one of the failed work units for today. It seemed to say nothing other than the files to be uploaded did not exist. Well in 16 seconds, I would imagine they had been created yet. I have no idea what ASWAP means. If the linux kernel wanted to page out some idle pages, it is free to do so, and it should not bother the application. Perhaps this message is not about swapping at all. Is 17116840 the number of bytes it wants to write or read? What is TRQNSOUT about. That does look as though a read or write had a problem. My machine seems to be reading and writing OK, though. ID: 59274 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 59379 - Posted: 10 Jan 2019, 17:36:05 UTC The w/u ran to 100% and then gave "computing error" with msg of Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH The job had 7 trickles waiting to upload.... when it reported the end of job the trickles were aborted (They disappeared anyway). So I guess its a loss all round. I dont seem to do to well with Climate w/u with almost a 50% fail rate. ID: 59379 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59385 - Posted: 10 Jan 2019, 22:22:47 UTC Last modified: 10 Jan 2019, 22:25:35 UTC REPLANCA is just what the rest of that line says. Example: You have 2 files, one contains people's names, the other how much you pay them. The 1st file has 10 items, the 2nd file has 8. So when you get to item 9 in the 1st list, there's no data in the 2nd list. But where in the 2nd list were the amounts missed out? And it's LOTS more complicated with these climate models. So not your fault. **************************** As the other problem, the files disappearing, that's just how BOINC works. It's really designed for other projects, so when it gets the signal that the task has failed, the next item on it's To Do list, is send back the error messages. Oh, and we don't need these other files any more, so remove them from the list. :( ********************* And please read my post here about the difference between Trickles and zips. Your trickles (on which credits is based), would have been returned, as per a few lines in the Event log. It's the zips, in the Transfers tab, that disappeared. ID: 59385 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 59386 - Posted: 10 Jan 2019, 23:45:22 UTC Thanks for the info. So the zip files are the science bit. So if a w/u fails at some point and the zip files are still waiting to get uploaded then the science is lost also?. Are partial completed w/u still of value to the project?. Its frustrating seeing 5 days of processing going to waste.... I will give it another go when the zip upload issues go away. ID: 59386 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59387 - Posted: 11 Jan 2019, 2:42:49 UTC Yes, the science is lost. Those models are a bit like buying an apple with half of it rotten. Best to buy a good one. Or, in this case, dump the incomplete models and only use the fully completed AND RECEIVED models. And, if necessary, issuing another batch to cover the gaps in the results. But with the REPLANCA fails, the models will be incomplete anyway. ID: 59387 · Reply Quote

Harri Liljeroos Send message Joined: 9 Dec 05 Posts: 111 Credit: 12,038,780 RAC: 1,393	Message 59415 - Posted: 12 Jan 2019, 22:08:44 UTC I just got an error with wah2 global model after 8 days and 13 hours of calculation. The error was 196 exit_disk_limit_exceeded. At least 50 zip files that were waiting to be uploaded were aborted. The task is here: https://www.cpdn.org/cpdnboinc/result.php?resultid=21361420 ID: 59415 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 59417 - Posted: 12 Jan 2019, 23:34:50 UTC - in response to Message 59415. I just got an error with wah2 global model after 8 days and 13 hours of calculation. The error was 196 exit_disk_limit_exceeded. At least 50 zip files that were waiting to be uploaded were aborted. The task is here: https://www.cpdn.org/cpdnboinc/result.php?resultid=21361420 Which is what I said might happen here. ID: 59417 · Reply Quote