WUs constantly failing

Author	Message
old_user113466 Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0	Message 19989 - Posted: 6 Feb 2006, 1:49:20 UTC I have yet to complete a sulphur model do to continual client errors Is it me , or the model ? Why do I get credit for a client error? If my host cant do it then lets move on. Sample msg <core_client_version>5.2.13</core_client_version> <message><file_xfer_error> <file_name>sulphur_itus_100878500_1_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> <file_xfer_error> <file_name>sulphur_itus_100878500_1_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message> </file_xfer_error> </message> Thanks for any help DP ID: 19989 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 19992 - Posted: 6 Feb 2006, 2:42:24 UTC You get credits each time you trickle, as per the FAQ. ID: 19992 · Reply Quote

old_user113466 Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0	Message 19995 - Posted: 6 Feb 2006, 4:17:12 UTC Thanks for getting back Let me be more specific. Are my client errors wasting time on both sides, me and CPDN? Do they convey valuable info back to the scientific assumptions? Is an error useful to massaging future thinking or am I just getting an atta-boy back for my cpu time? ie2/5/2006 9:29:22 PM\|climateprediction.net\|Computation for result sulphur_hfa8_100812960_0 finished 2/5/2006 9:29:22 PM\|Predictor @ Home\|Resuming result h0017B_1_138865_1 using mfoldB125 version 428 2/5/2006 9:29:23 PM\|climateprediction.net\|Unrecoverable error for result sulphur_hfa8_100812960_0 (<file_xfer_error> <file_name>sulphur_hfa8_100812960_0_1.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_2.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_3.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_4.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error><file_xfer_error> <file_name>sulphur_hfa8_100812960_0_5.zip</file_name> <error_code>-161</error_code> <error_message></error_message></file_xfer_error>) 2/5/2006 9:42:16 PM\|\|request_reschedule_cpus: process exited DP ID: 19995 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 19997 - Posted: 6 Feb 2006, 4:57:28 UTC Trickles are to tell the server that the model is alive, and is up to \'x,y,z\' of the processing. At the end of each phase, a large zip file of data gets sent back; the first is about 8Megs, the rest about 2Megs. It only becomes worthwhile if the end of the first phase is reached, and the data sent back. After this, ALL end of phase zip files are needed, to be further worthwhile. At the moment, there have been 2380 sulphur models completed, so it is possible. The next part of the experiment will be different, as regards to size of data on hds, when and how much data is returned, and the files left on the hd at the end of a model. But the run time will still be long. The error messages are usefull for debugging. To some extent. Mostly, it is long time users such as myself who help out with this. As has been posted MANY times, all over the help boards, the 161 error message tells us nothing. It\'s what\'s in yabsd.out, (in the dataout folder of the model\'s folder), that often provides a clue. When the two experiments due for imminent release are out of the way, the two programers will be able to devote some time to looking into the rash of suphur failures. As your computers are constantly failing here at present, perhaps you should set them for \'No new work\' from here, and concentrate on other projects for a few weeks. Look back now and then to see if there is something new, perhaps in the front page News section. ID: 19997 · Reply Quote

Curtis Send message Joined: 16 Dec 05 Posts: 27 Credit: 227,145 RAC: 6,532	Message 20001 - Posted: 6 Feb 2006, 6:53:15 UTC Ya. I just got the same errors but different model i think: sulphur_ghkh_000769265_0 Result id:1474958 ID: 20001 · Reply Quote

Curtis Send message Joined: 16 Dec 05 Posts: 27 Credit: 227,145 RAC: 6,532	Message 20002 - Posted: 6 Feb 2006, 7:01:30 UTC Do you get the yabsd.out file or do we need to send it somewhere some how? ID: 20002 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 20003 - Posted: 6 Feb 2006, 7:09:34 UTC We don\'t have access to your computer, so you have to copy and paste the data here. The last dozen or so lines should be enough to see what is happening. Mostly, it will probably be: \"Oh, right. Another one of those.\" But you never know, it may be different. When you say, (in your previous post), \"a different model\", where you refering to a different model name to dp? If so, then you need to know that every one gets a different data set and model name. There are no qorums here as used in SETI, etc. ID: 20003 · Reply Quote

old_user113466 Send message Joined: 23 Nov 05 Posts: 18 Credit: 407,491 RAC: 0	Message 20012 - Posted: 7 Feb 2006, 1:36:40 UTC - in response to Message 19997. As your computers are constantly failing here Look back now and then to see if there is something new, perhaps in the front page News section. Thanks I\'ll be back DP ID: 20012 · Reply Quote

Curtis Send message Joined: 16 Dec 05 Posts: 27 Credit: 227,145 RAC: 6,532	Message 20014 - Posted: 7 Feb 2006, 2:12:39 UTC NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 40 JMOUT = 40 J_PE_IND_MED = 40 NMEDLEV = 0 $END SLAB TIMESTEP 2 im,sm,ngroup,new_im,new_sm 1 1 48 T F FINAL TOTAL ENERGY = 0.45221E+27 J/ INITIAL TOTAL ENERGY = 0.45217E+27 J/ CHG IN TOTAL ENERGY OVER DAY = 0.37262E+23 J/ FLUXES INTO ATM OVER DAY = 0.88673E+23 J/ ERROR IN ENERGY BUDGET = 0.51410E+23 J/ TEMP CORRECTION OVER DAY = 0.28450E-01 K TEMPERATURE CORRECTION RATE = 0.32929E-06 K/S FLUX CORRECTION (ATM) = 0.33312E+01 W/M2 FINAL ATM MASS = 0.17980E+22 KG INITIAL ATM MASS = 0.17980E+22 KG CORRECTION FACTOR FOR PSTAR = 0.10000E+01 im,sm,ngroup,new_im,new_sm 3 1 1 T F NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 40 JMOUT = 40 J_PE_IND_MED = 40 NMEDLEV = 0 $END SLAB TIMESTEP 3 3395537 words long MODEL DUMP SUCCESSFULLY WRITTEN - 3434914 WORDS TO UNIT 22 Number of Words Written to Disk was 3436498 im,sm,ngroup,new_im,new_sm 1 1 48 T F FINAL TOTAL ENERGY = 0.45222E+27 J/ INITIAL TOTAL ENERGY = 0.45221E+27 J/ CHG IN TOTAL ENERGY OVER DAY = 0.15717E+23 J/ FLUXES INTO ATM OVER DAY = 0.67759E+23 J/ ERROR IN ENERGY BUDGET = 0.52042E+23 J/ TEMP CORRECTION OVER DAY = 0.28800E-01 K TEMPERATURE CORRECTION RATE = 0.33333E-06 K/S FLUX CORRECTION (ATM) = 0.33722E+01 W/M2 FINAL ATM MASS = 0.17980E+22 KG INITIAL ATM MASS = 0.17980E+22 KG CORRECTION FACTOR FOR PSTAR = 0.10000E+01 im,sm,ngroup,new_im,new_sm 3 1 1 T F NOCNINDX Namelist is $NOCNINDX J_1 = 1 J_2 = 2 J_3 = 3 J_JMT = 73 J_JMTM1 = 72 J_JMTM2 = 71 J_JMTP1 = 74 JST = 1 JFIN = 73 J_FROM_LOC = 0 J_TO_LOC = 0 JMT_GLOBAL = 73 JMTM1_GLOBAL = 72 JMTM2_GLOBAL = 71 JMTP1_GLOBAL = 74 J_OFFSET = 0 O_MYPE = 0 O_EW_HALO = 0 O_NS_HALO = 0 J_PE_JSTM1 = -1 J_PE_JSTM2 = -1 J_PE_JFINP1 = -1 J_PE_JFINP2 = -1 O_NPROC = 1 IMOUT = 40 JMOUT = 40 J_PE_IND_MED = 4*0 NMEDLEV = 0 $END SLAB TIMESTEP 4 im,sm,ngroup,new_im,new_sm 1 1 48 T F ID: 20014 · Reply Quote

KeeperC Send message Joined: 5 Aug 04 Posts: 66 Credit: 2,146,056 RAC: 0	Message 20098 - Posted: 10 Feb 2006, 14:09:19 UTC Last modified: 10 Feb 2006, 14:11:21 UTC [url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1612048]This[\\url] result and [url=http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1351239]this[\\url] one, both on the same machine, failed at exactly the same point. The machine cannot get past this point in sulphur, despite having many successful slab models to its credit. Two of my other machines have also failed on Sulphur, though less repeatably. I must say that I find this problem quite frustrating. I know the team is focused on the new experiments, but if this undiagnosed problem persists with the coupled model, it will begin to sap my (considerable) commitment to this project. :( Edit: Sorry, can\'t remember how to put in links but you have the URLs at least. ID: 20098 · Reply Quote

old_user31578 Send message Joined: 28 Nov 04 Posts: 9 Credit: 687,368 RAC: 0	Message 20129 - Posted: 11 Feb 2006, 11:49:17 UTC I have similar problems with the sulphur models, one example is: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=1753881 ID: 20129 · Reply Quote

Arnaud Send message Joined: 3 Sep 04 Posts: 268 Credit: 256,045 RAC: 0	Message 20130 - Posted: 11 Feb 2006, 13:55:43 UTC Last modified: 11 Feb 2006, 14:16:37 UTC @ Egon and KeeperC These generic errors messages have not been reported during the Couple Model tests. Hopefully, you\'ll be able to run this new model without problems. Arnaud ID: 20130 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2167 Credit: 64,487,091 RAC: 4,506	Message 20131 - Posted: 11 Feb 2006, 14:12:18 UTC Egon, The problem is with Linux sulphur 4.23. See this sticky in the \"BOINC Questions and Problems\" Linux forum if you haven\'t already. ID: 20131 · Reply Quote

old_user31578 Send message Joined: 28 Nov 04 Posts: 9 Credit: 687,368 RAC: 0	Message 20132 - Posted: 11 Feb 2006, 17:51:08 UTC Thanks for the info geophi, I will crunch some other boinc projects until next experiment is going live. /Egon ID: 20132 · Reply Quote