climateprediction.net home page
Reporting - Errors while computing -

Reporting - Errors while computing -

Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
KWSN - Sir Frank of the Wood

Send message
Joined: 3 Nov 10
Posts: 39
Credit: 2,494,427
RAC: 0
Message 45913 - Posted: 12 Apr 2013, 20:06:06 UTC - in response to Message 45905.  

hello les

i wondered what was going on - i checked to see how that work unit was processed by other folks, and i was the only person that had it...


so the download attempt was just a mistake ???

frank
ID: 45913 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45914 - Posted: 12 Apr 2013, 23:06:04 UTC

Yes, I'm afraid you could call it that. Luckily it does no harm.
Cpdn news
ID: 45914 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45932 - Posted: 15 Apr 2013, 17:27:05 UTC





Hi everyone,

yesterday I upgraded to BOINC 7.0.60 - - - so i just wanted to ask all you Long time season Pros - - - if every thing Looks OK ?


Windows 7

CPU type GenuineIntel

Intel(R) Xeon(R) CPU E5507 @ 2.27GHz [Family 6 Model 26 Stepping 5]

Number of processors 8 - - - 8 physical CPU - no Hyper threading

I'm only running one project - on this Computer - only Climate Prediction.net - 24 /7

the 8 Modles I'm now Crunching

Reporting - one Error while computing on the following:

name - - - - - - - - - - hadcm3n_39aw_1980_40_008283532

application - - - - - - UK Met Office Coupled Model Full Resolution Ocean

created - - - - - - - - 14 Jan 2013 6:26:17 UTC

15/04/2013 9:00:29 AM | climateprediction.net | Started download of hadcm3n_39aw_1980_40_008283532.zip
15/04/2013 9:00:30 AM | climateprediction.net | Finished download of hadcm3n_39aw_1980_40_008283532.zip
15/04/2013 9:00:30 AM | climateprediction.net | Started download of ocean_39aw_1980_40_008283532_0.gz
15/04/2013 9:00:39 AM | climateprediction.net | Started download of atmos_39aw_1980_40_008283532_0.gz
15/04/2013 9:01:15 AM | climateprediction.net | Finished download of atmos_39aw_1980_40_008283532_0.gz
15/04/2013 9:01:36 AM | climateprediction.net | Finished download of ocean_39aw_1980_40_008283532_0.gz

15/04/2013 9:02:07 AM | climateprediction.net | Computation for task hadcm3n_39aw_1980_40_008283532_3 finished

15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_1.zip for task hadcm3n_39aw_1980_40_008283532_3 absent
15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_2.zip for task hadcm3n_39aw_1980_40_008283532_3 absent
15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_3.zip for task hadcm3n_39aw_1980_40_008283532_3 absent
15/04/2013 9:02:07 AM | climateprediction.net | Output file hadcm3n_39aw_1980_40_008283532_3_4.zip for task hadcm3n_39aw_1980_40_008283532_3 absent


<core_client_version>7.0.60</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Model crashed: INITTIME: Atmosphere basis time mismatch tmp/pipe_dummy 2048
Sorry, too many model crashes! :-(
Called boinc_finish
</stderr_txt>
]]>

I hope this helps,
Byron


ID: 45932 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 45939 - Posted: 15 Apr 2013, 23:17:00 UTC

There was something wrong with some of the models created at that time. If you look at the stderr of all the models in the workunit you'll find they all crashed with the INITTIME error. A defect in the model that prevented it from starting.
Cpdn news
ID: 45939 · Report as offensive     Reply Quote
Steve Wenner

Send message
Joined: 2 Mar 06
Posts: 27
Credit: 240,040
RAC: 0
Message 45941 - Posted: 16 Apr 2013, 1:13:12 UTC

Hi, I'm a committed participant, usually running 6 CPDN tasks simultaneously, but I rarely check my account. I just noticed that of the 16 tasks that ended in the last eight months, only one completed. All the others stopped with "Error While Computing" or "Error While Downloading". What gives? Should I bail out of CPDN? I hate to think that I am accomplishing nothing with my CPU cycles because of these errors. Anything I can do?
Thanks,
Steve
ID: 45941 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 45942 - Posted: 16 Apr 2013, 2:11:09 UTC - in response to Message 45941.  

Steve,

Nearly all of your failures have been on the longer hadcm3n models. They failed at multiples of 25%, which is when the decadal uploads take place.

This is a common problem on some PCs, and the cause is unclear as to why some PCs frequently have this problem, and other don't (or seldom do).

I would continue running your current ones, but in your climateprediction.net specific preferences of your account page, select other model types, and not hadcm3n.

Recently, availability of tasks has been inconsistent, so you might not get any new ones for awhile.
ID: 45942 · Report as offensive     Reply Quote
JugNut

Send message
Joined: 6 Jun 11
Posts: 11
Credit: 356,113
RAC: 0
Message 45959 - Posted: 18 Apr 2013, 6:36:46 UTC - in response to Message 45942.  
Last modified: 18 Apr 2013, 7:31:15 UTC

Hi, I finally seem to to past the download errors from past weeks but except for one WU that's still crunching all others in the last 2 days have failed. This is one heck of a lot of time crunching for nothing..
Any idea what's causing all these errors?

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728575
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728505
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728222
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15727928
and this that had to be aborted.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15728098

Could someone tell me if this final task seems to be running correctly I have my doubts?
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15727960

Cheers, thanks for looking..
ID: 45959 · Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45961 - Posted: 18 Apr 2013, 8:09:15 UTC
Last modified: 18 Apr 2013, 8:53:43 UTC

Hi JugNut,
Your running one trickled just over an hour ago so seems to be OK. Fingers crossed.

I had the same error overnight. Out Of Memory (C++ Exception). Loads of memory here so shouldn't be able to run out.
Checking your wingmen shows the same error on a couple of those failed WUs and the others have yet to show but as your ones all have the same error, I suspect they will fail as well.

Someone will be along soon to tell us what's happening.

[Edit]
Just got a resend of another one of these so I'll try to pay attention to what happens to it. The other new resend I got didn't start for the wingman but has got past a few checkpoints so could be ok, unless it's one of the same batch.
ID: 45961 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45962 - Posted: 18 Apr 2013, 8:28:10 UTC - in response to Message 45961.  
Last modified: 18 Apr 2013, 8:41:00 UTC

Great!

Another problem. I'll let "them" know.

PS
Jugnut
It would be a good idea to upgrade to 7.0.60, which contains a fix for our download problems. Just in case.
ID: 45962 · Report as offensive     Reply Quote
JugNut

Send message
Joined: 6 Jun 11
Posts: 11
Credit: 356,113
RAC: 0
Message 45963 - Posted: 18 Apr 2013, 9:24:05 UTC - in response to Message 45962.  

Thanks for your time guy's.

@ Ray Murray: I also have 16GB on each of those machines reporting error's so "out of memory" is probably not the problem for me either.

I'm glad at least 1 WU seems to be still alive & kicking.

Oh well try try again. Please keep us informed of any progress.

@ Les Bayliss: Have just before upgraded to 7.0.62. Thanks.

All the best JN
ID: 45963 · Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45964 - Posted: 18 Apr 2013, 9:35:32 UTC
Last modified: 18 Apr 2013, 10:00:17 UTC

After some more digging; my failed wu didn't tidy up after itself so there is its 470MB folder (not modified since yesterday so probably nothing interesting in there), the sterr log still in the slot, but also an xml file showing a last update at timestep 25921, exactly where it would make the first trickle. Significant?

From Boinc logs:
[task] Process for hadcm3n_4f6k_1980_40_008350244_0 exited, exit code 3765269347, task state 1 before the
[task] exit code -529697949 (0xe06d7363) and output files absent

PS
Another resend with the same error which I'm just going to Abort so I only have to concentrate on the 1 definately dodgy and 1 possibly ok.

Final edit before editing timesout:
Is there another flag I could set other than [task_debug] that would give more detailed info on the error?
ID: 45964 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45966 - Posted: 18 Apr 2013, 9:47:27 UTC

I'm thinking "Out of memory C++" etc, is a compiler problem.
We'll see. If the project people haven't screamed loudly and run off into the distance.

ID: 45966 · Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45976 - Posted: 19 Apr 2013, 9:19:23 UTC

Both the resends from my message yesterday, that have been under observation, failed at the first trickle point (as expected) and again didn't tidy up. Those that have had a few of these will be building up a large amount of garbage with each failed wu leaving its c.470MB folder behind. Is there a server side cleanup option (maybe a forced reset?) or will people have to throw out this junk manually? Maybe a global message through Boinc?

Speaking of junk; could you delete my double post from yesterday, please Les, just to tidy up the thread.
ID: 45976 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,904,878
RAC: 6,593
Message 45977 - Posted: 19 Apr 2013, 9:35:18 UTC - in response to Message 45976.  

Is there a server side cleanup option (maybe a forced reset?) or will people have to throw out this junk manually?
Not to my knowledge. Performing a project reset from within BOINC Manager will usually clear out the debris. If not then a detach (or remove) will get rid of everything. After re-attaching a host merge is usually necessary.

Both options should only be attempted when no models are live, since they will also be removed.
ID: 45977 · Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 7 Aug 04
Posts: 50
Credit: 548,730
RAC: 0
Message 45981 - Posted: 19 Apr 2013, 17:43:16 UTC

Thanks Iain,
I thought that was the case. I've just deleted the leftover files from the dead WUs as some PNWs have sneeked in while I wasn't looking. I was thinking more about those who don't visit the boards or even don't realise they have had a problem until they find they have used up more disk space than they were expecting.
ID: 45981 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 45985 - Posted: 19 Apr 2013, 21:45:37 UTC

Out of Memory Errors -

My experience is that on Windows XP computers the C++ out of memory error is trapped by Windows (with an error window) that ends up with the task getting a Computational Error when I click on OK.

On LINUX the real memory gets sucked up and then the swap file and then the whole computer just sort of hangs. Sort of means it takes 30 seconds to recognize a mouse click. I have aborted these tasks when I realize what is going on.

I don't think it matters how much memory you have. It will get all used up.

There appear to be tasks in the system still being sent out with this problem. All of the tasks that I have aborted (LINUX) or had an Computational Error (Windows) have had similar problems by my wingmen.




ID: 45985 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 45986 - Posted: 19 Apr 2013, 22:02:40 UTC

We're still trying to figure this out.
Not all of the tasks are failing, and of those that have, not all have had this new error.

The "out of memory" error most likely refers to stack space, not the total memory in any computer.

ID: 45986 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 45996 - Posted: 20 Apr 2013, 12:24:51 UTC

I have had two Out of Memory errors today and noticed that the address given for the errors is exactly the same for the two, viz Out Of Memory (C++ Exception) (0xe06d7363) at address 0x75EDC41F . They had been running in parallel but failed about an hour apart. I am using BOINC 7.0.62 .
ID: 45996 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 45997 - Posted: 20 Apr 2013, 16:35:17 UTC

ID: 45997 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 46000 - Posted: 20 Apr 2013, 21:00:30 UTC - in response to Message 45986.  
Last modified: 20 Apr 2013, 21:17:02 UTC

.

We're still trying to figure this out.

Not all of the tasks are failing, and of those that have, not all have had this new error.

The "out of memory" error most likely refers to stack space, not the total memory in any computer.

here is an other one of those: - Of Memory (C++ Exception) - (0xe06d7363) at address 0x7560812F

hadcm3n_4kr6_1980_40_008353096_3

Model Crashed approx one hour ago.

.
ID: 46000 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Reporting - Errors while computing -

©2024 climateprediction.net