climateprediction.net home page
RAPIT tasks failing after few seconds

RAPIT tasks failing after few seconds

Message boards : Number crunching : RAPIT tasks failing after few seconds
Message board moderation

To post messages, you must log in.

AuthorMessage
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 49044 - Posted: 5 May 2014, 15:10:55 UTC

I've just had two consecutive RAPIT task fail after only a few seconds with 'Compute error'. (16585904 & 16585822). I've been running RAPIT tasks for some time with no problems and have other models running OK at the moment.
Just wondering if I have problems or is it the WU's. (No changes on my system)
ID: 49044 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,904,049
RAC: 6,657
Message 49045 - Posted: 5 May 2014, 17:01:10 UTC

The message (from an EU rather than RAPIT model):

execl(/Library/Application Support/BOINC Data/projects/climateprediction.net/hadam3p_eu_um_6.09_i686-apple-darwin, 159230) failed!


... is usually produced by the Mac permissions problem. There's no evidence of a BOINC Manager upgrade or restoration from backup that I can see, so if you can't get any models to start after a reboot then the next thing would be a project reset (or remove) to clear out all the downloaded files. Don't do that until the running models finish otherwise they'll be aborted.
ID: 49045 · Report as offensive     Reply Quote
Dave Roberts

Send message
Joined: 15 Jan 11
Posts: 175
Credit: 6,242,691
RAC: 699
Message 49046 - Posted: 5 May 2014, 19:46:44 UTC - in response to Message 49044.  
Last modified: 5 May 2014, 19:56:13 UTC

The ones I was referring to were further down the task information list (which wasn�t in �sent� date order?)

Mon 5 May 12:23:33 2014 climateprediction.net Starting task hadcm3n_890u_1980_40_008721113_0 using hadcm3n version 607
Mon 5 May 12:24:01 2014 climateprediction.net Computation for task hadcm3n_890u_1980_40_008721113_0 finished
Task id 16585904

Mon 5 May 13:46:13 2014 climateprediction.net Starting task hadcm3n_88yk_1980_40_008721031_0 using hadcm3n version 607
Mon 5 May 13:46:41 2014 climateprediction.net Computation for task hadcm3n_88yk_1980_40_008721031_0 finished
Task id 16585822

After my two current tasks finish I'll do a project reset anyway.

I've just looked at my task list a bit more thoroughly & realised that there have been more RAPIT failures than I'd realised. The last two happened as I was working and startled me.
ID: 49046 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 1,926,211
RAC: 254
Message 49059 - Posted: 7 May 2014, 6:40:17 UTC

Another bad batch...? :-(


Professor Desty Nova
Researching Karma the Hard Way
ID: 49059 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 49062 - Posted: 7 May 2014, 10:21:09 UTC - in response to Message 49059.  

Another bad batch...? :-(

No. It's the same permissions problem that Iain suggested, this time on the HadCM3N worker process:

execl(/Library/Application Support/BOINC Data/projects/climateprediction.net/hadcm3n_um_6.07_i686-apple-darwin, 131270) failed!
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 49062 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 1,926,211
RAC: 254
Message 49067 - Posted: 7 May 2014, 20:58:37 UTC
Last modified: 7 May 2014, 20:58:54 UTC

I was kind of asking/stating because I had just downloaded a RAPIT unit, and it crashed after a few seconds (on windows).

Stderr:

<core_client_version>7.2.42</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)
</message>
<stderr_txt>

Model crashed: INITTIME: Atmosphere basis time mismatch


Professor Desty Nova
Researching Karma the Hard Way
ID: 49067 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49068 - Posted: 7 May 2014, 21:31:54 UTC - in response to Message 49067.  
Last modified: 7 May 2014, 21:38:42 UTC

There are several reasons for crashes, some of which are to do with the computer in question, and others which indicate a problem with the data set.
So it's necessary to see/quote the error to know which is which.

The error INITTIME is one that means there's a problem with the data files.

PS
3 that I'm running DON'T have a problem, but they're all re-sends, so the others that tried to run them have computer problems.
ID: 49068 · Report as offensive     Reply Quote
rbpeake

Send message
Joined: 27 Feb 08
Posts: 41
Credit: 1,402,356
RAC: 0
Message 49071 - Posted: 8 May 2014, 19:10:15 UTC

It appears that approximately 3,000 of these RAPIT units were taken off the server in the past few days. Are the ones I am crunching still valid?

Thanks!
Regards,
Bob P.
ID: 49071 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 49072 - Posted: 8 May 2014, 19:32:21 UTC - in response to Message 49071.  

Have a look at this thread, I checked your active ones and didn't find that message so yours are still needed.
ID: 49072 · Report as offensive     Reply Quote

Message boards : Number crunching : RAPIT tasks failing after few seconds

©2024 climateprediction.net