climateprediction.net home page
Errors in new HADAM3P_ ANZ Tasks

Errors in new HADAM3P_ ANZ Tasks

Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53245 - Posted: 13 Jan 2016, 11:24:20 UTC

Just started computing the new batch of HADAM3P_ANZ tasks, but both my machines has errors.
My I7-3770K has ditched 3 tasks, all in just a few secs, but is processing 4 tasks OK.
My Xeon 6790 has trashed 4 and is running two...

Plenty of HDD space and plenty of RAM.
I7 have 16 Gigs and the Xeon has 24, the max RAM possible.

It happens so fast that I can not see what goes wrong :(

Is it only me??

ChrisD

ID: 53245 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 53246 - Posted: 13 Jan 2016, 11:44:58 UTC - in response to Message 53245.  

Clicking on the plus next to stderr on at least one of the failed tasks shows a replanca error which is a task problem rather than the computer it is running on.
ID: 53246 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53247 - Posted: 13 Jan 2016, 12:08:36 UTC - in response to Message 53246.  
Last modified: 13 Jan 2016, 12:09:29 UTC

Just got 4 more tasks, 1 trashed in 20 secs flat another started OK, but I had to suspend processing for a while.

1/13/2016 12:58:48 PM | | Suspending computation - user request
1/13/2016 1:00:26 PM | | Resuming computation
1/13/2016 1:00:40 PM | climateprediction.net | Computation for task hadam3p_anz_h01n_201112_12_287_010252705_1 finished
1/13/2016 1:00:40 PM | climateprediction.net | Output file hadam3p_anz_h01n_201112_12_287_010252705_1_1.zip for task hadam3p_anz_h01n_201112_12_287_010252705_1 absent
etc.

This process had been processing for 16 min 33 secs...

Tasks running now: 5, of which 4 has passed one hour of computing time and one of the new has reached 5 minutes.
One task in queue, destiny unknown...

I wish I could do some more debuging.
Any suggestions?

ChrisD (pulling his hair out)
ID: 53247 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53248 - Posted: 13 Jan 2016, 12:11:10 UTC - in response to Message 53246.  

Clicking on the plus next to stderr on at least one of the failed tasks shows a replanca error which is a task problem rather than the computer it is running on.


Thanks :)

That is comforting to know.

ChrisD

ID: 53248 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,487,746
RAC: 3,014
Message 53250 - Posted: 13 Jan 2016, 13:09:56 UTC - in response to Message 53246.  

Hi folks,
I can confirm that 3 of the current HADAM3P_ ANZ tasks crashed with replanca error (and I'm the second user where these WUs failed), 2 others are currently running for more than an hour with no errors.
ID: 53250 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53253 - Posted: 13 Jan 2016, 17:59:37 UTC - in response to Message 53245.  
Last modified: 13 Jan 2016, 18:55:17 UTC

I have had five HADAM3P_ ANZ tasks fail with "REPLANCA" errors on two machines running Win7 64-bit (BOINC 7.6.22), all in 11 or 12 seconds. And they failed on all other machines that have run them thus far (usually two others by now). But three other Australia/New Zealand tasks are running on one of my machines after 3 to 6 hours, so they will probably do OK. Interestingly, all the REPLANCA errors are on tasks dated 1996 or earlier, while the ones still running are dated 2002 or 2003, if that is any help.

And I have had one task (dated 2007) fail with just a "Model crash detected, will try to restart..." after 14 seconds, but it is still running on one other machine.
ID: 53253 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 53254 - Posted: 13 Jan 2016, 19:43:32 UTC - in response to Message 53253.  

And I have had one task (dated 2007) fail with just a "Model crash detected, will try to restart..." after 14 seconds, but it is still running on one other machine.


It's possible that it's running on another machine, but it may just have been downloaded and is waiting to run, or hasn't otherwise started for some other reason. We won't know for sure until either a trickle is returned, or it fails on that other host as well.
ID: 53254 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53256 - Posted: 13 Jan 2016, 20:41:31 UTC - in response to Message 53254.  

We won't know for sure until either a trickle is returned, or it fails on that other host as well.
It seems to be just sitting in the cache; no trickles thus far. It is now the only one I really need to keep watch of (with apologizes to Oxford English).
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=19188042
ID: 53256 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53257 - Posted: 14 Jan 2016, 8:43:47 UTC
Last modified: 14 Jan 2016, 8:44:33 UTC

Re the failing Task mentioned in msg. 53247.

I suspended CPU using BOINC Manager, and when I reenabled CPU the task died.

Here is the stderr:

Can anybody tell me, what happended here??


<core_client_version>7.6.9</core_client_version>
<![CDATA[
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=2592, selfPID=2592, iMonCtr=2
Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=0, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6620, selfPID=6352, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>

This is not the first task that has trashed when CPU has been suspended, so there is a problem somewhere, if that could be solved, we might save quite a few resends.

ChrisD
ID: 53257 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 53259 - Posted: 14 Jan 2016, 12:10:47 UTC

The ANZ tasks have now been fixed and resent.
ID: 53259 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53261 - Posted: 14 Jan 2016, 20:22:30 UTC - in response to Message 53259.  

Great :)

Thanks.

ChrisD

ID: 53261 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 53262 - Posted: 14 Jan 2016, 20:46:42 UTC - in response to Message 53259.  

That was staff's claim, Dave, but not all are healthy:
<stderr_txt>
09:22:07 (4864): start_timer_thread(): CreateThread() failed, errno 0
Signal 11 received, exiting...
Called boinc_finish
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=3004, selfPID=3004, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=3004, selfPID=76, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>

"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 53262 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 53266 - Posted: 15 Jan 2016, 9:31:35 UTC - in response to Message 53262.  
Last modified: 15 Jan 2016, 9:53:56 UTC

That was staff's claim, Dave, but not all are healthy:


Shucks. My suspicion is that there are two different problems around and only one of them has been fixed.

Edit: I see from my email that the problem was a combination of files and settings that had all worked individually before, just that they were not compatible. Same email says that none of the re-submitted ones have failed three times yet............
ID: 53266 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53268 - Posted: 16 Jan 2016, 9:15:51 UTC
Last modified: 16 Jan 2016, 9:18:03 UTC

At half past four this morning, my internet went down.
When I checked up on my crunchers this morning at 8, I found that my I7 cruncher had trashed all 7 running CPDN Tasks plus the 7 tasks that I had waiting in the queue.

BOINC Event Log was not long enough to tell me what had happened. All tasks seem to have died due to No Heartbeat.....

My Xeon Cruncher has survived and can tell me a little more. (Maybe because it has only 6 tasks running?)

When the internet went down, CPDN tasks could no longer upload their Trickles. When this happens, they keep trying every 2 minutes..
7 CPDN tasks that can not upload their trickles, seems to have trashed BOINC, and it had lost control completely.

This is a devatating blow, and I will revoke this machine from CPDN until I have found out what really happened..

Any help would be appreciated.
Thanks

ChrisD
ID: 53268 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 53269 - Posted: 16 Jan 2016, 9:50:33 UTC - in response to Message 53268.  
Last modified: 16 Jan 2016, 9:55:58 UTC

I have started by default turning off internet access in BOINC manager while crunching as it makes it easier for me to monitor what is happening - size of uploads etc. I did this primarily because this information was useful to the beta site people but that is currently not running. I have not heard of repeated attempts to access the internet crashing tasks before. I am also new to running windows tasks, running 4 using Wine for the first time at the moment.

Edit: I see that some of your failures are resends having failed on another machine already and one has failed on two others. I wouldn't be in too much of a hurry to write off your i7 till more information comes in from other crunchers.
ID: 53269 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 53273 - Posted: 16 Jan 2016, 23:39:03 UTC - in response to Message 53268.  

You might find a longer event file in the BOINC folder in program data. Its called stdoutdae and is a text file. Don't forget the ProgramData folder is hidden by default.
ID: 53273 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53276 - Posted: 17 Jan 2016, 11:51:28 UTC - in response to Message 53273.  
Last modified: 17 Jan 2016, 11:52:57 UTC

You might find a longer event file in the BOINC folder in program data. Its called stdoutdae and is a text file. Don't forget the ProgramData folder is hidden by default.


Tnx :)

Found it, and it told me the same sad story.
After the internet went down, BOINC kept trying every minute to upload trickles to CPDN.
Finally it simply lost patience and started trashing each and every CPDN task in an inferiour rage. 'Reporting xx completed Tasks' repeatedly together with result .zip files none of which ever made it out from here.

Strangely enough, SETI Beta that runs om my GPU survived. A lot of results were waiting to be uploaded, but when internet was back the result queue cleared.

Seems that BOINC/CPDN combined does not tolerate a bad internet. :(

ChrisD

All tasks were trashed with the excuse, no Heartbeat. Maybe someone should make this service a bit more robust??
ID: 53276 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53280 - Posted: 17 Jan 2016, 12:23:09 UTC - in response to Message 53276.  

I think that the "No heartbeat" stuff is when your computer gets very busy, and BOINC "can't get a word in edgewise". So it gives up waiting, and Aborts things.

Perhaps an anti virus is grabbing every file before it's allowed to run so that it can be checked?

ID: 53280 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53288 - Posted: 20 Jan 2016, 9:24:40 UTC - in response to Message 53280.  
Last modified: 20 Jan 2016, 9:28:54 UTC

Sorry, but my Xeon cruncher just trashed a task 5 trickles down the list.

BUT: this machine has just a vanilla windows 7 install. Nothing besides BOINC and BOINC-Tasks are installed.
Still this heartbeat error trashes my CPDN tasks at random.

Am I really the only one fighting these errors. If I am and nothing is done about it, I am not sure I can afford to compute for the trashcan much longer..

Sorry.

ChrisD
ID: 53288 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 53289 - Posted: 20 Jan 2016, 20:27:51 UTC - in response to Message 53288.  

About 12 hours ago, my Haswell, running Linux with a Windows XP clone running under Wine, completed 4 ANZ models with no drama.
Also, I don't run them at maximum, so as to give the OS some processors to use if it wants to do something. In this case, it's only the real processors and not also the Hyper-threaded.

The lack of anyone else posting about problems with them, could be due to them being "set and forget", or because no one else is having a problem. One would need access to the data base to see.

But it does look as though you're the only one.

ID: 53289 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks

©2024 climateprediction.net