climateprediction.net home page
wah tasks failed

wah tasks failed

Message boards : Number crunching : wah tasks failed
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 52536 - Posted: 11 Sep 2015, 13:04:16 UTC

I notice these tasks have all gone and the number of tasks in progress hasn't gone up enough to account for this. Have they been recalled?


Or perhaps not if a significant number are falling over.
ID: 52536 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 52538 - Posted: 11 Sep 2015, 16:17:04 UTC - in response to Message 52534.  

Richard Haselgrove wrote:
With the great variation in computer speeds, it's probably best to answer that in terms of progress made, rather than absolute time.

Ah, yes, that's a good point. I have three tasks that are as far along as 8% right now (and no trickles). I'll check after they get further along.
ID: 52538 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 925
Credit: 34,100,818
RAC: 11,270
Message 52539 - Posted: 11 Sep 2015, 17:05:28 UTC

And now I've got a 'Signal 11' crash of my own.

<result>
<name>wah2_eu2_c86m_1928_1_010155439_0</name>
<final_cpu_time>92076.300000</final_cpu_time>
<final_elapsed_time>95316.104979</final_elapsed_time>
<exit_status>0</exit_status>
<state>3</state>
<platform>windows_intelx86</platform>
<version_num>705</version_num>
<final_peak_working_set_size>327622656</final_peak_working_set_size>
<final_peak_swap_size>299331584</final_peak_swap_size>
<final_peak_disk_usage>11765</final_peak_disk_usage>
<stderr_out>
<![CDATA[
<stderr_txt>
Signal 11 received, exiting...

17:37:31 (12784): called boinc_finish(193)

Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=16120, iMonCtr=2

Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=15396, iMonCtr=2

Model crash detected, will try to restart...

Leaving CPDN_Main::Monitor...

17:37:43 (15396): called boinc_finish(0)

That one had been plodding along quietly, about 26.5 hours in and maybe 5% done.

Windows 7, nothing untoward shown in either the BOINC logs or the system Event Viewer. It does seem that 'Signal 11' is the default error message for these applications, whether it's a startup problem as others have reported, or a model crash well into the run.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=18882396
ID: 52539 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 52540 - Posted: 11 Sep 2015, 18:03:26 UTC - in response to Message 52539.  

First zips on two of my models uploaded - timestep 11819 or thereabouts. Computing at 4.3s/ts on 3.5GHz i5, W7 64bit if it helps.
ID: 52540 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 52541 - Posted: 11 Sep 2015, 18:21:42 UTC - in response to Message 52540.  
Last modified: 11 Sep 2015, 18:21:57 UTC

chavk (Alan) wrote:
First zips on two of my models uploaded - timestep 11819 or thereabouts...

Me, too. Sent somewhere between 8.1%-8.5% progress.
ID: 52541 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,053,321
RAC: 4,417
Message 52542 - Posted: 12 Sep 2015, 4:11:51 UTC

Task wah2_eu2_994i_1899_1_010151055_0 failed on my second fastest Win7 machine at 127,775.40 seconds CPU time. It appears to be a signal 11 error.

Sdterr: follows:

<core_client_version>7.4.42</core_client_version>
<![CDATA[
<stderr_txt>
Signal 11 received, exiting...
17:32:38 (3876): called boinc_finish(193)
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=5336, iMonCtr=2
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
17:33:06 (5336): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_2.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_3.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_4.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_5.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_6.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_7.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_8.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_9.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_10.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_11.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_994i_1899_1_010151055_0_12.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

No trickles were sent.

ID: 52542 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 52544 - Posted: 12 Sep 2015, 18:15:23 UTC

here is some information from my BOINC event Log, and my computer, in case it might help.

I am using BOINC 7.6.9 (x64) - running as a single instillation - (not as a service)

OS Windows 10 Pro x64 - - - Intel Xeon CPU E5-2687W v3 @ 3.10GHz HT

No trickles were received on this wah2 work unit.

I can't remember at what progress it was, but I think somewhere between 10% and 15%

wah2_eu2_j59d_1995_1_010165525_0


9/12/2015 9:12:04 AM | climateprediction.net | Message from task: 0
9/12/2015 9:12:04 AM | climateprediction.net | Computation for task wah2_eu2_j59d_1995_1_010165525_0 finished
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_1.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_2.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_3.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_4.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_5.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_6.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_7.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_8.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_9.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_10.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_11.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_12.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent
9/12/2015 9:12:04 AM | climateprediction.net | Output file wah2_eu2_j59d_1995_1_010165525_0_13.zip for task wah2_eu2_j59d_1995_1_010165525_0 absent


wah2_eu2_j59d_1995_1_010165525_0


<core_client_version>7.6.9</core_client_version>
<![CDATA[
<stderr_txt>
09:02:41 (1520): start_timer_thread(): CreateThread() failed, errno 0
09:02:42 (7148): start_timer_thread(): CreateThread() failed, errno 0
Signal 11 received, exiting...
09:11:55 (7148): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=1520, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=7148, selfPID=9996, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
09:12:02 (9996): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_1.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_2.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_3.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_4.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_5.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_6.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_7.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_8.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_9.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_10.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_11.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_12.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>wah2_eu2_j59d_1995_1_010165525_0_13.zip</file_name>
<error_code>-161 (not found)</error_code>
</file_xfer_error>

</message>
]]>

I hope this information helps some how.
ID: 52544 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 52545 - Posted: 12 Sep 2015, 18:33:01 UTC

Trickles are, indeed, being posted. First four, so far -- Timesteps:
    11,819
    23,339
    34,859
    46,379


"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 52545 · Report as offensive     Reply Quote
Profile Byron Leigh Hatch @ team Carl ...
Avatar

Send message
Joined: 17 Aug 04
Posts: 289
Credit: 44,103,664
RAC: 0
Message 52546 - Posted: 12 Sep 2015, 18:52:46 UTC - in response to Message 52545.  

Yes you are right, thank for that. I still have thirty nine (39) Weather At Home (wah2) v7.05 crunching along at approx. 11% to 22% progress and I am also receiving trickles on those (39) Weather At Home (wah2) work units.
ID: 52546 · Report as offensive     Reply Quote
3rkko

Send message
Joined: 12 Feb 08
Posts: 66
Credit: 4,877,652
RAC: 0
Message 52548 - Posted: 12 Sep 2015, 19:23:24 UTC

i7-3770, Win10, Boinc 7.6.6. not as a service running 24/7
7 WAH2 WUs downloaded
5 crashed after couple of minutes, Signal 11.
1 crashed after 17 hours, Signal 11, no trickles sent.
1 still going, 1 trickle sent at 11819 timestep / at 19 hours / less than 9% done.

i3-4330, Win10, Boinc 7.6.6. running as a service 24/7
3 WAH2 WUs downloaded
2 crashed after couple of minutes, Signal 11.
1 still going, 1 trickle sent at 11819 timestep / at 19 hours / less than 9% done.
ID: 52548 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 52549 - Posted: 12 Sep 2015, 20:32:24 UTC

Grrr...These errors are getting annoying. I've gotten a couple of resends of tasks that crashed the first time. I think I'm going to stop polling for new work until we get some feedback.
ID: 52549 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 52550 - Posted: 12 Sep 2015, 22:21:27 UTC

Workunit 10114084 still running and upto 8.731% with no problems yet, trickle sent.
My other laptop has Workunits 10115351 and 10115880 running, both up to about 2.9%. No trickles sent from those ones yet.

Seems to have just been the first couple workunits that had the zip problem. But there's still a lot of data to be crunched yet.
ID: 52550 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52551 - Posted: 13 Sep 2015, 1:35:42 UTC

There's a few ANZ now, so perhaps switch to them for a while.

ID: 52551 · Report as offensive     Reply Quote
Andrew Sanchez
Avatar

Send message
Joined: 28 May 14
Posts: 34
Credit: 705,936
RAC: 0
Message 52553 - Posted: 13 Sep 2015, 2:54:22 UTC - in response to Message 52550.  

Workunit 10114084 still running and upto 8.731% with no problems yet, trickle sent.
My other laptop has Workunits 10115351 and 10115880 running, both up to about 2.9%. No trickles sent from those ones yet.

Seems to have just been the first couple workunits that had the zip problem. But there's still a lot of data to be crunched yet.


EDIT: I spoke too soon... zip error on workunit 10115880. Must have been been at about 3%. I still got 2 wah2 jobs on that laptop, i have a feeling they will err too. But i'll keep them running to see what happens.
ID: 52553 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,053,321
RAC: 4,417
Message 52554 - Posted: 13 Sep 2015, 4:52:27 UTC

I hate to say it, but I think the wah2 tasks are duds. I now have 3 ffailures out of 6 started. The only good news is that 3 are still running and one is up to 18%. Another is at 16%.

ID: 52554 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 33,347,857
RAC: 0
Message 52555 - Posted: 13 Sep 2015, 10:06:10 UTC - in response to Message 52554.  

I still have 3 running on my Win 10 64bit computer using BOINC v7.2.33.
They have each sent 3 ZIPs successfully and are at 33% progress.

The time remaining estimate looks too short though, so they are taking longer than the original estimate.
ID: 52555 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 11,328,674
RAC: 2,783
Message 52556 - Posted: 13 Sep 2015, 18:27:39 UTC

WAH 2: very slow, no graphics (for curious or inquisitive people) and enormous uploads... Got 7 of them and do not want more...
ID: 52556 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52557 - Posted: 13 Sep 2015, 20:24:45 UTC - in response to Message 52556.  

Graphics are a thing of the past, and the big uploads are because the researchers want lots of data.
The big uploads are probably going to be a fixture too. One of the beta tests had uploads over a 100 megs each.

ID: 52557 · Report as offensive     Reply Quote
metalius
Avatar

Send message
Joined: 28 Nov 06
Posts: 89
Credit: 11,328,674
RAC: 2,783
Message 52558 - Posted: 13 Sep 2015, 22:16:16 UTC - in response to Message 52557.  

Graphics are a thing of the past...

Dear Less, You must remember... One type of CPDN tasks (or models) had "cold world" bug, another type - negative preasure bug. We, participants, were able to found and report such abnormal or impossible situations in models by single click of "Show Graphics". I do not know, how much usefull were our reports for the fixing of those bugs, but I personally avoided CPU time waste for many times, because I was able to detect - this task is now hopeless, I can abort it etc.
So, I do not accept blind crunching here, in this project. Are You sure, the project went to the correct way?
ID: 52558 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 52559 - Posted: 13 Sep 2015, 22:56:03 UTC - in response to Message 52558.  

I'm not part of the project, just another cruncher, with a few privileges.
The mods have raised this lack of graphics a few times, but unfortunately, it seems that this is how it is now.
Perhaps it will change again in the future, and perhaps not.


ID: 52559 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : wah tasks failed

©2024 climateprediction.net