climateprediction.net home page
computation error at 100% complete

computation error at 100% complete

Questions and Answers : Unix/Linux : computation error at 100% complete
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62809 - Posted: 1 Nov 2020, 17:15:27 UTC

So I left the machine with 15 mins to go before completion on task hadam4h_b0tg_201211_5_877_012029238_0.
When I returned it had failed with computation error.
It says that the w/u returned 5 trickles.
There was a street wide power failure a couple of weeks ago which shut the machine(s) down but all w/u's resumed ok.

I have 3 more due to finish in the next day. It would be a shame if all failed the same way.

I am beginning to suspect a shortage of disk space on this machine....... is this what the stderr_txt is actually trying to say??
Ta
Nairb


<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
16:13:32 (1925): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_2.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message
ID: 62809 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 62810 - Posted: 1 Nov 2020, 18:20:39 UTC
Last modified: 1 Nov 2020, 18:24:25 UTC

I have had the same thing with the "file-size too big" message. Maybe it is semi-normal.

I am 99.999% sure is is NOT a disk space issue.
ID: 62810 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 62811 - Posted: 1 Nov 2020, 19:23:23 UTC
Last modified: 1 Nov 2020, 19:50:03 UTC

I am 99.999% sure is is NOT a disk space issue.


I will let the project know. I am almost certain it is a setting in one of the files downloaded for the task that needs increasing. If I can work out which one it is I can post a fix for tasks still running for those who don't mind getting their hands dirty. If all five zips have gone, you will get all the credit. but it would be good to stop the tasks going out again only for the same error to occur at the end. Hopefully the fix can be applied before the 2072 batch for this series goes out.

Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000.
My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem.
ID: 62811 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62812 - Posted: 1 Nov 2020, 19:31:02 UTC

I lost 3 from the same batch a week ago, so it looks like a bad batch. :(
ID: 62812 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62813 - Posted: 1 Nov 2020, 19:50:54 UTC

Well, this is concerning.

Three of my batch 878 tasks have trickled once/uploaded a zip file, and in the upload section,

max_nbytes = 150000000 (150 MB)

and the uploaded _1.zip files were over 200 MB.
ID: 62813 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62814 - Posted: 1 Nov 2020, 22:36:33 UTC
Last modified: 1 Nov 2020, 22:37:35 UTC

On a separate machine I have just had the same thing happen to
hadam4h_b0cw_200811_5_877_012028642_0

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
22:21:59 (31404): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0cw_200811_5_877_012028642_0_r803748994_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
]]>

This machine had 11.6gb free disk space.
The final zip file was 193.11mb

I did not get time to check the <rsc_disk_bound> value
ID: 62814 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 62815 - Posted: 1 Nov 2020, 22:38:00 UTC - in response to Message 62813.  

Well, this is concerning.

Three of my batch 878 tasks have trickled once/uploaded a zip file, and in the upload section,

max_nbytes = 150000000 (150 MB)

and the uploaded _1.zip files were over 200 MB.


I will try increasing max_nbytes as well. But my 877's are all resends that have just started a few hours ago.
ID: 62815 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62816 - Posted: 1 Nov 2020, 23:22:21 UTC - in response to Message 62811.  


Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000.
My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem.


Ok, its been done on the 2 machines. There are 3 877's to complete in the next 20 hrs. I do hope they are more successful.
ID: 62816 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 62817 - Posted: 1 Nov 2020, 23:33:10 UTC - in response to Message 62812.  

I had 2 ffrom batch 877 that have finished OK with no errors. Third one on the go along with 2 from 878. Finish in about 12days time!
ID: 62817 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62818 - Posted: 2 Nov 2020, 1:36:28 UTC - in response to Message 62817.  

That's good to hear. I'll pass that along too.

This batch must have some members that are set to produce just enough extra data to cause problems.
ID: 62818 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62819 - Posted: 2 Nov 2020, 3:00:07 UTC - in response to Message 62818.  

I had 10 out of 10 batch 877 tasks finish successfully.
ID: 62819 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 62820 - Posted: 2 Nov 2020, 9:40:11 UTC - in response to Message 62816.  
Last modified: 2 Nov 2020, 10:06:49 UTC

looks like it is
max_nbytes rather than the other one. I am not sure whether the value needs to be increased for zips1-5, out.zip and restart.zip which would mean 7 changes for each task. Not sure I want to do that for 11 tasks. -one mistake could potentially knock out a lot of work. Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it.

The project are thinking about different options i.e. just changing things on tasks still to go out, aborting tasks in progress and resending rather than wasting crunching time etc.
ID: 62820 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 925
Credit: 34,100,818
RAC: 11,270
Message 62821 - Posted: 2 Nov 2020, 11:57:45 UTC

The individual file sizes, and the total disk usage, are different measures, and are treated differently.

From the BOINC code, talking about the individual files:

// Note: this is only checked when the application finishes.
// The total disk space is checked while the application is running.

So, the intermediate result files can upload without problems while the app is running, but any left over at the end may cause it to fail with ERR_FILE_TOO_BIG.
The intermediate files are certainly too big for the current run:

<file>
    <name>hadam4h_c0ap_206511_5_878_012030243_0_r1407367359_1.zip</name>
    <nbytes>202200658.000000</nbytes>
    <max_nbytes>150000000.000000</max_nbytes>
    <md5_cksum>9a24fd25f8124d69316920a39abb2f31</md5_cksum>
    <status>0</status>
    <uploaded/>
    <upload_url>http://upload3.cpdn.org/cgi-bin/file_upload_handler</upload_url>
</file>
but as you can see, that one went through OK.

Towards the end of the run, the app will create a fifth intermediate zip file. If you can, it would be wise to allow that one to finish uploading before the task completes.

Two more files are created at the very end - a restart.zip and an out.zip. I believe those files are significantly smaller, and should cause no problems at all.
ID: 62821 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62822 - Posted: 2 Nov 2020, 12:02:38 UTC - in response to Message 62820.  

Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it.


Its been done..... it does seem that only some w/u are affected. Better to find out its not a machine problem tho.

Ta
Nairb
ID: 62822 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 62825 - Posted: 2 Nov 2020, 14:23:59 UTC
Last modified: 2 Nov 2020, 15:20:36 UTC

All tasks waiting to go have been withdrawn.

Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted.
ID: 62825 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 62826 - Posted: 2 Nov 2020, 15:41:13 UTC - in response to Message 62825.  
Last modified: 2 Nov 2020, 15:58:03 UTC

Edit: just seen that those already out there will be left to run rather than an abort signal being sent out by the server. So those of us who are prepared to mess with the system (at our own risk!) will not have our effort wasted.

Good decision (for me at any rate).
I have updated all three of my machines before the first trickle. Only one was running 878, and the other two 879 (if it is affected). They should be good to go.
ID: 62826 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62829 - Posted: 2 Nov 2020, 19:27:02 UTC

Well the fix had been applied to the client_state.xml. Both the max-nbytes & rsc_disk_bound. All those lines that needed changing. Maybe I made an error but the next w/u to finish just now also failed with....
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
19:06:44 (1926): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0ft_200911_5_877_012028747_0_r1809679561_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
]]>

There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well......
ID: 62829 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 62830 - Posted: 2 Nov 2020, 20:07:49 UTC - in response to Message 62829.  

There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well..


Doubling the size is more than the increase being applied before they are re-released. I don't know enough to work out whether there is a difference between changing things in client_state.xml and the files that get sent from the server. Logically I can't see why there should be a difference but not having programmed since the days of ALGOL.....

Given that 14% have completed even though they show errors it looks like something else is going on. It is another 8 days till mine finish even on a Ryzen7.

It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client.
ID: 62830 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 62831 - Posted: 2 Nov 2020, 20:52:12 UTC - in response to Message 62829.  

I think that the thing about this is, if the last zip can get uploaded BEFORE the program ends a few hours later, then the DATA is OK; it's just the end stuff that gets into difficulties.
Which doesn't matter so much.
ID: 62831 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 62832 - Posted: 2 Nov 2020, 20:53:14 UTC - in response to Message 62830.  


It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client.


No,no, good question. I dident think to quit the client. So I just checked and the client_state.xml was as without the changes.

So I quit the client................. made the changes again. and restarted. 2 other w/u error(ed) straight away and died. They had only!! been running a day or so. The w/u that is due to complete soon restarted ok. I rechecked the client_state.xml and the changes were still there.

So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2.

I now just pull the power core out.
I will report the outcome of the remaining 877 with 1hr 10mins left to run

The 2 failed w/u after the restart showed
hadam4h_c0ds_206511_5_878_012030354_0
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process got signal 65</message>
<stderr_txt>
Signal 2 received: Interrupt
Signal 2 received: Illegal instruction - invalid function image
Signal 2 received: Floating point exception
Signal 2 received: Segment violation
Signal 2 received: Software termination signal from kill
Signal 2 received: Abnormal termination triggered by abort call
Signal 2 received, exiting...
20:14:24 (1310): called boinc_finish(193)
Signal 2 received: Interrupt
Signal 2 received: Illegal instruction - invalid function image
Signal 2 received: Floating point exception
Signal 2 received: Segment violation
Signal 2 received: Software termination signal from kill
Signal 2 received: Abnormal termination triggered by abort call
Signal 2 received, exiting...
20:14:25 (983): called boinc_finish(193)

</stderr_txt>

along with loads of messages like
02-Nov-2020 20:32:07 [climateprediction.net] Output file hadam4h_c0ds_206511_5_878_012030354_0_r381159273_4.zip for task hadam4h_c0ds_206511_5_878_012030354_0 absent
ID: 62832 · Report as offensive     Reply Quote
1 · 2 · Next

Questions and Answers : Unix/Linux : computation error at 100% complete

©2024 climateprediction.net