climateprediction.net home page
Posts by nairb

Posts by nairb

1) Message boards : Number crunching : Time for another moan about w/u not restarting (Message 63725)
Posted 19 Mar 2021 by nairb
Post:
I thought you would need to stop all processing in order to make a copy/backup of working w/u's. Which could be fatal to a climate w/u anyway. If there is only 1 project running this also helps. But if there are multiple w/u from several projects running then it all becomes very messy I believe. Can a climate w/u thats done a trickle file upload be restored to a point before the trickle up??. I dont know.

For most projects its get a w/u and dont worry, re-starts are not an issue. Climate seems to like its own dedicated machine and left alone - just feed it electricity.
2) Message boards : Number crunching : Time for another moan about w/u not restarting (Message 63720)
Posted 18 Mar 2021 by nairb
Post:
So, I found one machine had stopped responding by a vnc connection. I went and checked the machine and found its keyboard UN-responsive as well. So there was no way to shutdown Boinc. The mouse was still working so the machine was shutdown.

Well we all know what this might mean for those weak and feeble climate w/u. On startup 2 w/u had fainted and died. Some 11 days of processing lost between them. But a grain of sand in the history of climate processing.
I have bucket of ARP w/u to do instead. They love a restart, in-case there is a fault with the machine and it needs a restart.
I really do get frustrated just how easy these climate w/u pass-out and die for any reason.
3) Message boards : Number crunching : w/u completed but still showing as "Server state In progress" (Message 63118)
Posted 17 Dec 2020 by nairb
Post:
The last zip upload should have gone to the server OK, so all is not lost.
It's just that the completion will not be registered on your tasks page, so you have to remember that mentally.


Ok, so the w/u should be valid for the scientists but just shows "in progress" on the tasks page. Which is a better outcome.

It must be possible to run a script to update the tasks page. But if the task has finished ok I guess there is little incentive to do this.
4) Message boards : Number crunching : w/u completed but still showing as "Server state In progress" (Message 63108)
Posted 9 Dec 2020 by nairb
Post:
Thanks for the answer..... So it seems there is another way to lose a w/u at 100%. Without any warning a w/u can be wasted on the last "ready to report" communication. And all is lost................... along with any remaining sense of satisfaction which is all there really is from participating in these projects.
I will let the remaining 5 run to completion. Who knows - a couple of w/u might actually end up being useful.
5) Message boards : Number crunching : w/u completed but still showing as "Server state In progress" (Message 63089)
Posted 3 Dec 2020 by nairb
Post:
So I sat and watched a w/u complete...... sad I know. It uploaded the 5th trickle ok. Then came to the end of processing and produced a small file to upload, which uploaded ok. The w/u status changed Ready to report ?. Which it did.

All very satisfying after 20 days. We dont all have super computers.....

But a day later the w/u is still showing as "In progress" on the "tasks for computer" page.

https://www.cpdn.org/result.php?resultid=21957285

The computer it came from only has 1 cpdn task left to do before its harddisk is upgraded.
Has the server just not caught up yet or is it another fiendish way of losing a w/u at 100%
Ta
Nairb
6) Message boards : Number crunching : uploads failing hadam4h (Message 63019)
Posted 25 Nov 2020 by nairb
Post:
EDIT... forget it. Its cleared by its self and started uploading again.... its the 5 trickle and I was hoping it would finish uploading before the w/u ended.... But it seems ok.

Anybody else having an upload issue with hadam4h uploads??. Its been stuck on the 5th trickle up for a while now. The machine seems to be ok with other projects.
It just says error reported by upload server.......
Luckily the w/u has complete ok and is waiting to upload
7) Message boards : Number crunching : New work Discussion (Message 63012)
Posted 24 Nov 2020 by nairb
Post:

I am pretty certain that means the number of users who have returned completed tasks in the past 24 hours.


And I must be among them coz for the first time in living memory I managed to return 2... yes thats two successfully completed w/u in 1 day.

And just to settle the nerves I have a bunch of ARP w/u to do for a couple of days. I do like w/u's that seem impossible to kill. Unlike others that we could mention.....
8) Message boards : Number crunching : uploads failing hadam4h (Message 63004)
Posted 21 Nov 2020 by nairb
Post:
Opps. Yes its working now
Ta
Nairb
9) Message boards : Number crunching : uploads failing hadam4h (Message 63002)
Posted 21 Nov 2020 by nairb
Post:
Two machines are reporting upload fail with Hadam4h.
Sat 21 Nov 2020 03:59:10 GMT | climateprediction.net | [error] Error reported by file upload server: [hadam4h_b09s_200811_5_882_012035360_0_r1292062071_4.zip] locked by file_upload_handler PID=15930
Sat 21 Nov 2020 03:59:10 GMT | climateprediction.net | Backing off 02:54:39 on upload of hadam4h_b09s_200811_5_882_012035360_0_r1292062071_4.zip

I presume its an upload server issue?.

EDIT..... looks like the problem has gone away.

I would delete this message but there is no option to do so.
10) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62837)
Posted 2 Nov 2020 by nairb
Post:
So, I used the new method of suspending these wu before stopping the client. Then checking to make sure they have gone. It seem to have worked for 1 machine. All 4 w/u restarted. And sofar are still running.
On the second machine the same procedure applied and both w/u failed with computation error when restarted.

Maybe its a fedora 30 thing...... Thats some 50 days of processing lost/wasted today.
11) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62836)
Posted 2 Nov 2020 by nairb
Post:
Well,well,well its a success. And it was still uploading the last 192meg file when it reached 100%.

So I will try the new method of stopping the processing of w/u. I had worked with computers for endless years and always hated using the on/off switch to solve issues. But power cuts never seem to kill a climate w/u. Just luck I guess.

Good idea to use top to check if the process really has cleared off.
Lets hope the re-issued w/u work better when they arrive.
12) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62832)
Posted 2 Nov 2020 by nairb
Post:

It may be a stupid question but did you exit the client as well as suspending computation when you made the changes? If you just suspend, the changes get reversed by the running client - I have tried it in the past without stopping the client.


No,no, good question. I dident think to quit the client. So I just checked and the client_state.xml was as without the changes.

So I quit the client................. made the changes again. and restarted. 2 other w/u error(ed) straight away and died. They had only!! been running a day or so. The w/u that is due to complete soon restarted ok. I rechecked the client_state.xml and the changes were still there.

So I will have to redo the changes on 2 other machines again................ every time I suspend a job or restart boinc I seem to lose a w/u or 2.

I now just pull the power core out.
I will report the outcome of the remaining 877 with 1hr 10mins left to run

The 2 failed w/u after the restart showed
hadam4h_c0ds_206511_5_878_012030354_0
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process got signal 65</message>
<stderr_txt>
Signal 2 received: Interrupt
Signal 2 received: Illegal instruction - invalid function image
Signal 2 received: Floating point exception
Signal 2 received: Segment violation
Signal 2 received: Software termination signal from kill
Signal 2 received: Abnormal termination triggered by abort call
Signal 2 received, exiting...
20:14:24 (1310): called boinc_finish(193)
Signal 2 received: Interrupt
Signal 2 received: Illegal instruction - invalid function image
Signal 2 received: Floating point exception
Signal 2 received: Segment violation
Signal 2 received: Software termination signal from kill
Signal 2 received: Abnormal termination triggered by abort call
Signal 2 received, exiting...
20:14:25 (983): called boinc_finish(193)

</stderr_txt>

along with loads of messages like
02-Nov-2020 20:32:07 [climateprediction.net] Output file hadam4h_c0ds_206511_5_878_012030354_0_r381159273_4.zip for task hadam4h_c0ds_206511_5_878_012030354_0 absent
13) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62829)
Posted 2 Nov 2020 by nairb
Post:
Well the fix had been applied to the client_state.xml. Both the max-nbytes & rsc_disk_bound. All those lines that needed changing. Maybe I made an error but the next w/u to finish just now also failed with....
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
19:06:44 (1926): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0ft_200911_5_877_012028747_0_r1809679561_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
]]>

There is another 877 due to finish in a couple of hrs. Followed by a bunch of 878's. I am beginning to think is not worth letting any of these w/u's to run. Maybe abort the entire lot and wait for a fixed batch. Little point in waiting 20(ish) days to find they fail as well......
14) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62822)
Posted 2 Nov 2020 by nairb
Post:
Edit find and replace <max_nbytes>150000000.000000</max_nbytes> to <max_nbytes>300000000.000000</max_nbytes> and hitting replace all. should sort it.


Its been done..... it does seem that only some w/u are affected. Better to find out its not a machine problem tho.

Ta
Nairb
15) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62816)
Posted 1 Nov 2020 by nairb
Post:

Edit:With a text editor that will save the file as plain text without adding any end of file characters when saved I have opened client_state.xml and with all of batch 877 and 878 on my machine I have looked for <rsc_disk_bound> for these tasks and doubled the value from 2000000000.000000 to 4000000000.000000.
My tasks which might be affected have only just started so it will be about a week till I know whether it works or not. In the meantime I will let Andy know that this is a problem.


Ok, its been done on the 2 machines. There are 3 877's to complete in the next 20 hrs. I do hope they are more successful.
16) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62814)
Posted 1 Nov 2020 by nairb
Post:
On a separate machine I have just had the same thing happen to
hadam4h_b0cw_200811_5_877_012028642_0

<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
22:21:59 (31404): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0cw_200811_5_877_012028642_0_r803748994_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message>
]]>

This machine had 11.6gb free disk space.
The final zip file was 193.11mb

I did not get time to check the <rsc_disk_bound> value
17) Questions and Answers : Unix/Linux : computation error at 100% complete (Message 62809)
Posted 1 Nov 2020 by nairb
Post:
So I left the machine with 15 mins to go before completion on task hadam4h_b0tg_201211_5_877_012029238_0.
When I returned it had failed with computation error.
It says that the w/u returned 5 trickles.
There was a street wide power failure a couple of weeks ago which shut the machine(s) down but all w/u's resumed ok.

I have 3 more due to finish in the next day. It would be a shame if all failed the same way.

I am beginning to suspect a shortage of disk space on this machine....... is this what the stderr_txt is actually trying to say??
Ta
Nairb


<core_client_version>7.16.6</core_client_version>
<![CDATA[
<stderr_txt>
16:13:32 (1925): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
<file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_2.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam4h_b0tg_201211_5_877_012029238_0_r1378671937_5.zip</file_name>
<error_code>-131 (file size too big)</error_code>
</file_xfer_error>
</message
18) Questions and Answers : Unix/Linux : fedora 30 64 bit (Message 62784)
Posted 22 Oct 2020 by nairb
Post:
For those who might be trying fedora 30.... I have finally got round to doing a minimal fedora 30 install. I usually go for the kde-plasma desktop gui. We all know how resource hungry some fedora spins can be. But its nice to have some of the gui tools. I checked my 80gig disk only to find its getting bit short of space with climate/einstein etc installed.

So it was time to get back to basics. A min install + admin tools. And all you get is a command line. I use tigervnc & run boinc thru an xterm. All very simple ..... like it used to be. It now all fits in the corner of a 60gig disk.
19) Message boards : Number crunching : Welcome back/checking if everything is working? (Message 62754)
Posted 5 Oct 2020 by nairb
Post:

Les has contacted the project, some cleaning up will be done but probably not before some more work appears which will be part of the new season Msc programme which should in the next few weeks have work for both Windows and Linux machines. (Not sure about Mac.



I presume these will still require some 32bit libs and not the full blown 64bit jobbies. For linux w/u. I had better make sure the fedora 30 hard disk is plugged in.

Assuming I am "lucky" to snare a w/u that is.
20) Message boards : Number crunching : Updated BOINC Clients 7.16.11 - Windows 64-bit and Mac OS X (64-bit Intel) (Message 62736)
Posted 23 Sep 2020 by nairb
Post:
One thing about these ARP tasks is that they can be suspended/stopped/re-started endless times without having a fit and dying with a computation error.

Very endearing.


Next 20

©2021 climateprediction.net