climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 42 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66114 - Posted: 19 Sep 2022, 13:45:22 UTC - in response to Message 66112.  

The Windows task you got will be a resend with _1 or _2 at the end of the task name meaning it is on its second or third try after failing on one or two machines, or possibly being aborted.
Seems WAH has a problem with a machine reboot, was working fine until I had to reboot the machine after patch install and then it failed with computation error after it restarted. Unfortunately it disappeared too quick for me to see the detailed logs and keep the files for tests.

As a developer, that is a bit of nuisance. It would be nice if I could tell the client not to delete the files in event of a crash/failure but to leave the slot files as-is (or make a backup). I had a look through the client options and it's possible to exit the client after a task has finished but that would affect non-CPDN tasks which isn't what I need. Other than running another process to periodically rsync the suspect slot directory to somewhere else I can't see how to do it within boinc.

Does anyone know how to do this? (Richard H maybe?)
---
CPDN Visiting Scientist
ID: 66114 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66115 - Posted: 19 Sep 2022, 18:07:33 UTC
Last modified: 19 Sep 2022, 18:07:54 UTC

Seems WAH has a problem with a machine reboot

It isn't just the WAH tasks. I lost three with a reboot recently but in my experience they are more likely to survive a reboot than the Linux ones where in my experience the failure rate can be as high as one in four on reboots. With the Windows ones running under Wine, I find it less than one in ten losses from reboots.
ID: 66115 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,290,785
RAC: 11,182
Message 66116 - Posted: 19 Sep 2022, 18:22:34 UTC - in response to Message 66114.  

Was it result 22235033?

That's a curious one. Exit status 0 (0x00000000) (zero normally signifies success), nothing at all recorded from stderr.

But it's a resend (replication _1). The _0 copy also failed, leaving rather more evidence behind.

Result 22229598, Exit status 15, stderr ends with

Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=22328, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=21096, iMonCtr=1
after many, many interruptions.

You've thought of the obvious ideas. Beyond that, I can only suggest:

Start another task and let it get into its stride.
Stop BOINC prior to reboot, and examibe the state of the files.
Disable automatic BOINC start at reboot/login.
Reboot, and examine the state of the files again before BOINC has a chance to run.
Pull the network cable, and allow BOINC to start.
Assuming it crashes as before, the slot folder will be cleared, but the upload files and report should be held until after BOINC has reported them to the server and got an ack response. Stderr will be embedded in client_state.xml, not kept as a separate file.
ID: 66116 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66117 - Posted: 19 Sep 2022, 19:21:03 UTC - in response to Message 66116.  
Last modified: 19 Sep 2022, 19:43:36 UTC

Richard, yes it was result 22235033. But it failed as soon as it started so that's probably why no stderr? Thanks for the tips, I shall bear them in mind, though it seems rather poor to me that the volunteers should have to do this for a safe restart for CPDN tasks.

The zero exit status may be a red herring, it's possible the real error code is not propagated to the top level software layer correctly. The only way to tell would be to put the code in the debugger. I found the same thing with the HadSM4.

Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out.
ID: 66117 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66118 - Posted: 19 Sep 2022, 22:00:28 UTC - in response to Message 66117.  

Linux tasks are not that bad. I have not gotten any since July, but my failures seem mostly like this:

Task 22227751
Name 	hadsm4_a10i_201310_6_935_012148076_1
Workunit 	12148076
Created 	28 Jul 2022, 5:20:01 UTC
Sent 	28 Jul 2022, 6:08:55 UTC
Report deadline 	10 Jul 2023, 11:28:55 UTC
Received 	28 Jul 2022, 9:43:21 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241
Run time 	13 min 56 sec
CPU time 	13 min 20 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.58 GFLOPS
Application version 	UK Met Office HadSM4 at N144 resolution v8.02
i686-pc-linux-gnu
Peak working set size 	656.03 MB
Peak swap size 	787.57 MB
Peak disk usage 	0.02 MB
Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.                                                                                                                                                                                                                              tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(
04:59:14 (154169): called boinc_finish(22)

</stderr_txt>
]]>


I quit doing cold restarts a while ago, but IIRC, the offending program that cause those problems has long since been fixed.
At some point my machine crashed in such a way that I could not even do a shutdown. I powered it off and started it back up. I never found out what the trouble was, but when I powered it back up, Boinc and its children, probably including CPDN jobs, picked up where they left off with no problems.
ID: 66118 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66119 - Posted: 20 Sep 2022, 6:06:31 UTC

Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out.

May not be quite that bad. I will when work appears again, start keeping some real data on this rather than relying on my impressions.
ID: 66119 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 66120 - Posted: 20 Sep 2022, 8:50:01 UTC - in response to Message 66119.  

Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out.

May not be quite that bad. I will when work appears again, start keeping some real data on this rather than relying on my impressions.


I can only report my experiences. I do not take any precautions when rebooting (Ubuntu 20.04) and I have not had any CPDN fails in a couple of years.
ID: 66120 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66121 - Posted: 20 Sep 2022, 10:49:28 UTC - in response to Message 66118.  

Model crashed: ATM_DYN : NEGATIVE THETA DETECTED.
That's a model error rather than a technical boinc related problem like a restart. If I remember the hadley centre models properly it may indicate the model levels have touched (or crossed), probably because the vertical windspeed is too high or unstable. Usually that kind of thing happens in certain forecast conditions over high orography, where the model levels are naturally closer together.

For interest, OpenIFS has a different way of calculating where the winds are blowing. It tries to work out a trajectory of an air parcel between model timesteps. If we use a too large timestep or the winds get very strong, those trajectories near the surface can go underground and you'll see messages to that effect in the model logs. It can correct but if there are too many, the model will stop.
---
CPDN Visiting Scientist
ID: 66121 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,020,139
RAC: 23,959
Message 66122 - Posted: 20 Sep 2022, 11:25:20 UTC

I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks.
ID: 66122 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66123 - Posted: 20 Sep 2022, 12:20:19 UTC - in response to Message 66122.  
Last modified: 20 Sep 2022, 12:41:27 UTC

I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks.


Edit: re my previous post about errors on Linux tasks, one in four or five is the error rate when the computer is being turned off every night, so on a ball park figure of seven days for a task, closer to one in thirty falling over per task/shutdown event. Still a lot higher than ideal though.

The zips that are uploaded at the same time the trickle ups for credit are generated still provide some data that can be used I believe even if t task is a hard fail and all attempts crash.

Eit2: In contrast my tasks on the testing site where the nature of testing might lead one to expect a higher error rate is one in 20 over last 60 tasks. (Reboots while running testing work are only for emergencies or when a workman requires power to go off.
ID: 66123 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66124 - Posted: 20 Sep 2022, 13:42:03 UTC - in response to Message 66122.  

In defence of CPDN, I think it's quite impressive operational weather forecast models, 5-10 million lines of code, designed to run on highly parallel high performance computer systems, can be made to work on a range of Intel & AMD home/server hardware, across multiple operating systems. One of the complications with this setup is boinc which imposes certain constraints e.g. we have to make sure restarts work whether cleanly or sudden shutdowns, the model responds well to being suspended, swapped in/out of memory etc. It took 2 yrs of work for OpenIFS to run in CPDN and alot of that was on the boinc side and testing. I'd say 10-15% failures is acceptable given the wide range of computers it's running on.

As I'm only volunteering I'm not promising to fix restart issues. I thought I'd ask to understand if there are any quick fixes. The more pressing issues should be eradicate the need for 32bit libraries if possible. I need to finish working on OpenIFS first though.
ID: 66124 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 66125 - Posted: 20 Sep 2022, 16:47:23 UTC - in response to Message 66122.  

I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks.


Whilst I have had errors, mostly negative theta, I have not had a task fail on restart in a long time. Then, I very rarely restart more than once during the running of a single task.
ID: 66125 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66126 - Posted: 21 Sep 2022, 14:59:59 UTC

Another batch of Hadcm3s is on its way for testing site. I believe at some point this will result in more main site work but given that they won't run on recent releases of MacOS increasingly they will only be available to those who are willing and able to go down the virtualisation route. (Didn't work when I tried it, though others with same CPU and OS have got it to work. I will try again next time I do a clean install.

I looked at my Africa Rain Project tasks on WCG today. Not a single failed task despite this time of year when I don't have so much solar, the machine being turned off every night.
ID: 66126 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,958,609
RAC: 36,807
Message 66127 - Posted: 21 Sep 2022, 15:45:53 UTC - in response to Message 66126.  

Another batch of Hadcm3s is on its way for testing site. I believe at some point this will result in more main site work but given that they won't run on recent releases of MacOS increasingly they will only be available to those who are willing and able to go down the virtualisation route.


I'll have to get my VMs up and running again and waiting for the work! Need to spin up a few more of those, some hardware has rotated since the last batch.

Not a single failed task despite this time of year when I don't have so much solar, the machine being turned off every night.


Is there a reason you shut them down instead of sleep them? I do almost all of my compute in my solar powered, off-grid office, and I just put the machines to sleep at night - they don't pull enough power to matter, and it avoids task restarts as they're never being terminated and restarted - the machine just goes to sleep. There was one old Xeon box I couldn't do this with because it pulled 150W asleep, so I just pointed it at other projects - but the rest of my stuff is quite happy with sleep/resume cycles and CPDN works with that just fine.
ID: 66127 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66128 - Posted: 21 Sep 2022, 21:24:23 UTC - in response to Message 66126.  

Another batch of Hadcm3s is on its way for testing site.


Will they be MAC only, or Linux also? The last one I got worked OK.

Task 22191699
Name 	hadcm3s_1k9d_200012_168_926_012129726_2
Workunit 	12129726
Created 	29 Jan 2022, 20:46:55 UTC
Sent 	29 Jan 2022, 20:48:05 UTC
Report deadline 	12 Jan 2023, 2:08:05 UTC
Received 	1 Feb 2022, 13:43:03 UTC
Server state 	Over
Outcome 	Success
Client state 	Done
Exit status 	0 (0x00000000)
Computer ID 	1511241 [Linux]
Run time 	2 days 10 hours 49 min 14 sec
CPU time 	2 days 10 hours 24 min 3 sec
Validate state 	Valid
Credit 	4,354.56

ID: 66128 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66129 - Posted: 22 Sep 2022, 4:45:25 UTC

Will they be MAC only, or Linux also? The last one I got worked OK.
Mac only. The error rate on even known reliable Linux machines has been so much higher than on Macs. And the tests have been going on for a couple of months with no hints as to when they will transfer over to main site so it could be months it could be days.
ID: 66129 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 66130 - Posted: 22 Sep 2022, 20:22:00 UTC - in response to Message 66129.  

Mac only. The error rate on even known reliable Linux machines has been so much higher than on Macs. And the tests have been going on for a couple of months with no hints as to when they will transfer over to main site so it could be months it could be days.
HadCM3 is mac only? I didn't know that. Odd, because I've seen the code repository and the build script was (I thought) set up for linux/unix. There should be no reason why it can't be linux as well - something else I'll ask Andy about.
ID: 66130 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 66131 - Posted: 23 Sep 2022, 5:07:18 UTC - in response to Message 66130.  

HadCM3 is mac only? I didn't know that. Odd, because I've seen the code repository and the build script was (I thought) set up for linux/unix. There should be no reason why it can't be linux as well - something else I'll ask Andy about.
Till fairly recently, batches of hadcm3s tasks were for Linux as well. The high error rate on Linux machines with the last few batches is why since then they have only been released for Macs. But you are right that the code allows for them to run on Linux machines.
ID: 66131 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,541,651
RAC: 2,188
Message 66132 - Posted: 23 Sep 2022, 5:59:04 UTC - in response to Message 66131.  

Till fairly recently, batches of hadcm3s tasks were for Linux as well. The high error rate on Linux machines with the last few batches is why since then they have only been released for Macs. But you are right that the code allows for them to run on Linux machines.


I looked at a whole bunch of hadcm32 failures on my machine and they were mostly due to

Computer ID 	1511241
Run time 	36 sec
CPU time 	2 sec
Validate state 	Invalid

Application version 	UK Met Office HadCM3 short v8.36
i686-pc-linux-gnu

Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
SIGSEGV: segmentation violation     <---<<<
Stack trace (10 frames):
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4c140]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7cc01e9]

   [snip]


It is my contention that a segmentation violation in Linux can only be done by hardware problems (RAM not working right, overclocking, etc.) of software bugs such as using a pointer with an incorrect address in it (typically a value that does not point to an address in the address space of the process if the program language was one that uses pointers) or going off either end of an array if in programs that do not use pointerss

It is my understanding that programs such as UK Met Office HadCM3 short v8.36 i686-pc-linux-gnu is written in FORTRAN that does not use pointers, so my conjecture is that there is a bug in the source program. Trouble is that the source code is private and not fixable by the ClimatePrediction team even if they were inclined to look at the enormous program there. They would have to run debuggijng tools (e.g., sdb if it still exists) to find where this is happening and fix it.

From the stack trace, above, it happened just as the program was starting up, so a whole lot of the source would would probably not need looking at. But that assumes I understand the stack trace more than I have confidence with since I have not done any Linux programming in over 10 years.
ID: 66132 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,020,139
RAC: 23,959
Message 66133 - Posted: 23 Sep 2022, 7:20:03 UTC - in response to Message 66132.  
Last modified: 23 Sep 2022, 7:28:17 UTC

A while back I tried to get LHC projects running on WSL2 (Ubuntu 20.04) and one of them, native (as opposed to VBox) Theory, was failing within a minute or so with SIGSEGV errors. After some time of searching and trying some things I ran into something which I decided to try and it worked. No more SIGSEGV errors and Theory tasks started running to completion. The fix was to change a kernel parameter via the WSL2 config file to emulate vsyscall. I think it changes how system calls are made and I believe these types of problems come up when running older LInux programs. I wonder if Linux HadCM3 is experiencing similar issues. My Linux HadCM3 failures had a different error, NAMELIST input, for example: https://www.cpdn.org/result.php?resultid=22182053. It'd be interesting to try running Linux HadCM3 with vsyscall emulated and see if it works.

I'd add that running Theory in Ubuntu 20.04 in Hyper-V (Windows native type 1 hypervisor) was no problem. The errors are specific to WSL2. WSL2 kernel is not the same as regular Linux. One of the differences is that WSL2 is init.d not systemd.
ID: 66133 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 climateprediction.net