climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68462 - Posted: 25 Feb 2023, 9:27:51 UTC - in response to Message 68458.  

This task failed with a divide by zero error. Presumably this is one of those cases where the physics of the model get out of control?
Yes. It blew up in the model's convection code, clouds etc. Ran for a long time though, 49 days, before the instability occurred.

I would put a beer that the rerun might be successful.
---
CPDN Visiting Scientist
ID: 68462 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,276,661
RAC: 11,053
Message 68463 - Posted: 25 Feb 2023, 9:35:46 UTC

Task 22316388 failed with "process exited with code 9 (0x9, -247)".

But there's no error in the portion of stderr.txt that we can see (from upload 97 to the end). I can only guess that there was a child process error earlier in the run: the restart succeeded, but the error flag wasn't cleared from the BOINC task status. The final task finish looks normal, with:

..The child process terminated with status: 0
...
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
07:35:35 (41942): called boinc_finish(0)
That's going to be a tough one to debug.
ID: 68463 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68465 - Posted: 25 Feb 2023, 12:15:54 UTC - in response to Message 68463.  
Last modified: 25 Feb 2023, 12:16:59 UTC

Task 22316388 failed with "process exited with code 9 (0x9, -247)".

But there's no error in the portion of stderr.txt that we can see (from upload 97 to the end). I can only guess that there was a child process error earlier in the run: the restart succeeded, but the error flag wasn't cleared from the BOINC task status. The final task finish looks normal, with:

..The child process terminated with status: 0
...
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
07:35:35 (41942): called boinc_finish(0)
That's going to be a tough one to debug.
I am inclined to think this is a boinc issue, not ours. The output shows the model & task completed normally, all log files look ok, boinc_finish() was called... and then code 9 (EBADF: bad filenumber). I think it's to do with the final cleanup but quite what I am not sure. Or it may be that boinc expects us to be doing something we're not doing. Either way, it's file related and happens either inside boinc_finish or just as the task code exits after boinc_finish. I was going to look again at the way we clean up in the task to see if we missed anything.
---
CPDN Visiting Scientist
ID: 68465 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,276,661
RAC: 11,053
Message 68466 - Posted: 25 Feb 2023, 13:08:05 UTC - in response to Message 68465.  

I've pulled the overnight event log from the system journal, but there are no signs of any errors in there - seemed to be a normal finish after the final zip.
ID: 68466 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68467 - Posted: 25 Feb 2023, 16:34:59 UTC - in response to Message 68450.  

My fastest and best working cruncher had got lastly the definitiv dead tasks, that all errored out and now it has a daily Quota of 1. :-(

OK: I now have three losers and two winners. No more in the hopper.

OpenIFS 43r3 1.21 x86_64-pc-linux-gnu
Number of tasks completed 	2
Max tasks per day 	5
Number of tasks today 	1
Consecutive valid tasks 	2
Average processing rate 	29.66 GFLOPS
Average turnaround time 	0.62 days

All OpenIFS 43r3 tasks for computer 1511241

22316084 	12214703 	24 Feb 2023, 22:24:03 UTC 	25 Feb 2023, 13:43:22 UTC 	Completed 	53,538.60 	52,734.92 	0.00 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314976 	12213630 	24 Feb 2023, 12:25:29 UTC 	25 Feb 2023, 3:03:19 UTC 	Completed 	52,615.79 	51,784.63 	0.00 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314676 	12213385 	22 Feb 2023, 6:23:59 UTC 	22 Feb 2023, 7:24:41 UTC 	Error while computing 	66.16 	1.15 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314647 	12213316 	22 Feb 2023, 3:24:44 UTC 	22 Feb 2023, 3:49:31 UTC 	Error while computing 	66.61 	1.28 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314608 	12213345 	22 Feb 2023, 0:25:23 UTC 	22 Feb 2023, 1:23:20 UTC 	Error while computing 	66.38 	1.15 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu

ID: 68467 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68468 - Posted: 25 Feb 2023, 17:00:06 UTC

OpenIFS standalone high resolution (high memory) tests

Thanks to those who messaged me they were interested in these. For family reasons I am not able to spend much time on CPDN voluntary work at the moment but I will get to this as soon as I can.
---
CPDN Visiting Scientist
ID: 68468 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,533,637
RAC: 5,933
Message 68471 - Posted: 25 Feb 2023, 21:59:49 UTC
Last modified: 25 Feb 2023, 22:00:30 UTC

Another failure this time with,
CNT0 not found; string returned was: 'STEPO'

here
ID: 68471 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68472 - Posted: 25 Feb 2023, 22:27:54 UTC - in response to Message 68471.  
Last modified: 25 Feb 2023, 22:28:50 UTC

Dave, look further back in the stderr output. You'll see the model fell over, in the radiation code this time. The 'STEPO' message is just a check the control code does on the model output to see where's it got to and if it's still working (which it wasn't in this case). The extra printout at the bottom is just so we can figure out what went wrong. The model output will always been further back.

On the plus side I'm confident I've pinned down where the double free corruption is coming from. It's in the code that handles the trickles, which has now been rewritten.

Another failure this time with,
CNT0 not found; string returned was: 'STEPO'

here

---
CPDN Visiting Scientist
ID: 68472 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68473 - Posted: 26 Feb 2023, 6:54:11 UTC - in response to Message 68323.  

You're also the only person that I've seen who uses RHEL. I wonder if Glenn has seen any correlations between failure rates and Linux distros?


I should mention that the main differences between RHEL distributions and Fedora distributions is that Fedora releases are quite a more recent in the sense of additions and enhancements, whereas the RHEL distributions are meant for stability and tend to have no enhancements at all other than Thunderbird and Firefox. Even those two are "extended support" releases; e.g., my Firefox is 102.8.0esr (64-bit). So new updates in Fedora are much more frequent than those for RHEL. RHEL tends to have a major release about every 18 months, and each release is supported for 10 years. I do not know what the support for Fedora releases are.
ID: 68473 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,533,637
RAC: 5,933
Message 68474 - Posted: 26 Feb 2023, 7:30:03 UTC - in response to Message 68472.  

Dave, look further back in the stderr output. You'll see the model fell over, in the radiation code this time.
Thanks, I did look but must have missed it.
ID: 68474 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68477 - Posted: 26 Feb 2023, 10:06:34 UTC - in response to Message 68462.  

Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)

https://www.cpdn.org/workunit.php?wuid=12215285

This task failed with a divide by zero error. Presumably this is one of those cases where the physics of the model get out of control?
Yes. It blew up in the model's convection code, clouds etc. Ran for a long time though, 49 days, before the instability occurred.

I would put a beer that the rerun might be successful.

---
CPDN Visiting Scientist
ID: 68477 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,533,637
RAC: 5,933
Message 68478 - Posted: 26 Feb 2023, 10:26:37 UTC - in response to Message 68477.  

Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)


And this one completed for me on its final chance. One of the two others was a Ryzen, one, Intel. which I suspect lack of memory accounts for the failure rate. The Ryzon failed with

with  ABORT!    1 !! *** WAVE MODEL HAS ABORTED 
ID: 68478 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68479 - Posted: 26 Feb 2023, 10:33:42 UTC - in response to Message 68478.  

Dave, as I suspected, the resend of your failed task (Ryzen) worked fine. It landed on a Intel Xeon and completed. Another example of probably single bit differences in computation making a difference in parts of the model that are very sensitive to small changes when comparing two numbers. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)


And this one completed for me on its final chance. One of the two others was a Ryzen, one, Intel. which I suspect lack of memory accounts for the failure rate. The Ryzon failed with

with  ABORT!    1 !! *** WAVE MODEL HAS ABORTED 

The wave model abort is an indication of an earlier failed task with memory corruption.
---
CPDN Visiting Scientist
ID: 68479 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,533,637
RAC: 5,933
Message 68480 - Posted: 26 Feb 2023, 11:07:56 UTC - in response to Message 68479.  

My last one which is currently 65% completed failed after 16hours CPU time on another machine with only,

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>

</stderr_txt>
]]>
In the Stderr. which seems a bit odd. I am assuming that the machine is running too many tasks for the amount of RAM/

culprit here.
ID: 68480 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68481 - Posted: 26 Feb 2023, 12:45:09 UTC - in response to Message 68480.  

Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested.

My last one which is currently 65% completed failed after 16hours CPU time on another machine with only,

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)</message>
<stderr_txt>

</stderr_txt>
]]>
In the Stderr. which seems a bit odd. I am assuming that the machine is running too many tasks for the amount of RAM/

culprit here.

---
CPDN Visiting Scientist
ID: 68481 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68482 - Posted: 26 Feb 2023, 13:24:11 UTC - in response to Message 68481.  

Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested.


I looked at a few of the tasks on that machine and have some impressions. Not that there is anything wrong with the machine, but ...
1.) It gets a fantastically large number of suspends and resumes ....
2.) It is running CentOS 7 distribution. CentOS is much like RHEL, so it is like RHEL7. I happen to be running Red Hat Enterprise Linux release 8.7 (Ootpa) that is a whole new generation (i.e., about 1 1/2 years newer) than 7. And RHEL9 has been available for a while now.
3.) Similarly, I am running boinc-client 7.20.2, but that machine is running 7.16.something.
4.) It has only 32 GBytes of RAM. That is not necessarily bad, depending on what tasks are being run. If too many, that might account for all those suspends and resumes.

Now none of these is necessarily a problem, but taken together ... ?

I think #1 is very suspicious.
ID: 68482 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68484 - Posted: 26 Feb 2023, 15:13:50 UTC - in response to Message 68482.  

Yep, I agree. That machine (https://www.cpdn.org/show_host_detail.php?hostid=1531276) has barely had any successful tasks complete. Utter waste of compute time. I really wish people would not make themselves anonymous as then I could message them and help sort out any issues. I'm surprised they haven't tried to look into it more given the failure rate but maybe they are running other projects and not interested.
I looked at a few of the tasks on that machine and have some impressions. Not that there is anything wrong with the machine, but ...
1.) It gets a fantastically large number of suspends and resumes ....
I think #1 is very suspicious.
A large amount of suspend/resumes is not a problem, not unless 'leave non-gpu in memory is not set', which it isn't in this case, otherwise we'd see the model constantly restarting. Just means the %age cpu is set to much less than 100%.

But when I look at the task logs, almost always the model is being killed by signal 9 (kill -SIGKILL). In some cases when it's almost finished. I don't know if boinc is sending the signal. Maybe Richard might know. I would have thought boinc would send a different signal (SIGQUIT?), which should just stop the task but allow it to restart later. Maybe the operating system killed it because of memory oversubscription as Dave mentioned (OOM killer). There's not much to go on in the task output. It's not anything to do with the OS flavour in use.

It may not be the fault of the user of course. We know that the boinc client doesn't keep very good track of memory used by OpenIFS.
---
CPDN Visiting Scientist
ID: 68484 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4345
Credit: 16,533,637
RAC: 5,933
Message 68485 - Posted: 26 Feb 2023, 16:49:51 UTC

For the record, the task completed on my box. Now waiting for more work!
ID: 68485 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68487 - Posted: 26 Feb 2023, 19:13:07 UTC - in response to Message 68484.  

A large amount of suspend/resumes is not a problem, not unless 'leave non-gpu in memory is not set', which it isn't in this case, otherwise we'd see the model constantly restarting. Just means the %age cpu is set to much less than 100%.


I am prepared to accept that in this case. But I still do not understand this faith that leave non-gpu in memory guarantees that the task(s) so marked will be guaranteed to be kept in memory. Here is my boinc usage of my machine at the moment.
    
PID        PPID USER      PR  NI S    RES  %MEM  %CPU  P     TIME+ COMMAND                                                                   
 260110    2207 boinc     39  19 R   1.9g   1.5  98.9  3 173:48.59 ../../projects/einstein.phys.uwm.edu/einstein_O3MD1_1.03_x86_64-pc-linux+ 
 269459    2207 boinc     39  19 R 326720   0.2  98.9  2  28:02.31 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
 261928    2207 boinc     39  19 R 321496   0.2  99.0  4 143:13.60 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
 263896    2207 boinc     39  19 R 317356   0.2  99.4  9 113:54.09 ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-g+ 
   2207       1 boinc     30  10 S  41324   0.0   0.1  8  44761:25 /usr/bin/boinc   <---<<< boinc client                                                            
 269447    2207 boinc     39  19 R   7020   0.0  99.2  1  28:23.18 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 
 270386    2207 boinc     39  19 R   6996   0.0  99.2  7  14:32.60 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 
 270663    2207 boinc     39  19 R   5884   0.0  99.0 13   9:51.22 ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_1.46_x86_64-pc-linu+ 
 267477    2207 boinc     39  19 R   4984   0.0  99.1  8  60:56.19 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu  
 268160    2207 boinc     39  19 R   4560   0.0  99.1  0  48:48.88 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu  
 268791    2207 boinc     39  19 R   4480   0.0  99.3  6  39:17.70 ../../projects/universeathome.pl_universe/BHspin2_20_x86_64-pc-linux-gnu  


The column labelled PR (Priority) shows that the task processes run at priority 39, and that the boinc-client runs at priority 30. Now in Linux (and UNIX), the higher the "priority" number, the less likely a process will run. Furthermore a process not assigned a priority (i.e., most processes) runs at PR 20. So if some one wants to run a new process that will run at PR 20, it will force an existing process with a higher PR to lose control of its processor so that the PR 20 process can get one.
The column labelled S (Status) is R for running and S for sleeping. There are other possibilities, but none apply here.

Now if this process needs more RAM than is currently available, what happens? I have not studied the code of the Linux kernel ever, and I never looked at the code for UNIX since the 1970s (when it was written in assembler for the PDP-11), and even then I was no expert. But it would seem to me that it would normally just swap out the process with higher PR to make room for the new one. But I do not believe in those days one could lock a process to core.
So what happens now? It seems to me the kernel has two choices: 1.) ignore the leave non-gpu in memory and swap the process out or 2.) refuse to start the new process.

I there a third option? If not, which is done? It seems to me that neither of the two options I suggested are acceptable.
ID: 68487 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,930,360
RAC: 38,279
Message 68490 - Posted: 26 Feb 2023, 20:30:44 UTC - in response to Message 68477.  
Last modified: 26 Feb 2023, 20:31:43 UTC

It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :)


Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results?

I don't think there's a shortage of willing CPU cores right now.

Or even just have some people run the binaries manually and send you results somehow. I've got a range of AMD systems that are mostly bored!
ID: 68490 · Report as offensive     Reply Quote
Previous · 1 . . . 26 · 27 · 28 · 29 · 30 · 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net