climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 91 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 64834 - Posted: 20 Nov 2021, 16:53:25 UTC

Does this help? (I have not tried it yet.)

https://www.geeksforgeeks.org/see-cache-statistics-linux/
I will have a look soon. What I would really like is something that shows cache usage by application in much the same way that top shows cpu usage.

Thanks. I think that is a good first indication. It is not surprising that they use a lot of cache.


I had assumed the slow down was the writing to swap file because of lack of RAM.
ID: 64834 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64835 - Posted: 21 Nov 2021, 1:46:01 UTC - in response to Message 64834.  

I had assumed the slow down was the writing to swap file because of lack of RAM.


With my 64 GBytes of RAM, I seldom use the swap file at all. Now with the OpenIFS stuff, if it ever gets into production, it may use more of course.

That link I showed is not very useful, at least as far as I can tell. I found another way to get at this.

Right now I have my main machine running (the one with 65GBytes RAM and 16384+512K bytes memory cache).
It is running four N216 CPDN tasks and Four WCG tasks (two ARP1's and two CPN1's).
I found something I downloaded, the perf command, and ran it for while.

It looks like one can learn a lot with it. The manual page helps and running perf --help should help too.

Running two N216, 5 WCG, and one rosetta
localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses
^C
 Performance counter stats for 'system wide':

    12,019,648,573      cache-references                                            
     6,494,633,179      cache-misses             #   54.033 % of all cache refs    

      23.651410187 seconds time elapsed

Running no CPDN, but some WCG 
localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses
^C
 Performance counter stats for 'system wide':

     3,727,867,972      cache-references                                            
     1,368,824,386      cache-misses              #   36.719 % of all cache refs    

      14.735167255 seconds time elapsed

Running only boinc client, no tasks:
localhost:root[/home/jeandavid8]# perf stat -aB -e cache-references,cache-misses
^C
 Performance counter stats for 'system wide':

       128,714,374      cache-references                                            
         5,195,723      cache-misses              #    4.037 % of all cache refs    

      25.357159122 seconds time elapsed

ID: 64835 · Report as offensive
klepel

Send message
Joined: 9 Oct 04
Posts: 76
Credit: 67,812,914
RAC: 5,809
Message 64896 - Posted: 5 Jan 2022, 16:29:40 UTC
Last modified: 5 Jan 2022, 16:50:29 UTC

Am I the only one with problems with the new short tasks (UK Met Office HadCM3 short v8.36)?

It seems all error with:
<core_client_version>7.16.6</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Suspended CPDN Monitor - Suspend request from BOINC...
forrtl: severe (17): syntax error in NAMELIST input, unit 5, file /home/roland/projects/climateprediction.net/hadcm3s_1dei_200012_168_926_012128606/jobs/climate.cpdc, line 396, position 20
Image              PC        Routine            Line        Source             
hadcm3s_um_8.36_i  0851D9E5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  085429B6  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0832EC95  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FD206  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  081FED33  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848CCB5  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  0848BE04  Unknown               Unknown  Unknown
hadcm3s_um_8.36_i  08496BAD  Unknown               Unknown  Unknown
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=240, iMonCtr=1
Model crash detected, will try to restart...
Sorry, too many model crashes! :-(
17:23:42 (240): called boinc_finish(22)

</stderr_txt>
]]>


Sorry not to be precis, this is one of my WSL computers, on the Linux Computers I get:
core_client_version>7.16.5</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f1eb60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7bfcee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f65b60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c43ee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f4db60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c2bee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f63b60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7c41ee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f16b60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7bf4ee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7fc8b60]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/home/kle1boinc/BOINC/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf5)[0xf7ca6ee5]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=3927, iMonCtr=1
Model crash detected, will try to restart...
Sorry, too many model crashes! :-(
03:08:41 (3927): called boinc_finish(22)

</stderr_txt>
]]>

I know, this
SIGSEGV: segmentation violation
is normally associated to RAM overclocking but these computers do quite well the long WUs (hadam4h).
ID: 64896 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 64897 - Posted: 5 Jan 2022, 16:44:59 UTC - in response to Message 64896.  

I just got one and it errored out with the same type of namelist errors.

E-mail sent to Andy and Sarah.
ID: 64897 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 64898 - Posted: 5 Jan 2022, 17:25:36 UTC
Last modified: 5 Jan 2022, 19:52:00 UTC

So far, mine all seem OK. one has been running long enough to produce 5 zip files.
ID: 64898 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 64899 - Posted: 5 Jan 2022, 17:42:56 UTC - in response to Message 64898.  

Yeah, Checked some out and I've seen some trickling. Jumped the gun with the e-mail, and my errors on the two I downloaded were segmentation faults instead of namelist errors. And the hadcm3s has historically had a relatively high failure rate with segmentation faults.
ID: 64899 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 64900 - Posted: 5 Jan 2022, 19:00:11 UTC
Last modified: 5 Jan 2022, 19:02:57 UTC

14 failures here today, all CM3 all after about 44 seconds elapsed / 4 seconds CPU, same SIGSEGV as reported.
ID: 64900 · Report as offensive
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,906,534
RAC: 6,466
Message 64901 - Posted: 5 Jan 2022, 19:06:33 UTC

I've got one running OK on a Mac, so far.
ID: 64901 · Report as offensive
Profile Landjunge

Send message
Joined: 17 Aug 07
Posts: 8
Credit: 35,353,184
RAC: 1,966,142
Message 64902 - Posted: 5 Jan 2022, 20:59:34 UTC
Last modified: 5 Jan 2022, 20:59:52 UTC

Hi, i looked in here because all CM3's are failing. It happens on all 4 linux machines. The AM4's are working fine. I'm using Ubuntu 20.04 Server 64bit with the 32bit libraries as mentioned in the other thread.
ID: 64902 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 64903 - Posted: 5 Jan 2022, 22:58:30 UTC - in response to Message 64902.  

May laptop downloaded 4 of them and all died with segmentation faults. It then grabbed a hadam4h, which of course, is running fine.
ID: 64903 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 64904 - Posted: 6 Jan 2022, 5:52:26 UTC - in response to Message 64902.  
Last modified: 6 Jan 2022, 5:54:47 UTC

Hi, i looked in here because all CM3's are failing. It happens on all 4 linux machines. The AM4's are working fine. I'm using Ubuntu 20.04 Server 64bit with the 32bit libraries as mentioned in the other thread.

Six of these running OK here so far four have produced trickles and uploaded zips. Off to work soon, when I get back will poke around to see if I can find any common factors in the machines with failures and or those that are getting far enough to produce trickles.

Ubuntu 21.10 and BOINC 7.19.0 (The odd number after 7. indicates a pre-release version I compiled from source.
ID: 64904 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 64905 - Posted: 6 Jan 2022, 11:50:45 UTC

Off to work soon, when I get back will poke around to see if I can find any common factors in the machines with failures and or those that are getting far enough to produce trickles.


So far, error types are missing libraries, seg fault, process creation error (computers with this seem to be crashing everything as they do with missing libraries) and bad cpu type error (all on machines running Darwin. As to spotting any pattern, nothing has emerged from the noise yet.
ID: 64905 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64906 - Posted: 6 Jan 2022, 12:21:47 UTC - in response to Message 64905.  

Oh good! It is not just me. I think three out of three hadcm3s have just failed after a very few seconds each. I have three hadam4h tasks running for at least a full day each. Here is the beginning of the stderr file for one of them:
My machine is not overclocked.
Name 	hadcm3s_1gxf_200012_168_926_012129163_0
Workunit 	12129163
Created 	4 Jan 2022, 11:47:50 UTC
Sent 	6 Jan 2022, 6:13:06 UTC
Report deadline 	19 Dec 2022, 11:33:06 UTC
Received 	6 Jan 2022, 12:03:52 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241
Run time 	36 sec
CPU time 	2 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	6.57 GFLOPS
Application version 	UK Met Office HadCM3 short v8.36
i686-pc-linux-gnu
Peak working set size 	122.46 MB
Peak swap size 	181.44 MB
Peak disk usage 	4.50 MB
Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
SIGSEGV: segmentation violation
Stack trace (10 frames):
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f5c140]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7cd01e9]

Exiting...
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=602350, iMonCtr=1
Model crash detected, will try to restart...
SIGSEGV: segmentation violation
Stack trace (10 frames):
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu(boinc_catch_signal+0x67)[0x84ff4f7]
linux-gate.so.1(__kernel_sigreturn+0x0)[0xf7f30140]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x84277ad]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x80e8e67]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8089442]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8479d6e]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8494feb]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x848be04]
/var/lib/boinc/projects/climateprediction.net/hadcm3s_um_8.36_i686-pc-linux-gnu[0x8496bad]
/usr/lib/libc.so.6(__libc_start_main+0xf9)[0xf7ca41e9]

...


hadcm3s_1lbw_200012_168_926_012129901_0 failed the same way.
ID: 64906 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 64907 - Posted: 6 Jan 2022, 13:13:15 UTC - in response to Message 64906.  

Oh good! It is not just me.


A long way from being just you. Sadly I am not finding enough data to work out whether it is just some work units that have the problem or whether there is anything about the computers involved. So far I have seen that both Intel and AMD machines are implicated in the seg fault violations but both also have some machines like my own that have produced trickles. This is true across both Darwin and Linux computers and certainly both Ubuntu and Debian have tasks sending trickles and failures of this type. Yours in the only Red Hat machine I have looked at so a lack of data there. I don't know if there is anyone out there who could write a script to search for patterns but my brain is failing to spot anything useful.
ID: 64907 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64908 - Posted: 6 Jan 2022, 13:21:34 UTC - in response to Message 64907.  

Yours in the only Red Hat machine I have looked at so a lack of data there.


The CentOS Linux 8 machines should be the same as mine... Of course, there may not be (m)any of those either. Red Hat must be paid for, bur IIRC, CentOS is free,.
ID: 64908 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64909 - Posted: 6 Jan 2022, 14:13:04 UTC - in response to Message 64907.  

I notice my machine completed about dozen hadcm3s work units in March and April successfully.
ID: 64909 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 64911 - Posted: 6 Jan 2022, 16:50:28 UTC

So far seg faults make up about 25% of the failures, while with the hadam4h, they usually make up <5%.
ID: 64911 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64912 - Posted: 6 Jan 2022, 20:32:10 UTC - in response to Message 64907.  

Could this excerpt from the Boinc Manager Event log be any use? Itis from one of my failed tasks. They all seem to look like this.
Thu 06 Jan 2022 06:59:36 AM EST | climateprediction.net | Starting task hadcm3s_1gxf_200012_168_926_012129163_0
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Computation for task hadcm3s_1gxf_200012_168_926_012129163_0 finished
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_1.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_2.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_3.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_4.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_5.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_6.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_7.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_8.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_9.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_10.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_11.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_12.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_13.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_14.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:13 AM EST | climateprediction.net | Output file hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_restart.zip for task hadcm3s_1gxf_200012_168_926_012129163_0 absent
Thu 06 Jan 2022 07:00:16 AM EST | climateprediction.net | Started upload of hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_out.zip
Thu 06 Jan 2022 07:00:18 AM EST | climateprediction.net | Finished upload of hadcm3s_1gxf_200012_168_926_012129163_0_r100040496_out.zip

ID: 64912 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 64913 - Posted: 6 Jan 2022, 21:18:58 UTC - in response to Message 64912.  

Jean-David,

That's just the listing of files that should have been produced if it had run to the end. These used to be written to stderr.txt which is what is displayed on the task webpages when they finish, but now are just listed in the message log at the end of failures when one or more files that should have been produced in a successful run, wasn't.
ID: 64913 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 64916 - Posted: 7 Jan 2022, 5:31:35 UTC - in response to Message 64905.  

So far, error types are missing libraries, seg fault, process creation error (computers with this seem to be crashing everything as they do with missing libraries) and bad cpu type error (all on machines running Darwin. As to spotting any pattern, nothing has emerged from the noise yet.


You guys have my sympathy.

I looked at all 6 of the tasks I tried to run that failed to see if others had trouble too. IIRC, Some had trouble. Some have not been retried.
Two had missing 32-bit libraries
They had a variety of operating systems. Even FreeBSD

One failed with this; I have no idea how this happened.

Process creation (../../projects/climateprediction.net/hadcm3s_8.36_i686-pc-linux-gnu) failed: Error -1, errno=2
execv: No such file or directory
ID: 64916 · Report as offensive
Previous · 1 . . . 67 · 68 · 69 · 70 · 71 · 72 · 73 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 climateprediction.net