climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 91 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 64974 - Posted: 16 Jan 2022, 9:09:46 UTC

And bar some inevitable resends, the hadcm3s tasks are all gone now and at the current rate, it won't be that many days till the current batch of HADAM4 tasks are also gone.
ID: 64974 · Report as offensive
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 64975 - Posted: 17 Jan 2022, 15:01:34 UTC

I just now manually terminated all the "shorts" batch 926 waiting to run.
Because they've all been dying sigsegv
If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates.

keep on crunching

e
ID: 64975 · Report as offensive
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 6,981,170
RAC: 3,836
Message 64976 - Posted: 17 Jan 2022, 15:24:55 UTC - in response to Message 64975.  

I just now manually terminated all the "shorts" batch 926 waiting to run.
Because they've all been dying sigsegv
If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates.

keep on crunching

e


Thanks for doing that! My ancient Mac mini is slowly crunching through those models without failing so far, and has just picked up one of your rejects.
ID: 64976 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 64977 - Posted: 17 Jan 2022, 17:10:28 UTC - in response to Message 64975.  

I just now manually terminated all the "shorts" batch 926 waiting to run.
Because they've all been dying sigsegv
If cpdn runs out of work, I can run other projects, and maybe spend a few hours on hardware and software updates.

keep on crunching

e

I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these.
ID: 64977 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 64978 - Posted: 17 Jan 2022, 20:12:19 UTC - in response to Message 64977.  

I have had about 1/3 success and 2/3 fail with sigsegv on this batch. It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these.


I have had 100% failure rate with this batch, but pretty good success rate with those like this from last spring; most were like this
Name 	hadcm3s_a1dq_191012_120_900_012072461_0
Workunit 	12072461
Created 	31 Mar 2021, 15:02:24 UTC
Sent 	21 Apr 2021, 0:45:49 UTC
Report deadline 	3 Apr 2022, 6:05:49 UTC
Received 	23 Apr 2021, 17:47:34 UTC
Server state 	Over
Outcome 	Success


but one failed like this; I consider this a legitimate failure perhaps due to unfortunate values for the initial conditions. The other user failed due to missing libraries.

Task 22024864
Name 	hadcm3s_r157_190012_240_837_011897728_1
Workunit 	11897728
Created 	28 Feb 2021, 11:33:16 UTC
Sent 	9 Mar 2021, 12:13:36 UTC
Report deadline 	19 Feb 2022, 17:33:36 UTC
Received 	12 Mar 2021, 11:59:59 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	22 (0x00000016) Unknown error code
Computer ID 	1511241
Run time 	2 days 5 hours 3 min 21 sec
CPU time 	30 sec
Validate state 	Invalid
Credit 	3,421.44
Device peak FLOPS 	6.58 GFLOPS
Application version 	UK Met Office HadCM3 short v8.36
i686-pc-linux-gnu
Peak working set size 	175.64 MB
Peak swap size 	215.60 MB
Peak disk usage 	98.92 MB
Stderr 	

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
MainError:	11:23:44 AM	No files match the supplied pattern.
MainError:	11:23:44 AM	No files match the supplied pattern.
MainError:	04:04:46 PM	No files match the supplied pattern.
MainError:	04:04:46 PM	No files match the supplied pattern.
MainError:	08:50:45 PM	No files match the supplied pattern.
MainError:	08:50:45 PM	No files match the supplied pattern.
MainError:	01:39:23 AM	No files match the supplied pattern.
MainError:	01:39:23 AM	No files match the supplied pattern.
MainError:	06:24:21 AM	No files match the supplied pattern.
MainError:	06:24:21 AM	No files match the supplied pattern.
MainError:	11:01:18 AM	No files match the supplied pattern.
MainError:	11:01:18 AM	No files match the supplied pattern.
MainError:	03:43:02 PM	No files match the supplied pattern.
MainError:	03:43:02 PM	No files match the supplied pattern.
MainError:	08:28:44 PM	No files match the supplied pattern.
MainError:	08:28:44 PM	No files match the supplied pattern.
MainError:	01:12:53 AM	No files match the supplied pattern.
MainError:	01:12:53 AM	No files match the supplied pattern.
MainError:	05:51:28 AM	No files match the supplied pattern.
MainError:	05:51:28 AM	No files match the supplied pattern.
MainError:	10:26:43 AM	No files match the supplied pattern.
MainError:	10:26:43 AM	No files match the supplied pattern.

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  

Model crashed: ATM_DYN : INVALID THETA DETECTED.                                                                                                                                                                                                                               tmp/pipe_dummy                                                                  
Sorry, too many model crashes! :-(
06:39:07 (80419): called boinc_finish(22)

</stderr_txt>
]]>


It seems to me that if the model or the initial conditions were bad, but marginal enough that some machines, including their processors and libraries, were slightly different, one wiould expect to get floating point exceptions, NaN exceptions, and so forth. BUT NOT SEGMENTATION VIOLATIONS.[/code]
ID: 64978 · Report as offensive
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 64979 - Posted: 17 Jan 2022, 20:32:50 UTC - in response to Message 64977.  

Dave Jackson wrote:
It will be interesting to see if the batch which uses restarts from the successes of this batch has the lower failure rate the project are hoping for. I understand the theory that the initial conditions for the failing batches are thought to be a bit whacky but all too often I have seen a difference between theory and practice with some of these.

My Mac looks set to complete the seventh of the seven tasks it's received from the current HadCM3 batch tomorrow. Five of the work units had previously failed on other computers and three of them on two others. The eight previous computers all have a history of failing HadCM3s. The work units in this batch only get three chances each, so if their third chances go to "bad" computers, presumably they'll be "weeded out" as bad units. This and the 11.5-month deadlines make me wonder what useful science is actually being done here. :\
NG
ID: 64979 · Report as offensive
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,482,949
RAC: 4,328
Message 64980 - Posted: 18 Jan 2022, 4:52:21 UTC

The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd.
ID: 64980 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 64981 - Posted: 18 Jan 2022, 7:55:35 UTC - in response to Message 64980.  

As you say George, "quite odd." Just looked at the work units with most of my failures. Two went on to produce trickles but still failed eventually. One on a Mac and one on another Linux box. One I have found went on to complete on another Linux box. I have just closed down the links - I should really have checked whether the ones that went on to produce trickles/complete on Linux were AMD architecture like my own or Intel. Might get around to that later.
ID: 64981 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,831,258
RAC: 20,177
Message 64991 - Posted: 23 Jan 2022, 3:06:17 UTC

Just finished last 2 HadCM3s, both failed. Altogether 2 for 22. Some were on WSL2 Ubuntu 20.04 some on Hyper-V Ubuntu 20.04, both setups on the same Ryzen 9 computer.
ID: 64991 · Report as offensive
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 64992 - Posted: 23 Jan 2022, 8:24:54 UTC - in response to Message 64980.  

The one computer that I tried to run these on failed all 8 tasks with segmentation violations. Looking at those work units, a couple of those work units had all 3 tasks fail with segmentation violation errors, a few had all fail for various errors, and a couple work units had one of the tasks progress and produce trickles, including producing trickles on another Linux PC. Quite odd.

Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs?
I have no further useful input.
Hope someone somewhere can figure this problem out.
ID: 64992 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 64993 - Posted: 23 Jan 2022, 13:26:47 UTC - in response to Message 64992.  

Odd indeed. I'm clueless SIGSEGV ?? and some work a bit at least on on Macs?
I have no further useful input.
Hope someone somewhere can figure this problem out.


Wikipedia has this to say (long):

https://en.wikipedia.org/wiki/Segmentation_fault

Of course, most of the above applies mainly to C and C++ programs that can allocate RAM dynamically, and use pointers. Since CPDN programs are mostly FORTRAN, they do not use pointers, so the best way to get these faults is to use arrays and let their subscripts go off the end of the array.
ID: 64993 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 64994 - Posted: 23 Jan 2022, 18:18:46 UTC

Hopper almost empty and so far nothing seen preparing to be poured in.
ID: 64994 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64995 - Posted: 24 Jan 2022, 0:44:36 UTC

All gone.
ID: 64995 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 64996 - Posted: 24 Jan 2022, 13:55:42 UTC

Two OpenIFS tasks currently running in testing but no discussion to suggest they are getting near ready to launch on main site.
ID: 64996 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 65019 - Posted: 29 Jan 2022, 16:31:09 UTC - in response to Message 64970.  

My machine has now run out of ClimagePrediction work units. It has also run out of Rosetta and Universe work units, so I am running only WCG work units, and at most 5 of them. This has greatly improved my cache hit ratio. I infer that the N216 work unit are the ones that gobble up the processor cache(s). This is nor a surprise, but it is gratifying to be able to see why.

# perf stat -aB -e cache-references,cache-misses

 Performance counter stats for 'system wide':

     7,549,164,603      cache-references                                            
     1,883,988,132      cache-misses              #   24.956 % of all cache refs    

      65.847219356 seconds time elapsed


# ps -fu boinc
UID          PID    PPID  C STIME TTY          TIME CMD
boinc      19484       1  0 Jan23 ?        00:08:17 /usr/bin/boinc      [this is the boinc client]
boinc     509317   19484 99 05:37 ?        05:40:44 ../../projects/www.worldcommunitygrid.org/wcgrid_arp1_wrf_7.32_x86_64-pc-linux-gnu
boinc     525303   19484 98 10:14 ?        01:05:44 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu 
boinc     526551   19484 99 10:38 ?        00:42:01 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu 
boinc     527966   19484 99 11:01 ?        00:19:20 ../../projects/www.worldcommunitygrid.org/wcgrid_opn1_autodock_7.21_x86_64-pc-linux-gnu 
boinc     528648   19484 99 11:13 ?        00:06:50 ../../projects/www.worldcommunitygrid.org/wcgrid_mcm1_map_7.61_x86_64-pc-linux-gnu -Sett

ID: 65019 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 65020 - Posted: 30 Jan 2022, 7:24:46 UTC - in response to Message 64994.  
Last modified: 30 Jan 2022, 7:27:44 UTC

Hopper almost empty and so far nothing seen preparing to be poured in.


I got an N216 and an hadcm3s work unit (both re-runs) recently and they are both running fine. I am kind-of amazed at the hadcm3s one because I never got more than about 3 seconds on the 16 or so of those I ran recently whereupon they crashed after three seconds or so with a segmentation fault. This one has run for over 10 hours and delivered two trickles. The two previous attempts error-ed out for reasons I could not understand. (not missing libraries, not segmentation violations). They were on apple-Darwin machines.

Task 22191699
Name 	hadcm3s_1k9d_200012_168_926_012129726_2
Workunit 	12129726
Created 	29 Jan 2022, 20:46:55 UTC
Sent 	29 Jan 2022, 20:48:05 UTC
Report deadline 	12 Jan 2023, 2:08:05 UTC
Received 	---
Server state 	In progress
Outcome 	---
Client state 	New
Exit status 	0 (0x00000000)
Computer ID 	1511241

ID: 65020 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 65021 - Posted: 31 Jan 2022, 12:13:05 UTC

Just downloading 4 N144 tasks from testing. I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work.
ID: 65021 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1056
Credit: 16,520,115
RAC: 1,176
Message 65022 - Posted: 31 Jan 2022, 13:56:35 UTC - in response to Message 65021.  

I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work.


Do those openIFS tasks work, or do they crash?
How much RAM do they currently take?
Any idea how much processor cache they require?
ID: 65022 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,498,761
RAC: 5,627
Message 65023 - Posted: 31 Jan 2022, 14:39:11 UTC - in response to Message 65022.  

I have also ran some more openiFS tasks last week. As usual I have no idea if and when these will translate into main site work.


Do those openIFS tasks work, or do they crash?
How much RAM do they currently take?
Any idea how much processor cache they require?

They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band.
ID: 65023 · Report as offensive
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 65024 - Posted: 31 Jan 2022, 16:12:04 UTC - in response to Message 65023.  

They don't crash, the last batch I checked were taking 12GB of RAM each and uploads were about 550MB Haven't tried to check on CPU cache but it hasn't been raised as an issue by other testers so I suspect not as much as the N216 tasks. Some batches have had final uploads of over 1GB so I have had them uploading while I sleep if on a day when I am doing any Zoom calls. Obviously not an issue for those with real broad as opposed to bored band.

That is good information. The memory and bandwidth requirements are quite large, but a number of us could do a few at a time if that is what it takes.
Of course, that may not be enough to do them much good, but that is another question.
ID: 65024 · Report as offensive
Previous · 1 . . . 70 · 71 · 72 · 73 · 74 · 75 · 76 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 climateprediction.net