climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 69 · 70 · 71 · 72 · 73 · 74 · 75 . . . 91 · Next

AuthorMessage
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 485
Credit: 29,634,797
RAC: 3,427
Message 64944 - Posted: 9 Jan 2022, 23:33:26 UTC - in response to Message 64898.  

Two got - two failed.
ID: 64944 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64945 - Posted: 10 Jan 2022, 6:13:22 UTC - in response to Message 64941.  

Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then.


For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program.

Different operating systems, too When I was working on optimizers, the designers of the regular C-compiler suite used one version of the UNIX kernel and we used a slightly different one. Actually the same source, but we turned on a special testing option that made the hardware give an interrupt just prior to the segmentation violation. Then we diddled the compilaton system to leave the bottom page of RAM unused. And we set that page to no access. Thus any attempt to access that bottom page gave us an interrupt that we could analyze. One of the programs we compiled and tested was the UNIX kernel itself, that we would run and test. We optimized the kernel as well as everyting else. ANd we found bugs in it. The just did not care and never fixed the bugs we pointed out to them.
ID: 64945 · Report as offensive
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,016,003
RAC: 4,453
Message 64946 - Posted: 10 Jan 2022, 10:18:11 UTC - in response to Message 64945.  

...we found bugs in it. The just did not care and never fixed the bugs we pointed out to them.

So depressing and yet so common.
ID: 64946 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 64947 - Posted: 10 Jan 2022, 11:23:03 UTC - in response to Message 64943.  

The problem with batch 926, is that there are some bad data sets in among some good data sets.
They can either be all killed off, or they can all be left to run, which will quickly remove the bad ones.
The 2nd option is being used.


That’s fair enough, if they fail they fail very quickly.
ID: 64947 · Report as offensive
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 64951 - Posted: 10 Jan 2022, 19:52:50 UTC - in response to Message 64926.  
Last modified: 10 Jan 2022, 19:53:21 UTC

I wrote:
my iMac … picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly.

Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units.
WU 12129231
WU 12129961

WU 12127970
NG
ID: 64951 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 64952 - Posted: 10 Jan 2022, 20:35:10 UTC - in response to Message 64951.  

Both of these later tasks have now been successfully completed and reported. Another, received this morning, which had previously failed immediately on both a Mac and a Linux system, is currently 13% done and has returned one trickle. So the failures aren't just due to bad work units.


That has been my impression too. However there does seem to be a large element of randomness about it.
ID: 64952 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,029,250
RAC: 24,057
Message 64953 - Posted: 10 Jan 2022, 21:18:46 UTC

Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out.
ID: 64953 · Report as offensive
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 7,016,003
RAC: 4,453
Message 64954 - Posted: 10 Jan 2022, 21:25:28 UTC - in response to Message 64953.  

Could someone please look at this failed task https://www.cpdn.org/result.php?resultid=22185093? Failed for different reason than most recently here. Would like to know what the log means and if the Run & CPU times make sense. It's a HadCM3 that ran for over 3 days before erroring out.

Errors involving “invalid theta” indicate that the model’s physics has become unrealistic, so the model is stopped. For a computer that is not over-clocked such errors can be ignored.

The model is likely to fail in the same way on other PCs with the same architecture, but might succeed, for example, on a Mac with a different floating-point library. It’s all part of the ensemble method of modelling: the project knows that some models are on the edge.
ID: 64954 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64955 - Posted: 11 Jan 2022, 4:01:30 UTC - in response to Message 64945.  

For purely mathematical work,I loved the Illinois-Alcor Algol-60 compiler that ran on IBM 7090 machines. David Greis was one of the authors of that program.


Sorry: I spelled his name wrong.

https://www.cs.cornell.edu/gries/
ID: 64955 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64958 - Posted: 12 Jan 2022, 16:08:23 UTC - in response to Message 64947.  

That’s fair enough, if they fail they fail very quickly.


They sure do. I just got two more that failed after 36 seconds wall clock time, and just under 4 seconds cpu time.
Segmentation violations.

https://www.cpdn.org/result.php?resultid=22181507
https://www.cpdn.org/result.php?resultid=22182265
ID: 64958 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 64959 - Posted: 12 Jan 2022, 16:46:49 UTC

My last few to download have all failed. Two more downloading right now.
ID: 64959 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 147
Credit: 12,814,088
RAC: 261,385
Message 64960 - Posted: 12 Jan 2022, 16:55:18 UTC - in response to Message 64958.  

That’s fair enough, if they fail they fail very quickly.


They sure do. I just got two more that failed after 36 seconds wall clock time, and just under 4 seconds cpu time.
Segmentation violations.

https://www.cpdn.org/result.php?resultid=22181507
https://www.cpdn.org/result.php?resultid=22182265


I’ve realised a problem with my logic - when they fail the hour delay cuts in before it can pull down the next task to try.

In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine.
ID: 64960 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64961 - Posted: 12 Jan 2022, 19:47:34 UTC - in response to Message 64960.  

In my case there were 6 consecutive fails so cpdn lost out on over 6 hours processing (and about 50 hours lost overall) - happily that was taken up by another project but there are those who only run one project per machine.


I was away for a couple of hours, and three more failed. All with segmentation error.
One of mine that failed also failed on another machine running Darwin: it does not seem to like machines running Darwin. But my Linux machine is 1511241
ID: 64961 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 485
Credit: 29,634,797
RAC: 3,427
Message 64963 - Posted: 12 Jan 2022, 23:14:48 UTC - in response to Message 64959.  

The 5 I got over the weekend and the 2 today have all failed. My machine is Ubuntu 20.04.
ID: 64963 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64964 - Posted: 12 Jan 2022, 23:48:30 UTC

On the other side of things, I started a new one today, but it's hadm4h, batch 895.
Phew. :)

Hmmm That batch is almost a year old; the last attempt on mine was abandoned after a year, without any trickles.
ID: 64964 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64965 - Posted: 13 Jan 2022, 3:17:48 UTC - in response to Message 64964.  

On the other side of things, I started a new one today, but it's hadm4h, batch 895.
Phew. :)

Hmmm That batch is almost a year old; the last attempt on mine was abandoned after a year, without any trickles.


I get old ones sometimes too, Some failed several times before I get them. And for different reasons.

hadam4h_e0zp_207111_5_887_012043123_3
hadam4h_h1g7_201011_5_889_012045366_3
hadam4h_b0v6_201211_5_882_012036130_1
ID: 64965 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 485
Credit: 29,634,797
RAC: 3,427
Message 64966 - Posted: 13 Jan 2022, 23:23:03 UTC - in response to Message 64964.  
Last modified: 13 Jan 2022, 23:28:22 UTC

Same here. Suspended pending 926 tasks and started the 895 one just to check if it would run OK. Looking good at the moment. There seem to be a lot of computers that are "idle" for several months. Extra grist to the mill for a shortening of the allowed/estimated completion time before reissuing the task.
ID: 64966 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64967 - Posted: 14 Jan 2022, 0:42:55 UTC - in response to Message 64966.  

Suspended pending 926 tasks and started the 895 one just to check if it would run OK.


I take them as they come. My N216 tasks mostly run OK, but all 13 or so 926 task have failed with segmentation faults after 2 to 4 seconds of processor time. These have all been in 2022 January. The same model has worked OK in the past. I have one more downloaded, but there are 4 CPDN tasks in the queue before that one.. I can run up to 4 CPDN tasks at a time on this machine. Actually, I have an 8-core machine that can actually run 16 tasks at a time, but it does not make sense for me to run more than 8 Boinc tasks at a time.
ID: 64967 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 64968 - Posted: 14 Jan 2022, 17:05:35 UTC

A total of 107 of #926 are now showing as completed. I will have to wait till some N216 tasks complete before I can try any more.
ID: 64968 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,543,308
RAC: 2,242
Message 64970 - Posted: 15 Jan 2022, 7:17:04 UTC

Right now my machine is idle except for 8 Boinc tasks. Of these, three are N216 CPDN tasks and five are WCG models.
Of those, two are OPN1, two are ARP1, and one is MCM1.

The essentials of my machine are:
CPU type 	GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	62.4 GB
Cache 	16896 KB  <---<<<

top - 02:13:32 up 16 days, 12:00,  1 user,  load average: 8.08, 8.25, 8.31
Tasks: 463 total,   9 running, 453 sleeping,   1 stopped,   0 zombie
%Cpu(s):  0.3 us,  2.8 sy, 47.0 ni, 49.7 id,  0.0 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem :  63902.2 total,    867.4 free,   9746.1 used,  53288.6 buff/cache
MiB Swap:  15992.0 total,  15043.0 free,    949.0 used.  53304.0 avail Mem 


Note: with all the RAM I have, it does essentially no paging.
Now let us look at the cache hit ratio. Almost half of the memory requests are found in the cache with this procesor.
# perf stat -aB -e cache-references,cache-misses

 Performance counter stats for 'system wide':

    37,810,375,576      cache-references                                            
    20,135,725,239      cache-misses              #   53.254 % of all cache refs    

      60.935538193 seconds time elapsed


ID: 64970 · Report as offensive
Previous · 1 . . . 69 · 70 · 71 · 72 · 73 · 74 · 75 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 climateprediction.net