climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 91 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64917 - Posted: 7 Jan 2022, 10:53:51 UTC - in response to Message 64916.  

One failed with this; I have no idea how this happened.
Yes, I have seen looking at failed tasks a number with the process creation failure. Never experienced it myself. I suspect this happens on machines where the user has messed around with them and it is a permissions issue but in the absence of a user with that error posting in the forums....
ID: 64917 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64918 - Posted: 7 Jan 2022, 13:21:14 UTC - in response to Message 64917.  

One failed with this; I have no idea how this happened.

Yes, I have seen looking at failed tasks a number with the process creation failure. Never experienced it myself. I suspect this happens on machines where the user has messed around with them and it is a permissions issue but in the absence of a user with that error posting in the forums....


Yes: it seems to me the best way to get that error would be to remove the file file after the task was downloaded (so the client would put it in the ready-to-start list but before it actually was started.
I guess it would suffice to change it to a non-executable permission
ID: 64918 · Report as offensive
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 293
Credit: 14,184,080
RAC: 18,148
Message 64919 - Posted: 7 Jan 2022, 14:28:09 UTC

For me, HadCM3s are crashing on i7-4790 WSL2 Ubuntu 20.04 with NAMELIST errors but HadAM4s are running fine. So I tried HadCM3s on Hyper-V Ubuntu 20.04 on Ryzen 5900x and so far they're working fine.
ID: 64919 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64920 - Posted: 7 Jan 2022, 14:58:29 UTC

My first one completed and uploaded successfully

Just had one with segmentation violation but previous computer it failed on was missing libs. One that failed with Segmentation violation 40seconds in on first computer has made it to 18 minutes so I suspect OK. But it tells me that it is not just the computer at fault as mind is running most of these tasks fine.
ID: 64920 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64922 - Posted: 7 Jan 2022, 17:17:33 UTC

From Sarah in response to my observations.

yes I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault) but others are ok. We don't know which are the duff ones without running through this batch but would effectively filter out in any continuation batches (hopefully!)
ID: 64922 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64923 - Posted: 7 Jan 2022, 20:51:55 UTC - in response to Message 64922.  
Last modified: 7 Jan 2022, 20:52:30 UTC

I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault)


I think mine all quit in a very few seconds with a segmentation fault. I cannot imagine they got any significant computing done, surely not enough that they xrashed from bad parameters -- though I could be wrong I suppose.

Computer ID 	1511241
Run time 	36 sec
CPU time 	2 sec

<core_client_version>7.16.11</core_client_version>
<![CDATA[
<message>
process exited with code 22 (0x16, -234)</message>
<stderr_txt>
SIGSEGV: segmentation violation
 

ID: 64923 · Report as offensive
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 64926 - Posted: 8 Jan 2022, 11:06:13 UTC
Last modified: 8 Jan 2022, 11:08:28 UTC

FWIW, my iMac's just completed and reported two of the HadCM3s, which it received three days ago. It picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly. As I write, both tasks have been running on my machine for 26 hours and have returned four and three trickles respectively.

I've no idea why my iMac should be more successful with these than other machines. One thought is that when I joined this project nearly twelve years ago, I followed a tip that was linked to in the help pages at the time for quadrupling a Mac's shared memory allocation. But I don't know if it's relevant.
NG
ID: 64926 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1032
Credit: 36,242,218
RAC: 12,342
Message 64927 - Posted: 8 Jan 2022, 12:21:33 UTC

Two more data points, both from the same machine.

task 22185585
task 22185426

CM3 short tasks, failed with SIGSEGV and 'too many model crashes'. The first had failed twice before, on machines with missing libraries: the second also failed with 'too many model crashes', but without the accompanying SIGSEGV reports.

My machine is Linux Mint 20.2, Intel CPU, 16GB memory. I updated the OS with the latest patches (including a kernel update) before running these tasks: everything else started fine, and two AM4 tasks are now running just fine (as they usually do). Machine is available for any further analysis that may be wanted.
ID: 64927 · Report as offensive
AgentConDier

Send message
Joined: 28 Oct 17
Posts: 1
Credit: 1,390,220
RAC: 0
Message 64929 - Posted: 8 Jan 2022, 22:18:40 UTC

Recently moved to Linux after a PC upgrade and finished setting up Boinc earlier today.
So far, all 5 of the HadCM3 tasks I got have failed via SIGSEGV after 37s runtime / 2s cpu time in the same fashion as Jean-David Beyer's. Meanwhile, 4 HadAM4 WUs are running fine.

The tasks in question are:
22185746, 22184615, 22183973, 22184912, 22181490

I'm using an x64 install of Debian-based MX Linux on a Ryzen 5600X with 32G of RAM. Latest updates and 32bit libraries are installed.
ID: 64929 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64932 - Posted: 9 Jan 2022, 7:19:28 UTC

Another three completed successfully here.
ID: 64932 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64934 - Posted: 9 Jan 2022, 8:43:12 UTC - in response to Message 64904.  

Ubuntu 21.10 and BOINC 7.19.0 (The odd number after 7. indicates a pre-release version I compiled from source.


CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16

Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64|libc 2.28 (GNU libc)]
BOINC version 	7.16.11
Memory 	62.4 GB
Cache 	16896 KB

7.16.11 is the latest version for my Linux distribution.

My machine is having no trouble with hadam4h work units and has completed
HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu and
HadCM3 short v8.36 i686-pc-linux-gnu onits in the past.
ID: 64934 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64935 - Posted: 9 Jan 2022, 10:32:48 UTC

I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines.
ID: 64935 · Report as offensive
Bryn Mawr

Send message
Joined: 28 Jul 19
Posts: 148
Credit: 12,830,559
RAC: 228
Message 64936 - Posted: 9 Jan 2022, 12:10:19 UTC
Last modified: 9 Jan 2022, 12:13:27 UTC

An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles.

Task https://www.cpdn.org/result.php?resultid=22185636
ID: 64936 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64937 - Posted: 9 Jan 2022, 15:38:58 UTC - in response to Message 64935.  

I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines.


1.) Just how much computing is actually accomplished in the first two seconds of a work-unit? Does it even do more than initialize things?

2.) No matter what, nothing justifies a segmentation violation even if a computation does something that violates physical reality constraints, or such as dividing by zero. An error exit, yes, but a segmentation violation, no. Even bad programs should not do this. The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program.
ID: 64937 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64938 - Posted: 9 Jan 2022, 15:45:05 UTC - in response to Message 64936.  

An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles.


I think I have receives six of the suspect work units. However many, all have failed. I have not received an more of them.
I have received N216 work units and they all work or have at least one day of computing accomplished.One has 5 1/2 days accomplished so far. They take my machine about 8 days to run one., though some run in about 6 days.
ID: 64938 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64939 - Posted: 9 Jan 2022, 16:35:59 UTC
Last modified: 9 Jan 2022, 16:39:36 UTC

Six fails today with segmentation violations and 5 successful completions.

The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program.


As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work.

Edit, only 12 successful completions showing for the batch so far when I have had seven against my six failures is a high enough failure rate that merits further investigation. (There were no failures in the ones that went to the testing site but there may just not have been enough of them to be statistically significant.)
ID: 64939 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1109
Credit: 17,121,631
RAC: 5,430
Message 64940 - Posted: 9 Jan 2022, 19:25:30 UTC - in response to Message 64939.  

As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work.


I did not mean to imply (if I did) that the problems I attribute to bad code of the applications were the fault of the ClimatePrediction project and should be fixed by them

Back when I was working on an optimizer for the Bell Labs C-compilation system, it often turned out that optimized code gave radically different results, including segmentation faults, from unoptimized code. Naturally those who wrote the normal compilation suite blamed these differences on the optimizer program. In every case, we could show that the fault was that bad pointers or array subscripts were the cause, and the different results were due to the fact that the un-optimized code failed differently. For example, the optimizer overloaded data registers; i.e., if a register's content was not used afterwards, we could put a hot variable in it. Now, later the program used the contents of that register without putting anything in it. So the optimized version had one variable left in it, and the un-optimized version had another value in it. So of course the results were different. It should be sufficient for an optimizer to give the same results for correct programs. It should not be required that it give the same results for incorrect programs.
ID: 64940 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64941 - Posted: 9 Jan 2022, 20:01:06 UTC - in response to Message 64940.  

Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then.
ID: 64941 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 64942 - Posted: 9 Jan 2022, 21:47:41 UTC

Another variable ruled out - one lot that failed were all when I had 8 threads running. I tried reducing this down to three when two new ones were starting just now and they both failed.
ID: 64942 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64943 - Posted: 9 Jan 2022, 22:23:09 UTC

The problem with batch 926, is that there are some bad data sets in among some good data sets.
They can either be all killed off, or they can all be left to run, which will quickly remove the bad ones.
The 2nd option is being used.
ID: 64943 · Report as offensive
Previous · 1 . . . 68 · 69 · 70 · 71 · 72 · 73 · 74 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 cpdn.org