New work Discussion

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64917 - Posted: 7 Jan 2022, 10:53:51 UTC - in response to Message 64916. One failed with this; I have no idea how this happened. Yes, I have seen looking at failed tasks a number with the process creation failure. Never experienced it myself. I suspect this happens on machines where the user has messed around with them and it is a permissions issue but in the absence of a user with that error posting in the forums.... ID: 64917 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64918 - Posted: 7 Jan 2022, 13:21:14 UTC - in response to Message 64917. One failed with this; I have no idea how this happened. Yes, I have seen looking at failed tasks a number with the process creation failure. Never experienced it myself. I suspect this happens on machines where the user has messed around with them and it is a permissions issue but in the absence of a user with that error posting in the forums.... Yes: it seems to me the best way to get that error would be to remove the file file after the task was downloaded (so the client would put it in the ready-to-start list but before it actually was started. I guess it would suffice to change it to a non-executable permission ID: 64918 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 12,035,877 RAC: 23,095	Message 64919 - Posted: 7 Jan 2022, 14:28:09 UTC For me, HadCM3s are crashing on i7-4790 WSL2 Ubuntu 20.04 with NAMELIST errors but HadAM4s are running fine. So I tried HadCM3s on Hyper-V Ubuntu 20.04 on Ryzen 5900x and so far they're working fine. ID: 64919 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64920 - Posted: 7 Jan 2022, 14:58:29 UTC My first one completed and uploaded successfully Just had one with segmentation violation but previous computer it failed on was missing libs. One that failed with Segmentation violation 40seconds in on first computer has made it to 18 minutes so I suspect OK. But it tells me that it is not just the computer at fault as mind is running most of these tasks fine. ID: 64920 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64922 - Posted: 7 Jan 2022, 17:17:33 UTC From Sarah in response to my observations. yes I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault) but others are ok. We don't know which are the duff ones without running through this batch but would effectively filter out in any continuation batches (hopefully!) ID: 64922 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64923 - Posted: 7 Jan 2022, 20:51:55 UTC - in response to Message 64922. Last modified: 7 Jan 2022, 20:52:30 UTC I think that this is more likely that these are from perturbed physics runs so there are some parameter combinations/restarts that are not good (hence seg fault) I think mine all quit in a very few seconds with a segmentation fault. I cannot imagine they got any significant computing done, surely not enough that they xrashed from bad parameters -- though I could be wrong I suppose. Computer ID 1511241 Run time 36 sec CPU time 2 sec <core_client_version>7.16.11</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234)</message> <stderr_txt> SIGSEGV: segmentation violation ID: 64923 ·

Nigel Garvey Send message Joined: 5 May 10 Posts: 69 Credit: 1,169,103 RAC: 2,258	Message 64926 - Posted: 8 Jan 2022, 11:06:13 UTC Last modified: 8 Jan 2022, 11:08:28 UTC FWIW, my iMac's just completed and reported two of the HadCM3s, which it received three days ago. It picked up two more yesterday, both of which had previously failed on other machines: one on another Mac after just one trickle; the other on two Linux systems almost instantly. As I write, both tasks have been running on my machine for 26 hours and have returned four and three trickles respectively. I've no idea why my iMac should be more successful with these than other machines. One thought is that when I joined this project nearly twelve years ago, I followed a tip that was linked to in the help pages at the time for quadrupling a Mac's shared memory allocation. But I don't know if it's relevant. NG ID: 64926 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,307,352 RAC: 11,277	Message 64927 - Posted: 8 Jan 2022, 12:21:33 UTC Two more data points, both from the same machine. task 22185585 task 22185426 CM3 short tasks, failed with SIGSEGV and 'too many model crashes'. The first had failed twice before, on machines with missing libraries: the second also failed with 'too many model crashes', but without the accompanying SIGSEGV reports. My machine is Linux Mint 20.2, Intel CPU, 16GB memory. I updated the OS with the latest patches (including a kernel update) before running these tasks: everything else started fine, and two AM4 tasks are now running just fine (as they usually do). Machine is available for any further analysis that may be wanted. ID: 64927 ·

AgentConDier Send message Joined: 28 Oct 17 Posts: 1 Credit: 1,390,220 RAC: 0	Message 64929 - Posted: 8 Jan 2022, 22:18:40 UTC Recently moved to Linux after a PC upgrade and finished setting up Boinc earlier today. So far, all 5 of the HadCM3 tasks I got have failed via SIGSEGV after 37s runtime / 2s cpu time in the same fashion as Jean-David Beyer's. Meanwhile, 4 HadAM4 WUs are running fine. The tasks in question are: 22185746, 22184615, 22183973, 22184912, 22181490 I'm using an x64 install of Debian-based MX Linux on a Ryzen 5600X with 32G of RAM. Latest updates and 32bit libraries are installed. ID: 64929 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64932 - Posted: 9 Jan 2022, 7:19:28 UTC Another three completed successfully here. ID: 64932 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64934 - Posted: 9 Jan 2022, 8:43:12 UTC - in response to Message 64904. Ubuntu 21.10 and BOINC 7.19.0 (The odd number after 7. indicates a pre-release version I compiled from source. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.5 (Ootpa) [4.18.0-348.7.1.el8_5.x86_64\|libc 2.28 (GNU libc)] BOINC version 7.16.11 Memory 62.4 GB Cache 16896 KB 7.16.11 is the latest version for my Linux distribution. My machine is having no trouble with hadam4h work units and has completed HadSM4 at N144 resolution v8.02 i686-pc-linux-gnu and HadCM3 short v8.36 i686-pc-linux-gnu onits in the past. ID: 64934 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64935 - Posted: 9 Jan 2022, 10:32:48 UTC I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines. ID: 64935 ·

Bryn Mawr Send message Joined: 28 Jul 19 Posts: 147 Credit: 12,814,088 RAC: 261,385	Message 64936 - Posted: 9 Jan 2022, 12:10:19 UTC Last modified: 9 Jan 2022, 12:13:27 UTC An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles. Task https://www.cpdn.org/result.php?resultid=22185636 ID: 64936 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64937 - Posted: 9 Jan 2022, 15:38:58 UTC - in response to Message 64935. I a keeping an eye on this but suspect we won't find anything significant about the machines that work and those that don't with these and it may well just be as Sarah has said, the physics of the ones that fail and nothing to do with the machines. 1.) Just how much computing is actually accomplished in the first two seconds of a work-unit? Does it even do more than initialize things? 2.) No matter what, nothing justifies a segmentation violation even if a computation does something that violates physical reality constraints, or such as dividing by zero. An error exit, yes, but a segmentation violation, no. Even bad programs should not do this. The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program. ID: 64937 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64938 - Posted: 9 Jan 2022, 15:45:05 UTC - in response to Message 64936. An oddity, after 17 consecutive fails with sigsegv errors I now have a CM3 task running fine and producing trickles. I think I have receives six of the suspect work units. However many, all have failed. I have not received an more of them. I have received N216 work units and they all work or have at least one day of computing accomplished.One has 5 1/2 days accomplished so far. They take my machine about 8 days to run one., though some run in about 6 days. ID: 64938 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64939 - Posted: 9 Jan 2022, 16:35:59 UTC Last modified: 9 Jan 2022, 16:39:36 UTC Six fails today with segmentation violations and 5 successful completions. The best way to get a segmentation is to de-reference a pointer to which no value has been assigned, or to use a subscript into an array that is off the end of the array. These are both indications of a defective program. As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work. Edit, only 12 successful completions showing for the batch so far when I have had seven against my six failures is a high enough failure rate that merits further investigation. (There were no failures in the ones that went to the testing site but there may just not have been enough of them to be statistically significant.) ID: 64939 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1061 Credit: 16,546,621 RAC: 2,321	Message 64940 - Posted: 9 Jan 2022, 19:25:30 UTC - in response to Message 64939. As has been written elsewhere on these fora, (I know it isn't the correct plural but it should be!) The actual programs are from the Met office and the CPDN license to run them doesn't allow taking them apart and rewriting bits. Rather they are used to crunch data that is put into them. It may be that some initial values put into the program produce one of these situations? Not having access to the un-compiled code and never having used Fortran which it is written in I clearly have no way of telling. Anyway the next batch will be based on the restart files from successful tasks which it is believed by those at the project will produce a much higher percentage of tasks that work. I did not mean to imply (if I did) that the problems I attribute to bad code of the applications were the fault of the ClimatePrediction project and should be fixed by them Back when I was working on an optimizer for the Bell Labs C-compilation system, it often turned out that optimized code gave radically different results, including segmentation faults, from unoptimized code. Naturally those who wrote the normal compilation suite blamed these differences on the optimizer program. In every case, we could show that the fault was that bad pointers or array subscripts were the cause, and the different results were due to the fact that the un-optimized code failed differently. For example, the optimizer overloaded data registers; i.e., if a register's content was not used afterwards, we could put a hot variable in it. Now, later the program used the contents of that register without putting anything in it. So the optimized version had one variable left in it, and the un-optimized version had another value in it. So of course the results were different. It should be sufficient for an optimizer to give the same results for correct programs. It should not be required that it give the same results for incorrect programs. ID: 64940 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64941 - Posted: 9 Jan 2022, 20:01:06 UTC - in response to Message 64940. Thank you, that is helpful to my understanding as one who has only dabbled in programming and that a long time ago (Think Algol!) The fact that the code goes through different compilers depending on whether for Linux or Mac and at one time Windows, probably explains why there have been batches in the past which have produced these faults on only one platform out of the three these tasks went out on then. ID: 64941 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 64942 - Posted: 9 Jan 2022, 21:47:41 UTC Another variable ruled out - one lot that failed were all when I had 8 threads running. I tried reducing this down to three when two new ones were starting just now and they both failed. ID: 64942 ·

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 64943 - Posted: 9 Jan 2022, 22:23:09 UTC The problem with batch 926, is that there are some bad data sets in among some good data sets. They can either be all killed off, or they can all be left to run, which will quickly remove the bad ones. The 2nd option is being used. ID: 64943 ·