New work discussion

Author	Message
Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 68971 - Posted: 25 Jun 2023, 10:02:12 UTC - in response to Message 68970. Last modified: 25 Jun 2023, 10:15:56 UTC Why would this file behave differently on different machines? Is it a file which has not been downloaded when it should have been, and once a computer gets it, it stays there and everything works? Yes I was going to ask why we still had 1 year deadlines. I often find my computers leaving them and going off to do something else. I had to crank up the resource share of CPDN to stop them doing so. ID: 68971 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68973 - Posted: 25 Jun 2023, 10:19:59 UTC - in response to Message 68971. Why would this file behave differently on different machines? Why should anything else in the tasks do that? Sarh identified a potential problem with a file which makes me think there is a good chance it is the culprit. There are so many variables between computers that pinning down the common link between either those that work or those that don't is never going to be as straightforward as it is with the missing 32bit library files for the older Linux tasks. I have pretty much eliminated CPU type and OS versions from my list. None of those I looked at were short on RAM which can be an issue when running a lot of tasks. That over a hundred batches of this type of task have run in the past without this problem suggests one of the batch specific files. Once the file in question is identified, it would be nice to know why it affects some and not others but I am not sure we will ever know. ID: 68973 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68974 - Posted: 25 Jun 2023, 10:56:38 UTC - in response to Message 68973. As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem. I'll join the CPDN technical meeting tomorrow to find out more. --- CPDN Visiting Scientist ID: 68974 ·

Bill F Send message Joined: 17 Jan 09 Posts: 120 Credit: 1,460,421 RAC: 3,375	Message 68975 - Posted: 25 Jun 2023, 11:55:44 UTC Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy. Bill F ID: 68975 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68976 - Posted: 25 Jun 2023, 13:20:40 UTC - in response to Message 68974. As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Should this not be proven impossible, or checked by the program, before a bad memory reference is even generated or used? I.e., when all is said and done, no matter what bad data is presented to a program, it should never get a segentation violation. The only thing that should cause a segmentation violation in a correct program would be a hardware error. ID: 68976 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68977 - Posted: 25 Jun 2023, 14:13:17 UTC The only thing that should cause a segmentation violation in a correct program would be a hardware error. It is possible the error is down to how windows handles the data rather than the met office programs. ID: 68977 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68979 - Posted: 25 Jun 2023, 16:01:37 UTC - in response to Message 68977. Last modified: 25 Jun 2023, 16:04:32 UTC The only thing that should cause a segmentation violation in a correct program would be a hardware error. It is possible the error is down to how windows handles the data rather than the met office programs. That's not correct. Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv. --- CPDN Visiting Scientist ID: 68979 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68980 - Posted: 25 Jun 2023, 16:53:13 UTC - in response to Message 68979. Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv. The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not. ID: 68980 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68982 - Posted: 25 Jun 2023, 17:11:07 UTC - in response to Message 68980. Last modified: 25 Jun 2023, 17:12:41 UTC Even if the code is correct, if it's fed bad data that causes an array reference to go out of bounds of the program memory space you will get a segv. The program knows the dimensions of the array, so it should be able to determine if the array reference is in bounds or not. That's true but not all codes add this extra protection all the time. The code may have the correct computation but the data can still cause the code to fail. Although compilers can add in automatic array bound checking this is never turned on in production codes as it's a performance hit. ID: 68982 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 32,015,786 RAC: 33,522	Message 68985 - Posted: 25 Jun 2023, 20:52:07 UTC - in response to Message 68977. Last modified: 25 Jun 2023, 20:52:23 UTC The only thing that should cause a segmentation violation in a correct program would be a hardware error. It is possible the error is down to how windows handles the data rather than the met office programs. If you get different behavior, particularly around SIGSEGV, with the same code on different platforms, it's usually related to how memory is allocated and being "a little bit off" the end of an array in one direction or another. I don't do cross platform stuff anymore, but Windows and Linux absolutely handle memory allocation differently enough that the same memory access error (what should be an invalid access) will segfault on one platform, but not the other. They're both "wrong," but "how wrong you have to be to segfault" is different between the platforms. But it almost certainly means the code isn't bounds checking stuff somewhere, and probably could use some Valgrind-based love to catch those. ID: 68985 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 68986 - Posted: 25 Jun 2023, 21:56:16 UTC - in response to Message 68974. As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem. I'll join the CPDN technical meeting tomorrow to find out more. Presumably these files you wish to keep are now on the server from all of us who failed, so somebody can check, do you not have access to them? ID: 68986 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 68987 - Posted: 25 Jun 2023, 22:00:13 UTC - in response to Message 68975. Well my slowest machine after 4 back to back failures is now 33 minutes into a task and running it by all appearances. It has not trickled yet but it is early. My other system that has a task in progress has done 3 trickles and is still happy. Bill F It's difficult with the 1 task a day limit when your computer has been a naughty boy and been forced to strip in front of the headmaster, but I've managed to get three "dodgy" machines to get one task running. So for some reason they can sometimes get a good task. I have had a couple fail after several hours, although most are several minutes. They're actually running faster than my fast machine which filled all 24 threads with tasks. It shows full CPU usage on my monitoring software, but the temperature is a lot lower, and Boinc says they're only getting 2/3 of a CPU core each. I'm guessing these things have big data sets and are overloading the CPU cache? ID: 68987 ·

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2169 Credit: 64,554,250 RAC: 5,969	Message 68988 - Posted: 26 Jun 2023, 7:30:06 UTC - in response to Message 68974. As the models are failing right at the start it's almost certainly a problem with the input files. Though normally I would expect to see a floating point exception error because of bad input values rather than a segmentation violation (which means a bad memory reference). However, some bad data, say a negative pressure reference might put a -ve value in a memory reference and cause a segv. Without seeing the process traceback and the model log file it's v difficult to know. If the CPDN server decides to give me some more tasks I'll disable networking to keep the files so I can look at them. However, all my tasks' workunits all failed so I suspect it's a bad input problem. I'll join the CPDN technical meeting tomorrow to find out more. I only have 3 running on my Ryzen, but they are almost through 7 model months now. Of the work units associated with these tasks, two of the work units had two SEGV failure tasks each, very early in their runs, prior to my download. The third task running on my Ryzen had a similar early SEGV task failure prior to my downloading the 2nd task from that work unit. So, if it's an input file problem, that can't be the reason for the SEGV failures in the work units my three tasks came from. The work units are: https://www.cpdn.org/workunit.php?wuid=12217926 https://www.cpdn.org/workunit.php?wuid=12216852 https://www.cpdn.org/workunit.php?wuid=12217357 Like Dave, my Ryzen is running a version of Ubuntu, with Windows BOINC running under Wine. ID: 68988 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68989 - Posted: 26 Jun 2023, 10:12:58 UTC Update: The current Wah batch will be suspended due to the v high number of fails with the same error. We'll be running some tests with the model to understand what's happened before the batch is resubmitted. --- CPDN Visiting Scientist ID: 68989 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68990 - Posted: 26 Jun 2023, 10:19:12 UTC - in response to Message 68989. Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month. ID: 68990 ·

kotenok2000 Send message Joined: 22 Feb 11 Posts: 31 Credit: 226,546 RAC: 4,080	Message 68991 - Posted: 26 Jun 2023, 11:45:48 UTC Last modified: 26 Jun 2023, 11:49:07 UTC I get "climateprediction.net \| [http] [ID#21943] Info: Failed to connect to upload7.cpdn.org port 80 after 4356 ms: Couldn't connect to server" when uploading preliminary results. I have 22 stuck uploads for wah2 ID: 68991 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68992 - Posted: 26 Jun 2023, 14:41:13 UTC - in response to Message 68991. I have 22 stuck uploads for wah2 Andy, Sarah and the researcher in Korea are all aware of this. I currently have over 30 zips waiting to go. Andy is in meetings all day today but tomorrow should be able to give things a nudge be that a tweak from Oxford or an email to the owners of the server in Korea. ID: 68992 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68993 - Posted: 26 Jun 2023, 15:56:47 UTC - in response to Message 68990. Makes sense. Should we let working tasks run to completion or abort? I have seven that have all made it to at least 4th or fifth model month. I don't have a good answer to that. I don't know (yet) what CPDN will decide to do about this batch. If it was me, I'd let them run. ID: 68993 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68994 - Posted: 26 Jun 2023, 16:21:44 UTC - in response to Message 68993. If it was me, I'd let them run. That was what I was going to do in the absence of being told otherwise. ID: 68994 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68995 - Posted: 26 Jun 2023, 19:46:12 UTC - in response to Message 68994. If it was me, I'd let them run. That was what I was going to do in the absence of being told otherwise. Dave - have sent you a private message regarding the model logs. Could you please check. Thx. ID: 68995 ·

New work discussion - 2