OpenIFS Discussion

Author	Message
Alan K Send message Joined: 22 Feb 06 Posts: 485 Credit: 29,638,939 RAC: 3,372	Message 68251 - Posted: 10 Feb 2023, 23:27:00 UTC - in response to Message 68249. Last modified: 10 Feb 2023, 23:35:00 UTC "Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet." For the six that I have from batch 990 the estimated run time is 2days 23hrs compared to 16hrs (ish) for the previous batches. Edit: Actually running at 5.04% per hour. First one 73% complete after 14 hrs, remaining estimated at 19hrs so adjusting as it goes. ID: 68251 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68252 - Posted: 11 Feb 2023, 3:31:10 UTC - in response to Message 68251. "Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet." Unfortunately, I have no estimate of how long they were to take. Task 22250483 First one done on my Linux machine... Name oifs_43r3_bl_a051_2016092300_15_949_12166575_0 Workunit 12166575 Created 14 Dec 2022, 14:15:27 UTC Sent 14 Dec 2022, 14:24:00 UTC Report deadline 13 Jan 2023, 14:24:00 UTC Received 15 Dec 2022, 12:25:25 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 6 hours 46 min 55 sec CPU time 6 hours 41 min 2 sec Validate state Valid Credit 1,232.00 Application version OpenIFS 43r3 Baroclinic Lifecycle v1.07 x86_64-pc-linux-gnu Task 22250807 Most recent one done on my Linux machine. Name oifs_43r3_bl_a04c_2016092300_15_949_12166550_2 Workunit 12166550 Created 19 Dec 2022, 2:21:53 UTC Sent 19 Dec 2022, 2:23:58 UTC Report deadline 18 Jan 2023, 2:23:58 UTC Received 19 Dec 2022, 9:23:21 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 6 hours 12 min 40 sec CPU time 6 hours 7 min 11 sec Validate state Valid Credit 1,232.00 ID: 68252 · Reply Quote

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 12,040,847 RAC: 20,958	Message 68253 - Posted: 11 Feb 2023, 6:59:15 UTC - in response to Message 68252. Unfortunately, I have no estimate of how long they were to take. Those 2 tasks are from a BL test batch (949) from a coupe of months ago using the old app version (1.07). I'm not sure that I'd use them for any significant info or comparison as they were just part of the initial test runs in preparation for OIFS release. Production runs are likely to be different and will use the latest app version (1.11 or newer). ID: 68253 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68254 - Posted: 11 Feb 2023, 11:43:49 UTC Got this on the last of my tasks from 990 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8cf0 for signal#8, nsigs = 1 forrtl: error (65): floating invalid ID: 68254 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68255 - Posted: 11 Feb 2023, 12:36:54 UTC - in response to Message 68254. If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference. And here is an explanation on how it can occur. https://itslinuxfoss.com/floating-point-exception-core-dumped/ #define SIGHUP 1 #define SIGINT 2 #define SIGQUIT 3 #define SIGILL 4 #define SIGTRAP 5 #define SIGABRT 6 #define SIGIOT 6 #define SIGBUS 7 #define SIGFPE 8 #define SIGKILL 9 #define SIGUSR1 10 #define SIGSEGV 11 #define SIGUSR2 12 #define SIGPIPE 13 #define SIGALRM 14 #define SIGTERM 15 #define SIGSTKFLT 16 #define SIGCHLD 17 #define SIGCONT 18 #define SIGSTOP 19 #define SIGTSTP 20 #define SIGTTIN 21 #define SIGTTOU 22 #define SIGURG 23 #define SIGXCPU 24 #define SIGXFSZ 25 #define SIGVTALRM 26 #define SIGPROF 27 #define SIGWINCH 28 #define SIGIO 29 #define SIGPOLL SIGIO /* #define SIGLOST 29 */ #define SIGPWR 30 #define SIGSYS 31 #define SIGUNUSED 31 ID: 68255 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68256 - Posted: 11 Feb 2023, 13:20:19 UTC If you want to know what the signals mean in Linux, consider the following table where they are defined. Especially #8. Floating point exception. You might wish to keep it around for reference. And here is an explanation on how it can occur. https://itslinuxfoss.com/floating-point-exception-core-dumped/ Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts. ID: 68256 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68257 - Posted: 11 Feb 2023, 14:21:14 UTC - in response to Message 68256. Thanks, looking at the link and also in a couple of other places, this one is I suspect down to the physics of the model producing a value that the program doesn't like. It will be interesting to see what happens on subsequent attempts. I agree. But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors. ID: 68257 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68258 - Posted: 11 Feb 2023, 14:23:57 UTC But do not overlook the possibility of bad addresses, bad subscripts in arrays, or using dynamically allocated memory that has been freed, yet still used by defective programs. These can all give very strange, difficult-to-reproduce, errors. Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office. ID: 68258 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68259 - Posted: 11 Feb 2023, 15:27:01 UTC - in response to Message 68258. Last modified: 11 Feb 2023, 15:40:50 UTC Got this on the last of my tasks from 990 Of course. Though very little running on this computer. Only programs open apart from BOINC were Firefox with only a couple of tabs open, Thunderbird and Libre Office. If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether. P.s.: I just got Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2 Workunit 12206681 that has failed for the two previous attempts. Each has failed for very different reasons. I betcha it works on my machine. ID: 68259 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68260 - Posted: 11 Feb 2023, 15:49:05 UTC If you could somehow send me that work unit, I could run it on my machine. IIRC my machine has never failed to complete any of these Oifs work units. Neither the _ps nor the _bl ones. Almost 300 tasks altogether. The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not. I know the facility to send a work unit to only a specific machine exists as it has been used on the testing site at times but I am not aware of it ever being used on the Main site. ID: 68260 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68261 - Posted: 11 Feb 2023, 15:57:09 UTC - in response to Message 68260. The Intel machine running the second attempt has a very good record (about 1% failure rate.) I don't know if the failure on my machine is something AMD ones are more prone to or not. Well, my machine is Intel; at present I am allowing 12 cores to run Boinc tasks. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64\|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.24 GB ID: 68261 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68264 - Posted: 11 Feb 2023, 17:22:26 UTC - in response to Message 68254. Last modified: 11 Feb 2023, 17:24:47 UTC Got this on the last of my tasks from 990 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1734] DrHook backtrace done for signal#8, nsigs = 1 [EC_DRHOOK:swarm:1:1:4860:4860] [20230211:101058:1676110258:14770.286] [signal_drhook@/home/glenn/github/jamie_oifs43r3.git/src/ifsaux/support/drhook.c:1785] Calling previous signal handler at 0x1ce8cf0 for signal#8, nsigs = 1 forrtl: error (65): floating invalid I've seen that one. If you're interested, look further back in the traceback and you'll see: >OMP-RADINTG-RADLSW (1210) RADIATION_SCHEME radiation_interface:radiation radiation_cloud_optics:cloud_optics The model has failed in the radiation code. Floating invalid is usually a divide-by-zero. There were a few WUs that failed each try because the butterfly wings were perhaps too big :) Interesting though, there were a few other cases where the model failed like this on AMD hardware, the resend went to an Intel CPU and worked fine. Which is why they've been tried again. It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs). ID: 68264 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68265 - Posted: 11 Feb 2023, 17:52:18 UTC - in response to Message 68264. It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs). I agree that this is the most likely explanation. I did not mean that it was a failure of the memory of the machine doing the task. I meant that the current (and probably all future) Oifs models do a lot of memory allocation and freeing during their execution, and some failures seem to complain about freeing the same memory more than once; indicating, most likely, a programming error. And that being a possible thing, it is most vexing to find. In a former life, I was involved in writing (part of) the optimizer for the C compiler in UNIX. And people accused the optimizer of being defective because it gave different results than when code was not optimized. It turns out that the optimizer was not at fault. We guaranteed that our optimizer would give the same result for correctly-written code, but were silent about what would happen for incorrect code. We even compiled and ran the UNIX kernel and all the libraries with the optimizer turned on. It turns out that there was a lot of code out there that used pointers that were not initialized, so G.O.K. what values they had. Most were zero, and it was easy to trap those since we never stored anything in the bottom page of RAM, so all traps to there were uninitialized pointers. We found so many of those that they would not even read my MRs after a while. We had a secretary file my MRs with her name on them for a while, but then the caught on. By the time I left, they had never fixed those problems. ID: 68265 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68266 - Posted: 11 Feb 2023, 19:46:12 UTC - in response to Message 68265. Last modified: 11 Feb 2023, 19:48:08 UTC It's got nothing to do with bad memory etc. It's just normal differences in floating point arithmetic seen on different hardware (and OSes from different math libs). I agree that this is the most likely explanation. I did not mean that it was a failure of the memory of the machine doing the task. I meant that the current (and probably all future) Oifs models do a lot of memory allocation and freeing during their execution, and some failures seem to complain about freeing the same memory more than once; indicating, most likely, a programming error. And that being a possible thing, it is most vexing to find. In a former life, I was involved in writing (part of) the optimizer for the C compiler in UNIX. And people accused the optimizer of being defective because it gave different results than when code was not optimized. It turns out that the optimizer was not at fault. We guaranteed that our optimizer would give the same result for correctly-written code, but were silent about what would happen for incorrect code. We even compiled and ran the UNIX kernel and all the libraries with the optimizer turned on. It turns out that there was a lot of code out there that used pointers that were not initialized, so G.O.K. what values they had. Most were zero, and it was easy to trap those since we never stored anything in the bottom page of RAM, so all traps to there were uninitialized pointers. We found so many of those that they would not even read my MRs after a while. We had a secretary file my MRs with her name on them for a while, but then the caught on. By the time I left, they had never fixed those problems. Don't get me started on code optimizers - especially when dealing with vector instructions. I have a couple of stories there... Anyway, the OpenIFS code does do alot of heap allocate/free (it's mostly Fortran code) but the memory problems that have been reported here are not from the model but from the C++ wrapper code that monitors it and talks to boinc, just in case I've confused things. It's a newer code and not so tried & tested as the model. I agree completely about being careful with code & optimizers. I once saw a model go from radiative heating in the model stratosphere to radiative cooling just by moving the code to a new machine & compiler (I forget what that was now). That wasn't a good thing, which took time to understand. Before we put out these batches which have slight model perturbations, the idea of how much perturbation occurs from different computers was discussed. The machine perturbations are relatively small compared to the model changes being made, so the "hardware-only" model outcomes will still be part of the perturbation space explored by the scientist's perturbations. ID: 68266 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68270 - Posted: 12 Feb 2023, 4:06:15 UTC - in response to Message 68259. I just got Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2 Workunit 12206681 that has failed for the two previous attempts. Each has failed for very different reasons. I betcha it works on my machine. I win. My attempt worked just fine: Task 22306953 Name oifs_43r3_ps_0561_2021050100_123_990_12206681_2 Workunit 12206681 Created 11 Feb 2023, 12:22:34 UTC Sent 11 Feb 2023, 12:25:27 UTC Report deadline 12 Apr 2023, 12:25:27 UTC Received 12 Feb 2023, 3:56:11 UTC Server state Over Outcome Success Client state Done Exit status 0 (0x00000000) Computer ID 1511241 Run time 15 hours 20 min 39 sec CPU time 15 hours 1 min 51 sec Validate state Valid Credit 0.00 Device peak FLOPS 6.06 GFLOPS Application version OpenIFS 43r3 Perturbed Surface v1.09 x86_64-pc-linux-gnu OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu Number of tasks completed 2 Max tasks per day 6 Number of tasks today 0 Consecutive valid tasks 2 Average processing rate 28.85 GFLOPS Average turnaround time 0.64 days ID: 68270 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68271 - Posted: 12 Feb 2023, 7:18:26 UTC And the one that failed on my Ryzen has completed on its second attempt. ID: 68271 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68274 - Posted: 12 Feb 2023, 18:20:44 UTC - in response to Message 68271. Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) ID: 68274 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 809 Credit: 13,604,352 RAC: 5,068	Message 68275 - Posted: 12 Feb 2023, 19:03:19 UTC - in response to Message 68274. Now running one that has failed once on an intel machine and once on AMD. The AMD is a double corruption and the Intel is free(): invalid next size (fast) Don't take this the wrong way, but I sincerely hope that fails as well. Then we may have found a repeatable failure - which has eluded me so far. As for the other AMD:fail, Intel:Ok, I am wondering whether to turn down the optimization level on the Intel compiler I use for the model. Thx for reporting. Links to the WUs pages are useful too. ID: 68275 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68276 - Posted: 12 Feb 2023, 19:38:24 UTC - in response to Message 68275. Last modified: 12 Feb 2023, 19:47:29 UTC Thx for reporting. Links to the WUs pages are useful too. Work unit I am only running the one task at the moment and set to a maximum of 2 which will minimise the chances of other tasks interfering. Edited to provide the correct work unit. Edit2: Intel failed after uploading zip 95. The AMD managed another 10 zips so possibly not a smoking gun. ID: 68276 · Reply Quote

biodoc Send message Joined: 2 Oct 19 Posts: 21 Credit: 46,384,136 RAC: 14,445	Message 68277 - Posted: 12 Feb 2023, 20:16:00 UTC - in response to Message 68275. As for the other AMD:fail, Intel:Ok, I am wondering whether to turn down the optimization level on the Intel compiler I use for the model. I'm interested to know why you chose the Intel compiler over GCC. Would GCC offer better compatibility with the hardware and OS heterogeneity on a DC project? ID: 68277 · Reply Quote