OpenIFS Discussion

Author	Message
Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68491 - Posted: 26 Feb 2023, 21:40:04 UTC - in response to Message 68490. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results? I don't think there's a shortage of willing CPU cores right now. Or even just have some people run the binaries manually and send you results somehow. I've got a range of AMD systems that are mostly bored! What about if you compile on a Ryzen compiler? - I didn't even know they existed till I did a search! ID: 68491 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1063 Credit: 16,546,621 RAC: 2,321	Message 68492 - Posted: 26 Feb 2023, 21:45:53 UTC - in response to Message 68490. It would be fun to send out an identical batch where I've compiled the code on a Ryzen with the intel compiler instead of Intel+Intel and see what happens :) Is there a reason you can't? Send out a couple hundred otherwise identical WUs in a few batches and compare/contrast results? What is the reason for this experiment? Do you think there is an error in GNU compilers,. gcc and g++ and that the Intel compiler is free from that error? What if both compilers gave identical results? Or worse, if they gave inconsistent non-identical results. How do you propose to analyze the results of this experiment to resolve such possibilities? Will the stuff you compile on a Ryzen run on my Intel machine running Linux? I sure would not wish to debug it if it did not work. ID: 68492 · Reply Quote

alanb1951 Send message Joined: 31 Aug 04 Posts: 32 Credit: 9,526,696 RAC: 109,831	Message 68494 - Posted: 27 Feb 2023, 0:35:06 UTC Last modified: 27 Feb 2023, 0:41:18 UTC I don't know whether the below is of any diagnostic use, but I'll report it in case... I just noticed that one of my tasks (for work unit 12215433) had stalled over 24 hours ago, and it appeared that model.exe had finished but for some reason the wrapper (which was still present but quiescent) hadn't dealt with it as the stderr.txt file ended with the following: 15:37:36 STEP 2952 H=2952:00 +CPU= 20.358 That 15:37:36 was on the 25th, and when I checked my boinc log for around that time I saw the usual flurry of checkpoint messages that seem to accompany the construction and submission of a trickle, but the next scheduler request was for new work, not a trickle, and there was no sign of the files being uploaded. As well as checking the boinc log, I checked the system logs to see if there was anything odd around that time -- there wasn't anything obvious. Rather than just aborting it I decided to suspend and resume it to see what would happen; I wasn't optimistic that it would recover it successfully (as something had obviously broken initially) but it did seem to restart and, of course, it shut down more or less immediately (nothing more to do!) This time, it managed to upload the files and flesh out the end of stderr.txt; unfortunately it then it reported "double free or corruption (!prev)", so no luck... ((!prev) instead of the seemingly more usual (out) --- an effect of not really having anything to do?) I see that a retry has gone out promptly, and I suspect it'll run to completion without problems -- ah, well... Cheers - Al. ID: 68494 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 68504 - Posted: 27 Feb 2023, 13:55:15 UTC - in response to Message 68494. Yes, thanks. This behaviour has been noted and reported by others. It seems to be something going wrong when the task reports to the client that it's finished. For some unknown reason, it appears to get stuck in the client. Shutting down & restarting the client has been successful at getting the task to complete. Other projects, not just CPDN, have observed this behaviour according to other forums posts I've read. It doesn't appear to happen very often, I was going to look at the code to make sure we tidy everything up in terms of closing files etc, to see if that might cure it. I don't know whether the below is of any diagnostic use, but I'll report it in case... I just noticed that one of my tasks (for work unit 12215433) had stalled over 24 hours ago, and it appeared that model.exe had finished but for some reason the wrapper (which was still present but quiescent) hadn't dealt with it as the stderr.txt file ended with the following: 15:37:36 STEP 2952 H=2952:00 +CPU= 20.358 That 15:37:36 was on the 25th, and when I checked my boinc log for around that time I saw the usual flurry of checkpoint messages that seem to accompany the construction and submission of a trickle, but the next scheduler request was for new work, not a trickle, and there was no sign of the files being uploaded. As well as checking the boinc log, I checked the system logs to see if there was anything odd around that time -- there wasn't anything obvious. Rather than just aborting it I decided to suspend and resume it to see what would happen; I wasn't optimistic that it would recover it successfully (as something had obviously broken initially) but it did seem to restart and, of course, it shut down more or less immediately (nothing more to do!) This time, it managed to upload the files and flesh out the end of stderr.txt; unfortunately it then it reported "double free or corruption (!prev)", so no luck... ((!prev) instead of the seemingly more usual (out) --- an effect of not really having anything to do?) I see that a retry has gone out promptly, and I suspect it'll run to completion without problems -- ah, well... Cheers - Al. ID: 68504 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68518 - Posted: 1 Mar 2023, 11:01:21 UTC Just got this as a resend that appears to have finished but no stderr on the original task. ID: 68518 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 68519 - Posted: 1 Mar 2023, 12:35:02 UTC - in response to Message 68518. Just got this as a resend that appears to have finished but no stderr on the original task. Haha. That was one of mine. I switched the downloaded oifs_x86_64-pc-linux-gnu control executable to my development version so I could test it 'live' for this batch. But I made a mistake for this task. Glad it's in safe hands! :D Pretty confident the 'double corruption' problem was in the trickle code, which has now been rewritten. Will need a big test batch to confirm though. --- CPDN Visiting Scientist* ID: 68519 · Reply Quote

JagDoc Send message Joined: 21 Dec 22 Posts: 5 Credit: 6,235,426 RAC: 8,550	Message 68530 - Posted: 1 Mar 2023, 18:09:09 UTC Last modified: 1 Mar 2023, 18:12:15 UTC One of my hosts has some WUs with different error. https://www.cpdn.org/show_host_detail.php?hostid=1538124 It runs 2 x IFS tasks and 2 x ODLK1 tasks. Htop shows that 3 x IFS tasks running, 2 of them in one slot, how can that be. PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command 30914 boinc 39 19 4230M 3782M 33456 R 100. 11.9 12h47:38 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe 31490 boinc 39 19 2782M 2585M 33456 R 100. 8.1 8h39:56 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe 30791 boinc 39 19 4269M 3807M 33456 R 99.7 12.0 14h02:11 /var/lib/boinc-client/slots/0/oifs_43r3_model.exe This is what top shows: top - 19:10:15 up 6 days, 7:58, 2 users, load average: 3,02, 3,14, 3,69 Tasks: 217 gesamt, 4 laufend, 213 schlafend, 0 gestoppt, 0 Zombie %CPU(s): 0,0 us, 3,0 sy, 72,1 ni, 24,8 id, 0,0 wa, 0,0 hi, 0,1 si, 0,0 st MiB Spch : 31848,8 gesamt, 13202,1 frei, 11289,3 belegt, 7357,4 Puff/Cache MiB Swap: 2048,0 gesamt, 2048,0 frei, 0,0 belegt. 20045,2 verfü Spch PID USER PR NI VIRT RES SHR S %CPU %MEM ZEIT+ BEFEHL 30791 boinc 39 19 3862844 3,0g 33456 R 100,0 9,6 859:50.07 oifs_43r3_model 31490 boinc 39 19 4331988 3,6g 33456 R 100,0 11,7 537:34.95 oifs_43r3_model 30914 boinc 39 19 4331992 3,6g 33456 R 100,0 11,7 785:16.60 oifs_43r3_model ID: 68530 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68531 - Posted: 1 Mar 2023, 18:53:05 UTC Last modified: 1 Mar 2023, 18:55:34 UTC Glen can tell you more about this. It is a problem with when tasks suspend and restart I think. This may be one of the issues that is sorted ready for the next batch. Glen currently on zoom at BOINC workshop. ID: 68531 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 68533 - Posted: 1 Mar 2023, 21:39:05 UTC - in response to Message 68530. Htop shows that 3 x IFS tasks running, 2 of them in one slot, how can that be. PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command 30914 boinc 39 19 4230M 3782M 33456 R 100. 11.9 12h47:38 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe 31490 boinc 39 19 2782M 2585M 33456 R 100. 8.1 8h39:56 /var/lib/boinc-client/slots/1/oifs_43r3_model.exe 30791 boinc 39 19 4269M 3807M 33456 R 99.7 12.0 14h02:11 /var/lib/boinc-client/slots/0/oifs_43r3_model.exe This happens because of the 'memory corruption' problem oft reported here. Aside from the boinc client, there are two processes involved in the OpenIFS tasks. One is the model itself (oifs_43r3_model.exe), the other is a controlling process (oifs_43r3_1.21_x86_64-linux-gnu-pc) which monitors the model and reports back to the client. It's this second process that has the memory fault which kills it. Normally, when this process dies it should also kill the model, but for some reason, on odd occasions, it leaves the model running. Eventually the boinc client spots a rogue process is still running and kills it. Unfortunately by then, two models in the same slot will have corrupted some of the files and the task will eventually fail. If you see this happen, the best thing to do is the shutdown the boinc client, then restart it. That will clear out any rogue processes. I think we have solved this problem with the latest code, which will go out to to production once tested. Hope that's understandable. ID: 68533 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68535 - Posted: 1 Mar 2023, 22:03:56 UTC - in response to Message 68519. Just got this as a resend that appears to have finished but no stderr on the original task. Haha. That was one of mine. I switched the downloaded oifs_*x86_64-pc-linux-gnu control executable to my development version so I could test it 'live' for this batch. But I made a mistake for this task. Glad it's in safe hands! :D Pretty confident the 'double corruption' problem was in the trickle code, which has now been rewritten. Will need a big test batch to confirm though. And completed successfully. ID: 68535 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68563 - Posted: 6 Mar 2023, 8:42:11 UTC Last modified: 6 Mar 2023, 10:42:51 UTC #993 is looking good. Success: 1803 (90%) Fails: 418 (21%) Hard Fail: 5 (0%) Running: 192 (10%) Especially when you think that 21% fails includes the model failures due to the physics and the ones that fail due to users running too many tasks for the amount of RAM they have. Assuming Glen is right about having sorted out the double corruption (Makes it sound like politics) errors, next lot should be better still. ID: 68563 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68615 - Posted: 21 Mar 2023, 19:35:41 UTC - in response to Message 68614. Are there any more batches of work from OpenIFS to be released, or is that the lot for now? Nothing new in testing or that has appeared on moderators' email list. (New work doesn't usually get mentioned there unless someone wants us to post something about it anyway.) My WAH2 Windows Hadley models in testing are less than half way through their 50 days so I don't expect main site work from them to arrive for a while yet. I check for signs of new work both testing and main site about three times a week and post when something looks hopeful. Glen is more likely to know about new OIFS work than I am but, he is a volunteer programmer and if he is busy with other things he may not know what the current state of play is. Sorry I am not able to say more at the moment. ID: 68615 · Reply Quote

Drago75 Send message Joined: 8 Jan 22 Posts: 9 Credit: 1,498,898 RAC: 4,053	Message 68617 - Posted: 22 Mar 2023, 12:27:08 UTC - in response to Message 68616. Last modified: 22 Mar 2023, 12:27:32 UTC This has propably been raised on a number of occasions but it still puzzles me so I would like to ask this again. The project has 45.600 active work units which don't seem to finish ever. Over the past few months I noticed that the majority of work is beeing completed within 10-14 days. Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby. Either way once started they should finish within days. When I look at the WAH units they still allow for one year to be completed. If a calculation run takes that long it isn't any faster then the real weather outside. So if the projects aim is to predict the weather for the future, don't the scientists need the data as quickly as possible? There seem to be a lot of crunchers here who would be willing to process more data but don't get enough work. ID: 68617 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 68618 - Posted: 22 Mar 2023, 12:32:15 UTC - in response to Message 68614. Are there any more batches of work from OpenIFS to be released, or is that the lot for now? That's it for now. There are no OpenIFS batches planned for the near future, the scientists need time to look at the data collected from the previous ones and then there might be some more. There may also be some testing batches in due course but don't hold your breath. --- CPDN Visiting Scientist ID: 68618 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 68619 - Posted: 22 Mar 2023, 12:38:43 UTC - in response to Message 68617. This has propably been raised on a number of occasions but it still puzzles me so I would like to ask this again. The project has 45.600 active work units which don't seem to finish ever. Over the past few months I noticed that the majority of work is beeing completed within 10-14 days. Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby. Either way once started they should finish within days. When I look at the WAH units they still allow for one year to be completed. If a calculation run takes that long it isn't any faster then the real weather outside. So if the projects aim is to predict the weather for the future, don't the scientists need the data as quickly as possible? There seem to be a lot of crunchers here who would be willing to process more data but don't get enough work. It's leftover from the early days of CPDN when model runs used to take 6 months or so (I think it was). It's just one thing that they haven't got around to changing for the hadley models. I changed it for OpenIFS though we got caught out when the server went down and tasks started timing out. But, yes, if you get one of the old reruns from a workunit that's been around for over ~4 months or so, I'd abort it. The scientist would have got the data by then and moved on I suspect. I'll bring it up again when I next talk to them. As has been said many times, they are a very small team and little things like this have to make way for bigger issues. --- CPDN Visiting Scientist ID: 68619 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68620 - Posted: 22 Mar 2023, 13:08:05 UTC - in response to Message 68619. I'll bring it up again when I next talk to them. As has been said many times, they are a very small team and little things like this have to make way for bigger issues. Thanks Glen though worth noting that even on my Ryzen 7 3700X the four testing tasks I am running under WINE are going to take a few hours over 50days to complete. (WAH2 25 Km grid SEAsia so covering a much greater area than the ANZ regional models. ID: 68620 · Reply Quote

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 32,024,897 RAC: 33,224	Message 68621 - Posted: 22 Mar 2023, 15:37:18 UTC - in response to Message 68617. Wouldn't it be a good idea to reduce their expiry date to less then 4 weeks still? Maybe even to 14 days? Those wus run for approx. 18-24 hours and they don't seem to like being paused. The only real way to run them is either continiously or by interupting them by sending the PC to standby. Standby works quite nicely. I have a mild preference that they not be shorter, just because I do all my crunching on solar, off grid, and we get weeks without a lot of sun, but... if it's useful to the project to be shorter, fine. I would request they not be shortened for no good reason, though. And, as noted, the whole "server outage" really fouled up a lot of stuff, partly due to the lower time to return. ID: 68621 · Reply Quote

zombie67 [MM] Send message Joined: 2 Oct 06 Posts: 52 Credit: 26,209,214 RAC: 3,355	Message 68626 - Posted: 23 Mar 2023, 13:58:32 UTC - in response to Message 68617. The project has 45.600 active work units which don't seem to finish ever. Don't believe the numbers on the server status page. If you add up all the tasks in progress for the individual projects at the bottom os the page (25,646), it is not even close to the the total number in the upper right (45,600). I don't believe either of those numbers. If I had to guess, there are no more than just a few thousand tasks actually in progress. ID: 68626 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4347 Credit: 16,541,921 RAC: 6,087	Message 68627 - Posted: 23 Mar 2023, 15:03:53 UTC - in response to Message 68626. Last modified: 23 Mar 2023, 16:14:04 UTC Don't believe the numbers on the server status page. If you add up all the tasks in progress for the individual projects at the bottom os the page (25,646), it is not even close to the the total number in the upper right (45,600). I don't believe either of those numbers. If I had to guess, there are no more than just a few thousand tasks actually in progress. Sometimes I think the only correct number is the Tasks ready to send = 0 Though I doubt if the numbers for the OIFS tasks are very far out if at all. Edit: The trickles on tasks on testing site are now showing correctly so there is progress. I expect Andy will let us know or post himself when the upgrade of the server software here is going to happen. Nothing till after it happened on testing but then there were no tasks waiting to go out at the time and it wouldn't have been more than a couple of hours at most. ID: 68627 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 810 Credit: 13,614,292 RAC: 5,636	Message 69766 - Posted: 11 Oct 2023, 13:51:31 UTC OpenIFS Perturbed Surface batches Volunteers might be interested in this article that appeared in the ECMWF Newsletter, based on the OpenIFS Perturbed Surface batches earlier this year. https://www.ecmwf.int/en/newsletter/175/news/openifshome-using-land-surface-uncertainties-and-large-ensembles-seasonal This appears in the list of CPDN publications but the batch information is missing due to a minor technical hitch which will be fixed. The batches in question were: 944,945,946,947,990. --- CPDN Visiting Scientist ID: 69766 · Reply Quote