climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 31 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,303,210
RAC: 11,178
Message 68141 - Posted: 31 Jan 2023, 9:37:58 UTC - in response to Message 68140.  

Any idea as to how close things are ...
Look at the server status page

We're down to 4182, but unfortunately, with the administrative trickle display not functioning, we can't see what timestep any of them have reached. They might be plodding along slowly, they might have finished and just be waiting for the upload server to come back, or they might have been abandoned at the starting gate. The project could see that data, but I suspect they're as much in the dark as we are.
ID: 68141 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 68145 - Posted: 31 Jan 2023, 9:51:33 UTC - in response to Message 68140.  

Forthcoming batches

Just out of a meeting this morning 30/1/23. There will be some 6500 workunits coming for the OpeniFS Baroclinic Lifecycle app (oifs_43r3_bl) for an experiment run by the University of Helsinki, hopefully in 2 weeks time. ....

Aren't there still at least 12000 new tasks to be processed from the current run by the end of February? I believe that was the number when sending out of new work was turned off a week or so ago. Any idea as to how close things are for it to be turned back on?

Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line.
ID: 68145 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,048
RAC: 23,108
Message 68146 - Posted: 31 Jan 2023, 10:01:49 UTC - in response to Message 68145.  

Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line.

Wow, I must have missed it. I thought there were some thousands Unsent when things got turned off and I don't remember seeing them reappear. 45k unique tasks, not including reruns?
ID: 68146 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68147 - Posted: 31 Jan 2023, 10:40:26 UTC - in response to Message 68146.  

Pretty certain it did get turned back on. There have been now 45K tasks gone out unless I have miscalculated and now less than a tenth of them to come back in. I have five completed tasks waiting to upload or finish uploading once the server comes back on line.

Wow, I must have missed it. I thought there were some thousands Unsent when things got turned off and I don't remember seeing them reappear. 45k unique tasks, not including reruns?
Yes, they did get turned back on. Andy restarted the batches last week and they got sucked up pretty quickly. It's just resends going out now, but I believe the scientist needs to rerun some non-returns for 2021.

I was just looking at the batch stats page. All the 2021 batches (3125 total) have returned 90% or better so far, I estimated <5% were lost due to inappropriate perturbations causing model crashes. All the other hindcast years (1000 wus each), 2020-1981 have returned better than 80% with ~10% still in progress. So approx. 35,000 successfully completed model runs (be great to see a map of where all the machines were that run those, I'll see if I can put one together).

The total no. of runs is higher but that's harder to work out as the failure rate varied a fair bit. The earlier batches did better than the later ones because of the server uploads. Handwaving, let's say 60% of tasks were sent out again for one reason or another.
ID: 68147 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68149 - Posted: 31 Jan 2023, 12:07:56 UTC

p.s. CPDN are double checking all the batches were re-enabled.
ID: 68149 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68163 - Posted: 31 Jan 2023, 20:24:25 UTC - in response to Message 68135.  
Last modified: 31 Jan 2023, 20:25:27 UTC

Glenn Carver wrote:
It's not possible to 'filter-out' the triple-errors (if I understand what you mean).
I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.).
ID: 68163 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68164 - Posted: 31 Jan 2023, 20:54:16 UTC - in response to Message 68163.  

Glenn Carver wrote:
It's not possible to 'filter-out' the triple-errors (if I understand what you mean).
I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.).
Ok. Yes, we do parse the task fails, there is a python tool that scans the return database for known issues and produces a nice batch analysis (if it wasn't so difficult to attach an image to a forum post I'd put a copy here). But as I said in the earlier post, there is no guarantee the repeat task will suffer the same fate on a different machine even if it looks like it might be reproducible after 1 complete task in a workunit. It's only after 3 task fails in a workunit do we conclude it's an inappropriate perturbation issue - by identical fails I mean the model fails at the same (or very near) timestep in each case.
ID: 68164 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,048
RAC: 23,108
Message 68183 - Posted: 2 Feb 2023, 10:18:24 UTC

Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them.
ID: 68183 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68189 - Posted: 2 Feb 2023, 14:07:35 UTC - in response to Message 68183.  
Last modified: 2 Feb 2023, 14:25:50 UTC

Just discovered by chance that I still have 3 instances of OIFS processes still running for a task that errored out on 1/17: https://www.cpdn.org/result.php?resultid=22287313. That's probably the reason for that task's failure but I also wonder if that's the reason I had an increase in failure rate recently on this PC. Shutting down BOINC client still didn't end them. Couldn't kill them via htop although I'm not sure if I was doing it right. Shutting down WSL2, which I was already planning to do for other reasons, got rid of them.
Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device.

I'm not sure if that's why you might be getting more errors. It shouldn't be because each of those processes only manages its own files. However, I can't be sure. Sometimes a reboot is a good thing, completely clears the memory.

I have a possible explanation for this which has been fixed in the latest code, though I can't be certain I've eliminated it until we do a bigger test.

p.s. process needs to be running and not 'suspended' for the kill.
ID: 68189 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,048
RAC: 23,108
Message 68192 - Posted: 3 Feb 2023, 7:35:34 UTC - in response to Message 68189.  

Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device.

I don't think I've ever had to kill a process so don't know of the different ways to do it. I saw that in htop utility there's a Kill option so I tried it by selection both 9 SIGKILL and 15 SIGTERM signals and neither worked. Not sure how different that is from the command line you mention. I'm not sure how to tell if the process is running or suspended. These weren't showing up in BOINC so they weren't suspended in that way. I noticed them by looking in htop for another reason.
ID: 68192 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68193 - Posted: 3 Feb 2023, 10:15:36 UTC - in response to Message 68192.  

Yes, I know about these. You don't need to restart WSL2, just 'sudo kill -9 <process id>' will do it (but check the process id *very* careful!). If that *doesn't* do it, that's also interesting because usually an unkillable stuck process is waiting for a device.

I don't think I've ever had to kill a process so don't know of the different ways to do it. I saw that in htop utility there's a Kill option so I tried it by selection both 9 SIGKILL and 15 SIGTERM signals and neither worked. Not sure how different that is from the command line you mention. I'm not sure how to tell if the process is running or suspended. These weren't showing up in BOINC so they weren't suspended in that way. I noticed them by looking in htop for another reason.
Ok. htop 'kill' will do the same thing as 'kill' on the terminal. There is no difference. A 'kill -9' is 'kill with extreme prejudice'. It's a signal the process cannot ignore.

Richard and I exchanged some messages on this (he's seen it all before!). Looking at the boinc client code, there is a note in there that LHC have also seen this issue. So (a) the issue has obviously been around a long time, (b) LHC also run large memory jobs, so that might provide a clue. The only thing I can see quickly looking at the code is the client writes a 'boinc_finish' file (I have no idea why it feels the need to do that), and there's a timer involved. I wonder if a previous memory corruption has disrupted either of those two.

The good news is that of all of our task failures from these batches, this 'process still running after 5mins' problem only accounts for 2.5% of those fails. That's also the bad news as it'll be hard to track down because it's not reproducible nor happens very often.

To make any progress a traceback would be ideal. I had hoped the kill would generate one in the stderr but that didn't work. One way would be attach the 'gdb' debugger to the process and generate a call tree. If anyone knows their way around gdb let me know and I'll send details on how to proceed, though we'll need to have some new batches first.

thanks for highlighting this.
ID: 68193 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68233 - Posted: 9 Feb 2023, 14:57:24 UTC

Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.
ID: 68233 · Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 5 Aug 04
Posts: 171
Credit: 10,300,484
RAC: 27,066
Message 68234 - Posted: 9 Feb 2023, 15:07:21 UTC - in response to Message 68233.  

Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.


Can you please fill in missing Details:

OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc
OpenIFS_BL: ??? GB RAM ??? GB Harddisc


Supporting BOINC, a great concept !
ID: 68234 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 68235 - Posted: 9 Feb 2023, 15:21:22 UTC - in response to Message 68233.  
Last modified: 9 Feb 2023, 15:44:46 UTC

Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.
Interestingly, I have 5 tasks from batch 990. I see the closed batches have gone from the batch statistics page. Also one from 952 running. As that batch is now closed, should I abort? (If answer doesn't come within next couple of hours answer will be academic.)

Edit: All perturbed surface.

Edit2: I see I have a message from the Moderators email list. 990 is the batch of reruns.
ID: 68235 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,048
RAC: 23,108
Message 68241 - Posted: 10 Feb 2023, 8:33:03 UTC - in response to Message 68234.  

Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.


Can you please fill in missing Details:

OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc
OpenIFS_BL: ??? GB RAM ??? GB Harddisc

From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter. Memory requirement will be the same as model resolution is unchanged."
ID: 68241 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 68243 - Posted: 10 Feb 2023, 12:50:35 UTC

Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.

35 of the missing forecast ones now succeeded out of 143 I have three more which should all complete later today.
ID: 68243 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1061
Credit: 16,544,964
RAC: 2,285
Message 68245 - Posted: 10 Feb 2023, 13:37:51 UTC - in response to Message 68241.  

From an earlier post by Glenn: "These runs will be shorter, runtimes ~half of the PS OpenIFS app (YMMV). [...] Expect less total I/O and smaller upload sizes as the runs are shorter.


I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988.

22306912 	12206763 	9 Feb 2023, 15:24:41 UTC 	10 Feb 2023, 6:24:03 UTC 	Completed 	53,921.10 	53,055.71 	0.00 	OpenIFS 43r3 Perturbed Surface v1.09
x86_64-pc-linux-gnu
22306615 	12204746 	7 Feb 2023, 22:23:59 UTC 	8 Feb 2023, 14:24:05 UTC 	Completed 	55,839.56 	55,008.81 	0.00 	OpenIFS 43r3 Perturbed Surface v1.05
x86_64-pc-linux-gnu

OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu
Number of tasks completed 	223
Max tasks per day 	227
Number of tasks today 	0
Consecutive valid tasks 	223
Average processing rate 	28.23 GFLOPS
Average turnaround time 	3.32 days
OpenIFS 43r3 Perturbed Surface 1.09 x86_64-pc-linux-gnu
Number of tasks completed 	1
Max tasks per day 	5
Number of tasks today 	0
Consecutive valid tasks 	1
Average processing rate 	29.20 GFLOPS
Average turnaround time 	0.62 days


So yes, MMDV: it is shorter, but not very much. My machine predicted that v1.05 tasks would take a little over 15 hours, and that is what they did. It predicted that the v1.09 job would take a few hours more than two days, but it was about the same as the v1.05 jobs. The poor turnaround time for the v1.05 jobs was due to the upload problem during the time I was running those.
ID: 68245 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 68246 - Posted: 10 Feb 2023, 13:43:51 UTC

I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988.
#990 is the batch of 143 reruns from the previous batches.
ID: 68246 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,035,048
RAC: 23,108
Message 68249 - Posted: 10 Feb 2023, 22:30:49 UTC - in response to Message 68246.  

I do not notice this. The first (most recent) of these two tasks is a 990. The second, slightly older, is a 988.
#990 is the batch of 143 reruns from the previous batches.

Yes, it's the same PS run, finishing up some missing models, not the announced BL run. The difference is that a new app version is being used for those, 1.09 vs. 1.05. From such a small sample size, I'm not sure one can say that 1.09 is faster than 1.05. Glenn just said that the upcoming BL run will have shorter run times which I think is due to the design of this BL run not the newer app version. Initially the BOINC estimated run time is off likely due to the new app version that BOINC has no data for yet.
ID: 68249 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 807
Credit: 13,593,584
RAC: 7,495
Message 68250 - Posted: 10 Feb 2023, 23:18:04 UTC - in response to Message 68249.  

The new versions are because I've fixed various issues. Still getting a few memory corruption fails though which I'm still working on.
ID: 68250 · Report as offensive     Reply Quote
Previous · 1 . . . 18 · 19 · 20 · 21 · 22 · 23 · 24 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net