climateprediction.net home page
Tasks failing on Ubuntu 22

Tasks failing on Ubuntu 22

Message boards : Number crunching : Tasks failing on Ubuntu 22
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67402 - Posted: 6 Jan 2023, 22:54:57 UTC - in response to Message 67401.  

Try
sudo ls -l ?/ | grep srf

Same result.
- - - - - - - - - -
Greetings, Jens
ID: 67402 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,504,255
RAC: 1,261
Message 67403 - Posted: 6 Jan 2023, 22:59:11 UTC - in response to Message 67402.  

You have 1 or more tasks running at the moment ?
ID: 67403 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67404 - Posted: 6 Jan 2023, 22:59:49 UTC
Last modified: 6 Jan 2023, 23:02:53 UTC

Ok, just
ls -l
through all slots until I found CPDN.
There's one srf* file in the slot:
-rw-r--r-- 1 boinc boinc 804992476 Jan 6 23:51 srf00370000.0001
- - - - - - - - - -
Greetings, Jens
ID: 67404 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,504,255
RAC: 1,261
Message 67405 - Posted: 6 Jan 2023, 23:03:16 UTC - in response to Message 67404.  

Ok, just ls through all slots until I found CPDN.
There's one srf* file in the slot.

That's good, you only want 1 per slot, the latest one, older ones take up space.

Was it in a slot with a number >9 ?
ID: 67405 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67406 - Posted: 6 Jan 2023, 23:03:41 UTC - in response to Message 67403.  

Just one.
- - - - - - - - - -
Greetings, Jens
ID: 67406 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67407 - Posted: 6 Jan 2023, 23:04:26 UTC - in response to Message 67405.  

Was it in a slot with a number >9 ?

Slot 14, so yes.
- - - - - - - - - -
Greetings, Jens
ID: 67407 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,504,255
RAC: 1,261
Message 67408 - Posted: 6 Jan 2023, 23:07:07 UTC - in response to Message 67407.  

Slot 14, so yes.
Okay :)
Glenn has his answer.
ID: 67408 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67409 - Posted: 6 Jan 2023, 23:11:20 UTC - in response to Message 67408.  

Slot 14, so yes.
Okay :)
Glenn has his answer.

I see.
Thx.
- - - - - - - - - -
Greetings, Jens
ID: 67409 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,526,721
RAC: 1,957
Message 67420 - Posted: 8 Jan 2023, 10:01:53 UTC

Hi there,
I also have one WU that crashed with
<message>
Disk usage limit exceeded</message>
<stderr_txt>


I was not working on the computer, I allow only two OSIFs to run (4 HT), 16 GB RAM, swap was around 2.6 GB from 8GB allocated. CPU allowed to work 100%, leave task in memory checked;

For some reason I could not find the srf files via the terminal so I looked at with Files browser
In slot 1 I found 4 srfs. 3 were 805 MB
I also have slots 0 and 2 with no srf files there

So I wonder what could've gone wrong with this one?
ID: 67420 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67443 - Posted: 9 Jan 2023, 1:40:57 UTC - in response to Message 67420.  
Last modified: 9 Jan 2023, 1:42:57 UTC

Hi Bernard,

I can explain what's going on and a workaround. Just to recap, the srf files are the model restart files. OpenIFS has to dump its internal state in full in order to be able to restart in a bit reproducible way if the task is stopped. That's why these files are quite big. If the task (i.e model) is run uninterrupted, there will only ever be one srf file. However, there may be several if the model did a restart because, for backup, it always keeps the restart file it's just used. New srf files will be continually created & deleted as the model runs, until such time it needs to restart again. A model restart will be either because: (a) client was shutdown (e.g. power-off), (b) model was kicked out of memory whilst suspended. It's this latter one that's the reason for asking people to enable 'leave non-gpu in memory' option as that reduces the risk the model gets kicked out. That hopefully will mean restarts happen less but not guaranteed.

One of the changes we made before these latest batches was to fix an error in the task disk usage limit, which was wrong by a factor of 10. For the very first batch, we saw lots of 'Disk full' errors because the restarts were building up to such an extent they were filling the user's disk (obviously not desirable!). Now, we're getting task disk limit exceeded for the same reason. There is headroom in the task disk usage limit for a couple of restart files but as we can see not enough.

- I will change the model so it no longer keeps its old, used restart srf files. That will mean only 1 srf files should normally exist in a slot. I can't make the change for the current batches, but it will be for future ones. That will prevent the srf files accumulating at the risk of more frequent restart fails (the model can fall back to a known good restart if the one it tries is corrupt).
- You can safely delete the older srf files. By older I mean older by file date (and lowest number on the filename). You could write a script to periodically run (obviously test it first): some hints here: https://stackoverflow.com/questions/25785/delete-all-but-the-most-recent-x-files-in-bash

I don't know if you are shutting down your machine and that's why the model is restarting, but if possible put the machine in 'suspend/sleep' (which is what I do at night) rather than shutdown. You say you have 'leave non-gpu in memory' which will help but does not stop openifs being kicked out if other processes demand more, as all boinc tasks are run at the lowest system priority.

If you have OpenIFS running in a slot directory and there's no srf file in there, that's fine, it just means that the model has not been running long and has not got to the first restart write time yet (it only writes restart files every model day).

I guess you couldn't see the files in the terminal because of permission issues?

Apologies this is causing fails. Just to quickly summarise, this has nothing to do with the size of the disk, amount of memory used, swap size, it's solely because the task has a limit on the amount of disk that can be used and if the model has to restart due to the client shutting it down, those restart files will build up and break the limit.

Hope that answers all your questions.

Hi there,
I also have one WU that crashed with
<message>
Disk usage limit exceeded</message>
<stderr_txt>


I was not working on the computer, I allow only two OSIFs to run (4 HT), 16 GB RAM, swap was around 2.6 GB from 8GB allocated. CPU allowed to work 100%, leave task in memory checked;

For some reason I could not find the srf files via the terminal so I looked at with Files browser
In slot 1 I found 4 srfs. 3 were 805 MB
I also have slots 0 and 2 with no srf files there

So I wonder what could've gone wrong with this one?
ID: 67443 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,982,864
RAC: 23,561
Message 67445 - Posted: 9 Jan 2023, 2:41:58 UTC - in response to Message 67420.  

bernard_ivo,
May I suggest you reduce the amount of tasks you run concurrently to 1. I have a very similar older PC as one of yours, i7-4790 with 16GB RAM. Initially running 2 at a time seemed reasonable but I was getting errors despite also not using the PC, having plenty of disk space, having 'leave non-GPU in memory' setting on, having the PC run 24/7, and I also don't allow task swapping. Once I reduced concurrent OIFS tasks to 1 a few weeks ago, I've had no errors and all tasks complete successfully.

Another suggestion (secondary to the one above) is make sure you don't task swap as it has no benefits and can clog up your RAM. What I mean is letting BOINC work on a group of tasks for a short while and then switch to another group and then perhaps yet another before coming back to the original one. To prevent that, set "Switch between tasks every __ minutes" setting to a very high number like 10080 minutes (1 week). Task swapping is inefficient and can clog up a lot of memory, especially if "Leave non-GPU tasks in memory while suspended" is on. Regardless how long they take, let one group of tasks finish before starting another.

This is something Glenn has mentioned before but it's not been getting much attention on the forums: run less concurrent tasks than you think you should be able to. OIFS tasks are prone to crashes when overall system RAM is pushed too much. I'd suggest leaving about 10GB for overhead when running OIFS tasks.
ID: 67445 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,526,721
RAC: 1,957
Message 67450 - Posted: 9 Jan 2023, 9:39:13 UTC

Thanks Glenn and Andrey,

I usually allow my machines to work 24/7 and this laptop I realise was suspended once or twice on Jan 7th as I was moving around - hence the accumulation of SRF files. As for the switch between tasks I currently have it as 60 minutes, but I've never seen it behaving like this, though sometimes I have to pause some to make BOINC continue with tasks I want it to. Moreover CPDN is my highest priority 75%, WCG 12.5 % and WUProp 12.5 % - the latter being non-cpu-intensive. However I will increase the time as suggested to reduce the switches.

As for the concurrent numbers of OIFS both machines work fine(ish) with 2 tasks at the same time (if I use the machines than SWAP kicks in), with 3 WUs for the i7-4790 at the extreme. With 4 the system crashed and I could not start it for several days to switch off BOINC. I can no longer crunch as I hit the upload limit, but I may reduce WUs to 1 as suggested. Let's first clear the upload queue.

So yeah, these OIFS are pushing the limits of my current machines and more demanding ones are to come :) looking forward to the upgraded batches which should reduce some failures.
ID: 67450 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67566 - Posted: 11 Jan 2023, 20:05:25 UTC

Is it fine to assume the tasks are ok when uploaded and not flagged erroneous?
If so, leaving them in memory should have been the only thing to resolve the task of getting work done and not trashing it.
- - - - - - - - - -
Greetings, Jens
ID: 67566 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67573 - Posted: 11 Jan 2023, 21:00:33 UTC - in response to Message 67566.  

Is it fine to assume the tasks are ok when uploaded and not flagged erroneous?
If so, leaving them in memory should have been the only thing to resolve the task of getting work done and not trashing it.
If the 'Outcome' on the task result webpage is shown as 'Success', then yes.
ID: 67573 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67578 - Posted: 11 Jan 2023, 21:19:39 UTC - in response to Message 67573.  

Is it fine to assume the tasks are ok when uploaded and not flagged erroneous?
If so, leaving them in memory should have been the only thing to resolve the task of getting work done and not trashing it.

If the 'Outcome' on the task result webpage is shown as 'Success', then yes.

Ah, there.
I see.
Thanks!
- - - - - - - - - -
Greetings, Jens
ID: 67578 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Tasks failing on Ubuntu 22

©2024 climateprediction.net