|
Message boards : Number crunching : Tasks failing on Ubuntu 22
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Try Same result. - - - - - - - - - - Greetings, Jens |
![]() Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161 |
You have 1 or more tasks running at the moment ? |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Ok, just ls -lthrough all slots until I found CPDN. There's one srf* file in the slot: -rw-r--r-- 1 boinc boinc 804992476 Jan 6 23:51 srf00370000.0001 - - - - - - - - - - Greetings, Jens |
![]() Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161 |
Ok, just ls through all slots until I found CPDN. That's good, you only want 1 per slot, the latest one, older ones take up space. Was it in a slot with a number >9 ? |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Just one. - - - - - - - - - - Greetings, Jens |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Was it in a slot with a number >9 ? Slot 14, so yes. - - - - - - - - - - Greetings, Jens |
![]() Send message Joined: 29 Nov 17 Posts: 83 Credit: 17,184,625 RAC: 13,161 |
Slot 14, so yes.Okay :) Glenn has his answer. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Slot 14, so yes.Okay :) Glenn has his answer. I see. Thx. - - - - - - - - - - Greetings, Jens |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904 |
Hi there, I also have one WU that crashed with <message> Disk usage limit exceeded</message> <stderr_txt> I was not working on the computer, I allow only two OSIFs to run (4 HT), 16 GB RAM, swap was around 2.6 GB from 8GB allocated. CPU allowed to work 100%, leave task in memory checked; For some reason I could not find the srf files via the terminal so I looked at with Files browser In slot 1 I found 4 srfs. 3 were 805 MB I also have slots 0 and 2 with no srf files there So I wonder what could've gone wrong with this one? |
Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160 |
Hi Bernard, I can explain what's going on and a workaround. Just to recap, the srf files are the model restart files. OpenIFS has to dump its internal state in full in order to be able to restart in a bit reproducible way if the task is stopped. That's why these files are quite big. If the task (i.e model) is run uninterrupted, there will only ever be one srf file. However, there may be several if the model did a restart because, for backup, it always keeps the restart file it's just used. New srf files will be continually created & deleted as the model runs, until such time it needs to restart again. A model restart will be either because: (a) client was shutdown (e.g. power-off), (b) model was kicked out of memory whilst suspended. It's this latter one that's the reason for asking people to enable 'leave non-gpu in memory' option as that reduces the risk the model gets kicked out. That hopefully will mean restarts happen less but not guaranteed. One of the changes we made before these latest batches was to fix an error in the task disk usage limit, which was wrong by a factor of 10. For the very first batch, we saw lots of 'Disk full' errors because the restarts were building up to such an extent they were filling the user's disk (obviously not desirable!). Now, we're getting task disk limit exceeded for the same reason. There is headroom in the task disk usage limit for a couple of restart files but as we can see not enough. - I will change the model so it no longer keeps its old, used restart srf files. That will mean only 1 srf files should normally exist in a slot. I can't make the change for the current batches, but it will be for future ones. That will prevent the srf files accumulating at the risk of more frequent restart fails (the model can fall back to a known good restart if the one it tries is corrupt). - You can safely delete the older srf files. By older I mean older by file date (and lowest number on the filename). You could write a script to periodically run (obviously test it first): some hints here: https://stackoverflow.com/questions/25785/delete-all-but-the-most-recent-x-files-in-bash I don't know if you are shutting down your machine and that's why the model is restarting, but if possible put the machine in 'suspend/sleep' (which is what I do at night) rather than shutdown. You say you have 'leave non-gpu in memory' which will help but does not stop openifs being kicked out if other processes demand more, as all boinc tasks are run at the lowest system priority. If you have OpenIFS running in a slot directory and there's no srf file in there, that's fine, it just means that the model has not been running long and has not got to the first restart write time yet (it only writes restart files every model day). I guess you couldn't see the files in the terminal because of permission issues? Apologies this is causing fails. Just to quickly summarise, this has nothing to do with the size of the disk, amount of memory used, swap size, it's solely because the task has a limit on the amount of disk that can be used and if the model has to restart due to the client shutting it down, those restart files will build up and break the limit. Hope that answers all your questions. Hi there, |
Send message Joined: 12 Apr 21 Posts: 319 Credit: 15,031,602 RAC: 4,207 |
bernard_ivo, May I suggest you reduce the amount of tasks you run concurrently to 1. I have a very similar older PC as one of yours, i7-4790 with 16GB RAM. Initially running 2 at a time seemed reasonable but I was getting errors despite also not using the PC, having plenty of disk space, having 'leave non-GPU in memory' setting on, having the PC run 24/7, and I also don't allow task swapping. Once I reduced concurrent OIFS tasks to 1 a few weeks ago, I've had no errors and all tasks complete successfully. Another suggestion (secondary to the one above) is make sure you don't task swap as it has no benefits and can clog up your RAM. What I mean is letting BOINC work on a group of tasks for a short while and then switch to another group and then perhaps yet another before coming back to the original one. To prevent that, set "Switch between tasks every __ minutes" setting to a very high number like 10080 minutes (1 week). Task swapping is inefficient and can clog up a lot of memory, especially if "Leave non-GPU tasks in memory while suspended" is on. Regardless how long they take, let one group of tasks finish before starting another. This is something Glenn has mentioned before but it's not been getting much attention on the forums: run less concurrent tasks than you think you should be able to. OIFS tasks are prone to crashes when overall system RAM is pushed too much. I'd suggest leaving about 10GB for overhead when running OIFS tasks. |
Send message Joined: 18 Jul 13 Posts: 438 Credit: 25,919,008 RAC: 6,904 |
Thanks Glenn and Andrey, I usually allow my machines to work 24/7 and this laptop I realise was suspended once or twice on Jan 7th as I was moving around - hence the accumulation of SRF files. As for the switch between tasks I currently have it as 60 minutes, but I've never seen it behaving like this, though sometimes I have to pause some to make BOINC continue with tasks I want it to. Moreover CPDN is my highest priority 75%, WCG 12.5 % and WUProp 12.5 % - the latter being non-cpu-intensive. However I will increase the time as suggested to reduce the switches. As for the concurrent numbers of OIFS both machines work fine(ish) with 2 tasks at the same time (if I use the machines than SWAP kicks in), with 3 WUs for the i7-4790 at the extreme. With 4 the system crashed and I could not start it for several days to switch off BOINC. I can no longer crunch as I hit the upload limit, but I may reduce WUs to 1 as suggested. Let's first clear the upload queue. So yeah, these OIFS are pushing the limits of my current machines and more demanding ones are to come :) looking forward to the upgraded batches which should reduce some failures. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Is it fine to assume the tasks are ok when uploaded and not flagged erroneous? If so, leaving them in memory should have been the only thing to resolve the task of getting work done and not trashing it. - - - - - - - - - - Greetings, Jens |
Send message Joined: 29 Oct 17 Posts: 1072 Credit: 17,020,946 RAC: 5,160 |
Is it fine to assume the tasks are ok when uploaded and not flagged erroneous?If the 'Outcome' on the task result webpage is shown as 'Success', then yes. |
Send message Joined: 4 Dec 15 Posts: 52 Credit: 2,562,405 RAC: 1,841 |
Is it fine to assume the tasks are ok when uploaded and not flagged erroneous? If the 'Outcome' on the task result webpage is shown as 'Success', then yes. Ah, there. I see. Thanks! - - - - - - - - - - Greetings, Jens |
©2025 cpdn.org