climateprediction.net home page
Tasks failing on Ubuntu 22

Tasks failing on Ubuntu 22

Message boards : Number crunching : Tasks failing on Ubuntu 22
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67333 - Posted: 4 Jan 2023, 20:21:30 UTC

Hi.
I had two tasks running on my shiny new Ubuntu 22 system.
Knowing about the problems with 64-bits and the need to install additional stuff I first installed Ubuntu 18, put the additional stuff for CPDN (and other projects) on it, then upgraded to Ubuntu 20, then to Ubuntu 22.
The tasks seem to run for their full length, but they return as faulty.
I'd like to know if there's anything in the machine's stderr out that might identify problems.
Is anybody able to elaborate on that?
I run one CPDN task at a time, but have the machine do other Boinc related work on all its threads, but I do so on all my machines all the time, and my other Ryzen handed successful work back to the server.
- - - - - - - - - -
Greetings, Jens
ID: 67333 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,185,480
RAC: 6,754
Message 67334 - Posted: 4 Jan 2023, 20:26:19 UTC - in response to Message 67333.  

<message> Disk usage limit exceeded </message>

would be a good place to start.
ID: 67334 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67337 - Posted: 4 Jan 2023, 21:24:55 UTC

Ouch!
Thanks for getting me back to earth.
So I presume you think it's a reasonable thing to start reading western texts from above? ;-)
I looked at the end because the task did that lots of things, and it was just confusing to me.
And, looking at the beginning and the numbers above the stderr out, I don't really understand how it couldn't have enough disk space.
Boinc was allowed to use 100GB, up to half of the disk (whatever is reached first), and there's 800GB of unused disk space on a 1TB device.
I now set this to 200GB, but it still fail to understand this.
- - - - - - - - - -
Greetings, Jens
ID: 67337 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,920,463
RAC: 15,032
Message 67338 - Posted: 4 Jan 2023, 22:09:46 UTC - in response to Message 67337.  

What you are exceeding is the per task disk usage limit. The result you linked had peak disk usage above 7GB configured for the task. This thread is likely relevant, if that computer frequently suspend and resume tasks for whatever reason.
ID: 67338 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67339 - Posted: 4 Jan 2023, 22:34:49 UTC

I already applied the setting to keep tasks in memory while suspended.
How does the usage limit for tasks work?
Can I adjust it?
- - - - - - - - - -
Greetings, Jens
ID: 67339 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 804
Credit: 13,568,734
RAC: 7,025
Message 67340 - Posted: 4 Jan 2023, 22:40:19 UTC - in response to Message 67333.  

Hi Jens,
Do you have 'leave non-GPU tasks in memory while suspended' unchecked (blank) under Disk & memory boincmgr computing preferences. I suspect you don't. Please enable it.

I can tell from your log that the model is constantly restarting (lots of STEP ... lines, then a whole bunch of startup messages repeatedly). This is happening because the boinc client suspends the task, and because you don't have the option above on, it kicks it out of memory probably because another OS process wants it effectively killing the process (boinc runs all tasks at nice level 19, the lowest priority so getting kicked out of memory is quite likely). The model then has to restart from its checkpoint files. Now, that would be all ok except that every time the model restarts it keeps its old ones around just as a backup. The more the model restarts, the more of these checkpoint files accumulate until you hit the task limit. I am going to change this model behaviour but I can't do it for these batches. For now, unless if causes you a problem, please enable 'leave non-GPU in memory'.

That will solve it. We've seen this happen alot unfortunately. If you can't enable this option for any reason, let me know. Another way to fix it would be to change 'usage limits' to 'Use at most 100% of CPU'. That will keep the model running all the time, but will of course affect all boinc tasks running so you might not want to solve it that way.

Regarding 32bit libraries, the OpenIFS tasks are all 64bit, and do not need any further OS libraries installed.

Best, Glenn
Hi.
I had two tasks running on my shiny new Ubuntu 22 system.
Knowing about the problems with 64-bits and the need to install additional stuff I first installed Ubuntu 18, put the additional stuff for CPDN (and other projects) on it, then upgraded to Ubuntu 20, then to Ubuntu 22.
The tasks seem to run for their full length, but they return as faulty.
I'd like to know if there's anything in the machine's stderr out that might identify problems.
Is anybody able to elaborate on that?
I run one CPDN task at a time, but have the machine do other Boinc related work on all its threads, but I do so on all my machines all the time, and my other Ryzen handed successful work back to the server.
ID: 67340 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67344 - Posted: 5 Jan 2023, 0:42:06 UTC - in response to Message 67340.  

I already applied the setting to keep tasks in memory while suspended.

[quote]For now, unless if causes you a problem, please enable 'leave non-GPU in memory'.
That will solve it.

Regarding 32bit libraries, the OpenIFS tasks are all 64bit, and do not need any further OS libraries installed.[quote]
I already had this setting enabled after commenting in the corresponding thread on 2023-01-01, so for the second task it should already have been applied.

The machine didn't even use a quarter of its memory, although it is allowed to use up to three quarters.
It might have taken another 32GB to do whatever if needed.
I may have kicked the second task out of memory once by reducing the number of threads I gave to Boinc, but that's not even near to restarting all the time.
Strange.

Nice thing the 32bit stuff isn't needed for this application; one potential issue less.
I think it was for HadAM/HadSM. My second Ubuntu system didn't even get one task straight before I installed what I found in the forums.

Well, for the moment I don't think I should do any more in regard of fiddling with the system, but I'll keep my eyes on this and have given more resources to CPDN to keep other tasks even more from kicking it out of the way because Boinc thinks they're more important.

Thanks a lot for everyone's suggestions and ideas so far!
- - - - - - - - - -
Greetings, Jens
ID: 67344 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,920,463
RAC: 15,032
Message 67345 - Posted: 5 Jan 2023, 1:46:05 UTC - in response to Message 67344.  

Read the log in more detail and I think you might be able to figure out even now. There are multiple lines of this message in your log, but not in any of my WUs that were never paused.
Quit request received from BOINC client, ending the child process

The timestamp right before that line should let you to jump to the right point of boinc logs to understand why boinc decided to pause the task. For example, I would run this for the first pause around 15:20
journalctl -u boinc-client --since "2023-01-02 15:20:03"


This would not answer why the "leave non-GPU tasks in memory while suspended" is not effective though. The assumption is that if that's checked, pausing shouldn't cause any problems. Should the task even receive the quit request at first place?
ID: 67345 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 67347 - Posted: 5 Jan 2023, 5:14:04 UTC

If you changed the option to "leave tasks in memory" but did not read the file to update BOINC with the change it may not work until it is read.
Restarting BOINC would also read the file.

Conan
ID: 67347 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 11,837,884
RAC: 20,354
Message 67355 - Posted: 5 Jan 2023, 9:33:57 UTC

If the options are changed via the BOINC manager, the config files get re-read automatically once you hit Save. If the changes are made directly in the config files themselves, then yes, one must run a command to re-read the relevant file(s).

gemini8,
For future reference. To run the Hadley 32-bit models (HadAM, etc.), you don't need to start from Ubuntu 18, upgrade to 20, and then 22. I'm not sure it'll even work. Just do a clean install of the version you want and then install the needed 32-bit libraries. The very simple instructions for that for different Linuxes and versions are listed at the top of the Unix/Linux section of the forum. This way, not only will it be less time consuming but you'll have a cleaner system.

One potential reason for your issue could be that there's a lot of task swapping going on. I.e. BOINC works on a group of tasks for a short while and then switches to another group and then perhaps yet another before coming back to the original one. A suggestion would be to set "Switch between tasks every __ minutes" setting to a very high number like 10080 minutes (1 week). Task swapping is inefficient and can clog up a lot of memory, especially if "Leave non-GPU tasks in memory while suspended" is on. Regardless how long they take, let one group of tasks finish before starting another.
ID: 67355 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 804
Credit: 13,568,734
RAC: 7,025
Message 67363 - Posted: 5 Jan 2023, 14:39:08 UTC - in response to Message 67345.  

Read the log in more detail and I think you might be able to figure out even now. There are multiple lines of this message in your log, but not in any of my WUs that were never paused.
Quit request received from BOINC client, ending the child process
It's quite hard for the task to know why the client told it to quit. The most likely cause is lack of 'leave non-gpu in memory' and the client needs the task to suspend (because it's not allowed to use 100% cpu). Even if that option is enabled, I believe the task can still be kicked out of memory if the OS decides it needs it for something else. Were you doing anything at the time that was particularly memory hungry?

All the boinc tasks run with the lowest system priority (19) and will be the first to go if some other process needs the RAM. I might be wrong but I don't think the client can tell the OS to keep the processes in memory at the expense of all else.

The timestamp right before that line should let you to jump to the right point of boinc logs to understand why boinc decided to pause the task. For example, I would run this for the first pause around 15:20
journalctl -u boinc-client --since "2023-01-02 15:20:03"
This would not answer why the "leave non-GPU tasks in memory while suspended" is not effective though. The assumption is that if that's checked, pausing shouldn't cause any problems. Should the task even receive the quit request at first place?

I'd suggest double checking the option is still on after closing & opening boincmgr, just in case something weird is going on. Maybe enabling that option doesn't affect currently running tasks and only ones started after the option change?
ID: 67363 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67374 - Posted: 5 Jan 2023, 19:19:07 UTC - in response to Message 67363.  
Last modified: 5 Jan 2023, 19:24:30 UTC

Glenn Carver wrote:
All the boinc tasks run with the lowest system priority (19) and will be the first to go if some other process needs the RAM. I might be wrong but I don't think the client can tell the OS to keep the processes in memory at the expense of all else.
Correct. The boinc client cannot do this. But the application itself could do it (via mlock() or mmap(), which require the caller to hold certain privileges), but this should be reserved to applications with realtime requirements, not to mere bulk processing applications, and it certainly shouldn't be done (and may not succeed) with large memory regions.

Anyway. If the kernel's process scheduler preempts a CPDN task, then that's just like a suspend-to-RAM and possibly page-out-RAM-to-a-swap-device. In contrast, if the boinc client requested a CPDN task to suspend-to-disk, then the task would have to write its checkpoint data.

Glenn Carver wrote:
I'd suggest double checking the option is still on after closing & opening boincmgr, just in case something weird is going on. Maybe enabling that option doesn't affect currently running tasks and only ones started after the option change?
At least all boinc client versions which I have been using so far applied this option change (in whichever direction, on or off) immediately to all currently running tasks.
ID: 67374 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 804
Credit: 13,568,734
RAC: 7,025
Message 67375 - Posted: 5 Jan 2023, 21:03:17 UTC - in response to Message 67374.  

Anyway. If the kernel's process scheduler preempts a CPDN task, then that's just like a suspend-to-RAM and possibly page-out-RAM-to-a-swap-device. In contrast, if the boinc client requested a CPDN task to suspend-to-disk, then the task would have to write its checkpoint data.
Swapping out won't cause the model to terminate though (unless there's not enough swap). The boinc client has to send either a quit or abort request to the controlling wrapper, which then sends a SIGKILL to the model process. If the client sends a suspend request, the model process gets a SIGSTOP; it does not interpret signals like this so won't write a checkpoint restart. Instead the model has its own internal mechanism for periodically writing checkpoint restart files.
ID: 67375 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67379 - Posted: 5 Jan 2023, 22:02:28 UTC
Last modified: 5 Jan 2023, 22:03:09 UTC

Thanks for your idea!

Just to make it clear again:
This machine hasn't seen memory usage exceeding 30% of the 64GB while swap is at 0%.
RAM isn't an issue here.
Leave non-GPU tasks in memory has been ticked between the first and second task.

Also, I had a look at the log file, and the only thing I find is:
02-Jan-2023 17:10:34 [climateprediction.net] Aborting task oifs_43r3_ps_0094_2007050100_123_976_12192738_0: exceeded disk limit: 7590.70MB > 7168.00MB

with nothing unusual preceding it.

ATM I'm waiting for the third task to finish in hope it will be ok.
- - - - - - - - - -
Greetings, Jens
ID: 67379 · Report as offensive     Reply Quote
rjs5

Send message
Joined: 16 Jun 05
Posts: 16
Credit: 18,625,550
RAC: 11,526
Message 67381 - Posted: 5 Jan 2023, 22:14:08 UTC - in response to Message 67379.  

Thanks for your idea!

Just to make it clear again:
This machine hasn't seen memory usage exceeding 30% of the 64GB while swap is at 0%.
RAM isn't an issue here.
Leave non-GPU tasks in memory has been ticked between the first and second task.

Also, I had a look at the log file, and the only thing I find is:
02-Jan-2023 17:10:34 [climateprediction.net] Aborting task oifs_43r3_ps_0094_2007050100_123_976_12192738_0: exceeded disk limit: 7590.70MB > 7168.00MB

with nothing unusual preceding it.

ATM I'm waiting for the third task to finish in hope it will be ok.


Have you looked at the BOINC Manager DISK pie chart screen? It shows Disk Usage pie charts and you can visually see how much space you have left on the disk partition.
You can increase the DISK available with the OPTIONS : COMPUTING PREFERENCES popup. Look at the DISK AND MEMORY options.

I think I remember some problems with leaving any of the 3 disk options blank. Try setting those 3 values and check the DISK USAGE pie charts.
ID: 67381 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67382 - Posted: 5 Jan 2023, 22:24:31 UTC

Storage and memory usage
Use no more than
	200 GB
Leave at least free
	5 GB
Use no more than
	50 % of total disk space

The first option was at 100 GB when I did those two tasks.
Of my 1TB SSD about 750 GB are free.
- - - - - - - - - -
Greetings, Jens
ID: 67382 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67387 - Posted: 6 Jan 2023, 9:47:21 UTC - in response to Message 67381.  
Last modified: 6 Jan 2023, 9:50:21 UTC

@rjs5, there are two types of disk limits:
– What you refer to is the global limit for everything which happens in the boinc client, all projects and all tasks summed up.
– Independent of that, there is an individual limit for each task. This one is the one which caused the failure according to @gemini8's log line.

The per-task limit is controlled by the project admin, not by the user. (Unless the user performs certain modifications in the client's state file, which is not intended in boinc client's design.)
ID: 67387 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 804
Credit: 13,568,734
RAC: 7,025
Message 67390 - Posted: 6 Jan 2023, 10:29:18 UTC - in response to Message 67379.  
Last modified: 6 Jan 2023, 10:30:47 UTC

Hi Jens,
Understand your points below. Could you please check something for me? Go into the /var/lib/boinc/slots directory (or wherever your boinc is installed), and run this command (it must be run from 'slots').

ls -l ?/srf*
and let me know what output you get.

The srf files are the biggest of the model restart files. For example, I have:

 $ cd /var/lib/boinc/slots/; ls -l 0/srf*
-rw-r--r-- 1 boinc boinc 804992476 Jan  6 00:52 0/srf00260000.0001
-rw-r--r-- 1 boinc boinc 804992476 Jan  6 10:13 0/srf00330000.0001
There should be 1 file for every restart the model has done. The number after 'srf' is the model step count when the file was written.

If the task is running 100% cpu as set in Computing preferences in boincmgr, you should only have 1 srf file. In my example above, the model has restarted once as I shutdown my PC at night. If you have alot of these files per slot directory, then the model is restarting often and we need to understand why.

To save space (if needed) you can safely delete the older srf files, but always leave the most recent file otherwise the model will not be able to restart at all. In the example above, I could safely delete srf00260000 as that's the lowest number and the oldest date but I must leave the srf00330000 file.

Cheers, Glenn

Just to make it clear again:
This machine hasn't seen memory usage exceeding 30% of the 64GB while swap is at 0%.
RAM isn't an issue here.
Leave non-GPU tasks in memory has been ticked between the first and second task.

Also, I had a look at the log file, and the only thing I find is:
02-Jan-2023 17:10:34 [climateprediction.net] Aborting task oifs_43r3_ps_0094_2007050100_123_976_12192738_0: exceeded disk limit: 7590.70MB > 7168.00MB

with nothing unusual preceding it.

ATM I'm waiting for the third task to finish in hope it will be ok.
ID: 67390 · Report as offensive     Reply Quote
gemini8

Send message
Joined: 4 Dec 15
Posts: 52
Credit: 2,182,959
RAC: 836
Message 67399 - Posted: 6 Jan 2023, 21:03:09 UTC - in response to Message 67390.  
Last modified: 6 Jan 2023, 21:43:06 UTC

ls -l ?/srf*
and let me know what output you get.

$ sudo ls -l ?/srf*
ls: Zugriff auf '?/srf*' nicht möglich: Datei oder Verzeichnis nicht gefunden

which means file or directory not found.

*edit*
Replacing the ? with any slot ls shows me gets me nothing either.
*end edit*
- - - - - - - - - -
Greetings, Jens
ID: 67399 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,494,294
RAC: 834
Message 67401 - Posted: 6 Jan 2023, 22:17:40 UTC - in response to Message 67399.  
Last modified: 6 Jan 2023, 22:19:33 UTC

Try
sudo ls -l ?/ | grep srf
ID: 67401 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Tasks failing on Ubuntu 22

©2024 climateprediction.net