climateprediction.net home page
Almost all tasks fail in Linux Mint

Almost all tasks fail in Linux Mint

Questions and Answers : Unix/Linux : Almost all tasks fail in Linux Mint
Message board moderation

To post messages, you must log in.

AuthorMessage
Fardringle

Send message
Joined: 10 Mar 06
Posts: 6
Credit: 2,887,278
RAC: 5,932
Message 62902 - Posted: 8 Nov 2020, 15:53:18 UTC

I don't know how to tell if this is a problem with my computer, or with the tasks, or with BOINC, or something else, so I'm hoping one of you will know so that I don't waste more time on it and/or cause problems with the project results.
This computer is running Linux Mint 20 in a Hyper-V VM (host is Windows 10). It is allowed to use all 24 cores of the Ryzen 9 3900X CPU if it wants to, is allowed to have up to 24GB of the host's 32GB of RAM, and has 128GB of disk space (about 60GB actually used, most of it in the BOINC folders). There aren't any other projects running on this computer at this time except for WUProp. I actually had a few other VMs that I set up when I saw that some CPDN tasks were showing up, in hope that the other VMs would grab a few as well, but they all failed almost immediately so I shut them down and am only running this one for now.

Most of the tasks that I got on November 1 failed after running for only a few minutes so I figured it was just a problem with the specific batch. But several others have failed after running for multiple days. Three more of the first set of tasks are still running and appear to be making progress, although their elapsed time doesn't seem to be matching the actual time they have been running (showing about 5 days elapsed time after running for 7+ days). They also don't seem to be doing any trickles, although again I'm not sure exactly how to tell if that's the case.

This computer just got another batch of 8 new tasks today so I'd like to try to figure out what is going wrong before these fail as well...

https://www.cpdn.org/results.php?hostid=1510139
ID: 62902 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62903 - Posted: 8 Nov 2020, 16:23:42 UTC - in response to Message 62902.  
Last modified: 8 Nov 2020, 16:36:57 UTC

These models are big. They take up about 1.4 GB of memory per task. They are also L3 cache hogs, optimally liking 3-4 MB per task. You don't need that much L3 per task, but having a lot less really slows things down. You may be trying to run too many at a time. Try to keep it to 6 or 8 and see if those will run okay and perhaps expand to more if that works. Never run on SMT/HyperThreads. Restrict the number of models to at most the number of physical (not logical) cores available.

Edit...Also, even those tasks that ran awhile before crashing, didn't make it to the first trickle. These trickle once per model month and even the fastest PCs running very few models trickle in less than 1.5 days of CPU time.
ID: 62903 · Report as offensive     Reply Quote
Fardringle

Send message
Joined: 10 Mar 06
Posts: 6
Credit: 2,887,278
RAC: 5,932
Message 62904 - Posted: 8 Nov 2020, 16:40:06 UTC - in response to Message 62903.  

These models are big. They take up about 1.4 GB of memory per task. They are also L3 cache hogs, optimally liking 3-4 MB per task. You don't need that much L3 per task, but having a lot less really slows things down. You may be trying to run too many at a time. Try to suspend all but 6 or 8 and see if those will run okay.
Thank you for the suggestion. I've never had more than about 6 running at any one time, but I'll try suspending a few to see if it makes a difference.

Edit...Also, even those tasks that ran awhile before crashing, didn't make it to the first trickle. These trickle once per model month and even the fastest PCs running very few models trickle in less than 1.5 days of CPU time.
I did notice the lack of trickles but wasn't sure if that was a difference in the Linux app or not, as the few tasks I have running on a Windows machine have been sending in trickles quite frequently. Is it possible that the Climate Prediction app just doesn't run well in a virtual machine? Or maybe not in Linux Mint? I haven't had any trouble running other BOINC projects in this Linux VM...
ID: 62904 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62905 - Posted: 8 Nov 2020, 18:34:53 UTC - in response to Message 62904.  

I did notice the lack of trickles but wasn't sure if that was a difference in the Linux app or not, as the few tasks I have running on a Windows machine have been sending in trickles quite frequently. Is it possible that the Climate Prediction app just doesn't run well in a virtual machine? Or maybe not in Linux Mint? I haven't had any trouble running other BOINC projects in this Linux VM...


The tasks trickle once per model month. But cpdn has several different types of models with varying levels of complexity. The Windows models at this time are not as complex as the high resolution hadam4 N216 models that are running on Linux. That is mostly the difference.
ID: 62905 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,378,503
RAC: 3,632
Message 62906 - Posted: 8 Nov 2020, 18:46:27 UTC - in response to Message 62905.  

For comparison on my Ryzen7 if running more than 6 out of 16 threads, tasks take over 3 days to get to their first trickle. If I cut down the number of running tasks to 5 or less they do so in just under 2 days.

However, I have to date only had one N216 task crash on this machine and it had crashed on four others previously though 2 of and were due to missing libraries so model problems can be discounted for those crashes. At some point I will test the time between checkpoints running different numbers of n216 tasks to determine what gives the greatest throughput of work.
ID: 62906 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 62907 - Posted: 8 Nov 2020, 18:50:16 UTC

@Fardringle

The latest Linux incarnation of your 3900X in the listing of your computers has < 10 GB of RAM indicated: https://www.cpdn.org/show_host_detail.php?hostid=1510193

One should be able to get by on 10 GB of memory with less than 6 tasks running at the same time.

It also looks like a couple of your crashed tasks on that specific VM instance lasted long enough to trickle and will get credits for one trickle each when the credit script is run on Wed/Thur.
ID: 62907 · Report as offensive     Reply Quote
Fardringle

Send message
Joined: 10 Mar 06
Posts: 6
Credit: 2,887,278
RAC: 5,932
Message 62908 - Posted: 8 Nov 2020, 19:30:20 UTC - in response to Message 62907.  
Last modified: 8 Nov 2020, 20:13:54 UTC

The RAM is set to dynamic so it increases or decreases depending on the actual RAM usage in the VM. It's set to a minimum of 10GB, but can go up to 24GB if needed. That does tend to make the BOINC client stats look a little odd, though.

Also, the client that you linked to is a second virtual machine that I turned off completely because all of the CPDN tasks running on it had failed with errors. The only one running right now is named 3900X-Linux-VM1 and as of this moment it is using 12GB RAM with 5 climateprediction tasks actively running.
ID: 62908 · Report as offensive     Reply Quote
Fardringle

Send message
Joined: 10 Mar 06
Posts: 6
Credit: 2,887,278
RAC: 5,932
Message 62946 - Posted: 13 Nov 2020, 20:04:56 UTC - in response to Message 62907.  
Last modified: 13 Nov 2020, 20:05:30 UTC

Running 5 tasks simultaneously is using almost 20GB of RAM in Linux at this point, but they do seem to be running well now and were awarded credits for trickles on Thursday, so I appreciate the help!
ID: 62946 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Almost all tasks fail in Linux Mint

©2024 climateprediction.net