Slow progress rate for HadAM4 at N216

Author	Message
Hal Bregg Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139	Message 61394 - Posted: 26 Oct 2019, 11:44:22 UTC Last modified: 26 Oct 2019, 11:52:55 UTC Hello, I decided to run the project on 32-bit Linux installed in VM. I dedicated only one core of i3-2100 to VM and 4GB of RAM. I got one HadAM4 at N216 but the progress rate is really slow. After nearly 6hrs of running the project, I completed only 0.84%. Should I expect such long-running time for this task or is it just my host? ID: 61394 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61395 - Posted: 26 Oct 2019, 13:01:13 UTC - in response to Message 61394. That would be over 25 days per work unit, which is a bit slow for that CPU. I would expect at least twice that fast. Are you running other projects? Normally VBox does not exact much of a penalty, but maybe it does not work well with N216. The caching requirements for N216 are a bit strange. ID: 61395 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,544,136 RAC: 2,237	Message 61396 - Posted: 26 Oct 2019, 13:12:05 UTC - in response to Message 61394. I got one HadAM4 at N216 but the progress rate is really slow. After nearly 6hrs of running the project, I completed only 0.84%. Should I expect such long-running time for this task or is it just my host? My machine is a 64-bit 1.8 GHz 4-core Intel Xeon. Pretty fast when I got it long ago, but about 1/2 the speed of current machines. I am running an hadam4h N216 process in one core, an hadam4 N144 process in a second core, and two hadcm3s processes in the other two cores. The hadam4h process has 204 hours on it and is 38% done. The hadam4 process has 293 hours on it and is 74% done. One hadcm3s process has 226 hours on it and is 54% done. The other hadcm3s process has 223 hours on it and is 53% done. These figures are as reported by my (old 7.2.33) version of the boinc client. ID: 61396 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 61397 - Posted: 26 Oct 2019, 13:19:00 UTC - in response to Message 61394. Hal Did you read the message from 3 weeks ago, from the project co-ordinator about these? HadAM4 at N216 resolution ID: 61397 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 61398 - Posted: 26 Oct 2019, 13:58:23 UTC Also worth noting that if your computer gets switched off before the first checkpoint, it will start again from scratch. Looking at statistics for these tasks, machines without the necessary 32bit libraries are probably still a bigger problem but I would be very surprised if some go for a very long time before being returned -if ever though I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. ID: 61398 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,544,136 RAC: 2,237	Message 61399 - Posted: 26 Oct 2019, 17:24:54 UTC - in response to Message 61398. I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. Are there a lot of these? Would not all work units, not just hadam4* work units, crash because of this? Is there any way the boinc server for ClimatePrediction to detect if libraries are absent (perhaps by analysis of failures) to determine this, and to refrain from sending 32-bit work units to machines lacking 32-bit libraries? ID: 61399 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 61400 - Posted: 26 Oct 2019, 17:30:22 UTC Are there a lot of these? Would not all work units, not just hadam4* work units, crash because of this? My theory is that the people who haven't installed the 32bit libraries are the same ones that might not notice their tasks starting again from scratch each time they turn their computer on. The openifs tasks won't crash because they are 64 bit. Is there any way the boinc server for ClimatePrediction to detect if libraries are absent (perhaps by analysis of failures) to determine this, and to refrain from sending 32-bit work units to machines lacking 32-bit libraries? The project has been asked about his and the alternative of sending the libraries out with the tasks but I guess there isn't an easy way to do it because they haven't. There are other projects where the libraries are needed too and none of them seem to have resolved this issue either. ID: 61400 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,549,280 RAC: 6,707	Message 61401 - Posted: 26 Oct 2019, 18:25:23 UTC - in response to Message 61394. Hello, I decided to run the project on 32-bit Linux installed in VM. I dedicated only one core of i3-2100 to VM and 4GB of RAM. I got one HadAM4 at N216 but the progress rate is really slow. After nearly 6hrs of running the project, I completed only 0.84%. Should I expect such long-running time for this task or is it just my host? What I would do is make sure boinc preferences has "leave non-GPU tasks in memory when suspended" checked, and "Suspend when computer in use" unchecked, and "Use at most xx% of computer time set to 100%". This should minimize interruptions. Even with those options, the time between checkpoints will be long (3-4 hrs?). With only 3 MB of L3 cache on that CPU, that's marginal for decent performance on this model. If that cache is being shared with other processes on the PC regularly, it might slow it down even more. ID: 61401 · Reply Quote

Hal Bregg Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139	Message 61402 - Posted: 26 Oct 2019, 20:34:00 UTC - in response to Message 61395. Last modified: 26 Oct 2019, 20:35:49 UTC That would be over 25 days per work unit, which is a bit slow for that CPU. I would expect at least twice that fast. Are you running other projects? Normally VBox does not exact much of a penalty, but maybe it does not work well with N216. The caching requirements for N216 are a bit strange. Exactly. Nearly a month of continuous crunching which I am not capable of doing at the moment. At current speed it would take me more than that. Anyway, I installed 32-bit Debian based CLI version of Linux and I simply save current state of VM before turning of VirtualBox. It should not affect progress of currently processed task. Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. What stroke me else is that I crunched few wah2 WUs on Windows host with Intel Celeron 2.16Ghz on board and usually I needed around 7 days to complete those. And last thing, I am familiar with the announcement Les Bayliss mentioned in his post but I did not see any clear indication of how long the task would run. Looking at running times posted by Jean-David Beyer indicates that I might end up crunching for more than a month at current speed. ID: 61402 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61403 - Posted: 26 Oct 2019, 21:05:22 UTC - in response to Message 61402. Last modified: 26 Oct 2019, 21:05:34 UTC Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. Hummm. The LHC VBox tasks will take a lot of memory, at least CMS and ATLAS. When you suspend them, if you have "leave applications in memory" enabled, they will hang around in memory. So I wouldn't run them at all. I don't even try to run the native LHC tasks on my machines with a lot more memory and cache. Just exit LHC entirely and run CPDN for a while. Good luck with the switch out of VBox. I have never attempted such a thing. ID: 61403 · Reply Quote

Hal Bregg Send message Joined: 20 Nov 18 Posts: 20 Credit: 816,342 RAC: 1,139	Message 61404 - Posted: 26 Oct 2019, 21:13:55 UTC - in response to Message 61403. Also I kept running 2 LHC@home tasks at the same time, which use VirtualBox. However I suspended those for about 2 hours to see if things will get better but didn't notice any improvement in crunching progress of CPND task. Hummm. The LHC VBox tasks will take a lot of memory, at least CMS and ATLAS. When you suspend them, if you have "leave applications in memory" enabled, they will hang around in memory. So I wouldn't run them at all. I don't even try to run the native LHC tasks on my machines with a lot more memory and cache. Just exit LHC entirely and run CPDN for a while. Good luck with the switch out of VBox. I have never attempted such a thing. I treat this as experiment. Nothing more. I tried once to run CPND on 64-bit Linux but either it wasn't working at all or tasks where crashing unexpectedly after some time. I might try again in the future. ID: 61404 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,544,136 RAC: 2,237	Message 61405 - Posted: 26 Oct 2019, 21:36:36 UTC - in response to Message 61402. Looking at running times posted by Jean-David Beyer indicates that I might end up crunching for more than a month at current speed. Bear in mind my processor is an old GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7] processor that was pretty fast when I bought it, but runs at about 1/2 the speed of current machines. It does have a relatively large on-chip cache of 10240 KBytes, and 16 GBytes of RAM -- 8 modules of 2GB DMS Certified Memory DDR3-1333 (PC3-10600) 256x72 CL9 1.5v 240 Pin ECC Registered DIMM My hadam4h is taking 52.3026 sec/TS My hadcm3s is taking 22.7600 #1 My hadcm3s is taking 22.7597 #2 My hadam4 is taking 25.8856 The N216 model seems to be running twice as fast as the other two, but I am not sure I believe that. I figure out two or three weeks apiece, but I have not been running these larger tasks for very long. I have not even been running any CPDN work units in a long time because I run Linux on this machine. I do not mind how long these work units take. In the past, I have run work units that had three phase to them and took several months apiece. ID: 61405 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 24,530,862 RAC: 1,970	Message 61406 - Posted: 27 Oct 2019, 7:33:44 UTC I run 4 HadAM4h on my i7-4790 with 16GB RAM. They all are above 75%, have run >9 days with estimated 3 days remaining. the HDD write is around 145 GB for 20 hours. ID: 61406 · Reply Quote

Jim1348 Send message Joined: 15 Jan 06 Posts: 637 Credit: 26,751,529 RAC: 653	Message 61407 - Posted: 27 Oct 2019, 12:48:49 UTC - in response to Message 61406. Last modified: 27 Oct 2019, 12:49:27 UTC I run 4 HadAM4h on my i7-4790 with 16GB RAM. They all are above 75%, have run >9 days with estimated 3 days remaining. I am seeing the same thing on my i7-4790. With four cores running on HadAM4h, I have 12+ days total (50% completed). It seems that four cores work the best on all my Intel machines (also i7-8700 and i7-9700), regardless of the total number of real or virtual cores. Ryzen is another matter, and I am still chasing that one down. ID: 61407 · Reply Quote

bernard_ivo Send message Joined: 18 Jul 13 Posts: 438 Credit: 24,530,862 RAC: 1,970	Message 61409 - Posted: 27 Oct 2019, 18:57:58 UTC - in response to Message 61406. Last modified: 27 Oct 2019, 18:58:18 UTC the HDD write is around 145 GB for 20 hours. I need to correct this to 14.5 GB (2x7400 MB) for 20 h, which is much better for the HDD. Checkpoints are at around 2.5h with no UPS it is too long for my taste. ID: 61409 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,544,136 RAC: 2,237	Message 61410 - Posted: 27 Oct 2019, 19:27:05 UTC - in response to Message 61406. I run 4 HadAM4h on my i7-4790 with 16GB RAM. They all are above 75%, have run >9 days with estimated 3 days remaining. My processor is 4-core 64-bit 1.8 GHz Xeon with 10240 KBytes cache, and 16 GBytes of RAM. I run one hadam4h currently getting 98.8% of a CPU. 153 hours to go. 234 hours run. II run two hadcm3h currently getting 98.1% each of a CPU About 343 hours to go. 254 hours run. I run one hadam4 currently getting 97.6% of a CPU. 230 hours to go. 323 hours run. They all get a little bit more CPU time when I am not running Boinc Manager, Firefox web browser, and a coupla little processes. ID: 61410 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 61416 - Posted: 28 Oct 2019, 12:56:15 UTC - in response to Message 61409. the HDD write is around 145 GB for 20 hours. I need to correct this to 14.5 GB (2x7400 MB) for 20 h, which is much better for the HDD. Checkpoints are at around 2.5h with no UPS it is too long for my taste. I have been told the checkpoint interval will be lower for these in the future but I don't know by how much. ID: 61416 · Reply Quote

Michael Goetz Send message Joined: 2 Feb 05 Posts: 11 Credit: 847,527 RAC: 34,615	Message 61419 - Posted: 28 Oct 2019, 19:24:46 UTC - in response to Message 61399. Jean-David Beyer wrote: I guess there would be a lot more of them were it not for the tasks crashing due to lack of those libraries. Are there a lot of these? Would not all work units, not just hadam4* work units, crash because of this? Is there any way the boinc server for ClimatePrediction to detect if libraries are absent (perhaps by analysis of failures) to determine this, and to refrain from sending 32-bit work units to machines lacking 32-bit libraries? You probably don't want to do that: 1) Project-wise, this isn't a big problem. These tasks error out almost immediately, get sent back to the server, and are quickly turned around to go out to other hosts. This doesn't affect the project's overall throughput significantly, nor does it significantly impact the ability of good hosts to get work. 2) This is a problem users can, and do fix. You don't want to block the host permanently. You don't want to even block it temporarily because the inability to get tasks makes it impossible for a user to fix the problem. if you lock out such a host, you're actually contributing to the problem by making it harder for users to correct the problem! Want to find one of the largest known primes? Try PrimeGrid. Or help cure disease at WCG. ID: 61419 · Reply Quote

entity Send message Joined: 27 Apr 13 Posts: 4 Credit: 7,354,332 RAC: 149	Message 61424 - Posted: 28 Oct 2019, 21:31:50 UTC - in response to Message 61419. I had two machines that didn't have the 32-bit libraries installed and they didn't get any work sent to them despite hitting the update button multiple times during the day. Once I installed the indicated libraries and rebooted the machine, I got work immediately after hitting the update button. Currently running 25 N216 units on one of the machines and they seem to be well behaved so far. Some are at 238 hours with 99 hours to go. Most should end under 400 hours. ID: 61424 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,541,921 RAC: 6,087	Message 61427 - Posted: 29 Oct 2019, 7:40:24 UTC - in response to Message 61419. 1) Project-wise, this isn't a big problem. These tasks error out almost immediately, get sent back to the server, and are quickly turned around to go out to other hosts. This doesn't affect the project's overall throughput significantly, nor does it significantly impact the ability of good hosts to get work. Not sure I agree, Batch 843 already has 11% of its tasks down as hard failures, I.E. all three attempts to complete the work unit have failed. If nothing else, this means to get sufficient results, 10% or more extra work units need to be generated and sent out in order to get sufficient results back. I don't know if it would be worth giving the tasks dependent on these libraries four or five attempts before being designated hard failures compared to the normal three or not. It would mean those that failed for other reasons, possibly after many days computing would tie up more computers. 2) This is a problem users can, and do fix. You don't want to block the host permanently. You don't want to even block it temporarily because the inability to get tasks makes it impossible for a user to fix the problem. if you lock out such a host, you're actually contributing to the problem by making it harder for users to correct the problem! In the past, machines have been blocked and messages sent to the user that once they confirm they have installed the missing libs, reset so they can get work again. ID: 61427 · Reply Quote