Benchmarks and other problems

Author	Message
Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,540,823 RAC: 2,132	Message 60379 - Posted: 20 Jun 2019, 12:19:49 UTC - in response to Message 60376. It will be nice to see some work done by the Linux boxes that are currently trashing everything they get because of missing 32bit libs.:) Are they sending work-units to Linux boxes that are trashing everything? I have a Linux box that has the 32-bit compatibility libraries but I have received almost no work units in about a year except a coupla retreads. Then day before yesterday, I got the current four work units that are crunching away. One has generated two trickles and the other three have generated one trickle each. It seems to me that if they were sending out 32-bit work units to boxen missing necessary libraries, that I would have been getting some of them too. But I have not. Wed 19 Jun 2019 03:27:26 PM EDT \| Finished upload of hadam4_a027_200610_12_825_011882434_0_r411654165_1.zip Wed 19 Jun 2019 04:11:36 PM EDT \| Sending scheduler request: To send trickle-up message. ID: 60379 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,535,294 RAC: 5,887	Message 60380 - Posted: 20 Jun 2019, 12:33:36 UTC It seems to me that if they were sending out 32-bit work units to boxen missing necessary libraries, that I would have been getting some of them too. But I have not. Batch 825 550 (HADAM4) tasks for linux has the statistics shown. Of the 38% hard fails, i.e. failed on all three attempts, each one has either 2 or three fails because of missing 32bit libs. Success: 0 (0%) Fails: 208 (38%) Hard Fail: 37 (7%) Running: 513 (93%) Unsent: 0 (0%) ID: 60380 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,540,823 RAC: 2,132	Message 60381 - Posted: 20 Jun 2019, 14:38:25 UTC - in response to Message 60380. I failed one of these on work unit 21490395. Not because of missing libraries, but because my machine crashed, and it was the version that could not tolerate machine restarts. UK Met Office HadAM4 at N144 resolution v8.08 i686-pc-linux-gnu stderr out <core_client_version>7.2.33</core_client_version> <![CDATA[ <message> process exited with code 22 (0x16, -234) </message> <stderr_txt> Suspended CPDN Monitor - Suspend request from BOINC... Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Model crashed: READDUMP: BAD BUFFIN OF DATA tmp/xnnuj.pipe_dummy Sorry, too many model crashes! :-( 13:45:02 (3083): called boinc_finish(22) </stderr_txt> ]]> This problem seems to have been fixed with the UK Met Office HadAM4 at N144 resolution v8.09 version of the software. I have four of those running and they are uploading and trickling OK. They seem to be running about twice as fast as the expected completion time. It was expected that they would take about 1050 hours to complete, but one is at 18.455% complete after running 72.5 hours. ID: 60381 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,535,294 RAC: 5,887	Message 60382 - Posted: 20 Jun 2019, 14:53:19 UTC [quote]I failed one of these on work unit 21490395.[/quote And that one which failed 3 times, one of its three was a lack of 32bit libs. ID: 60382 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,542,654 RAC: 6,625	Message 60383 - Posted: 20 Jun 2019, 15:25:05 UTC - in response to Message 60381. They seem to be running about twice as fast as the expected completion time. It was expected that they would take about 1050 hours to complete, but one is at 18.455% complete after running 72.5 hours. The initial estimate for time to completion is partially based on the boinc floating point benchmark. For some reason the 7.2.33 version that comes with Redhat 6 type installations has unrealistically low benchmarks. Some thing for my Phenom II 945 on CentOS 6. 7.2.33 gives a FP benchmark of about 1600 whereas the later versions of boinc on Ubuntu have about 3000 for the same CPU. ID: 60383 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,540,823 RAC: 2,132	Message 60384 - Posted: 20 Jun 2019, 16:21:20 UTC - in response to Message 60383. The initial estimate for time to completion is partially based on the boinc floating point benchmark. For some reason the 7.2.33 version that comes with Redhat 6 type installations has unrealistically low benchmarks. If that benchmark is still the widely-used "whetstone" benchmark, it is possibly the worst benchmark there could be for computing floating-point operations. It is based on the statistics obtained by running tests on "typical" programs written in Algol 60 in an interpreter (not a compiler). The interpreter was used because it was easy to implement necessary timings into the code automatically. There are about a dozen loops in the program, each executed 10,000 times if I remember correctly. Each one does either some simple calculations, or calls subroutines. The one whose subroutine does floating point operations is in the benchmark to evaluate the cost of function calls, not loop overhead, not floating point operations. It just happens to do some floating point operations. I was involved in the writing of an optimizer for the C compiler at Bell Labs in the early 1980s. One of the optimizations we used was to expand called functions inline when it made sense. So it defeated the measurement of the call and return operations that was the original purpose of that module. But then my loop invariant code motion optimization moved all those operations out of the loop, since they did not change from one iteration to another. Then a live-dead analysis eliminated all the floating point operations remaining because because the results were not used. These optimizations resulted in an enormous speed-up of our execution of that benchmark. Since these optimizations were common by 1990, they are probably in almost all compilers by now. So whatever that benchmark may have measured in 1965, it does not measure them today. ID: 60384 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 60387 - Posted: 20 Jun 2019, 22:21:08 UTC Last modified: 20 Jun 2019, 22:31:00 UTC I've moved all of the preceding posts out of the thread intended to discuss the new OpenIFS models. ID: 60387 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 60388 - Posted: 20 Jun 2019, 22:47:04 UTC Are they sending work-units to Linux boxes that are trashing everything? Projects don't send work, it's requested by computers connected to it. ***************** It seems to me that if they were sending out 32-bit work units to boxen missing necessary libraries, that I would have been getting some of them too. But I have not. Your computer got 4 of the latest version of that model on 17 Jun 2019, 12:21:33 UTC. This was 3 days before you posted. **************** One of the reasons for your low benchmarks, is that your computer is horribily slow compared to the latest computers. And if you ever expect to get anywhere with the new type of model when it's finally release, (will we EVER get to that point? ), then people should be looking at processor times in the 3 GHz area. With a lot more memory than you have. All of which will be posted about when the time comes. ID: 60388 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,540,823 RAC: 2,132	Message 60389 - Posted: 21 Jun 2019, 0:24:25 UTC - in response to Message 60387. A very good idea. ID: 60389 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,535,294 RAC: 5,887	Message 60390 - Posted: 21 Jun 2019, 7:03:42 UTC I've moved all of the preceding posts out of the thread intended to discuss the new OpenIFS models. Thanks Les, I was thinking about moving some of it myself. ID: 60390 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,540,823 RAC: 2,132	Message 60394 - Posted: 21 Jun 2019, 17:10:08 UTC - in response to Message 60388. And if you ever expect to get anywhere with the new type of model when it's finally release, (will we EVER get to that point? ), then people should be looking at processor times in the 3 GHz area. With a lot more memory than you have. When I got the machine, 1.8 GHz was not all that slow, but 3 GHz is less than double the speed of mine. It is a 64-bit Intel Xeon. Is 16 GBytes of RAM all that small? (I just doubled it from the 8 GBytes that came with the machine.) To add more RAM with this motherboard, I would need to add the second processor. I could easily put a second processor in that mother board, but it would be the same speed, though I could then handle 8 processes at once instead of only four. And I could double the RAM at the same time. I do not think any one process could get over 16 Gbytes of RAM in my setup. At some point (when the money tree blooms), getting a new machine is probably the way to go. Time Sent (UTC) Host ID Result ID Result Name Phase Timestep CPU Time (sec) Average (sec/TS) 20 Jun 2019 08:17:32 1256552 21718833 hadam4_a0c5_201310_12_825_011882792_0 1 8,741 224,334 25.6646 ID: 60394 · Reply Quote

WB8ILI Send message Joined: 1 Sep 04 Posts: 161 Credit: 81,421,805 RAC: 1,225	Message 60395 - Posted: 21 Jun 2019, 19:26:00 UTC Jean-David - I think you are going OK with your setup. I am currently running the N144 tasks on four machines 1) AMD Phenom II X4 945 3.6 GHz 23.5 sec/TS 2) AMD Phenom II X4 945 3.6 GHz 24.5 sec/TS 3) AMD FX 8370 Eight Core 4.0 GHz 18-5-20.5 sec/TS 4) AMD FX 8370 Eight Core 4.0 GHz 18-5-20.5 sec/TS I don't believe your have a "tortoise" for a machine since you are reporting 25 sec/TS. As far a memory goes, these tasks seem to be using about 650 MB each. Depending on what else you are doing on your machine, you may or may not have enough memory. Look at your memory usage (% of total). If it is over 85% when running these tasks, I would strongly consider more memory if you are using the computer for anything else. There are many other factors that determine throughput (i.e. memory speed). ID: 60395 · Reply Quote

Alex Plantema Send message Joined: 3 Sep 04 Posts: 126 Credit: 26,363,193 RAC: 0	Message 60396 - Posted: 21 Jun 2019, 21:06:41 UTC Xeons are faster than other processors with the same clock frequency, and 16 GB should be more than enough. The project is supposed to run on home computers. ID: 60396 · Reply Quote