climateprediction.net home page
Posts by Jean-David Beyer

Posts by Jean-David Beyer

21) Message boards : Number crunching : no new work units (Message 61531)
Posted 14 Nov 2019 by Jean-David Beyer
Post:
the new models seem to need around 4 Megs of L3 cache, or they slow waaaaaaay down.


Hurray! My slow (1.8GHz) 4-core 64-bit processor has 10240 Megs of L3 cache.
And now I have 16GBytes of DDR3 registered RAM in eight 2 GByte modules.
22) Message boards : Number crunching : no new work units (Message 61525)
Posted 11 Nov 2019 by Jean-David Beyer
Post:
Right now you only have two kinds of workunits available, one that takes an average of 147 hours and one that takes an average of 337 hours that's a huge difference especially if your pc is older and doesn't run in the 3 or 4 ghz range for speed.


On my slow machine, 1256552, the N144 work units are taking about 16 days each, and the N216 work units are taking about 22 days. Machine running 24/7, and not doing much else except web browsing and e-mail. I am currently running four N216 work units.

CPU type 	GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Number of processors 	4
Operating System 	Linux 2.6.32-754.23.1.el6.x86_64
BOINC version 	        7.2.33
Memory 	       15.5 GB
Cache 	      10240 KB
Swap space     3.91 GB
Total disk space 	117.21 GB
Free Disk Space 	 97.36 GB
Measured floating point speed 	1.27 billion ops/sec
Measured integer speed 	        3.53 billion ops/sec
Average upload rate 	    1394.09 KB/sec
Average download rate 	   10519.69 KB/sec
Average turnaround time 	18.94 days
23) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61512)
Posted 10 Nov 2019 by Jean-David Beyer
Post:
New for me having been only Windows for many years have now got a VM running Ubuntu 18.04 (one core out of 4 on a 3.3GHz i5, allocated 3Gb RAM) have now got one of these to go with an N144. Will see how it goes.


3GBytes of RAM for a N144 task? That seems like a lot.
I am running four N216 tasks on Red Hat Enterprise Linux Server release 6.10 (Santiago) ...
CPU type 	        GenuineIntel
                        Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Number of processors 	4
Operating System    	Linux 2.6.32-754.23.1.el6.x86_64
Memory 	                15.5 GB

and they are taking 1.35 GB of virtual memory (each) and the working set size is 1.33 GBytes. I do not have any N144 tasks at the moment.
24) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61510)
Posted 9 Nov 2019 by Jean-David Beyer
Post:
Someone over at WCG seemed to think 5MB cache was what a MIP1 job would like. The user offered no justification for that number but 4MB probably isn't enough for near-optimum performance.


I wonder what that is really about. I ran three MIP1 jobs today (one at a time) with three CPDN N216 jobs on my other three cores.
The MIP1 job used about 2% of my RAM whereas the N216 jobs take 8.5% each. I have 16 GBytes RAM. 64-bit processor.

Were they talking about disk cache? That would not make much sense.
Or processor cache? My processor has Cache 10240 KB.
25) Message boards : Number crunching : What happens near the end of a task? (Message 61508)
Posted 9 Nov 2019 by Jean-David Beyer
Post:
There are a couple of other files that sometimes get created after the last zip. The restart.zip and stdaeout.zip (might not have the name of the last one exactly right. The first of the two is of the same order of magnitude as the monthly zips while the other is KB in size rather than MB.


There was one more file:

Sat 09 Nov 2019 09:18:28 AM EST | climateprediction.net | Computation for task hadam4h_a0pg_200811_4_842_011905372_0 finished
Sat 09 Nov 2019 09:18:28 AM EST | climateprediction.net | Resuming task hadam4h_a1i3_201211_4_842_011906403_0 using hadam4h version 852 in slot 1
Sat 09 Nov 2019 09:18:31 AM EST | climateprediction.net | Started upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_out.zip
Sat 09 Nov 2019 09:18:36 AM EST | climateprediction.net | Finished upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_out.zip
Sat 09 Nov 2019 09:18:38 AM EST | climateprediction.net | Sending scheduler request: To report completed tasks.


Notice how long the _4.zip file took to upload?
Sat 09 Nov 2019 08:03:57 AM EST | climateprediction.net | Started upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_4.zip
Sat 09 Nov 2019 08:14:18 AM EST | climateprediction.net | Finished upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_4.zip


A bit over 10 minutes. Since my Internet connection is about 75 Megabits/second, I infer that either the Internet backbone speeds are clogged or the CPDN server is none too speedy. I could go about 40x faster than that.

Date 	Download 	Upload 	IP Address 	Test Server 	Export All Results
	11/9/2019 9:32:03 AM	82.44 Mbps	36.14 Mbps	100.35.165.125	New York City, NY	 	 
	10/4/2019 7:45:44 AM	83.98 Mbps	87.33 Mbps	100.35.165.125	New York City, NY	 	 
	10/1/2019 5:17:42 PM	83.84 Mbps	66.32 Mbps	100.35.165.125	Seattle, WA
26) Message boards : Number crunching : What happens near the end of a task? (Message 61506)
Posted 9 Nov 2019 by Jean-David Beyer
Post:
One of my work-units just uploaded its last .zip file, yet the boinc client says I have almost three hours to go. And the task is still running. What could it be doing now? After running all four trickles, is it not done?

Name hadam4h_a0pg_200811_4_842_011905372_0
Workunit 11905372

 Sat 09 Nov 2019 08:03:11 AM EST | climateprediction.net | Started upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_restart.zip
Sat 09 Nov 2019 08:03:13 AM EST | climateprediction.net | Finished upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_restart.zip
Sat 09 Nov 2019 08:03:24 AM EST | climateprediction.net | Sending scheduler request: To send trickle-up message.
Sat 09 Nov 2019 08:03:24 AM EST | climateprediction.net | Not requesting tasks: "no new tasks" requested via Manager
Sat 09 Nov 2019 08:03:25 AM EST | climateprediction.net | Scheduler request completed
Sat 09 Nov 2019 08:03:57 AM EST | climateprediction.net | Started upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_4.zip
Sat 09 Nov 2019 08:14:18 AM EST | climateprediction.net | Finished upload of hadam4h_a0pg_200811_4_842_011905372_0_r1707431134_4.zip
27) Message boards : climateprediction.net Science : Misconfigured Machine? (Message 61498)
Posted 8 Nov 2019 by Jean-David Beyer
Post:
All crashes.
https://www.cpdn.org/results.php?hostid=1434989


Missing 32-bit compatibility library.
28) Message boards : Number crunching : Why do tasks crash on some machines but not others? (Message 61494)
Posted 8 Nov 2019 by Jean-David Beyer
Post:
They conducted a study and produced a publication back in 2007 with information on the types of differences one might see between CPU types for given simulations. Association of parameter, software and hardware variation with large scale behavior across 57,000 climate models. It doesn't exactly answer your question directly, but it talks about the differences in output and what it means to the ensembles.


Thank-you. That is a really interesting paper.
29) Message boards : Number crunching : Why do tasks crash on some machines but not others? (Message 61491)
Posted 7 Nov 2019 by Jean-David Beyer
Post:
Some of the negative pressure/negative theta errors are completely repeatable and all tasks in a a work unit that get that far will crash at virtually the same model progress point. These are due to the scientists testing the limits of the parameters used in the mode, that sometimes lead to unrealistic atmospheres.


To the extent that this is true, it would explain why my task crashed too. But it did not crash. It went all the way to a successful completion.

For others, it might be due to hardware that is not quite up to the task when working hard and/or overheats, or has bad memory that only fails under certain conditions.


Mine does not overheat even though it is currently running four hadam4h N216 processes; its fan is not even running fast.

Since on the work unit I posted, two computers failed before I got mine, the one that completed. I realize the following is bad statistics, but I could conclude that my computer is better than 2/3 of those working on this. My guess is that it is not as bad as that because I did not look at all those who completed on the very first try, or the first two.
30) Message boards : Number crunching : Why do tasks crash on some machines but not others? (Message 61484)
Posted 7 Nov 2019 by Jean-David Beyer
Post:
I was looking at some tasks that I have run that failed for others, typically two others.

One of these was Workunit 11901525

Now some of them crash after a large fraction of a second, or a few seconds. I am ignoring these.

But some run a long time, such as Task 21754793

This one ran CPU time 2 days 1 hours 32 min 7 sec

And the failure was
<stderr_txt>
CPDN Monitor - Quit request from BOINC...

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            

Model crashed: P_TH_ADJ : NEGATIVE PRESSURE VALUE CREATED.                                                                                                                                                                                                                     tmp/xnnuj.pipe_dummy                                                            
Sorry, too many model crashes! :-(


Now if the program had bugs, or if the initial data were bad, would it not have crashed for me too?
But since I completed it correctly, it seems to me that the program probably had no bugs, and that the initial data were good too. So why did the Model crash for the others?
31) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61449)
Posted 4 Nov 2019 by Jean-David Beyer
Post:
I now have four UK Met Office HadAM4 at N216 resolution v8.52
i686-pc-linux-gnu work units running on my four cores.

21785482 21761296 21784271 21760249

One has completed three trickles, but the other three have only been running a short while and have not produced a trickle.

Those three trickles were running 52.3026 seconds/TS for the 25% complete one
52.2617 seconds/TS for the 50% complete one and
52.3267seconds/TS for the 75% complete one.

Each is getting over 97% of the cpu time of the processor it runs on. The processor is not a fast one by today's standards, but it seems to have an unusually large cache for what it is.
CPU type 	GenuineIntel
Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
Operating System 	Linux 2.6.32-754.23.1.el6.x86_64
BOINC version 	7.2.33
Memory 	                15.5 GB
Cache 	               10240 KB
Swap space 	        3.91 GB
Total disk space 	117.21 GB
Free Disk Space 	 98.49 GB
Measured floating point speed 	1.27 billion ops/sec
Measured integer speed 	3.53 billion ops/sec
Average upload rate 	        3009.79 KB/sec
Average download rate 	        9768.13 KB/sec
32) Questions and Answers : Unix/Linux : What is the name of a checkpoint file (especially hadam4h)? (Message 61442)
Posted 1 Nov 2019 by Jean-David Beyer
Post:
$ locate atmos_restart.day
/home/boinc/projects/climateprediction.net/hadam4h_a0pg_200811_4_842_011905372/dataout/atmos_restart.day
/home/boinc/projects/climateprediction.net/hadam4h_a18g_201111_4_842_011906056/dataout/atmos_restart.day
/home/boinc/projects/climateprediction.net/hadcm3s_qg42_190012_240_837_011899288/dataout/atmos_restart.day
/home/boinc/projects/climateprediction.net/hadcm3s_qy57_190012_240_837_011900203/dataout/atmos_restart.day

On my machine, /home/boinc is a separate file system all its own.
$ df
Filesystem      1K-blocks      Used  Available Use% Mounted on
/dev/sdb5        48249720  15625144   30166976  35% /
/dev/sdb2          499656    115328     358116  25% /boot
/dev/sdb6        48249720   9864708   35927412  22% /home
/dev/sdd3       122908728   9562664  107095976   9% /home/boinc <---<<<
/dev/sdb8         3966144     19108    3742236   1% /tmp
/dev/sdb7        15995848   4110856   11065792  28% /var
...
33) Questions and Answers : Unix/Linux : What is the name of a checkpoint file (especially hadam4h)? (Message 61441)
Posted 31 Oct 2019 by Jean-David Beyer
Post:
OK. Thank-you.

boinc[~/projects/climateprediction.net/hadam4h_a18g_201111_4_842_011906056/dataout]$ ls -l
total 475276
[snip]
-rw-r--r--. 1 boinc boinc 423837696 Oct 31 15:51 atmos_restart.day
-rw-r--r--. 1 boinc boinc     14173 Oct 31 15:51 shmem_restart.day
-rw-r--r--. 1 boinc boinc         0 Oct 30 16:27 xnnuj.err
-rw-r--r--. 1 boinc boinc    279754 Oct 31 17:53 xnnuj.out
-rw-r--r--. 1 boinc boinc     14100 Oct 31 15:51 xnnuj.phist
-rw-r--r--. 1 boinc boinc     14100 Oct 31 15:51 xnnuj.thist
34) Questions and Answers : Unix/Linux : What is the name of a checkpoint file (especially hadam4h)? (Message 61439)
Posted 31 Oct 2019 by Jean-David Beyer
Post:
In the "boincdata"/projects/climateprediction.net directory, a file is written with the filename trickle_up_"rest of the filename".


Trouble is, I do not wish to see the trickle_up files; I want to see the checkpoint files.
35) Questions and Answers : Unix/Linux : What is the name of a checkpoint file (especially hadam4h)? (Message 61435)
Posted 31 Oct 2019 by Jean-David Beyer
Post:
I would like to look at checkpoint files. Actually, I do not wish to look at the content, but just size, time and date written, etc. What ls -l could tell me.

If I knew the name of it, the locate command could probably find it for me.
36) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61433)
Posted 31 Oct 2019 by Jean-David Beyer
Post:
What the long time between checkpoints does mean for these tasks is that on computers that get switched off several times a day will never finish because if they have not reached the first checkpoint they will restart from the beginning.


I just started
Name 	hadam4h_a18g_201111_4_842_011906056_2
Workunit 	11906056
Created 	30 Oct 2019, 12:56:26 UTC
Sent 	30 Oct 2019, 16:46:39 UTC
CPU time at last checkpoint     12:45:28
CPU time                        12:51:34
Elapsed time                    14:21:29


So they certainly checkpoint more often than they trickle.
Remember, when looking at these times, that my machine has a 1.8 GHz 4-core 64-bit Xeon processor that runs at about half the speed of current processors.
37) Message boards : climateprediction.net Science : WCG Climate project (Message 61432)
Posted 31 Oct 2019 by Jean-David Beyer
Post:
Same here. As the project appears to be new, there are probably some kinks that need to be worked out (hopefully soon!)


No doubt, but right after the above failure, it downloaded another
Wed 30 Oct 2019 12:46:38 PM EDT | climateprediction.net | Sending scheduler request: To fetch work.
Wed 30 Oct 2019 12:46:38 PM EDT | climateprediction.net | Requesting new tasks for CPU
Wed 30 Oct 2019 12:46:40 PM EDT | climateprediction.net | Scheduler request completed: got 1 new tasks
Wed 30 Oct 2019 12:46:42 PM EDT | climateprediction.net | Started download of hadam4h_a18g_201111_4_842_011906056.zip
Wed 30 Oct 2019 12:46:42 PM EDT | climateprediction.net | Started download of a18g_842_atmos.gz
Wed 30 Oct 2019 12:46:43 PM EDT | climateprediction.net | Finished download of hadam4h_a18g_201111_4_842_011906056.zip
Wed 30 Oct 2019 12:46:43 PM EDT | climateprediction.net | Started download of ic_N216_2003_11_000047.nc.gz
Wed 30 Oct 2019 12:46:52 PM EDT | climateprediction.net | Finished download of ic_N216_2003_11_000047.nc.gz
Wed 30 Oct 2019 12:46:52 PM EDT | climateprediction.net | Started download of ALLclim_ancil_7mon_OSTIA_sst_N216_2011-10-01_2012-04-30.gz
Wed 30 Oct 2019 12:46:56 PM EDT | climateprediction.net | Finished download of ALLclim_ancil_7mon_OSTIA_sst_N216_2011-10-01_2012-04-30.gz
Wed 30 Oct 2019 12:46:56 PM EDT | climateprediction.net | Started download of ALLclim_ancil_7mon_OSTIA_ice_v2_N216_2011-10-01_2012-04-30.gz
Wed 30 Oct 2019 12:46:57 PM EDT | climateprediction.net | Finished download of ALLclim_ancil_7mon_OSTIA_ice_v2_N216_2011-10-01_2012-04-30.gz
Wed 30 Oct 2019 12:46:57 PM EDT | climateprediction.net | Started download of so2dms_rcp45_N216_2009_2020.gz
Wed 30 Oct 2019 12:47:23 PM EDT | climateprediction.net | Finished download of so2dms_rcp45_N216_2009_2020.gz
Wed 30 Oct 2019 12:47:23 PM EDT | climateprediction.net | Started download of ozone_rcp45_N216L38_2009_2020v2.gz
Wed 30 Oct 2019 12:47:25 PM EDT | climateprediction.net | Finished download of ozone_rcp45_N216L38_2009_2020v2.gz
Wed 30 Oct 2019 12:47:30 PM EDT | climateprediction.net | Finished download of a18g_842_atmos.gz

so now WCG will have to wait a while.
38) Message boards : climateprediction.net Science : WCG Climate project (Message 61429)
Posted 30 Oct 2019 by Jean-David Beyer
Post:
New weather prediction project from IBM.

https://www.worldcommunitygrid.org/about_us/viewNewsArticle.do?articleId=608


I signed up for it, but so far:
Wed 30 Oct 2019 12:46:31 PM EDT | World Community Grid | Requesting new tasks for CPU
Wed 30 Oct 2019 12:46:33 PM EDT | World Community Grid | Scheduler request completed: got 0 new tasks
Wed 30 Oct 2019 12:46:33 PM EDT | World Community Grid | No tasks sent
Wed 30 Oct 2019 12:46:33 PM EDT | World Community Grid | No tasks are available for Africa Rainfall Project
Wed 30 Oct 2019 12:46:33 PM EDT | World Community Grid | No tasks are available for Smash Childhood Cancer
Wed 30 Oct 2019 12:46:33 PM EDT | World Community Grid | No tasks are available for the applications you have selected.
39) Message boards : Number crunching : Slow progress rate for HadAM4 at N216 (Message 61410)
Posted 27 Oct 2019 by Jean-David Beyer
Post:
I run 4 HadAM4h on my i7-4790 with 16GB RAM.

They all are above 75%, have run >9 days with estimated 3 days remaining.


My processor is 4-core 64-bit 1.8 GHz Xeon with 10240 KBytes cache, and 16 GBytes of RAM.
I run one hadam4h currently getting 98.8% of a CPU.            153  hours to go. 234 hours run.
II run two hadcm3h currently getting 98.1% each of a CPU About 343 hours to go.  254 hours run.
I run one hadam4 currently getting 97.6% of a CPU.             230 hours to go.  323 hours run.

They all get a little bit more CPU time when I am not running Boinc Manager, Firefox web browser, and a coupla little processes.
40) Message boards : Number crunching : Slow progress rate for HadAM4 at N216 (Message 61405)
Posted 26 Oct 2019 by Jean-David Beyer
Post:
Looking at running times posted by Jean-David Beyer indicates that I might end up crunching for more than a month at current speed.


Bear in mind my processor is an old
GenuineIntel Intel(R) Xeon(R) CPU E5-2603 0 @ 1.80GHz [Family 6 Model 45 Stepping 7]
processor that was pretty fast when I bought it, but runs at about 1/2 the speed of current machines. It does have a relatively large on-chip cache of 10240 KBytes, and 16 GBytes of RAM -- 8 modules of 2GB DMS Certified Memory DDR3-1333 (PC3-10600) 256x72 CL9 1.5v 240 Pin ECC Registered DIMM

My hadam4h is taking 52.3026 sec/TS
My hadcm3s is taking 22.7600 #1
My hadcm3s is taking 22.7597 #2
My hadam4 is taking 25.8856

The N216 model seems to be running twice as fast as the other two, but I am not sure I believe that. I figure out two or three weeks apiece, but I have not been running these larger tasks for very long. I have not even been running any CPDN work units in a long time because I run Linux on this machine. I do not mind how long these work units take. In the past, I have run work units that had three phase to them and took several months apiece.


Previous 20 · Next 20

©2020 climateprediction.net