climateprediction.net home page
Posts by Eirik Redd

Posts by Eirik Redd

1) Message boards : Number crunching : New work Discussion (Message 62593)
Posted 20 Jun 2020 by Eirik Redd
Post:
You can also use the BOINC command line interface to manage the the client. See:


Yes I was managing BOINC from the command line until I got a work around. However I like to play and often build my own BOINC client and manager from source code using the testing versions and would quite like to get back into being able to do that on the laptop.


Yes, yes boinccmd has been a great help to me over the years when the gui fails, as it does from time to time.
But I don't think many users even imagine how useful a cli interface can be.
My daughter asked me a few months ago, while taking a coding class --"dad, how do I get out of less?" "q".
How few CPDN volunteers even know what differential equation is?

How to get more crunchers?
Then there will be more work units.
2) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61862)
Posted 31 Dec 2019 by Eirik Redd
Post:
Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue.

Me,I say, never load the "Ncore-2Nthread" CPU more than "Ncore" .AND for the many-core 6-16 core cpu's -- allow less than the number of cores to run these N216 models.
Preliminary stats from my fastest boxes show that the "biggie manycore boxes" I'm running are most productive at about half the "core-count" -- I've not any "Threadripper or i9-10"
BUT - the RYZEN 9- 39xx is twice as fast on these L2 and L3 hogs as any other "not-quite-bleeding-edge" box I'm running

You can look at my public stats.
Near zero fails on the N216 batches.
Random fails on the N144 batches.

I've no clue why
3) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61861)
Posted 31 Dec 2019 by Eirik Redd
Post:
Well, now that you put me up to it, I checked the last twelve successful completions on all machines.
Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen.

I won't try to do a statistical analysis.


For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here
Give me more of the "N216"
4) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61513)
Posted 10 Nov 2019 by Eirik Redd
Post:
The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks.

Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure.

One particularly egregious host was found that had crashed over a thousand of these!


I just saw hadam4h task from batches 842 and 843 with a _3 and a _4 tail download on one of my machines. Glad to take them.

<edit>
Exec summary
These hadam4h N216 models can run in about a week on even old machines.
How many cores to load before thruput drops off fast -- that's the question
<edit>


Unfortunately I've a 2-to-3-week backlog of these wu's mostly account of early underestimated completion times and optimism about how much help the extra cores on recent Intel and AMD might give.
It seems there is a "sweet spot" for various cpus of various ages and designs, a minimum that depends on the generation, microarch, and load. I can't generalize yet, but here's 3 examples.

AMD Phenom II 6-core completes 1 model in 5.7 days no other load. running 3 models on the 6 "cores" all 3 at once take 19.6 days to finish or 6.5 days per model

Intel i7700 running 4 models all 4 cores, (no threading) 8.8 days for 4 models 2.2 per each. 2 hadam4h on this box takes 6.2 days; 3.1 days each. Running 3 at once takes (estimated) 7.1 days or 2.4 cpu days each. In this case it pays well to run 4 models (compare sandy bridge thru Haswell)

Intel i8700K running 6 models so loading all 6 "cores" 12 days to finish 6 models 2days per model. Barely better than the 6.2 days to complete 3 models with only 3 at once. and sure better througput than running only 2 models on the 6-core box, when estimate is 5.3 days for 2 to complete -- 2.6 days per model

Doh.

Hope to report more on the speed vs core-count thing on newer Ryzen 2700x and 3900X - first observation is that using all 8 or 12 cores gets no more throughput than using 4 or 6 right around a week per model. (Don't know how all that 64M L3 on the 3900X is shared - many other questions.)
5) Message boards : Number crunching : Credits (Message 60713)
Posted 25 Jul 2019 by Eirik Redd
Post:
Looks like that worked, at least for me here.
Updated credits consistent with last few weeks diminishing work.
Thanks
6) Message boards : Number crunching : Upload failures (Message 60539)
Posted 2 Jul 2019 by Eirik Redd
Post:
Figuring "how long to clear uploads"
Right now one of my 3 fast boxes (Ryzen 2 2700X) is the only one I'm letting upload at the moment. It has about 40 92 MB safr50 queued for upload and about 160 76MB sam50 uploads queued. It has been running all through this recent incident, but disconnected from the internet for a part of that.
It uploaded about 80 of various sizes in the last 3 hours. So at least 12 hours to clear its upload queue.
Two more fast boxes will take another 30 hours. The old slow ones, not much worry

So I get a significantly sooner time to catch up than Speedy's 15.7 days. Nearer 4 days at a guess. But we'll all see how it goes.

Remember one of Murphy's mottoes "Constants aren't , variables won't"
7) Message boards : Number crunching : About the new many core computers (Message 59928)
Posted 6 Apr 2019 by Eirik Redd
Post:
I'm thinking that nowadays the number of specialized FPU's per core is unknown to us users. Makes sense to me to experiment a bit.

I think the OS, and also web browsers and email don't use floating-point much, if at all. So yeah, it's the FPUs/core , and maybe the memory bandwidth, that may be the bottlenecks.


I know from testing, that models on hyperthreaded computers run better if the hyperthreading is left on, but not used.
e.g. on my Haswell, only run 4 models.

This leaves the rest for the OS.

But how well do the 16, 24, 32 core computers fare?

It's possible that bottlenecks in getting to the FPU for so many at once may be a hindrance.

Any thoughts?
8) Message boards : Number crunching : About the new many core computers (Message 59927)
Posted 6 Apr 2019 by Eirik Redd
Post:
I agree with what's been posted so far (except I don't tweak clocks, know nothing about that)
A few years ago I did serious tests on the 4-core/8-thread Intels (Ivy and Sandy) and found that, on Linux, with SMT enabled, and no non-CPDN use of the machines, runnning 5 models gained 5% throughput -- meaning the slowdown from running 5 rather than 4 was barely made up for by the +20% number of models running. More than 5 models per 8 threads lost big, and bigger the more models per SMT up to big loss trying 8 max threads.

Now -running Wine under Linux My more-than-4-core-8-thread experience is recent - like within a year.I've done no formal tests on what I now have Haven't done any tests with more models than cores.

Willing to try some tests if 6/12 Intel and 8/16 AMD info will help others. I can't buy an AMD threadripper 16 core or a similar Xeon.

Both Ryzen 7 2700+ -- much faster than AMD's previous cpus.
and Intel i7-8700K CPU @ 3.70GHz surprisingly much faster than the Intel i7-7700 CPU @ 3.60GHz even allowing for six cores versus four

How to test? Please advise
9) Message boards : Number crunching : Credits (Message 59727)
Posted 7 Mar 2019 by Eirik Redd
Post:
Credits posted yesterday.
10) Message boards : Number crunching : New work Discussion (Message 59726)
Posted 7 Mar 2019 by Eirik Redd
Post:
[Nairb wrote]... My question is, are these models restartable. In other words if I get 25 days into a model and there is a power cut...

The model saves intermediate files as it runs - "checkpoint" files - and these files should allow the model to continue after a PC restart. Sometimes the models won't restart from the checkpoint file and will fail, but usually the models are fine.

Right. No worries.
11) Message boards : Number crunching : New work Discussion (Message 59725)
Posted 7 Mar 2019 by Eirik Redd
Post:
Just managed to get 4 new tasks. They are the Wah2_safr50-... Its says they have a runtime of approx 30days. My question is, are these models restartable. In other words if I get 25 days into a model and there is a power cut... And I lose the 4 models its a waste of effort.

ta
Nairb


I've been running these models for 10 years.
Lately when one of my 10 Windows and/or Linux computers fails or the power drops,
no worries.
Restart works OK. Work not wasted. There have been models that some people here say failed after restart. Maybe so. But for me -- not an issue.
Keep crunching.
12) Message boards : Number crunching : New work Discussion (Message 59460)
Posted 18 Jan 2019 by Eirik Redd
Post:
Perfect time for the project to clear the problems log and work on the wish list ;)

Also good time for crunchers to clean heatsinks, and review dusty old software and bloatware, and catch up on reboots. :) smiley :)
And search for other projects to work on until --
13) Message boards : Number crunching : New work Discussion (Message 59457)
Posted 18 Jan 2019 by Eirik Redd
Post:
And 49 weeks in the future :)
14) Message boards : Number crunching : transient HTTP error (Message 59362)
Posted 9 Jan 2019 by Eirik Redd
Post:
The future is a strange place. You'll just have to wait and see.


Too true. Thanks for short sweet true words. And thanks for your long time support of this project.
We like Les!
15) Message boards : Number crunching : Batch 777 safr50 (Message 59213)
Posted 21 Dec 2018 by Eirik Redd
Post:
I noticed yesterday that several of this batch had failed on a new machine I've been monitoring more closely.
Looking further over recent workunit failures of this batch on my machines, seems like

At about the halfway point they fail with signal 11. This happens on Intel and AMD, on Windows 10 both in virtual and real machines, and with wine on Ubuntu bionic and Debian stretch. Probably less than a third of the wu's fail like this -- sample size too small to make better estimate.

Anybody else notice this?
Thinking this is something the batch creators will figure out.
<edit>
The majority of these rather short workunits seem to complete and upload OK
16) Message boards : Number crunching : 72 days for wah2_sam25? (Message 59133)
Posted 6 Dec 2018 by Eirik Redd
Post:
Yeah -- those long-running sam models look like taking 6-12 weeks on a typical machine running 24/7 per core but less problems than some other long-runners -
Eh? I've got a few, will let them run until they die (don't think likely)

and I have two sam25s at 10% on 10th day. Expected to finish in 100 days ;) I've reduced the load to 2 full cores so let's see if I get some gain on the sam25s


No real difference btw 2 or 4 cores on my i5-2520M sec/Ts almost the same. And I still keep one global from 766 now at 61% after 14 days I will just leave it until it crashes as expected.
17) Message boards : Number crunching : Batch 774 (safr50) (Message 59124)
Posted 1 Dec 2018 by Eirik Redd
Post:
I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker.

Right - push the likely misconfigured units out the door, clear the queue for the next good batch.
18) Message boards : Number crunching : Upload server is out of disk space (Message 59089)
Posted 25 Nov 2018 by Eirik Redd
Post:
Isn't anyone on the server side capable of some basic maths to determine the disc space required for all the uploads? This is about the fourth time in ten days.


Server side capable? So joke a joke! Server side cloud obviously run by undergrads with zero funding and no worries and zero maths
19) Message boards : Number crunching : Upload server is out of disk space (Message 59034)
Posted 20 Nov 2018 by Eirik Redd
Post:
Lost another e0n7 batch 769 zips 60-145 absent
e0os batch 766 zips 90-145 absent
e0nv batch 766 zips 133-145 absent
e1n4 batch 769 zips 124-145 absent
no restart zip files for any of these.

It puzzles me. These all running on a Windows 10 (1803) virtualbox under Ubuntu 18.04 on Intel I7-3770.
I had suspended network activity during the latest upload problems.
I'll check if any of my other machines have reported similar problems. I've suspended computation on those that had their uploads still queued.
<edit>
Don't see any other "exceeded disk limit" errors on other machines. But the one that got the "exceeded disk limit" errors has by far the most uploads queued - over 100 50MB zip files.
I'll sleep on it, doesn't seem to be directly related to the main upload problem
20) Message boards : Number crunching : Upload server is out of disk space (Message 59031)
Posted 20 Nov 2018 by Eirik Redd
Post:
Oops, getting "No space left on server" error again now on a global batch 766 task.

Also, now getting
11/19/2018 20:32:10 | climateprediction.net | Aborting task wah2_global_e1n4_208812_145_769_011664000_1: exceeded disk limit: 1920.80MB > 1907.35MB
11/19/2018 21:32:11 | climateprediction.net | Aborting task wah2_global_e0nv_200412_145_766_011655231_1: exceeded disk limit: 1921.15MB > 1907.35MB
11/19/2018 21:44:12 | climateprediction.net | Aborting task wah2_global_e0os_200412_145_766_011655264_0: exceeded disk limit: 1920.43MB > 1907.35MB


This looks like some local disk usage limit coded into the task. Maybe local limit exceeded because uploads aren't happening?
I'll suspend computing on the machine that's getting these errors, for now


Next 20

©2020 climateprediction.net