climateprediction.net home page
Posts by Eirik Redd

Posts by Eirik Redd

21) Message boards : Number crunching : Little work, yet the most "important" thing in the world? (Message 63949)
Posted 7 May 2021 by Eirik Redd
Post:
It's mostly because they use Unix on main frame computers.

Or on what they now call "supercomputers"
22) Message boards : Number crunching : Little work, yet the most "important" thing in the world? (Message 63820)
Posted 9 Apr 2021 by Eirik Redd
Post:
There's something like 400 CPU-years of work queued.


Yes, and all of it is set to run only on Os’s that are run by only a tiny minority of the public. Windows is run on about 87% of all the home computers in the World.


I personally don't have a problem with that, Linux OS's can be burnt onto a dvd and up and running in under 15 minutes, but that's beside my point...my point is I can't find WHERE it says that X type of task is only for 32bit Linux OS's while Y type of task is for 64bit Linux OS's and Z type of task over there is for Windows 7 or 10 pc's. A bit more info instead of guessing would help especially since the default timeout of 3600 seconds makes any kind of figuring out what kind of task will work on my pc useless.[/quote]

How it is now -- Windows 10 has a subsystem in the windows store for free that makes it easy peasy to run Ubuntu and a few other Linux apps under Windows 10. I myself have tried that. It works reliably but I've not compared efficiency
for a long time Linux has had "not linux - independent FOSS project called "wine" " that has worked for me

Actually on Win 10 or linux it takes a small bit of work --

What I say is , with an hour reading the documentation,
Anybody can run "Windows under Linux" or "Linux under Windows"
Both "just work" -- takes an hour or two -- you can run most apps either way.

If anybody wants it easier than that -- Duh

e
23) Message boards : Number crunching : The trickles are there, but got no credit (Message 63620)
Posted 8 Mar 2021 by Eirik Redd
Post:
I sent an e-mail on this issue to the person who usually troubleshoots such problems.


Any response?


Response to moderator's email list.


Thanks, yes I am aware of this. It should be running again shortly.


This may well mean that the script will not run again till the early hours of Thursday morning.


And might fail again at that time.
No worries, "Credit is the least of --"

Whatever -- keep on
24) Message boards : Number crunching : New work Discussion (Message 62593)
Posted 20 Jun 2020 by Eirik Redd
Post:
You can also use the BOINC command line interface to manage the the client. See:


Yes I was managing BOINC from the command line until I got a work around. However I like to play and often build my own BOINC client and manager from source code using the testing versions and would quite like to get back into being able to do that on the laptop.


Yes, yes boinccmd has been a great help to me over the years when the gui fails, as it does from time to time.
But I don't think many users even imagine how useful a cli interface can be.
My daughter asked me a few months ago, while taking a coding class --"dad, how do I get out of less?" "q".
How few CPDN volunteers even know what differential equation is?

How to get more crunchers?
Then there will be more work units.
25) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61862)
Posted 31 Dec 2019 by Eirik Redd
Post:
Looking at the failures for these and the N144 tasks, most of the failures that are not down to missing 32bit libs are from machines running large numbers of these tasks concurrently. (Often 16, 24 or even more in some cases.) Some of these machines are failing nearly everything they get. If this applies to you, please try under computing preferences, reducing the number of cores in use. A lot of computer time is being wasted by this issue.

Me,I say, never load the "Ncore-2Nthread" CPU more than "Ncore" .AND for the many-core 6-16 core cpu's -- allow less than the number of cores to run these N216 models.
Preliminary stats from my fastest boxes show that the "biggie manycore boxes" I'm running are most productive at about half the "core-count" -- I've not any "Threadripper or i9-10"
BUT - the RYZEN 9- 39xx is twice as fast on these L2 and L3 hogs as any other "not-quite-bleeding-edge" box I'm running

You can look at my public stats.
Near zero fails on the N216 batches.
Random fails on the N144 batches.

I've no clue why
26) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61861)
Posted 31 Dec 2019 by Eirik Redd
Post:
Well, now that you put me up to it, I checked the last twelve successful completions on all machines.
Nine of them show the segfaults, and three don't. Some of each are from each machine, both Intel and Ryzen.

I won't try to do a statistical analysis.


For me, these "N216" have never failed. They take almost two weeks but unlike the "144" things -- no fails here
Give me more of the "N216"
27) Message boards : Number crunching : UK Met Office HadAM4 at N216 resolution (Message 61513)
Posted 10 Nov 2019 by Eirik Redd
Post:
The project are looking at measures to deal with the large number of computers missing the 32bit libraries or otherwise crashing everything on these and other Linux tasks.

Each task will now get a maximum of 5 attempts rather than the three we have seen in the past. This may mean a small increase in the number of tasks that fail on all computers being received. They are also looking at blocking computers from getting tasks till they sort things out. If the latter is effective I guess the former may only be a temporary measure.

One particularly egregious host was found that had crashed over a thousand of these!


I just saw hadam4h task from batches 842 and 843 with a _3 and a _4 tail download on one of my machines. Glad to take them.

<edit>
Exec summary
These hadam4h N216 models can run in about a week on even old machines.
How many cores to load before thruput drops off fast -- that's the question
<edit>


Unfortunately I've a 2-to-3-week backlog of these wu's mostly account of early underestimated completion times and optimism about how much help the extra cores on recent Intel and AMD might give.
It seems there is a "sweet spot" for various cpus of various ages and designs, a minimum that depends on the generation, microarch, and load. I can't generalize yet, but here's 3 examples.

AMD Phenom II 6-core completes 1 model in 5.7 days no other load. running 3 models on the 6 "cores" all 3 at once take 19.6 days to finish or 6.5 days per model

Intel i7700 running 4 models all 4 cores, (no threading) 8.8 days for 4 models 2.2 per each. 2 hadam4h on this box takes 6.2 days; 3.1 days each. Running 3 at once takes (estimated) 7.1 days or 2.4 cpu days each. In this case it pays well to run 4 models (compare sandy bridge thru Haswell)

Intel i8700K running 6 models so loading all 6 "cores" 12 days to finish 6 models 2days per model. Barely better than the 6.2 days to complete 3 models with only 3 at once. and sure better througput than running only 2 models on the 6-core box, when estimate is 5.3 days for 2 to complete -- 2.6 days per model

Doh.

Hope to report more on the speed vs core-count thing on newer Ryzen 2700x and 3900X - first observation is that using all 8 or 12 cores gets no more throughput than using 4 or 6 right around a week per model. (Don't know how all that 64M L3 on the 3900X is shared - many other questions.)
28) Message boards : Number crunching : Credits (Message 60713)
Posted 25 Jul 2019 by Eirik Redd
Post:
Looks like that worked, at least for me here.
Updated credits consistent with last few weeks diminishing work.
Thanks
29) Message boards : Number crunching : Upload failures (Message 60539)
Posted 2 Jul 2019 by Eirik Redd
Post:
Figuring "how long to clear uploads"
Right now one of my 3 fast boxes (Ryzen 2 2700X) is the only one I'm letting upload at the moment. It has about 40 92 MB safr50 queued for upload and about 160 76MB sam50 uploads queued. It has been running all through this recent incident, but disconnected from the internet for a part of that.
It uploaded about 80 of various sizes in the last 3 hours. So at least 12 hours to clear its upload queue.
Two more fast boxes will take another 30 hours. The old slow ones, not much worry

So I get a significantly sooner time to catch up than Speedy's 15.7 days. Nearer 4 days at a guess. But we'll all see how it goes.

Remember one of Murphy's mottoes "Constants aren't , variables won't"
30) Message boards : Number crunching : About the new many core computers (Message 59928)
Posted 6 Apr 2019 by Eirik Redd
Post:
I'm thinking that nowadays the number of specialized FPU's per core is unknown to us users. Makes sense to me to experiment a bit.

I think the OS, and also web browsers and email don't use floating-point much, if at all. So yeah, it's the FPUs/core , and maybe the memory bandwidth, that may be the bottlenecks.


I know from testing, that models on hyperthreaded computers run better if the hyperthreading is left on, but not used.
e.g. on my Haswell, only run 4 models.

This leaves the rest for the OS.

But how well do the 16, 24, 32 core computers fare?

It's possible that bottlenecks in getting to the FPU for so many at once may be a hindrance.

Any thoughts?
31) Message boards : Number crunching : About the new many core computers (Message 59927)
Posted 6 Apr 2019 by Eirik Redd
Post:
I agree with what's been posted so far (except I don't tweak clocks, know nothing about that)
A few years ago I did serious tests on the 4-core/8-thread Intels (Ivy and Sandy) and found that, on Linux, with SMT enabled, and no non-CPDN use of the machines, runnning 5 models gained 5% throughput -- meaning the slowdown from running 5 rather than 4 was barely made up for by the +20% number of models running. More than 5 models per 8 threads lost big, and bigger the more models per SMT up to big loss trying 8 max threads.

Now -running Wine under Linux My more-than-4-core-8-thread experience is recent - like within a year.I've done no formal tests on what I now have Haven't done any tests with more models than cores.

Willing to try some tests if 6/12 Intel and 8/16 AMD info will help others. I can't buy an AMD threadripper 16 core or a similar Xeon.

Both Ryzen 7 2700+ -- much faster than AMD's previous cpus.
and Intel i7-8700K CPU @ 3.70GHz surprisingly much faster than the Intel i7-7700 CPU @ 3.60GHz even allowing for six cores versus four

How to test? Please advise
32) Message boards : Number crunching : Credits (Message 59727)
Posted 7 Mar 2019 by Eirik Redd
Post:
Credits posted yesterday.
33) Message boards : Number crunching : New work Discussion (Message 59726)
Posted 7 Mar 2019 by Eirik Redd
Post:
[Nairb wrote]... My question is, are these models restartable. In other words if I get 25 days into a model and there is a power cut...

The model saves intermediate files as it runs - "checkpoint" files - and these files should allow the model to continue after a PC restart. Sometimes the models won't restart from the checkpoint file and will fail, but usually the models are fine.

Right. No worries.
34) Message boards : Number crunching : New work Discussion (Message 59725)
Posted 7 Mar 2019 by Eirik Redd
Post:
Just managed to get 4 new tasks. They are the Wah2_safr50-... Its says they have a runtime of approx 30days. My question is, are these models restartable. In other words if I get 25 days into a model and there is a power cut... And I lose the 4 models its a waste of effort.

ta
Nairb


I've been running these models for 10 years.
Lately when one of my 10 Windows and/or Linux computers fails or the power drops,
no worries.
Restart works OK. Work not wasted. There have been models that some people here say failed after restart. Maybe so. But for me -- not an issue.
Keep crunching.
35) Message boards : Number crunching : New work Discussion (Message 59460)
Posted 18 Jan 2019 by Eirik Redd
Post:
Perfect time for the project to clear the problems log and work on the wish list ;)

Also good time for crunchers to clean heatsinks, and review dusty old software and bloatware, and catch up on reboots. :) smiley :)
And search for other projects to work on until --
36) Message boards : Number crunching : New work Discussion (Message 59457)
Posted 18 Jan 2019 by Eirik Redd
Post:
And 49 weeks in the future :)
37) Message boards : Number crunching : transient HTTP error (Message 59362)
Posted 9 Jan 2019 by Eirik Redd
Post:
The future is a strange place. You'll just have to wait and see.


Too true. Thanks for short sweet true words. And thanks for your long time support of this project.
We like Les!
38) Message boards : Number crunching : Batch 777 safr50 (Message 59213)
Posted 21 Dec 2018 by Eirik Redd
Post:
I noticed yesterday that several of this batch had failed on a new machine I've been monitoring more closely.
Looking further over recent workunit failures of this batch on my machines, seems like

At about the halfway point they fail with signal 11. This happens on Intel and AMD, on Windows 10 both in virtual and real machines, and with wine on Ubuntu bionic and Debian stretch. Probably less than a third of the wu's fail like this -- sample size too small to make better estimate.

Anybody else notice this?
Thinking this is something the batch creators will figure out.
<edit>
The majority of these rather short workunits seem to complete and upload OK
39) Message boards : Number crunching : 72 days for wah2_sam25? (Message 59133)
Posted 6 Dec 2018 by Eirik Redd
Post:
Yeah -- those long-running sam models look like taking 6-12 weeks on a typical machine running 24/7 per core but less problems than some other long-runners -
Eh? I've got a few, will let them run until they die (don't think likely)

and I have two sam25s at 10% on 10th day. Expected to finish in 100 days ;) I've reduced the load to 2 full cores so let's see if I get some gain on the sam25s


No real difference btw 2 or 4 cores on my i5-2520M sec/Ts almost the same. And I still keep one global from 766 now at 61% after 14 days I will just leave it until it crashes as expected.
40) Message boards : Number crunching : Batch 774 (safr50) (Message 59124)
Posted 1 Dec 2018 by Eirik Redd
Post:
I had 5 overnight and this morning. All failed within 90secs. When I get problem sets like this I pause anything already running and start the rogue ones. Gets through the failures quicker.

Right - push the likely misconfigured units out the door, clear the queue for the next good batch.


Previous 20 · Next 20

©2024 climateprediction.net