climateprediction.net home page
Posts by Greg van Paassen

Posts by Greg van Paassen

21) Questions and Answers : Wish list : Longer Deadlines - That's What I'd Like To See (Message 47057)
Posted 16 Sep 2013 by Profile Greg van Paassen
Post:
science requirements vs. processing realities. [...]Not so sudden, really. It's been like this for quite a while.

Processing realities, indeed. And limitations of BOINC when used for long duration tasks.

The problem that CPDN has, is with task re-issue. The BOINC system won't re-issue a task until either the client reports failure, or the deadline date has passed. The second case is unfortunately quite common with CPDN's work, even with generous deadlines: a volunteer computer starts on a task, and then ... gets redeployed to do something else. As far as CPDN knows, the task is still "in progress".

If CPDN has set the deadline years into the future, it has to wait years just to learn that it needs to re-issue the task ... and then wait some more. But even scientists have to work to a schedule. ;-)

Until reasonably recently, the way that CPDN compensated for this time-out problem was to issue tasks to 5 or 7 computers at a time. The odds were that one of them would process the task to completion.

But of course the odds were also that more than one, or more than two, of them would process the task to completion, too. There was duplication of effort by crunchers, and extra data transmission, processing and storage costs for the project.

The new method, of short deadlines and issuing work to only one computer (reissuing if timed out or failed), fixes these problems (and may have been intended to). But now we have this "high priority" problem...

It's all an illustration of Eric Sevareid's line, "the chief cause of problems is solutions".
22) Message boards : Number crunching : still don't get credits since last breakdown (Message 47035)
Posted 13 Sep 2013 by Profile Greg van Paassen
Post:
[...] we should remember that it is the work that counts, not a bunch of useless credits.

Is it, though? We don't know what the work does, because the scientists never engage with the community. What are those 50,000 HadCM3Ns in the latest batch actually for? Does anyone know?

No funds for hardware or systems admin, no communication about purposes or timetables, no proper validation of models before releasing them: whatever it was before, the project is starting to look more and more like a public relations exercise, being given all the priority and attention that scientists usually give public relations. That is, none.

Credits are a substitute for a purpose: a substitute that is acceptable to a lot of us because we're trained up from birth to be good consumers, to want to acquire lots of 'stuff', no matter what the 'stuff' is. But with neither meaning nor credits, why should anyone bother?
23) Message boards : Number crunching : failed upload: can't resolve hostname (Message 47004)
Posted 11 Sep 2013 by Profile Greg van Paassen
Post:
Hi bernadinho,

I see your computer is running other CPDN models as well as that one. How did the uploads for those other models go?

If the other models have uploaded their ..._1.zip files OK, I would just 'abort' the one with the HTTP error.

If not, do they all specify the same host, apid-wattch.badc.rl.ac.uk?
24) Questions and Answers : Windows : Optimise PC build for CPDN (Message 46961)
Posted 5 Sep 2013 by Profile Greg van Paassen
Post:
Thanks for the write-up, Martin.

I don't recall a list of error codes on this discussion board, but the Boinc FAQ service, http://boincfaq.mundayweb.com/index.php, has a section devoted to them (section 6).
25) Message boards : climateprediction.net Science : Energy Efficiency: Combi Boiler (Message 46922)
Posted 1 Sep 2013 by Profile Greg van Paassen
Post:
...Air heating dominates water heating by a large margin
...
PS The roof was insulated courtesy of a council grant just before the weekly spot measurements began in 2008.

Do you have downlights, Iain?

Recessed-into-the-ceiling downlights work hard against ceiling insulation, unless they are the sealed type (which are getting quite affordable, btw). Ventilated downlight fittings act as very effective chimneys, drawing hot air into the roof space. In nearly all houses, the roof space is neither well sealed nor insulated.

(Insulation must be kept 100 - 150 mm away from ventilated downlights, too. Although it seems a small fraction of the total roof area, that has a surprisingly large effect.)

Nearly all houses built before this century leak air like tents, downlights aside. Four to five changes of air per hour was the standard.

An infrared imaging heat loss survey or blower door test could help you decide what to invest in next, if you wish to go further.
26) Message boards : Number crunching : Waiting to run (scheduler wait) (Message 46875)
Posted 26 Aug 2013 by Profile Greg van Paassen
Post:
I have something similar, which I also haven't seen before. Four tasks running normally, and one that looks as though it is running, but it is stuck at timestep 670536 and using no CPU time. The graphics display looks normal, not blue as they usually do when a model goes bad. Restarting the computer does not change things.

hadcm3n_o2h3_1940

Update: see also this thread in the Unix section. Same problem.
27) Message boards : Number crunching : One task completed with 0 credit (Message 46874)
Posted 26 Aug 2013 by Profile Greg van Paassen
Post:
No bug with the task, boulmontjj.

The credit-calculation script ordinarily runs once a day. Since the server started having problems about a month ago, it has not run. (See the News thread for details of the server problems.)

When this script is run again, you will get the credits for the completed task.
28) Message boards : Number crunching : Trickle-up message (Message 46811)
Posted 19 Aug 2013 by Profile Greg van Paassen
Post:
Mike,

What happens when you clear a field?

I don't know what BOINC does, but the normal programming convention when the account-owner clears a preference that was previously set, is to go back to using the server-wide default (whatever the BOINC project or CPDN decided that was). Internally the system stores a value for that item in the account preferences (-1, say) that means "use the system's default value", and displays that as either the system default value, or blank, or something like '---'.

In BOINC's case there is the complication of multiple projects and a single person having multiple places to set preferences. I expect that clients would revert to the value that was most recently set on any of the attached projects. A little unpredictable for people, especially if the 'most recent' value isn't communicated to the other projects for display on account web pages.

* * *

An alternative to what Les suggests, if you have few computers, is to set the preferences in the BOINC Manager's "Computing Preferences" setting, which is in the "Tools" menu in advanced view.

I like this method because I don't have to think too hard, but it'd be a pain if you had more than a dozen or so clients.
29) Message boards : Number crunching : Not complete - running but not using CPU (Message 46699)
Posted 24 Jul 2013 by Profile Greg van Paassen
Post:
I've had that too, with one model.

A system reboot fixed it, allowed it to carry on and finish successfully. (I needed to apply some updates anyway...)
30) Questions and Answers : Windows : Intel Visual Fortan run-time error (Message 46651)
Posted 19 Jul 2013 by Profile Greg van Paassen
Post:
I understand how HT works, but I am just asking for clarification about Climate WU processing efficiency. It is my experience that this type of processing is enhanced by using as many cores as possible. Whether they are HT or not, overall wall-clock time is reduced for the job. Somewhere along the lines of that 20%. This is significant in my opinion, but I also recognize that successful WU completion is the goal.

Just to be clear, WUs are single-threaded.

On my machine (core i7 SNB, 4 cores, 8 threads), with 4 models running concurrently, each takes about 1.0 seconds per time step (s/ts). With 8 running, each takes about 1.5 s/ts. Doing the arithmetic, doubling the number of WUs running concurrently increases total throughput by one third. It also increases the clock time required to complete any one WU by half.

So with hyperthreading, machines get more done in a year, but each individual WU takes longer to finish.

HadCM3Ns seem to be sensitive to disk i/o congestion--"impatient". Running fewer models reduces the probability of a "disk traffic jam" causing a model to crash because a disk read or write didn't complete quickly enough. (I think this is what Iain meant about model completion rates.) The degree of impatience seems to vary between different batches of HadCM3Ns.

(For an idea of the numbers: on my machine, at 1.5 s/ts, each model averages about 0.85 MB/s continual disk writing, with spikes up to 7 MB/s during checkpoints (every 72 time steps). During the decadal zip-file uploads, disk activity goes as high as the disk system will support (over 65 MB/s reads and 35 MB/s writes at the same time) for a few seconds.)
31) Questions and Answers : Getting started : Cannot attach to climateprediction (Message 46576)
Posted 2 Jul 2013 by Profile Greg van Paassen
Post:
Bill,

I don't have access to a Mac to experiment, but on other platforms there are (at least) three commands: `boincmgr' which may correspond to the Mac's `boincmanager' (it starts the GUI interface); `boinccmd', which is a command-line utility for interacting with Boinc instances from shell scripts; and `boinc', the main program.

If the Mac has a `boinc' executable, that's the one you want. Check with "apropos boinc".

You may have to change to the Boinc app directory, and then run it as "./boinc --attach_project ...".

Also, if a Boinc instance is running, you'll have to shut it down before trying this, according to Ron's instructions.
32) Message boards : Number crunching : Server Status page down (Message 46559)
Posted 1 Jul 2013 by Profile Greg van Paassen
Post:
Trickles are uploading fine - it's the display page that seems to be the problem
Yes, this is what I meant.
33) Message boards : Number crunching : Server Status page down (Message 46551)
Posted 1 Jul 2013 by Profile Greg van Paassen
Post:
Just to be completely explicit about things, and cross all the 'i's and dot all the 't's...no, wait, that's dot all the 'i's and cross all the 't's:-

The specialist who is coming on Tuesday will know that he or she has to fix the problem with trickles not being recorded, as well as the problem with the server status page not showing ... right?

It's just that I haven't seen any mention of the no-trickle problem anywhere on the board yet. So I thought I'd better mention it. Just in case the project team was unaware of it.
34) Message boards : climateprediction.net Science : Misconfiguration e-mail (Message 46451)
Posted 18 Jun 2013 by Profile Greg van Paassen
Post:
My favourite is EDGeSUser. 5307 and 2469 models crashed, none even started successfully. Host IDs 1218239 and 1218323.
35) Message boards : climateprediction.net Science : Misconfiguration e-mail (Message 46450)
Posted 18 Jun 2013 by Profile Greg van Paassen
Post:
Hi Belfry,

there was some discussion about this with Ba two weeks ago on this thread. Ba is aware of the problem and agrees with you, I think.
36) Questions and Answers : Windows : Optimise PC build for CPDN (Message 46413)
Posted 13 Jun 2013 by Profile Greg van Paassen
Post:
I agree with Eirik, assuming he means UPS - uninterruptible power supply. That would probably have a bigger effect on the number of failures than ECC RAM versus non-ECC.

About placement - I think it's fine to put Boinc (programs) on the system disk. The CPDN programs live in the data folder, though, IIRC. If you have lots of spare disks you could consider putting the paging file on its own disk - although nowadays with RAM relatively cheap, paging is less common than it used to be.
37) Questions and Answers : Windows : Optimise PC build for CPDN (Message 46404)
Posted 12 Jun 2013 by Profile Greg van Paassen
Post:
Tough call.

The delay arises from buffering rather than ECC itself. For sequential accesses (as I'd expect with both CPDN models and visual image processing), buffering only delays the first bytes in the read or write request. After that throughput is the same for the rest of that memory access.

If your (proposed) motherboard can take non-ECC memory, it may require unbuffered memory. Unbuffered ECC memory does exist but I'm not sure how easy it is to buy.

My gut feel, from reading Wikipedia's ECC page and other pages about RAM, is that ECC would be worth a price premium of 20% or so to me.

38) Message boards : Number crunching : Reporting - Errors while computing - (Message 46364)
Posted 4 Jun 2013 by Profile Greg van Paassen
Post:
Hi Ba,

In addition to what Mo said, you have several crashes on 1179592 that looks as though they are disk-related. HadCM3Ns are "disk-write-heavy" and seem to be sensitive to sluggish disk response, much more so than the regional models (HadAM3). HadCM3Ns seem to like neither their code and static data being swapped out, nor for the disks to take too long when they're creating zip files at the 25%, 50%, 75%, and 100% marks.

Probably I'm teaching my grandmother to suck eggs here, but you might want to check the sysctls vm.swappiness, vm.dirty_background_ratio, and vm.dirty_ratio.

Avoid swapping if possible (low swappiness, say 20), and avoid big "surges" in disk activity.

With 64 GB of memory, letting pending disk writes accumulate to 5% of memory (IIRC, that's the default value for vm.dirty_background_ratio) before writing them out would produce noticeable delays when the writing does take place. HadCM3Ns won't like that. Try vm.dirty_background_ratio=1 and vm.dirty_ratio=3 (both are percent of memory), and see if that reduces the number of crashes at the 25% mark, 3110.40 credits. Reducing vm.vfs_cache_pressure may also help, since CPDN models are continually writing to the same files.

Alternatively (or as well), try the 'deadline' scheduler, if you're using CFQ.
39) Questions and Answers : Windows : Optimise PC build for CPDN (Message 46282)
Posted 23 May 2013 by Profile Greg van Paassen
Post:
I'd be interested in the results, Martin, especially if you go Haswell - I believe they're being released in September?

An 8% improvement in FP performance (over Ivy Bridge) means finishing a HadCM3N a day sooner. The improved memory controller in Haswell should further speed up climate models, which do a lot of memory writes. You get better performance per watt -- always a consideration with NZ's electricity prices. And you may be able to put a Broadwell processor in the same board later, if they turn out any good. (Not that I'm trying to sell you on anything...) :)

Disks: as you say, CPDN isn't particularly demanding in terms of disk performance (if you have enough memory for OS disk buffers), but durability is a different story. I measured some HadCM3Ns at 700MB of disk writes over 10 minutes, each. I'd been considering buying a consumer grade SSD before that. Not any more. (According to Anandtech, consumer SSDs are designed for ten years of life assuming a mere 10 GB of writes per day.)

But, definitely put the Photoshop program files and light table (working) storage on SSD(s). Others have used phrases like "night and day" after making that change.
40) Message boards : Number crunching : Workunit error - check skipped (Message 46277)
Posted 23 May 2013 by Profile Greg van Paassen
Post:
Technically, what it means to BOINC is that your task is the only good one in that work unit (all the earlier tasks failed), so its result cannot be compared to the others.

(BOINC was designed on the assumption that two different computers processing a task would produce results that are identical, bit for bit. If this were true a simple comparison of results would be a useful check for correct data transmission. But climate models break that assumption.)

Since CPDN never runs the BOINC cross-checking ("validation") code, tasks keep whatever "validate state" BOINC's end-of-task code assigns: for failed tasks, "Invalid"; for most good results, "Initial"; and for the last result and only good one of a bad bunch, "Workunit error - check skipped".


Previous 20 · Next 20

©2024 climateprediction.net