climateprediction.net home page
Strange counter on workunit

Strange counter on workunit

Message boards : Number crunching : Strange counter on workunit
Message board moderation

To post messages, you must log in.

AuthorMessage
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 34890 - Posted: 4 Sep 2008, 5:12:39 UTC
Last modified: 4 Sep 2008, 5:14:12 UTC

Can somebody pleas explain to me why the cpu time counter on the workunit suddenly jumped down to a much lower number?
http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=7969311

It went from 4,477,594 seconds to 249,438 seconds, but the timestamp kept incrementing.

(This is not my computer by the way)
ID: 34890 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 34891 - Posted: 4 Sep 2008, 5:50:14 UTC

Not sure if it\'s related, but the owner of that computer is still running BOINC version 4.19
All current climate models require a version 5 of BOINC to work properly.

Please have them upgrade immediately or stop trying to crunch climate models.
They\'re just wasting them.

Look at the number of models for this computer and this one.

ID: 34891 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 34892 - Posted: 4 Sep 2008, 9:19:39 UTC

That model has at least completed, which is more than most on that machine. There have been a number of reports of CPU time anomalies over the years - all on Linux machines, I think. No-one has got to the bottom of it as far as I know. The problem is benign: the models complete successfully, but with saw-tooth CPU time values and sec/TS.

As Les says, upgrading to at least BOINC version 5 is a must.
ID: 34892 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34893 - Posted: 4 Sep 2008, 11:12:09 UTC
Last modified: 4 Sep 2008, 11:13:41 UTC

If the owner of these computers doesn\'t know a lot about BOINC it would probably be better for them to upgrade to version 5.10.45 than to BOINC version 6. (On the BOINC download page click on \'All versions\'.)

I think something may have gone wrong with the processing of that HADAM about 5 trickles before the end. The sec/timestep suddenly went down to a much lower number as if some ar all of the data wasn\'t being processed. This didn\'t happen on the other computers running the same workunit.

A BOINC upgrade will probably fix all this. No current CPDN models are compatible with BOINC version 4.

We\'ve posted about new versions of BOINC and the models in the News thread which is at the top of Number Crunching.
Cpdn news
ID: 34893 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 34896 - Posted: 5 Sep 2008, 0:20:30 UTC

well at least the time series an all of the three computers look the same:






But there are slight variations. How much variation is normal?
ID: 34896 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 34897 - Posted: 5 Sep 2008, 0:26:43 UTC

How much variation is normal?


Only the researchers will know that. The data displayed is only a tiny part of the results, and is only \'eye candy\' for the crunchers.
The bulk of the info is analysised by statistics programs to check it\'s validity.

ID: 34897 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34898 - Posted: 5 Sep 2008, 0:32:49 UTC

One thing I\'ve never understood about these HADAM graphs is why they only seem to show results for 11 months although the models run for a whole year.

I\'d already looked at the model\'s graphs to see whether there was any abnormality near the end, but none of the models in the WU (or any other HADAMs as far as I know) have the lines extending to the end of March.


Cpdn news
ID: 34898 · Report as offensive     Reply Quote
DaveG27

Send message
Joined: 8 Nov 06
Posts: 18
Credit: 2,425,895
RAC: 0
Message 34900 - Posted: 5 Sep 2008, 8:19:32 UTC
Last modified: 5 Sep 2008, 8:20:57 UTC

Graph for 7969311 is slightly different to the other 2 which are identical.
Tip
If you load only the graph\'s into 3 different tab\'s and flick between the tabs you quickly see the differences. Works on Firefox I assume it works for other browsers too?

Dave
ID: 34900 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 34902 - Posted: 5 Sep 2008, 9:54:03 UTC
Last modified: 5 Sep 2008, 10:28:45 UTC

Hi Dave

The researchers expect each model in a workunit to produce slightly different results. The longer the model, the more likely it is that differences will develop. It\'s because of the Lorenz \'butterfly\' effect. The slightest difference in computation affects everything that happens after that.

We know, for example, that AMDs and Intels handle the calculations in different ways.

So the research team treats the results from each model in a WU as unique and uses all the results unless they\'re eliminated by quality control.

So I wasn\'t looking for minor differences. I wondered whether the graphs might show missing parts, or the graph line dropping down off the scale, or flatlining. For example, when HADSMs turn into \'iceworlds\', in the few cases where they get far enough to produce a graph for the phase, this is the sort of thing we see.

This HADAM discussed in a different thread turned into a looper and has now aborted on two computers. The graphs there show the sort of wacky (non-)results that mean something went seriously wrong.
Cpdn news
ID: 34902 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 34906 - Posted: 5 Sep 2008, 13:57:24 UTC

As Dave says, 7969313 and 7969315 are the same. This is because they are both run on Windows/Intel. The third result, 7969311 is different because it is Linux/Intel.

A computer is a \'machine\' and, other things being equal, will produce exactly the same results every time. I don\'t buy this butterfly stuff!

There are a number of reasons why results from the same work unit differ:

1. They are run by different \'machines\', which are not expected to produce the same results - i.e. the theoretical algorithms are not in practice identical, because of differences in low-level calculations between operating system plus processor combinations.

2. Where the \'machines\' are apparently the same but the results differ, the likely cause is:

2a) An inadequate definition of what a \'machine\' is: operating system plus processor manufacturer may not be a complete enumeration of all machine components that vary. A machine will not then agree with other machines, but it would agree with itself.

2b) The state is corrupted during the run. That corruption amounts to a difference in the algorithms the machines apply: they are no longer the same. There are lots of reasons for this: a faulty PC (failure or overclocked), a programming error related to checkpointing (incomplete state saved, or changed state reloaded), corruption of the installation, and whacky things like cosmic rays or power glitches (causing single-event upsets, rewinds etc.).

To produce the \'reference\' run for your machine, you could:

- put the PC at the bottom of a mine

- add ECC RAM

- not overclock agressively

- keep the installation folders protected

- run the models as quickly as possible

- run a diagnostics/stress test, even on a healthy-seeming machine.

The project has published an analysis of the variations, and they\'re content that the variations behave as if they were butterfly flaps.
ID: 34906 · Report as offensive     Reply Quote
NewtonianRefractor

Send message
Joined: 22 May 08
Posts: 49
Credit: 2,335,997
RAC: 0
Message 35083 - Posted: 22 Sep 2008, 18:37:33 UTC

So my laptop finally returned the result. It is different from the other three, but still fairly similar. It\'s an AMD Turion X2 machine.
ID: 35083 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 35084 - Posted: 22 Sep 2008, 22:15:56 UTC - in response to Message 35083.  

[NewtonianRefractor wrote:] So my laptop finally returned the result. It is different from the other three, but still fairly similar. It\'s an AMD Turion X2 machine.

Well done:
ID: 35084 · Report as offensive     Reply Quote

Message boards : Number crunching : Strange counter on workunit

©2024 climateprediction.net