climateprediction.net home page
hadcm slight deceleration of secs/TS

hadcm slight deceleration of secs/TS

Message boards : Number crunching : hadcm slight deceleration of secs/TS
Message board moderation

To post messages, you must log in.

AuthorMessage
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 32736 - Posted: 26 Feb 2008, 16:51:27 UTC

my 80yr coupled model is running fine, under a week away from finishing. I\'m just slightly curious at noticing on the full trickles results page that the seconds per timestep (s/TS) is gradually slowing down throughout the model. It started at abt 3.4s/TS then settled around 3.79s/TS for much of the long haul and during the last month it has slipped further to 3.96s/TS.

do models typically slow up slightly or could my pc have built up sludge processes using up crunching resources? I\'m not very tecchie about \'looking under the hood\'... lol

for the power of my pc it also seems a bit slower than others who typically report speeds under 3s/TS.

I\'ve opened the gate and let a new model download (a 160yr one this time) and run in the other core; it\'s kicked out the other projects and started crunching - probably some long term debt on cpdn. This is going a similar speed to the old 80yr model - currently 3.98s/TS.
ID: 32736 · Report as offensive     Reply Quote
old_user428438

Send message
Joined: 1 Feb 07
Posts: 26
Credit: 885,216
RAC: 0
Message 32737 - Posted: 26 Feb 2008, 17:18:42 UTC - in response to Message 32736.  

my 80yr coupled model is running fine, under a week away from finishing. I\'m just slightly curious at noticing on the full trickles results page that the seconds per timestep (s/TS) is gradually slowing down throughout the model. It started at abt 3.4s/TS then settled around 3.79s/TS for much of the long haul and during the last month it has slipped further to 3.96s/TS.

do models typically slow up slightly or could my pc have built up sludge processes using up crunching resources? I\'m not very tecchie about \'looking under the hood\'... lol

for the power of my pc it also seems a bit slower than others who typically report speeds under 3s/TS.

I\'ve opened the gate and let a new model download (a 160yr one this time) and run in the other core; it\'s kicked out the other projects and started crunching - probably some long term debt on cpdn. This is going a similar speed to the old 80yr model - currently 3.98s/TS.


Typically I find that the s/TS decreases consistently throughout a model run. I am surprised when I see a value that is greater than its predecessor.

F.
ID: 32737 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32739 - Posted: 26 Feb 2008, 17:48:43 UTC

If you view the graphics a lot or decide to enable the model screensaver, that can slow a model down. Or if you exit from BOINC when the timestep countdown is close to the next checkpoint, that causes some repeat crunching. It\'s best to exit shortly after a checkpoint when the timestep number is high.

Some models encounter a processing problem that causes a (usually short) rewind; again, the repeat processing will slow the speed which is a cumulative average.

And some just slow down slightly. But if the difference in speed is sudden and noticeable and the model stops trickling, you need to keep an eye on the model dates to check it\'s progressing and hasn\'t got stuck in a loop. Not many models do this now.
Cpdn news
ID: 32739 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,407,645
RAC: 5,424
Message 32740 - Posted: 26 Feb 2008, 18:04:04 UTC - in response to Message 32736.  

....
I\'ve opened the gate and let a new model download (a 160yr one this time) and run in the other core; it\'s kicked out the other projects and started crunching - probably some long term debt on cpdn. This is going a similar speed to the old 80yr model - currently 3.98s/TS.

It can also depend on what the other project is doing on the other core. Some projects \'play nice\' and don\'t get in the way: other projects compete for shared resources (particularly memory access) and slow themselves down as well as slowing down CPDN. Of your list, Einstein is one of the better ones to have running in parallel - I don\'t have experience of Malaria, Rosetta or LHC in this context.

Also, it probably depends whether your \"Intel(R) Pentium(R) 4 CPU 3.40GHz\" is a true dual core, or just a hyperthreaded single core. Again, others will have more experience of HT CPUs than I do.
ID: 32740 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32742 - Posted: 26 Feb 2008, 18:34:09 UTC


The model\'s files will also get fragmented over time, which will slow things down slightly. There is a post on disk maintenance which may be worth reading (You can find it in the \'crashes and other problems\' readme, \'hardware\' section. The link is in my signature).

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32742 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 32774 - Posted: 28 Feb 2008, 14:54:09 UTC

I\'ve been pondering the various answers - thanks to all.

it\'s difficult to elicit a cause that would gradually and systematically produce the slowing down symptoms.

I don\'t do the obvious no-nos: I don\'t use the screensaver and only occasionally peek at the graphics perhaps to check near the end of a year or decade or for the savepoint. I don\'t shutdown much (during the heating season) and anyway the loss of time crunched from last checkpoint would be random and not linked to %age of task completed. There are no signs of rewinding or looping - not that I\'ve seen.

on the matter of it\'s partner processing in the other core, apart from the overlap of two CM models this week, I don\'t normally do two cpdn tasks. otherwise it has been a fairly consistent mix of rosetta, malaria, einstein (and spasmodic lhc, somewhat more this month) - I haven\'t changed the resource share much since I set them up. The cpu is indeed a virtual not a true dual core as Richard guessed.

point taken on disk maintenance: I\'ve made a tentative start on a bit of clearout, first out were the temp internet files and I found one dead aborted hadcm folder. The old disused BBC-CE folder was also given the heave-ho - it was archived anyway! Next on the list is re-organising a load of media files on to a USB drive...
I\'ll do the full defrag when the 80yr model finishes next week when I\'ll shut down all processes and let it get on with it.

I did look at cpu usage and I guess it would help one %age point or two if I shut a couple of dozen Firefox tabs down! The DVD media player takes 20% or a little more when running - that\'s the only thing I\'ve been able to see that might be resource hungry.

since the original post, of course no fresh trickles are showing on the web-database during the current cpdn server problems for you to see - the models are now saying 3.99s/TS and 4.02s/TS respectively, even slower than two days ago.

has anyone attempted to put together a very rough guidance table of expected processing speed of the most popular cpus of various vintages? This P4 is now abt 2.5yrs old.
ID: 32774 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32778 - Posted: 28 Feb 2008, 15:42:52 UTC
Last modified: 28 Feb 2008, 15:57:07 UTC

The trickle times for your model are shown in the following graphic.

Cumulative is what\'s on the Web site, \'spot\' is the difference in CPU time per trickle, and \'trickle time\' is based on when the trickles were registered.
There are a number of different parts to this run:
1. At the beginning it is \'normal\' for some of the new runs to slow down. The \'spot times\' are regular at this point.
2. The spot times increase and become irregular at about 23 January (or possibly earlier).
From (2) I would guess that something changed on the machine at about 23 January, since I\'ve not seen irregularity in a coupled model before. The fact that the processor is hyperthreaded makes me wonder whether BOINC is just miscalculating the CPU time because of the total load on the computer.

You could run the machine for a few CPDN trickles without anything else running and if the spot times settle down again, then it\'s a PC curiosity; if the spot times still vary then the model is getting into some funny state ...
ID: 32778 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,366,593
RAC: 254
Message 32781 - Posted: 28 Feb 2008, 17:40:10 UTC - in response to Message 32736.  

my 80yr coupled model is running fine, under a week away from finishing. I\'m just slightly curious at noticing on the full trickles results page that the seconds per timestep (s/TS) is gradually slowing down throughout the model. It started at abt 3.4s/TS then settled around 3.79s/TS for much of the long haul and during the last month it has slipped further to 3.96s/TS.

My latest workunits behaved the same. Previous workunits showed constant or even decreasing timestep lengths. I think it is caused by the nature of the workunit, not by other processes on your computer, fragmentation or other lack of maintenance.

ID: 32781 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 32782 - Posted: 28 Feb 2008, 17:58:22 UTC
Last modified: 28 Feb 2008, 18:06:24 UTC

interesting and very pretty graphic!! the big gap was obviously xmas when I was shut down.

we might learn a bit more when the model finally finishes (and see how fast the new model runs on its own); the old one is set to complete in 109 hours but I\'ve just set a timer against the clock as it\'s clearly running down much slower than that prediction.

I\'ll turn network activity back on when the all the server problems are back to normal. It\'s a pity we can\'t look at the last couple of trickles - I\'d also like to see my new model successfully get it\'s first trickle showing and check it\'s happy.
ID: 32782 · Report as offensive     Reply Quote
old_user141342

Send message
Joined: 20 Dec 05
Posts: 9
Credit: 127,973
RAC: 0
Message 32805 - Posted: 1 Mar 2008, 17:21:47 UTC - in response to Message 32740.  

It can also depend on what the other project is doing on the other core. Some projects \'play nice\' and don\'t get in the way: other projects compete for shared resources (particularly memory access) and slow themselves down as well as slowing down CPDN. Of your list, Einstein is one of the better ones to have running in parallel - I don\'t have experience of Malaria, Rosetta or LHC in this context.


I\'ve made this observation, too. I\'ve always thought it had to do with the CPU cache, which is a little on the low side on my AMD 64 x2, but I really don\'t know much about computers on the hardware level. One thing I can tell for sure, however: SETI is rather bad for running on the second core, but I can highly recommend Sudoku. Seems to speed up your CPDN model.
ID: 32805 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 32955 - Posted: 14 Mar 2008, 14:19:07 UTC
Last modified: 14 Mar 2008, 14:24:35 UTC

hi again Iain and everyone,

I let things run for a while after the recent server problems settled down and the old model had finished. The model continued to slow down right to the end. More worrying is that the new model started up matching the same slow speed as the old one (~4s/TS) and then decelerated quite sharply while the old was paused on suspend. Even now it\'s crunching on its own, it is still slowing although not as markedly. It has now done 10yrs dropping down to 4.64s/TS with another trickle due this afternoon which should show about 4.66s/TS. Each year is now taking over 36hrs and it was only a little over a day at the beginning of the year - once the weather warms up I\'m unlikely to crunch 24/7 so it could take all year at this rate.

I looked at my old BBC CCE account too and, while only the last few dozen trickles are available to view (on a model that failed with negative pressure c.2050), the machine was performing at 1.98s/TS at that time last May. Admittedly that was before I discovered how to set both cpu virtual cores running which will account for quite a bit of this change. The only other comparison to look at is the slab model run last October.

Despite all this, it\'s still pleasing to have completed my first coupled model on the main cpdn project.

/Pete
ID: 32955 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32959 - Posted: 14 Mar 2008, 17:43:19 UTC

My 3 GHz hyperthreaded processor ran at about 5.6 sec/timestep. So yours might run at about 5.6 * 3 / 3.4 = 4.9 s/ts. If your PC wasn\'t running two hyperthreaded models, a single model might run at 4.7 * 1.2 / 2 = 2.8 s/ts.

The current 5.44 model can be 30% slower than the 5.15 models at the BBC. So, your previous 2 sec/timestep becomes 2.6 s/ts.

So, the numbers are about right - but always slightly worse ...

PS I\'ve stopped running models hyperthreaded. I would rather finish the models quickly, even though the machine does 20% less work.
ID: 32959 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 32968 - Posted: 14 Mar 2008, 23:36:17 UTC - in response to Message 32959.  
Last modified: 14 Mar 2008, 23:36:39 UTC

>snip<

The current 5.44 model can be 30% slower than the 5.15 models at the BBC. So, your previous 2 sec/timestep becomes 2.6 s/ts.

So, the numbers are about right - but always slightly worse ...

thanks, interesting point about v.5.44 against the older v.5.15 which I didn\'t know.
the calculations make sense, however we still have the two puzzles of why the models are decelerating fairly systematically and why the second model (160yrs rather than the previous 80yr one) started off at the speed the first one finished at.
ID: 32968 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 32969 - Posted: 15 Mar 2008, 0:10:43 UTC


There are 2 values on the page for each computer in your Account:

Average CPU efficiency
Result duration correction factor


These are used (and updated), by BOINC to give it an indication of how long it ACTUALLY took to complete a model. (Starting with a value supplied by each DC project for their work units.)

Because BOINC\'s estimating maths is set for the shorter projects, it has problems getting it right for the very long WUs on this project.
It usually needs several completed WUs from a project to get an accurate estimate, which is OK when a WU is 10 minutes to a day in length.
But here, a WU (model) takes months, so BOINC takes a long time to adjust it\'s estimate.

This is what you\'re seeing - a gradually change in the estimate of \'time-to-completion\' of a WU (model).
And none of this is helped (on ANY project), when they have WUs of varying length. And there are several others like this.

Of course, there IS another possible reason for a slowdown; dust accumulating on the processor heat sink, insulating it more and more, and causing thermal throttling to have more and more effect on the processor\'s \"up\" time.


Backups: Here
ID: 32969 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 33036 - Posted: 20 Mar 2008, 0:49:05 UTC - in response to Message 32968.  
Last modified: 20 Mar 2008, 0:50:55 UTC

>snip<

The current 5.44 model can be 30% slower than the 5.15 models at the BBC. So, your previous 2 sec/timestep becomes 2.6 s/ts.

So, the numbers are about right - but always slightly worse ...

thanks, interesting point about v.5.44 against the older v.5.15 which I didn\'t know.
the calculations make sense, however we still have the two puzzles of why the models are decelerating fairly systematically and why the second model (160yrs rather than the previous 80yr one) started off at the speed the first one finished at.


One other thing that could also be happening if you are running other projects at the same time is the Boinc Manager/Client fails to release a WU from one project before starting the next.
What I mean here is that for instance you have 4 cores running 4 work units. Time for one WU to cede and give another WU a go, it does not let go but the next WU starts anyway.
I have had this happen on a number of occasions (although still rare), giving me 5 work units running when I have 4 cores.
Three work units run at 100% and the 4th and 5th run at 50% each, cycling back and forth every few seconds to each WU.
Suspending does not stop this and restarting Boinc manager is the only fix I have found whenever I have struck this.

Running at 50% for a few hours or more in a day will affect the overall time average.

Just something else to consider.
ID: 33036 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 33048 - Posted: 21 Mar 2008, 17:46:12 UTC
Last modified: 21 Mar 2008, 17:48:34 UTC

Conan - perhaps something for others to watch out for, I do everything to prevent the hadcm task switch out as the lost time cannot be recovered on such long work units; I leave the second core to share the remaining time between other projects.
_

meanwhile... ...I\'ve cleared out 5-6GB of media files, redundant or backed-up elsewhere, and done a couple of defrags (it says 15% free space is preferred) and rebooted.
the model continued to slow down to 4.726s/TS but - short of coincidence - this last clean-up has turned things around; the latest trickle reported at 4.69 and the graphic screen is currently quoting 4.68.
because we are talking overall averages and I\'m on year1936 this an improvement from ~123k-126ksecs per trickle-year to ~109ksecs per year. that is still only 4.2s/TS in the long run but at hopefully at least I\'ve stopped the rot. I probably should reboot the system more often as it can get a bit sludgier after a week or so.

perhaps cleaning out the dust is the next project for the holiday weekend!! aitchooooo... :)
ID: 33048 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 33583 - Posted: 26 Apr 2008, 0:19:17 UTC
Last modified: 26 Apr 2008, 0:20:22 UTC

a small news update since I last reported:

within a few days of the slight improvement that I got from the file clean up and defrag things turned rapidly for the worse and the timestep interval climbed back to 4.75s/TS within about 4 days so I finally got down to cleaning the dust out. it was pretty dreadful in there and took ages to suck out - the fan on the video processor card was probably the worst; also without having a jet blower the fine fins on the main cpu are not quite fully clear but anyway everything immediately improved.

at a rough estimate, its now crunching about 15% faster and, if you check the trickle list, I\'m managing about 3 years every four days (apart from one shut down day) and this has now been pretty consistent for a few weeks so the deceleration problem looks pretty well fixed. this rate is close to 4s/TS and the cumulative trickles are gradually approaching this - it did go a bit faster with the previous model (more like 28hrs/year than 32hrs/year) but at least things are now more respectable.

the moral is definitely to clean out the dust periodically!

/pg
ID: 33583 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 33589 - Posted: 26 Apr 2008, 9:37:37 UTC


It sounds as if your processor was being \'thermally throttled\' (automatically slowed down when overheating is detected by the Bios), well done for solving the issue :-)

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 33589 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 33590 - Posted: 26 Apr 2008, 19:40:45 UTC - in response to Message 33589.  
Last modified: 26 Apr 2008, 19:41:26 UTC


It sounds as if your processor was being \'thermally throttled\' (automatically slowed down when overheating is detected by the Bios), well done for solving the issue :-)

hi mike - yes I agree, the most probable explanation; that\'s why I thought it worth mentioning as those of us in the northern hemisphere go into \'summer mode\' putting more heat strain on our rigs. Actually I\'ll be stopping treating my pcs as heaters now and shutting them down more often. I\'m also going to try and cadge the tail end of a can of pressurised air off a tecchie friend to give the heat sink another blast.
ID: 33590 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 33605 - Posted: 28 Apr 2008, 17:49:33 UTC

Cylinder vacuum cleaners are very useful for removing dust, particularly the more modern ones whose power can be varied ie reduced. Of course you have to be careful never to touch any of the parts with the nozzle. I also recommend hair dryers for blowing dust from inaccessible corners and fine, soft (dry) paint brushes for the fans and fins. I mean brushes for painting pictures, not for painting walls!
Cpdn news
ID: 33605 · Report as offensive     Reply Quote

Message boards : Number crunching : hadcm slight deceleration of secs/TS

©2024 climateprediction.net