climateprediction.net home page
Optimise PC build for CPDN

Optimise PC build for CPDN

Questions and Answers : Windows : Optimise PC build for CPDN
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46964 - Posted: 5 Sep 2013, 10:53:44 UTC - in response to Message 46961.  

Problems with Models Crashing Computer ID 1290283

So far all my tasks have crashed and I'm suspending calculations for the moment.

2 early ones model errors as noted in another thread
2 caused by me as below.
2 at Time Step 259,200 (Tasks 15966049, 15998767)
3 at Time Step 518,400 (Tasks 15940327, 15942872, 15965966)

I assume these timesteps are the 25 & 50% marks for the hadcm3n models.

Remembering other threads about this I assume this is a problem with the hard drives not getting the data out quickly enough as highlighted in Greg's post here. It seems a possibility that the PC is generating too much data for the older drive BOINC sits on.

Am I correct in this, and if so, what's the best approach? I'm quite happy to put a faster drive in and that includes a SSD if necessary. Yes I know the SSD life could be short, but an Intel 520 should be good for 2 years? By that time something else will be along anyway. A 120GB SSD around $240, 1TB Seagate HDD around $110. Would need the 1TB drive to get the higher data speeds.

Or, are there any ways in which Windows can be manipulated to speed things up. I do have a largish number of drives, don't know if that makes any difference.

Any thoughts anyone?

BTW, thanks for the comments all and the link Greg. I'm sure there was something more related to CPDN somewhere, but perhaps that was the old BB.

Martin
ID: 46964 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 46965 - Posted: 5 Sep 2013, 13:01:23 UTC - in response to Message 46964.  

Two minutes waiting for shutdown tasks to write sounds pretty out-of-whack. Could it have something to do with that ISRT for SSD's?
ID: 46965 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46966 - Posted: 5 Sep 2013, 13:19:31 UTC


I would recommend against an SSD for CPDN. I calculated my 64mb Intel 520 would only survive around 6 weeks in theory (each model generates something like a terabyte of writes over its life), so I moved the Boinc data directory onto a physical disk. A bigger disk, or single-level-cell flash would last longer.

I see both 'signal 11' and 'code 193' in the status of those jobs. Does the time of the crashes correspond to anything particular?

As a starting point:

* Change your settings to 'Leave tasks in memory when suspended' = Y, 'suspend if CPU usage is above %' to 0%, 'Use at most ... % of CPU' to 100.00. This will prevent the model being swapped out of memory.

* Make sure that the Boinc data directories are excluded from any antivirus scans
If the crash always happens at the moment that the zip files are generated (25%, 50%, 75%) then I would be looking at the antivirus software on that PC first, it may be interfering. Obviously you shouldn't turn off antivirus, but you can exclude the appropriate directories (which may include c:/temp/ if the files are generated there).


Here is a good post regarding error numbers:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7592&nowrap=true#46161


If neither of these helps, try running a 'stress test' for 24 or 48 hours on the PC. I use prime95 (one copy per thread, to test the CPU), and memtest86+ (to test the memory).

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46966 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46967 - Posted: 5 Sep 2013, 13:28:47 UTC - in response to Message 46965.  

Two minutes waiting for shutdown tasks to write sounds pretty out-of-whack. ...


Well, he has 16 threads so there are an awful lot of models to shut down.





I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46967 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 46968 - Posted: 5 Sep 2013, 14:04:38 UTC

If the ISRT cache is large enough, Windows may decide to copy the BOINC data directory there (explaining the long shutdown time). Then you could perform the trick of hotplugging the BOINC HDD without crashing. Of course with this kind of caching you're still subjecting your SSD to a lot of writes.
ID: 46968 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46969 - Posted: 5 Sep 2013, 15:05:24 UTC - in response to Message 46967.  

Thanks Guys,

From Mike's post.
> Does the time of the crashes correspond to anything particular?
Not as far as I'm aware. In fact the last one happened when I was sitting here looking at the CPDN results web pages & Std Err to figure out what was going wrong. The task in BOINC Manager was sitting at 49.xx% and the next time I looked back it was at 100% and crashed. A few browser pages open, email etc, but no power surges that I noticed. These were the entries in the Event Log at the time (Task 15965966):
    5/09/2013 7:44:03 p.m. | climateprediction.net | Sending scheduler request: To send trickle-up message.
    5/09/2013 7:44:03 p.m. | climateprediction.net | Not requesting tasks: don't need
    5/09/2013 7:44:08 p.m. | climateprediction.net | Computation for task hadcm3n_o4fj_1980_40_008408049_1 finished
    5/09/2013 7:44:08 p.m. | climateprediction.net | Output file hadcm3n_o4fj_1980_40_008408049_1_2.zip for task hadcm3n_o4fj_1980_40_008408049_1 absent
    5/09/2013 7:44:09 p.m. | climateprediction.net | Scheduler request completed


One odd thing I have noticed though is that for all tasks the % finished does not match the hours e.g. might say 50%, run 133hrs, to go 326hrs. I don't remember this as being normal on the other PC.

Mike's starting points were already in place, except for excluding c:/temp/ or similar from the AV. Not sure that is a good directory to exempt AV scanning, but if needs must. How do we know for certain this is where BOINC builds the zipped files? I notice a file in c:/windows/temp/ called DMI3FCD.tmp of 0KB and timed at 17:37 (NZ time), which was the time task 15998767 crash. Could be it?

PC had a 40 hour burn before it left the shop, and a 30 hour one after I had installed all the extra gear and software, but could run it again in the weekend if it was going to be beneficial.

As a precaution the drive that houses BOINC was fully reformatted before use as it was an old (5yr) OS HDD. Hopefully that isolated any bad sectors if they were there.

Belfry mentions the IRST cache. I know nothing about this, but digging around it seems to only work when configured as RAID. All my drives are AHCI. The software is loaded, but when opened, the Accelerate feature is not present. I can only surmise this is not active.

I does give me a couple of things to try I suppose. Might still get a faster HDD though. I take your point Mike about the SSD, it's been mentioned before.

Anyway it's very late, must head to bed. Thanks again.


ID: 46969 · Report as offensive     Reply Quote
Belfry

Send message
Joined: 19 Apr 08
Posts: 179
Credit: 4,306,992
RAC: 0
Message 46970 - Posted: 5 Sep 2013, 15:32:05 UTC - in response to Message 46969.  
Last modified: 5 Sep 2013, 15:40:41 UTC

...,but digging around it seems to only work when configured as RAID.

No, all it requires is an SSD and and HDD. It sounds like it will be active by default on Intel 68 and up even without the configuration program installed (the driver is part of Win7/8).

Ed: actually you're right. For it to work a single HDD must be put in RAID mode--pretty weird.
ID: 46970 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46971 - Posted: 5 Sep 2013, 17:50:06 UTC - in response to Message 46970.  
Last modified: 5 Sep 2013, 17:53:19 UTC


... It sounds like it will be active by default on Intel 68 and up even without the configuration program installed (the driver is part of Win7/8).
...


It won't use ISRT by default unless it is set up in both the Bios & Windows... it took me several days and much swearing before I could get ISRT working with my PC! (I set it up on my main disk, and have a secondary disk for Boinc which is not cached).


... How do we know for certain this is where BOINC builds the zipped files? I notice a file in c:/windows/temp/ called DMI3FCD.tmp of 0KB and timed at 17:37 (NZ time), which was the time task 15998767 crash. Could be it? ...


Well, it is probably associated with the model, but whether it was actually the starting point for a .Zip or not I cannot tell. That's a typical name for a temporary file when requested by something using the Windows API. But try excluding it, and see if that helps.


... PC had a 40 hour burn before it left the shop, and a 30 hour one after I had installed all the extra gear and software, but could run it again in the weekend if it was going to be beneficial. ...


Well, if it's already had a stress-test done, there is little point in doing it again.


... the drive that houses BOINC ...


That sort of suggests that you have other drives available. Just as an experiment, it may be worth moving the Boinc Data folder over to a different drive (as long as it isn't an SSD), to see if you can see a difference.



Speaking of which, in the (very) old days it used to be possible to set up a RAM disk at system startup, and store it at shutdown. While that wouldn't be a good idea for the models (since you would risk losing progress if the system shuts down unexpectedly), it may also be worth experimenting with.



The overall impression I am getting from this is that the problem is not stability (otherwise the crashes would be along the lines of NEGATIVE THETA etc), but something to do directly or indirectly with disk access (which is why I mentioned antivirus software earlier).
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46971 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46972 - Posted: 5 Sep 2013, 18:02:09 UTC
Last modified: 5 Sep 2013, 18:03:11 UTC

One more thing:

This is the model running on your machine:
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15998767


This is the same model running on someone else's machine.
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15816061


Note that they both crashed at the same point. It might simply be coincidence (the risky 50% point), or perhaps the model itself was doomed to die then anyway.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46972 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46973 - Posted: 5 Sep 2013, 20:32:52 UTC

When I run several models on one machine, I stagger them a bit by suspending some for different intervals. Hopefully, this means that they're after the same resources at different times.
It certainly means that they're at the 25% points at different times.

But with 16 at once, maybe the best advice is: Good luck.

ID: 46973 · Report as offensive     Reply Quote
Alex Plantema

Send message
Joined: 3 Sep 04
Posts: 126
Credit: 26,363,193
RAC: 0
Message 46974 - Posted: 5 Sep 2013, 21:04:45 UTC - in response to Message 46969.  

MartinNZ wrote:
Mike's starting points were already in place, except for excluding c:/temp/ or similar from the AV. Not sure that is a good directory to exempt AV scanning, but if needs must. How do we know for certain this is where BOINC builds the zipped files? I notice a file in c:/windows/temp/ called DMI3FCD.tmp of 0KB and timed at 17:37 (NZ time), which was the time task 15998767 crash. Could be it?
There's no C:\Temp folder on my computer, and C:\Windows\Temp is only accessible with administrative permissions, which Boinc doesn't have on my computer. So Boinc doesn't use these folders.
Climateprediction isn't a real time program, so a slow drive cannot be the cause of a crashing task.
My advice is not to run any other projects together with Climateprediction on the same computer simultaneously, to keep it from being interrupted periodically. And exclude the Boinc data folder from scanning for viruses, as mentioned by others. Starting from Boinc version 6, viruses in this folder cannot access programs or non-Boinc data on your computer.
ID: 46974 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46975 - Posted: 6 Sep 2013, 5:50:58 UTC - in response to Message 46973.  

Phew, thanks for all the helpful feedback.

Les mentioned
When I run several models on one machine, I stagger them a bit by suspending some for different intervals.

I wonder if this is a key issue? Early on when the program was released to get more tasks, 4 arrived at once - big job when it comes to reporting. But then surely this happens with other PCs?

Things I've done.

  • Installed new faster hard drive dedicated to BOINC only (Seagate ST2000DM - at least it can be used for something else if it doesn't work). Mike it was easier to do this than direct BOINC to a different drive. Interestingly, no difference in shut down times, but noticeable reduction in HD peak Queue length during normal running - factor of 10+.
  • Put the drive on a 6Gb/s port to at least get the data to the HD cache quicker than before.
  • Suspended a couple of tasks in order to get at least 5% difference in completion between tasks.
  • Added windows/temp to the directories not scanned by AV.
  • Stopped adding new tasks till we see how this goes.



There seems to be some confusion over the number of tasks I'm running - currently set at 8 tasks. When Mike first mentioned 16 threads, I took that as meaning CPU threads as hyperthreading is on. Can't ever see me running 16 Tasks - wasn't the aim anyway. What I take from all this is that CPDN and modern CPUs are probably pushing BOINC into areas that are really borderline in reliability. I guess someone has to be the guinea pig ;-)

To come back to Alex, I only run CPDN in BOINC, no interest in sharing time with other projects, worthy though they may be. However, I will always be sharing the PC with other work, and that is the whole ethos of the BOINC software. This is my main work PC. Can't comment on your other points as I just don't know, perhaps someone with better PC knowledge will.


ID: 46975 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46976 - Posted: 6 Sep 2013, 6:16:04 UTC - in response to Message 46975.  
Last modified: 6 Sep 2013, 6:19:59 UTC

... currently set at 8 tasks. When Mike first mentioned 16 threads, I took that as meaning CPU threads as hyperthreading is on. ...


Yes, I was looking at the 'processor' count on your computer page (= actually the number of CPU threads). 8 models is easier on the machine than 16 :-) I actually run 6 models on 4 cores / 8 threads, any more than that and a) it makes my machine struggle, and b) throughput did not increase anyway. The best individual processing speed comes from running one model per core.

Let us know if you have any failures after the above changes. (Fingers crossed...)
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46976 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 46991 - Posted: 9 Sep 2013, 13:45:59 UTC - in response to Message 46976.  
Last modified: 9 Sep 2013, 13:48:16 UTC

Oh dear... And it seemed to be going so well.

Another crash 16002819 but this time not me? (INVALID THETA DETECTED - error type added in edit) I take it from Les's post 45480 in another thread that this is a model error? Nothing abnormal happening at the time.

If this is indeed a model error, do you reckon it's OK to get some more tasks? Things seem to have settled down in the last few days, with quite a few models getting past 25/50/75% points.
ID: 46991 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46992 - Posted: 9 Sep 2013, 17:22:29 UTC
Last modified: 9 Sep 2013, 17:25:40 UTC

Yes, the INVALID THETA errors are different. They can be caused by either the model's initial parameters resulting in an implausible climate, or they can be due to floating point errors creeping in. They will also be at the 25%/50%/75% boundaries because that is when the model validation takes place.

You've only had the one of these, and your machine has passed long stability checks, so in your case I think the model itself is to blame. If you were getting lots of INVALID THETAs, while other people running the same models were not, then there would be a cause for concern, but that isn't the case.

Even if you saw a number of THETAs turning up, they may be related to a particular batch of models, so one of the things to look at would be if they were generated at similar times or different times.

I would suggest ramping up the number of models & seeing what happens.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46992 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 47066 - Posted: 16 Sep 2013, 20:46:27 UTC - in response to Message 46992.  

Final(?) mods underway.
    1. UPS to be installed, probably Eaton. Four more lost tasks because of power failure. (FYI, Highest winds for 70 years, had no power for 2 days, no phone line for 5 days, parts of my pump shed found 500m away. Local ski field recorded 240 km/hr, reliable weather station just up the road recorded 152km/hr with 132 km/hr average over 6 hours! Unfortunately I was out when the storm hit or would have shut the system down earlier.)
    2. CPU Water Cooler - Corsair H100i. CPU temps are reaching 73C (Max for CPU is 85C) and we are only in very early spring. Standard air cooling will struggle in summer and if more than 8 tasks run. The inclusion of these while rare in desktops is not abnormal in workstations as all the major suppliers (e.g. HP, Dell) offer water cooling as a standard option.


Compared with my old system, this one seems much more susceptible to power failure. Having a quick check of the old results I could find no occurrences of multiple task failures at one reporting time. With this one, the first power failure gave 2 failed tasks, this one 4. Of course this does not mean the old PC did not have task failure on power cuts, but it did not seem to have multiple failures (it was running 6 tasks.)

Therefore from what I've seen with my PC, for very fast systems it seems you need to be prepared to spend the dosh on extras like the UPS. It can't be any old UPS either as the UPS software needs to be able to auto shutdown BOINC and then run for say 5 mins before shutting down the system.

It's been a bit of a journey and lets hope it gets a bit smoother from now on. Hmmmm.


ID: 47066 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 47067 - Posted: 16 Sep 2013, 21:55:02 UTC - in response to Message 47066.  

Hi Martin,

I don't think you said what brand and model of power supply is in the machine? I found that the power supply makes a difference. Also, sizing. The rule of thumb is to aim for full-on processing, including graphics card, to be no more than 2/3 of the rated power of the supply. (And no less than 1/2, for optimum efficiency.) I'd estimate a 550W-650W class supply for your machine.

FYI on my Sandy Bridge Core i7, i7z (a Linux CPU reporting tool) reports temperatures of 83 - 85 degrees Celsius with 8 models running, and the machine's been stable for the last couple of years (... touch wood). (I do need to vacuum out the CPU heatsink fins six-monthly.) Ivy Bridge CPUs may be more touchy, of course.

If you still have problems even after your UPS and water(!) cooling, the last resort (before a different motherboard) is to underclock a few percent and see if that helps.

I feel for you. This must be frustrating.
ID: 47067 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 47071 - Posted: 17 Sep 2013, 3:20:58 UTC - in response to Message 47067.  

Hi Greg, I gave up being frustrated years ago - the hair is grey enough as it is.

99.9% of power supplies are oversized, including mine. Have a Corsair HX650 (supposedly a badged Seasonic G-650, 650W, 80 Plus Gold.) Assuming my metered wattages are correct, they are in my post 46956 below. When running with 8 tasks I pull around 155W. Peak efficiency is at 50%, but according to the Plus 80 test result, the efficiency for 115VAC is 88.6% @ 20% load, 91% @ 50% and 88.9% @ 100%. From memory HardwareSecrets had similar numbers in a test report. So I could have got a smaller power supply, but that does not allow for peak loads. Couldn't find a load calculator that included the Xeon E5-2670, but the Thermaltake one came the closest for the majority of my components and it calculated 537W for the Powersupply. These estimates are always over, but what else can you do.

As for temperatures, I'm erring on the side of extreme caution. Temperatures always do my head in as they are never straightforward and I'm no expert. Intel give a Tcase 85C max for my processor. However NONE of the monitoring software reports this correctly for my motherboard. What they show as Tcase/CPU temp stays constant when core temps have increased 40C. The other key and related temp is Tjunction max (Core temp), but Intel do not give this that I can find. Core Temp (Windows) reports this as 102C, which seems about correct from what I've read. The CPU will throttle about 5-10C before this, and the recommendations I've read say stay around 20C below TjMax for stability and long life. I can easily see my Tjunction getting to 85-90C in the middle of summer, so I decided to get in early and add more cooling. I want this rig to last 4-5 years.

My old i7 ran in the low 60sC, but I had a massive air cooler on it. Your temp seems high, but if it's running OK, fine I suppose. I also clean mine out every 6-9 months, but take into the garage and use a compressor - with due care of course, and no fan spinning. Vacuum cleaners do run a static risk.
ID: 47071 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 47074 - Posted: 17 Sep 2013, 6:57:00 UTC - in response to Message 47071.  

the recommendations I've read say stay around 20C below TjMax for stability and long life.

This fits with what I read when I was building audio amps. Audio amps we like to keep running for decades. But what's long life for a computer? Hmmm... maybe I should get a better cooler, too.

(Audio is all different now--that was class B bipolar transistors; now hi-fi amps are mostly class D, and the things barely get warm at all, while providing much better sound. And with high-frequency switching power supplies, like those in computers, they barely weigh anything either. Hurray for modern power MOSFETs.)

Your power supply sounds good, so that's blown that theory. Your earlier post sounded like you live a bit out of town, so the voltage at your house may vary quite a bit. The UPS should help considerably with that.

ID: 47074 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47077 - Posted: 17 Sep 2013, 7:08:58 UTC
Last modified: 17 Sep 2013, 7:10:29 UTC

I do have a UPS also - an APC SmartUPS-2200 which can run my PC for 20-30 minutes. Second-hand, and very cheap from ebay because it was so heavy (had to collect it). But if you get an ebay one you will need to replace the batteries.

At the time I was getting 10 powercuts / month. I'm not currently running it because the power supply has been much improved and I no longer get powercuts.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47077 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Questions and Answers : Windows : Optimise PC build for CPDN

©2024 climateprediction.net