climateprediction.net home page
Errors in new HADAM3P_ ANZ Tasks

Errors in new HADAM3P_ ANZ Tasks

Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53291 - Posted: 21 Jan 2016, 11:43:56 UTC - in response to Message 53289.  
Last modified: 21 Jan 2016, 11:44:48 UTC

My I7 is a 3770K with HT enabled, and BOINC is limited to 7 processors, just to make sure one CPU is always there to do something else.

The Xeon has HT disabled, 6 CPU's active, and as both machines fail at random, the spare CPU does not seem to help any.

Is there any Log settings that might help me pinpoint the problem?

Errors are random,the Xeon has one Africa Region task left and has passed TS 95000 whithout problems.

??

ChrisD
ID: 53291 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53292 - Posted: 21 Jan 2016, 13:13:58 UTC - in response to Message 53288.  

Am I really the only one fighting these errors.

Maybe it is unrelated, but since you are running several tasks at once on an HDD, I found a few years ago that the write-contention to the drive (SSD in my case) was causing errors. I now use a ramdisk to hold the BOINC Data folder, so that all writes are to the main memory. Alternatively, a large write-cache will work. I can provide details if necessary. The errors I see now are usually just the REPLANCA errors due to bad work units. But the more tasks you run at once, the more likely the problems with disk drives.
ID: 53292 · Report as offensive     Reply Quote
ChrisD

Send message
Joined: 8 Aug 04
Posts: 69
Credit: 1,561,341
RAC: 0
Message 53293 - Posted: 21 Jan 2016, 14:35:38 UTC - in response to Message 53292.  
Last modified: 21 Jan 2016, 14:39:20 UTC

... Alternatively, a large write-cache will work.

That may be worth a try :)

I can provide details if necessary.

That would be great, so please..

ChrisD

Re. Using a RAM drive. Then I must get me a No-Break first. One glitch and all is lost.

I actually use a RAM-Disk on my I7 to store Firefox's Cache. Otherwise my Wear Level Count is rapidly declining, due to the rediculous way Firefox is storing Your browsed Pages. Myriads of small 1K maybe 2K files, trashing any SSD in record time.
But, if I loose powewr, I can safely discard any file on the RAM-Drive.
ID: 53293 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 53294 - Posted: 21 Jan 2016, 18:22:21 UTC - in response to Message 53293.  
Last modified: 21 Jan 2016, 18:27:46 UTC

... Alternatively, a large write-cache will work.

That may be worth a try :)

The write cache I use is PrimoCache http://www.romexsoftware.com/en-us/primo-cache/index.html. It is not free, but well worth it; I bought two licenses for both my dedicated Haswell machines. I devote six cores to BOINC, so up to six CPDN tasks may be running at once. You need only a write-cache, since the read cache is irrelevant for reducing the wear on SSDs, or preventing errors, and just takes up memory space. The HADAM3P_ ANZ use somewhat less disk space than the WAH2s, so I think a cache size of maybe 10 GB on your I7, and 16 GB on your Xeon would be plenty. The cache latency can just be set to "infinite". When the cache fills up, it will write the excess to the HDD, but that will be rare, and they will be easy serialized writes in large blocks, not the small random writes that are hard for the HDD (or SSD) to handle.

The writes will then go into the cache, and in fact the reads will come out of the cache also for that data still in the cache, which should be most, if not all of it. I actually have 32 GB memory on my Haswell machines and set aside 24 GB for the write-cache, but that is because I am doing WAH2 on those machines; it is a bit of overkill though. It should be mentioned that the Samsung SSDs come with their own caching software (Rapid Mode cache in the Samsung Magician utility), but it is only 1 GB in size. That might be enough to eliminate the errors, but you will still be writing a lot to the SSD. The Crucial SSDs also have their own Momentum Cache utility, which goes up to 4 GB in size. That might take care of the problem itself.

I hear you on the need of a backup power supply (UPS) for a ramdisk; you definitely need it for BOINC work. But it wouldn't be a bad idea for the write-cache too, though I think the results of a power interruption are not as difficult to correct there. But I have not had a power interruption to the PCs for a while since I use UPS for everything, and prefer not to find out the effects. There are a variety of projects that benefit from a cache or ramdisk though; I have seen a reduction in errors on ATLAS and WCG/CEP2, since they all have very high write rates. They really weren't intended to be run on so many cores at once I think. Good luck, and ask about anything else.
ID: 53294 · Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks

©2024 climateprediction.net