climateprediction.net home page
Upgrade Problem HADAM3P failure

Upgrade Problem HADAM3P failure

Questions and Answers : Windows : Upgrade Problem HADAM3P failure
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user611471

Send message
Joined: 26 Jan 10
Posts: 5
Credit: 168,664
RAC: 0
Message 41780 - Posted: 11 Mar 2011, 18:19:14 UTC

Hi
I have recently upgraded my computer to windows7 intel i5 processor overclocked, at the same time the climateprediction.net servers/database went through an overhaul. Since my upgrade and the server update (still getting errors from the server about space) I haven’t managed to complete a task – all have either Error while downloading or Error while computing. The SETI project is working with no errors.
Any ideas…

ID: 41780 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,053,321
RAC: 4,417
Message 41782 - Posted: 11 Mar 2011, 19:06:18 UTC - in response to Message 41780.  

How much is your computer overclocked? These climate models tend to become unstable when the processor is run above clock speed. Other projects don’t seem to be as sensitive to overclocking.

ID: 41782 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 41783 - Posted: 11 Mar 2011, 20:08:31 UTC

I'm with Jim. What clockspeed is your PC running at?

Those CPUs overclock like crazy. The default big overclock is very high, and may work fine with one or two tasks running. But if you're trying to run 4 cores at time at 100% CPU utilization at 4.2 GHz, you better have really, really good cooling.
ID: 41783 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 41785 - Posted: 12 Mar 2011, 7:30:36 UTC

Hi Oracle

Your EU models have all crashed with download errors caused by MD5 checksum errors. This error is a bug in the EU tasks and was mentioned in the News thread (top of Number Crunching). As far as I can see from your computer's web page the new batch of EU models created and released yesterday still contain this bug. You could edit the climateprediction preferences in your account and say you don't want EU models for the time being until the checksum error bug is fixed.

But your SAF and PNW models, most of which download correctly, should run and complete. Yours are crashing very early on with -161 errors which one wouldn't normally get with these stably-compiled model types.

Cpdn news
ID: 41785 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 41786 - Posted: 12 Mar 2011, 8:15:38 UTC

Overclocked!


Gotcha!



Backups: Here
ID: 41786 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 41787 - Posted: 12 Mar 2011, 8:19:11 UTC - in response to Message 41785.  

As far as I can see ... the new batch of EU models created and released yesterday still contain this bug.


I received a new _eu model last night which arrived without the MD5 checksum error and has now started to be processed. FYI.

/Rgds
ID: 41787 · Report as offensive     Reply Quote
old_user611471

Send message
Joined: 26 Jan 10
Posts: 5
Credit: 168,664
RAC: 0
Message 41788 - Posted: 12 Mar 2011, 9:59:40 UTC

Thanks for all of the comments/help. I have just done a small stability test for 10 hours and all reports ok, tempreture stable at 59C so system seems to be running ok. Will try to do a longer test durring the week.

I have downed the cpu useage to 60% and number of cpu to 3. I'll see how that goes. CPU's are hardly working now!!! Ho Hum

Once again thanks
ID: 41788 · Report as offensive     Reply Quote
old_user611471

Send message
Joined: 26 Jan 10
Posts: 5
Credit: 168,664
RAC: 0
Message 41831 - Posted: 19 Mar 2011, 8:42:10 UTC

Well a week into my new settings and I have just had another failure! Three tasks @between 60% and 78% - All ok last night switch on this morning to find one of the task 100% Computation Error, no error message I can find.

So, 100% failure still.

If the other fail as well I think I may just give up on the project.

ID: 41831 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 41837 - Posted: 19 Mar 2011, 17:19:43 UTC - in response to Message 41831.  

So, 100% failure still.

How do you shut the machine down? If you don't first suspend CPDN tasks and Exit boinc before shutdown, try that. It might also help to shut-down connected client on boinc Advanced tab.

Is the machine still overclocked? Ten hours of as many copies of Prime-95 Torture Test as a machine has cores really isn't enough to prove the point. --> Years ago, in preparation for the original Spinup runs, one of my P-4 boxes failed on a small O/C, with hyper-threading, after about 24 hours. That machine was unstable in O/C mode as far as I was concerned and it was not used for Spinup, even with H/T off. (CPDN models were designed to run on Cray supercomputers and can be touchy on all but the most stable PCs.)
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 41837 · Report as offensive     Reply Quote
old_user611471

Send message
Joined: 26 Jan 10
Posts: 5
Credit: 168,664
RAC: 0
Message 41838 - Posted: 20 Mar 2011, 10:58:36 UTC

I have just looked at the task stderr for my failed tasks and all of them have the same pattern, only the time seems to change 10 - 70% of the way through. Having never had any problem until now not sure as to what this means

CPDN Monitor - Quit request from BOINC...
Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=5864, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6044, selfPID=6044, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6044, selfPID=1140, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
20:15:10 (1140): called boinc_finish

</stderr_txt>
<message>
<file_xfer_error>
<file_name>hadam3p_saf_1drd_1984_1_006945665_0_1.zip</file_name>
<error_code>-161</error_code>
</file_xfer_error>
<file_xfer_error>
<file_name>hadam3p_saf_1drd_1984_1_006945665_0_2.zip</file_name>
<error_code>-161</error_code>
...

What causes the CPDN process not to run or is it the fact the files are missing.
My old PC was a dual core running XP and I have just moved to quad core running Windows 7 so every thing about the system is different. I dont rule out anything being at the root of the problem!

any help appreciated
ID: 41838 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,905,706
RAC: 6,529
Message 41840 - Posted: 20 Mar 2011, 13:01:38 UTC - in response to Message 41838.  

... or is it the fact the files are missing.

No. The missing files message is produced after the model has crashed. BOINC has a list of output files that should be produced and if the model hasn't got to the end then some of those files won't be created - BOINC is just listing the files that haven't been created. It's not a very helpful message.
ID: 41840 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,905,706
RAC: 6,529
Message 41841 - Posted: 20 Mar 2011, 13:15:32 UTC - in response to Message 41838.  

What causes the CPDN process not to run ...

The tasks running on the i5 machine seem to be suspending a lot. For example, click the '+' button next to Stderr in this task.

The regional models contain a number of synchronised processes and perhaps the frequent suspends are causing the problems. I doubt whether the suspends are caused by repeatedly selecting the 'snooze' option on the BOINC system tray icon. More likely the setting for 'While processor usage is less than X%' is still at the default - 25% - which means that every time the processor usage exceeds 25% the models will be suspended. Set that value to 0% (BOINC Manager | Advanced | Preferences | processor usage'). In the 'disk and memory usage' tab of that dialog box, select 'leave applications in memory while suspended'.

The final bit of advice you have already applied: multiple HADAM3P regional models slow each other down - so it's a good idea to run a mix of model types (or projects) or run with some unused core(s).

Actually, that's not the final piece of advice: make sure that the BOINC application and data folders are excluded from virus checking and snooze BOINC if you're about to do something very processor intensive (like a game). Previous posters have mentioned shutting down manually (or suspend/shutdown) - that's very good advice too.
ID: 41841 · Report as offensive     Reply Quote
old_user611471

Send message
Joined: 26 Jan 10
Posts: 5
Credit: 168,664
RAC: 0
Message 41842 - Posted: 20 Mar 2011, 13:49:21 UTC

Thanks for that.

I have just had my first completion and another is on 94% so hopefully that will succeed as well

I have now got
Processor usgage 0%
75% of processors
75% CPU time
and leave in memory

I was concerned at the number of suspends as nothing other than BOINC was running with 3 HADAM3P tasks.

Hopefully with these setting I wont have so many problems!
ID: 41842 · Report as offensive     Reply Quote

Questions and Answers : Windows : Upgrade Problem HADAM3P failure

©2024 climateprediction.net