Upgrade Problem HADAM3P failure

Author	Message
old_user611471 Send message Joined: 26 Jan 10 Posts: 5 Credit: 168,664 RAC: 0	Message 41780 - Posted: 11 Mar 2011, 18:19:14 UTC Hi I have recently upgraded my computer to windows7 intel i5 processor overclocked, at the same time the climateprediction.net servers/database went through an overhaul. Since my upgrade and the server update (still getting errors from the server about space) I havenâ€™t managed to complete a task â€“ all have either Error while downloading or Error while computing. The SETI project is working with no errors. Any ideasâ€¦ ID: 41780 · Reply Quote

JIM Send message Joined: 31 Dec 07 Posts: 1152 Credit: 22,118,845 RAC: 2,411	Message 41782 - Posted: 11 Mar 2011, 19:06:18 UTC - in response to Message 41780. How much is your computer overclocked? These climate models tend to become unstable when the processor is run above clock speed. Other projects donâ€™t seem to be as sensitive to overclocking. ID: 41782 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2169 Credit: 64,555,907 RAC: 5,858	Message 41783 - Posted: 11 Mar 2011, 20:08:31 UTC I'm with Jim. What clockspeed is your PC running at? Those CPUs overclock like crazy. The default big overclock is very high, and may work fine with one or two tasks running. But if you're trying to run 4 cores at time at 100% CPU utilization at 4.2 GHz, you better have really, really good cooling. ID: 41783 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 41785 - Posted: 12 Mar 2011, 7:30:36 UTC Hi Oracle Your EU models have all crashed with download errors caused by MD5 checksum errors. This error is a bug in the EU tasks and was mentioned in the News thread (top of Number Crunching). As far as I can see from your computer's web page the new batch of EU models created and released yesterday still contain this bug. You could edit the climateprediction preferences in your account and say you don't want EU models for the time being until the checksum error bug is fixed. But your SAF and PNW models, most of which download correctly, should run and complete. Yours are crashing very early on with -161 errors which one wouldn't normally get with these stably-compiled model types. Cpdn news ID: 41785 · Reply Quote

Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0	Message 41786 - Posted: 12 Mar 2011, 8:15:38 UTC Overclocked! Gotcha! Backups: Here ID: 41786 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 41787 - Posted: 12 Mar 2011, 8:19:11 UTC - in response to Message 41785. As far as I can see ... the new batch of EU models created and released yesterday still contain this bug. I received a new _eu model last night which arrived without the MD5 checksum error and has now started to be processed. FYI. /Rgds ID: 41787 · Reply Quote

old_user611471 Send message Joined: 26 Jan 10 Posts: 5 Credit: 168,664 RAC: 0	Message 41788 - Posted: 12 Mar 2011, 9:59:40 UTC Thanks for all of the comments/help. I have just done a small stability test for 10 hours and all reports ok, tempreture stable at 59C so system seems to be running ok. Will try to do a longer test durring the week. I have downed the cpu useage to 60% and number of cpu to 3. I'll see how that goes. CPU's are hardly working now!!! Ho Hum Once again thanks ID: 41788 · Reply Quote

old_user611471 Send message Joined: 26 Jan 10 Posts: 5 Credit: 168,664 RAC: 0	Message 41831 - Posted: 19 Mar 2011, 8:42:10 UTC Well a week into my new settings and I have just had another failure! Three tasks @between 60% and 78% - All ok last night switch on this morning to find one of the task 100% Computation Error, no error message I can find. So, 100% failure still. If the other fail as well I think I may just give up on the project. ID: 41831 · Reply Quote

astroWX Volunteer moderator Send message Joined: 5 Aug 04 Posts: 1496 Credit: 95,522,203 RAC: 0	Message 41837 - Posted: 19 Mar 2011, 17:19:43 UTC - in response to Message 41831. So, 100% failure still. How do you shut the machine down? If you don't first suspend CPDN tasks and Exit boinc before shutdown, try that. It might also help to shut-down connected client on boinc Advanced tab. Is the machine still overclocked? Ten hours of as many copies of Prime-95 Torture Test as a machine has cores really isn't enough to prove the point. --> Years ago, in preparation for the original Spinup runs, one of my P-4 boxes failed on a small O/C, with hyper-threading, after about 24 hours. That machine was unstable in O/C mode as far as I was concerned and it was not used for Spinup, even with H/T off. (CPDN models were designed to run on Cray supercomputers and can be touchy on all but the most stable PCs.) "We have met the enemy and he is us." -- Pogo Greetings from coastal Washington state, the scenic US Pacific Northwest. ID: 41837 · Reply Quote

old_user611471 Send message Joined: 26 Jan 10 Posts: 5 Credit: 168,664 RAC: 0	Message 41838 - Posted: 20 Mar 2011, 10:58:36 UTC I have just looked at the task stderr for my failed tasks and all of them have the same pattern, only the time seems to change 10 - 70% of the way through. Having never had any problem until now not sure as to what this means CPDN Monitor - Quit request from BOINC... Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=5864, iMonCtr=1 Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6044, selfPID=6044, iMonCtr=2 Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6044, selfPID=1140, iMonCtr=1 Model crash detected, will try to restart... Leaving CPDN_Main::Monitor... 20:15:10 (1140): called boinc_finish </stderr_txt> <message> <file_xfer_error> <file_name>hadam3p_saf_1drd_1984_1_006945665_0_1.zip</file_name> <error_code>-161</error_code> </file_xfer_error> <file_xfer_error> <file_name>hadam3p_saf_1drd_1984_1_006945665_0_2.zip</file_name> <error_code>-161</error_code> ... What causes the CPDN process not to run or is it the fact the files are missing. My old PC was a dual core running XP and I have just moved to quad core running Windows 7 so every thing about the system is different. I dont rule out anything being at the root of the problem! any help appreciated ID: 41838 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,079,827 RAC: 6,030	Message 41840 - Posted: 20 Mar 2011, 13:01:38 UTC - in response to Message 41838. ... or is it the fact the files are missing. No. The missing files message is produced after the model has crashed. BOINC has a list of output files that should be produced and if the model hasn't got to the end then some of those files won't be created - BOINC is just listing the files that haven't been created. It's not a very helpful message. ID: 41840 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,079,827 RAC: 6,030	Message 41841 - Posted: 20 Mar 2011, 13:15:32 UTC - in response to Message 41838. What causes the CPDN process not to run ... The tasks running on the i5 machine seem to be suspending a lot. For example, click the '+' button next to Stderr in this task. The regional models contain a number of synchronised processes and perhaps the frequent suspends are causing the problems. I doubt whether the suspends are caused by repeatedly selecting the 'snooze' option on the BOINC system tray icon. More likely the setting for 'While processor usage is less than X%' is still at the default - 25% - which means that every time the processor usage exceeds 25% the models will be suspended. Set that value to 0% (BOINC Manager \| Advanced \| Preferences \| processor usage'). In the 'disk and memory usage' tab of that dialog box, select 'leave applications in memory while suspended'. The final bit of advice you have already applied: multiple HADAM3P regional models slow each other down - so it's a good idea to run a mix of model types (or projects) or run with some unused core(s). Actually, that's not the final piece of advice: make sure that the BOINC application and data folders are excluded from virus checking and snooze BOINC if you're about to do something very processor intensive (like a game). Previous posters have mentioned shutting down manually (or suspend/shutdown) - that's very good advice too. ID: 41841 · Reply Quote

old_user611471 Send message Joined: 26 Jan 10 Posts: 5 Credit: 168,664 RAC: 0	Message 41842 - Posted: 20 Mar 2011, 13:49:21 UTC Thanks for that. I have just had my first completion and another is on 94% so hopefully that will succeed as well I have now got Processor usgage 0% 75% of processors 75% CPU time and leave in memory I was concerned at the number of suspends as nothing other than BOINC was running with 3 HADAM3P tasks. Hopefully with these setting I wont have so many problems! ID: 41842 · Reply Quote