climateprediction.net home page
Ocean model crashed.

Ocean model crashed.

Message boards : Number crunching : Ocean model crashed.
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42760 - Posted: 7 Aug 2011, 4:33:29 UTC

This model crashed, saying client error. http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13132040 I suspect this has been covered before but I don't remember seeing it for this particular batch of models. My other task from the same batch is still running. One of the two other tasks from the work unit crashed at 0 time. This one was around the 50% mark. Despite the, "client error" message I suspect it is not my computer at fault.

Dave
ID: 42760 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42783 - Posted: 17 Aug 2011, 20:27:05 UTC - in response to Message 42760.  

And the other full resolution ocean model I was running has now crashed, albeit another 20odd percent further through. I notice all the other ones in the same work units didn't complete either but I did hope the one that kept going beyond 70% would complete. Result page for it is http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=13122714

Dave
ID: 42783 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2170
Credit: 64,555,907
RAC: 5,858
Message 42784 - Posted: 17 Aug 2011, 22:48:19 UTC

Dave,

I don't know if this is right, but the integer benchmark score is HUGE for that processor. Is it significantly overclocked?
ID: 42784 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42785 - Posted: 18 Aug 2011, 4:30:00 UTC - in response to Message 42784.  

Thanks for that comment which intrigues me. No it is not significantly overclocked.Multiplier is standard - I haven't tried but don't think it is unlocked.
[dave@localhost ~]$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz
stepping : 10
cpu MHz : 2699.621
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5399.24
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Pentium(R) Dual-Core CPU E5400 @ 2.70GHz
stepping : 10
cpu MHz : 2699.621
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm xsave lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5399.60
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:
I can't think why the integer benchmark score should be HUGE when I haven't been playing around with overclocking.

Dave

ID: 42785 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42786 - Posted: 18 Aug 2011, 10:30:40 UTC

Get a copy of cpu-z and run that.
It'll tell you the current speed of the processor.


Backups: Here
ID: 42786 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42787 - Posted: 18 Aug 2011, 15:13:51 UTC - in response to Message 42786.  

Thanks Les, - cpu-z doesn't seem to have a linux version so I have installed PerlMon which claims to tell you your actual cpu frequency and that tells me 2699.814MHz which, not having ever increased the clock frequency on this box by more than 10Hz makes me even more curious as to why the integer benchmark should be, "HUGE" for the processor I have. I am disinclined to believe that anyone who shares the house with me is fiddling as neither have been converted to the power of the penguin.


Dave
ID: 42787 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2170
Credit: 64,555,907
RAC: 5,858
Message 42789 - Posted: 19 Aug 2011, 19:21:18 UTC

Perhaps it's an overestimation from a new version of BOINC. I don't have 6.12.xx running on any of my Linux boxes so am not sure if it's inflated by a new version?
ID: 42789 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42791 - Posted: 20 Aug 2011, 9:13:47 UTC - in response to Message 42789.  

Could be, I haven't ever paid attention to the benchmark scores before upgrading to the latest BOINC so couldn't comment. It still leaves the question mark as to why the crashes. In one of the two tasks in question none of the other tasks in the work unit completed. In the other one completed and the other crashed.If there are likely to be any more of these models, should I remove them from my preferences or not? I normally back up about once a week but haven't yet tried restoring a crashed model. What happens if restoring is successful? are the results accepted after the task is already showing, "Error while computing" in the status column?

Sorry if many of these questions are ones that have been answered a number of times already.

Dave
ID: 42791 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,130,442
RAC: 2,102
Message 42792 - Posted: 20 Aug 2011, 14:09:58 UTC - in response to Message 42791.  

Yes the results are excepted and the data is used by the Scientists just like any other completed WU. The one thing that can be off-putting is that when a restored WU finishes and sends the results a line will appear in messages that says it was previously reported as error. Just ignore this message as it only applies to other projects and has no meaning in CPDN.

ID: 42792 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42793 - Posted: 20 Aug 2011, 18:27:06 UTC - in response to Message 42792.  

Thanks Jim,
I will try and back up twice a week and in the event of another crash will try restore to see what happens.

Dave
ID: 42793 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,130,442
RAC: 2,102
Message 42798 - Posted: 23 Aug 2011, 13:42:14 UTC

As you can see task hadcm3n_yfi4_1900_40_007352950_0 crashed at the very end with the 4.zip file missing.

8/23/2011 4:29:03 AM climateprediction.net Generated new computer cross-project ID: ead5a08edbfa16e91d2b66991616c4ee
8/23/2011 4:29:04 AM climateprediction.net Computation for task hadcm3n_yfi4_1900_40_007352950_0 finished
8/23/2011 4:29:04 AM climateprediction.net Output file hadcm3n_yfi4_1900_40_007352950_0_4.zip for task hadcm3n_yfi4_1900_40_007352950_0 absent
8/23/2011 4:29:05 AM climateprediction.net Restarting task hadam3p_pnw_3204_1980_1_007395222_0 using hadam3p_pnw version 609


Could this have anything to do with the recent change in the server used. I do have a backup that I made only 6 hours before the model finished, so if a solution can be found I could restore and run it to the end again.

ID: 42798 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,130,442
RAC: 2,102
Message 42809 - Posted: 24 Aug 2011, 19:34:37 UTC - in response to Message 42798.  

I restored task hadcm3n_yfi4_1900_40_007352950_0 from the 6 hour backup and ran it again. Same result. It crashes appormx. 2 hours from end. The stderr are shown below. The one that look significant to me is:

23:50:55 (4200): Can't acquire lock file (32) - waiting 35s
It shows up twice.

Unless someone can come up with a correctable reason for the failure I will delete the backup and go one. I hate to just write off 900 hours of crunching.

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
- exit code 193 (0xc1)
</message>
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
02:11:08 (7860): Can't acquire lockfile (32) - waiting 35s
02:11:15 (3840): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
23:50:55 (4200): Can't acquire lockfile (32) - waiting 35s
23:51:10 (7860): No heartbeat from core client for 30 sec - exiting
CPDN Monitor - No 'heartbeat' from BOINC...
CPDN Monitor - Quit request from BOINC...
CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Signal 11 received, exiting...
Called boinc_finish

</stderr_txt>
]
ID: 42809 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 42810 - Posted: 24 Aug 2011, 22:32:45 UTC - in response to Message 42809.  

The lock file problem could be several things, but is most likely caused by the anti virus program scanning each file that becomes active to check it for viruses before allowing you to use it.
Some av programs are more aggressive than others in the way they work.

Which is why it's long been recommended to block av scanning of the entire BOINC data section, both manually started scans AND scheduled scans.


Backups: Here
ID: 42810 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42811 - Posted: 25 Aug 2011, 4:17:36 UTC - in response to Message 42810.  

Still no closer to working out why my units crashed. I assume something may be in this from the errors but am afraid they mean nothing to me. I do know it isn't an antivirus program scanning anything in my case however.

SIGABRT: abort called
Stack trace (17 frames):
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(boinc_catch_signal+0x6f)[0x80b80df]
[0xffffe400]
[0xffffe430]
/lib/libc.so.6(gsignal+0x51)[0xf7540ce1]
/lib/libc.so.6(abort+0x182)[0xf7542632]
/lib/libc.so.6(+0x65e4d)[0xf757ce4d]
/lib/libc.so.6(+0x6bba1)[0xf7582ba1]
/usr/lib/libstdc++.so.6(_ZdlPv+0x21)[0xf7763321]
/usr/lib/libstdc++.so.6(_ZdaPv+0x1d)[0xf776337d]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8053e8e]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8057bc4]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x804f232]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x8050491]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805112c]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu[0x805137a]
/lib/libc.so.6(__libc_start_main+0xe6)[0xf752db96]
../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu(__gxx_personality_v0+0x169)[0x804cb51]


Dave
ID: 42811 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 42812 - Posted: 25 Aug 2011, 5:02:35 UTC - in response to Message 42811.  
Last modified: 25 Aug 2011, 5:04:11 UTC

Dave, I see there's a large number of "Suspended CPDN monitor" messages -- your computer is throttling back the tasks quite frequently. In the past, people have noticed stability problems when CPDN tasks are starved of resources.

Have you tried this:-

In Boinc Manager - Advanced menu - Preferences:

On the "processor usage" tab, set 'Use at most ... % of CPU' to 100.00, or at a minimum, 80.00.

(If you're worried about heat, it'd be best to set "On multiprocessor systems, use at most 50.00 % of the processors", i.e. process one task at a time.)

On the "disk and memory usage" tab, ensure "Leave applications in memory when suspended" is selected.

Also on this tab, for two HadCM3Ns in 2GB, it'd be best to set the Memory Usage figure "Use at most ... % when computer is in use" to at least 80.00 % to be on the safe side. Likewise for "Use at most ... when computer is idle".

If you've done all these and still have the problem, then it might be worth:-

* having a good vacuum-out of the CPU's heat sink, and unseating and re-seating the RAM modules

* running mprime or memtest86+ for 48 hours to check your computer's RAM

* upgrading its power supply to a newer, name brand model.

That last item helped me with a stability problem. Newer PSUs seem to reject power supply noise better than ones from a few years ago.
ID: 42812 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42813 - Posted: 25 Aug 2011, 15:54:28 UTC - in response to Message 42812.  

Thanks Greg,
Only change in settings was to increase % memory usage when computer is in use. Shouldn't be much dust in system as it is fairly new. I will probably increase memory soon to 4GB which may make a difference. With the regional models I don't seem to have any problems.

Dave
ID: 42813 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42814 - Posted: 25 Aug 2011, 17:37:03 UTC - in response to Message 42813.  

Just done a temperature check - with two regional models running it is 45C, dropping to 41C if I disable one of the tasks.- Voltages seem to be stable too with less than .1v change in either 12V or 5V line when I stop a task.

Dave
ID: 42814 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 42815 - Posted: 25 Aug 2011, 21:36:42 UTC - in response to Message 42814.  

OK, Dave, that's good. Increasing the memory % might have done the trick.

If not, the only thing left is to run mprime and memtest86+ for 24 - 48 hours each. You can't use the PC for anything else while memtest86+ is running, though.
ID: 42815 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4353
Credit: 16,598,247
RAC: 6,156
Message 42832 - Posted: 29 Aug 2011, 16:06:36 UTC - in response to Message 42815.  

Memory now up to 4GB. It is still going to be a long wait to see if the HADCM3 finishes or not.
ID: 42832 · Report as offensive     Reply Quote
Profile Greg van Paassen

Send message
Joined: 17 Nov 07
Posts: 142
Credit: 4,271,370
RAC: 0
Message 42833 - Posted: 29 Aug 2011, 23:25:37 UTC - in response to Message 42832.  

Yes, I estimate about 4 weeks with the PC running 24/7, less what has been done so far.

Anyway, good luck!
ID: 42833 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : Ocean model crashed.

©2024 climateprediction.net