climateprediction.net home page
Multiple failures

Multiple failures

Questions and Answers : Unix/Linux : Multiple failures
Message board moderation

To post messages, you must log in.

AuthorMessage
old_user61264

Send message
Joined: 6 Mar 05
Posts: 4
Credit: 7,782,147
RAC: 0
Message 32178 - Posted: 15 Jan 2008, 14:20:36 UTC

All,

I have two machines currently crunching climate prediction clients. They both run Ubuntu 7.10. One is a dual processor AMD chip on an Asus motherboard, and repeatedly (20x) ends the run with \'Client Error\'. It is always at different places in the run with between 60k and 2,000k CPU seconds committed.

The second machine has an Intel dual core (4gb memory, runs 64bit Ubuntu), and has success about half the time, and Client error about half the time.

Am I wasting my time/energy trying to do Climate Prediction? From the figures I have the impression I am contributing very little to the effort in spite of months of CPU time.

Thanks,

Nic out

ID: 32178 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,469,696
RAC: 3,603
Message 32179 - Posted: 15 Jan 2008, 14:36:04 UTC

On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair.

How do you start boinc on that PC. Does that PC run other intensive programs at various times during the day?
ID: 32179 · Report as offensive     Reply Quote
old_user61264

Send message
Joined: 6 Mar 05
Posts: 4
Credit: 7,782,147
RAC: 0
Message 32181 - Posted: 15 Jan 2008, 17:29:41 UTC - in response to Message 32179.  

Thanks for the response.

Actually, the AMD PC is at home and pretty much runs CPDN anytime I don\'t reboot it into WinXP to play WoW. The Intel PC is my workstation at the lab and is regularly running heavy jobs.

I start them both with;
cd ~/bin/boinc
nohup ./run_client > test.log &

and let them run until I need the CPU for something else.

Nic out




On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair.

How do you start boinc on that PC. Does that PC run other intensive programs at various times during the day?


ID: 32181 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32184 - Posted: 15 Jan 2008, 20:27:33 UTC


Well, the Intel at work, most jobs seem to complete successfully, of the ones which don\'t:



16th Nov: Two jobs killed by signal 11

26th Aug: 2 jobs killed by signal 11

21st Aug: 1 signal 11 and exit status 139, the other just exit status 139

8th May : 2 jobs killed by signal 11

Signal 11 is a segmentation fault, I think somewhat like the access violation in Windows.

What was happening on that box at the time the crashes occurred?


Do you have \'leave in memory\' turned on or off? (the recommended setting is to have things stay in memory).

It may be worth reading through the project READMEs to see if there is anything useful (link in my signature). I should point out that a crash isn\'t a disaster - the climate is uploaded to the CPDN server as the model progresses (in the trickle-ups). But since it is always more satisfying to complete the model yourself, some people take backups at intervals, and other people shut down Boinc before running anything major on the PC. It is also a good idea to shut down boinc prior to shutting-down the PC.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32184 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32185 - Posted: 15 Jan 2008, 20:32:13 UTC
Last modified: 15 Jan 2008, 20:39:28 UTC

The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?).

Is there any software in common between your home and work PCs? Something you run more frequently at home?

How do you stop Boinc running when you need it for something else? If you use \'kill -9\' or similar I\'d recommend using \'boinc_cmd --quit\' instead.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32185 · Report as offensive     Reply Quote
old_user61264

Send message
Joined: 6 Mar 05
Posts: 4
Credit: 7,782,147
RAC: 0
Message 32197 - Posted: 16 Jan 2008, 14:20:48 UTC - in response to Message 32185.  

Hey Mark,

Actually, I\'m not sure what is happening when the \'client error\' occurs. I really only check my CPDN numbers once a month or so. Hence, it is usually days to weeks past as CPDN has efficiently sent me a new work unit.

The two machine run pretty much the same software (Ubuntu dual boot with WinXp, my professional software). The one real difference is that the home computer is rebooted almost every evening to play WoW with the kids, while the lab computer goes weeks between reboots.

I\'ll give a try with the explicit boinc quit command to see if it helps. Otherwise there is lot of new research I can do in your \'README\' collection. I\'ll poke around.

Thanks for your help.

Nic out



The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?).

Is there any software in common between your home and work PCs? Something you run more frequently at home?

How do you stop Boinc running when you need it for something else? If you use \'kill -9\' or similar I\'d recommend using \'boinc_cmd --quit\' instead.



ID: 32197 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 32198 - Posted: 16 Jan 2008, 16:13:01 UTC

In the README about crashes and problems I\'d recommend a look at item #5 by MikeMars. It doesn\'t deal specifically with the type of error your models have suffered on both computers, but it does comprehensively list all the normal precautions.
Cpdn news
ID: 32198 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32201 - Posted: 16 Jan 2008, 17:13:40 UTC


It could be the reboots which are causing the problem, and the boinc_cmd --quit should resolve that. I usually manually close boinc whenever I quit, both on Linux and Win32.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32201 · Report as offensive     Reply Quote
old_user200013

Send message
Joined: 19 Sep 06
Posts: 1
Credit: 317,635
RAC: 0
Message 32213 - Posted: 17 Jan 2008, 11:44:04 UTC

Hello,

I also have the same problem with CPND: All my 15 work units finished with \"client error\", the one that worked more time was for total of 477,855.64 seconds.

I use Windows VISTA 32bit, and I usualy don\'t close the BOINC client manualy.

I also run Eistein@Home and malariacontrol, and in these 2 cases almost all the work units finished OK.

Can you please help?

Thanks,
Carlos Almeida
ID: 32213 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 32214 - Posted: 17 Jan 2008, 12:44:46 UTC

Ziggy,

It is especially important to close BOINC manually in Vista, as Microsoft have reduced Vista\'s closedown time to such an extent that BOINC can\'t cope - and the bigger the BOINC task, the harder it is to close it down quickly, which is why climate models get hit hardest.

Just exit BOINC (or suspend and exit) and you shouldn\'t get any Vista closedown crashes.

Iain

PS With Vista it\'s also best to install outside \'C:\\Program Files\' to stop Vista\'s User Access Control messing with BOINC.
ID: 32214 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32215 - Posted: 17 Jan 2008, 12:57:02 UTC


It\'s worthwhile reading the README posts (link in my signature), there are a couple of additional suggestions regarding Vista there as well as the above ones.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32215 · Report as offensive     Reply Quote
old_user61264

Send message
Joined: 6 Mar 05
Posts: 4
Credit: 7,782,147
RAC: 0
Message 32303 - Posted: 23 Jan 2008, 14:00:24 UTC - in response to Message 32197.  

I think I\'ve found something that helps. I installed boinc through the Ubuntu servers (something like apt-get install boinc-client). That put an entry in /etc/init.d that explicitly starts up boinc when I boot Ubuntu. More importantly, it explicitly shuts down boinc when I shutdown the machine. This gives boinc/cpdn the time and commands to shut down gracefully during the process.

I\'m about half way through the next model series, and hopeful I\'ll get a success this time.

Nic out

ID: 32303 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 32310 - Posted: 23 Jan 2008, 18:13:14 UTC


Good luck on the new model. If it\'s handy, could you post the shutdown script that Ubuntu uses to stop Boinc (I\'m guessing that it\'ll be calling boinc_cmd --quit)?
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 32310 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 32317 - Posted: 24 Jan 2008, 1:06:52 UTC - in response to Message 32310.  


Good luck on the new model. If it\'s handy, could you post the shutdown script that Ubuntu uses to stop Boinc (I\'m guessing that it\'ll be calling boinc_cmd --quit)?


#/bin/bash
...
killproc $BOINCEXE

This is what it does. I\'ve been running with those init scripts for a long time now with no problems. Killproc waits for the boinc daemon to exit before continuing.
ID: 32317 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Multiple failures

©2024 climateprediction.net