Multiple failures

Author	Message
old_user61264 Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0	Message 32178 - Posted: 15 Jan 2008, 14:20:36 UTC All, I have two machines currently crunching climate prediction clients. They both run Ubuntu 7.10. One is a dual processor AMD chip on an Asus motherboard, and repeatedly (20x) ends the run with \'Client Error\'. It is always at different places in the run with between 60k and 2,000k CPU seconds committed. The second machine has an Intel dual core (4gb memory, runs 64bit Ubuntu), and has success about half the time, and Client error about half the time. Am I wasting my time/energy trying to do Climate Prediction? From the figures I have the impression I am contributing very little to the effort in spite of months of CPU time. Thanks, Nic out ID: 32178 · Reply Quote

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2172 Credit: 64,723,937 RAC: 2,741	Message 32179 - Posted: 15 Jan 2008, 14:36:04 UTC On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair. How do you start boinc on that PC. Does that PC run other intensive programs at various times during the day? ID: 32179 · Reply Quote

old_user61264 Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0	Message 32181 - Posted: 15 Jan 2008, 17:29:41 UTC - in response to Message 32179. Thanks for the response. Actually, the AMD PC is at home and pretty much runs CPDN anytime I don\'t reboot it into WinXP to play WoW. The Intel PC is my workstation at the lab and is regularly running heavy jobs. I start them both with; cd ~/bin/boinc nohup ./run_client > test.log & and let them run until I need the CPU for something else. Nic out On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair. How do you start boinc on that PC. Does that PC run other intensive programs at various times during the day? ID: 32181 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 32184 - Posted: 15 Jan 2008, 20:27:33 UTC Well, the Intel at work, most jobs seem to complete successfully, of the ones which don\'t: 16th Nov: Two jobs killed by signal 11 26th Aug: 2 jobs killed by signal 11 21st Aug: 1 signal 11 and exit status 139, the other just exit status 139 8th May : 2 jobs killed by signal 11 Signal 11 is a segmentation fault, I think somewhat like the access violation in Windows. What was happening on that box at the time the crashes occurred? Do you have \'leave in memory\' turned on or off? (the recommended setting is to have things stay in memory). It may be worth reading through the project READMEs to see if there is anything useful (link in my signature). I should point out that a crash isn\'t a disaster - the climate is uploaded to the CPDN server as the model progresses (in the trickle-ups). But since it is always more satisfying to complete the model yourself, some people take backups at intervals, and other people shut down Boinc before running anything major on the PC. It is also a good idea to shut down boinc prior to shutting-down the PC. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 32184 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 32185 - Posted: 15 Jan 2008, 20:32:13 UTC Last modified: 15 Jan 2008, 20:39:28 UTC The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?). Is there any software in common between your home and work PCs? Something you run more frequently at home? How do you stop Boinc running when you need it for something else? If you use \'kill -9\' or similar I\'d recommend using \'boinc_cmd --quit\' instead. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 32185 · Reply Quote

old_user61264 Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0	Message 32197 - Posted: 16 Jan 2008, 14:20:48 UTC - in response to Message 32185. Hey Mark, Actually, I\'m not sure what is happening when the \'client error\' occurs. I really only check my CPDN numbers once a month or so. Hence, it is usually days to weeks past as CPDN has efficiently sent me a new work unit. The two machine run pretty much the same software (Ubuntu dual boot with WinXp, my professional software). The one real difference is that the home computer is rebooted almost every evening to play WoW with the kids, while the lab computer goes weeks between reboots. I\'ll give a try with the explicit boinc quit command to see if it helps. Otherwise there is lot of new research I can do in your \'README\' collection. I\'ll poke around. Thanks for your help. Nic out The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?). Is there any software in common between your home and work PCs? Something you run more frequently at home? How do you stop Boinc running when you need it for something else? If you use \'kill -9\' or similar I\'d recommend using \'boinc_cmd --quit\' instead. ID: 32197 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 32198 - Posted: 16 Jan 2008, 16:13:01 UTC In the README about crashes and problems I\'d recommend a look at item #5 by MikeMars. It doesn\'t deal specifically with the type of error your models have suffered on both computers, but it does comprehensively list all the normal precautions. Cpdn news ID: 32198 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 32201 - Posted: 16 Jan 2008, 17:13:40 UTC It could be the reboots which are causing the problem, and the boinc_cmd --quit should resolve that. I usually manually close boinc whenever I quit, both on Linux and Win32. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 32201 · Reply Quote

old_user200013 Send message Joined: 19 Sep 06 Posts: 1 Credit: 317,635 RAC: 0	Message 32213 - Posted: 17 Jan 2008, 11:44:04 UTC Hello, I also have the same problem with CPND: All my 15 work units finished with \"client error\", the one that worked more time was for total of 477,855.64 seconds. I use Windows VISTA 32bit, and I usualy don\'t close the BOINC client manualy. I also run Eistein@Home and malariacontrol, and in these 2 cases almost all the work units finished OK. Can you please help? Thanks, Carlos Almeida ID: 32213 · Reply Quote

Iain Inglis Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317	Message 32214 - Posted: 17 Jan 2008, 12:44:46 UTC Ziggy, It is especially important to close BOINC manually in Vista, as Microsoft have reduced Vista\'s closedown time to such an extent that BOINC can\'t cope - and the bigger the BOINC task, the harder it is to close it down quickly, which is why climate models get hit hardest. Just exit BOINC (or suspend and exit) and you shouldn\'t get any Vista closedown crashes. Iain PS With Vista it\'s also best to install outside \'C:\\Program Files\' to stop Vista\'s User Access Control messing with BOINC. ID: 32214 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 32215 - Posted: 17 Jan 2008, 12:57:02 UTC It\'s worthwhile reading the README posts (link in my signature), there are a couple of additional suggestions regarding Vista there as well as the above ones. I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 32215 · Reply Quote

old_user61264 Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0	Message 32303 - Posted: 23 Jan 2008, 14:00:24 UTC - in response to Message 32197. I think I\'ve found something that helps. I installed boinc through the Ubuntu servers (something like apt-get install boinc-client). That put an entry in /etc/init.d that explicitly starts up boinc when I boot Ubuntu. More importantly, it explicitly shuts down boinc when I shutdown the machine. This gives boinc/cpdn the time and commands to shut down gracefully during the process. I\'m about half way through the next model series, and hopeful I\'ll get a success this time. Nic out ID: 32303 · Reply Quote

MikeMarsUK Volunteer moderator Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0	Message 32310 - Posted: 23 Jan 2008, 18:13:14 UTC Good luck on the new model. If it\'s handy, could you post the shutdown script that Ubuntu uses to stop Boinc (I\'m guessing that it\'ll be calling boinc_cmd --quit)? I'm a volunteer and my views are my own. News and Announcements and FAQ ID: 32310 · Reply Quote

DJStarfox Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370	Message 32317 - Posted: 24 Jan 2008, 1:06:52 UTC - in response to Message 32310. Good luck on the new model. If it\'s handy, could you post the shutdown script that Ubuntu uses to stop Boinc (I\'m guessing that it\'ll be calling boinc_cmd --quit)? #/bin/bash ... killproc $BOINCEXE This is what it does. I\'ve been running with those init scripts for a long time now with no problems. Killproc waits for the boinc daemon to exit before continuing. ID: 32317 · Reply Quote