Questions and Answers :
Unix/Linux :
Multiple failures
Message board moderation
Author | Message |
---|---|
Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0 |
All, I have two machines currently crunching climate prediction clients. They both run Ubuntu 7.10. One is a dual processor AMD chip on an Asus motherboard, and repeatedly (20x) ends the run with \'Client Error\'. It is always at different places in the run with between 60k and 2,000k CPU seconds committed. The second machine has an Intel dual core (4gb memory, runs 64bit Ubuntu), and has success about half the time, and Client error about half the time. Am I wasting my time/energy trying to do Climate Prediction? From the figures I have the impression I am contributing very little to the effort in spite of months of CPU time. Thanks, Nic out |
Send message Joined: 7 Aug 04 Posts: 2172 Credit: 64,723,937 RAC: 2,741 |
On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair. How do you start boinc on that PC. Does that PC run other intensive programs at various times during the day? |
Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0 |
Thanks for the response. Actually, the AMD PC is at home and pretty much runs CPDN anytime I don\'t reboot it into WinXP to play WoW. The Intel PC is my workstation at the lab and is regularly running heavy jobs. I start them both with; cd ~/bin/boinc nohup ./run_client > test.log & and let them run until I need the CPU for something else. Nic out On the AMD PC, it appears as if when one model fails, the other one on the dual core PC also fails within a few/several minutes. It\'s almost like they error out on an unclean shutdown of boinc, or when some other intensive process runs. If it was pure PC instability, they would be failing at various times, instead of nearly the same time for both runs of a pair. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Well, the Intel at work, most jobs seem to complete successfully, of the ones which don\'t: 16th Nov: Two jobs killed by signal 11 26th Aug: 2 jobs killed by signal 11 21st Aug: 1 signal 11 and exit status 139, the other just exit status 139 8th May : 2 jobs killed by signal 11 Signal 11 is a segmentation fault, I think somewhat like the access violation in Windows. What was happening on that box at the time the crashes occurred? Do you have \'leave in memory\' turned on or off? (the recommended setting is to have things stay in memory). It may be worth reading through the project READMEs to see if there is anything useful (link in my signature). I should point out that a crash isn\'t a disaster - the climate is uploaded to the CPDN server as the model progresses (in the trickle-ups). But since it is always more satisfying to complete the model yourself, some people take backups at intervals, and other people shut down Boinc before running anything major on the PC. It is also a good idea to shut down boinc prior to shutting-down the PC. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?). Is there any software in common between your home and work PCs? Something you run more frequently at home? How do you stop Boinc running when you need it for something else? If you use \'kill -9\' or similar I\'d recommend using \'boinc_cmd --quit\' instead. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0 |
Hey Mark, Actually, I\'m not sure what is happening when the \'client error\' occurs. I really only check my CPDN numbers once a month or so. Hence, it is usually days to weeks past as CPDN has efficiently sent me a new work unit. The two machine run pretty much the same software (Ubuntu dual boot with WinXp, my professional software). The one real difference is that the home computer is rebooted almost every evening to play WoW with the kids, while the lab computer goes weeks between reboots. I\'ll give a try with the explicit boinc quit command to see if it helps. Otherwise there is lot of new research I can do in your \'README\' collection. I\'ll poke around. Thanks for your help. Nic out The AMD errors look similar (signal 11, and error code 139, which I think is the same thing), but much more frequent. What was happening on the PC at the moment those crashes took place? Is there anything in the Boinc messages log? (or stderr/stdout?). |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
In the README about crashes and problems I\'d recommend a look at item #5 by MikeMars. It doesn\'t deal specifically with the type of error your models have suffered on both computers, but it does comprehensively list all the normal precautions. Cpdn news |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
It could be the reboots which are causing the problem, and the boinc_cmd --quit should resolve that. I usually manually close boinc whenever I quit, both on Linux and Win32. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 19 Sep 06 Posts: 1 Credit: 317,635 RAC: 0 |
Hello, I also have the same problem with CPND: All my 15 work units finished with \"client error\", the one that worked more time was for total of 477,855.64 seconds. I use Windows VISTA 32bit, and I usualy don\'t close the BOINC client manualy. I also run Eistein@Home and malariacontrol, and in these 2 cases almost all the work units finished OK. Can you please help? Thanks, Carlos Almeida |
Send message Joined: 9 Jan 07 Posts: 467 Credit: 14,549,176 RAC: 317 |
Ziggy, It is especially important to close BOINC manually in Vista, as Microsoft have reduced Vista\'s closedown time to such an extent that BOINC can\'t cope - and the bigger the BOINC task, the harder it is to close it down quickly, which is why climate models get hit hardest. Just exit BOINC (or suspend and exit) and you shouldn\'t get any Vista closedown crashes. Iain PS With Vista it\'s also best to install outside \'C:\\Program Files\' to stop Vista\'s User Access Control messing with BOINC. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
It\'s worthwhile reading the README posts (link in my signature), there are a couple of additional suggestions regarding Vista there as well as the above ones. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 6 Mar 05 Posts: 4 Credit: 7,782,147 RAC: 0 |
I think I\'ve found something that helps. I installed boinc through the Ubuntu servers (something like apt-get install boinc-client). That put an entry in /etc/init.d that explicitly starts up boinc when I boot Ubuntu. More importantly, it explicitly shuts down boinc when I shutdown the machine. This gives boinc/cpdn the time and commands to shut down gracefully during the process. I\'m about half way through the next model series, and hopeful I\'ll get a success this time. Nic out |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Good luck on the new model. If it\'s handy, could you post the shutdown script that Ubuntu uses to stop Boinc (I\'m guessing that it\'ll be calling boinc_cmd --quit)? I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 27 Jan 07 Posts: 300 Credit: 3,288,263 RAC: 26,370 |
#/bin/bash ... killproc $BOINCEXE This is what it does. I\'ve been running with those init scripts for a long time now with no problems. Killproc waits for the boinc daemon to exit before continuing. |
©2024 cpdn.org