Computation Errors

Author	Message
GlobalWarring Send message Joined: 15 Feb 09 Posts: 2 Credit: 4,875,146 RAC: 2,534	Message 48458 - Posted: 19 Mar 2014, 17:46:25 UTC Hi All, I have recently switched two machines from SETI@home to ClimatePrediction. My main machine (Computer number 1319270) is using 5 cores to crunch HADAM3P and all results are being returned with a 'Computation Error' Am I using too many cores? I used to use 5 cores plus the GPU for SETI no issues??? Help please. Thanks in advance ID: 48458 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,499,590 RAC: 5,672	Message 48459 - Posted: 19 Mar 2014, 18:55:43 UTC - in response to Message 48458. If you look at the individual tasks by going to your account >Computers on this account>Tasks Then click on the + sign just under stderr you will see "Invalid global lon coordinates: 359.780000" as one of the lines. This I think indicates that it is a problem with that particular model. A number of the latest batch seem to be affected but not all. Other problems being discussed in this thread ID: 48459 · Reply Quote

GlobalWarring Send message Joined: 15 Feb 09 Posts: 2 Credit: 4,875,146 RAC: 2,534	Message 48460 - Posted: 19 Mar 2014, 19:16:24 UTC - in response to Message 48459. Thanks Dave, encouraging that it's not a machine issue. Regards David ID: 48460 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,499,590 RAC: 5,672	Message 48492 - Posted: 23 Mar 2014, 8:59:03 UTC Interestingly I just had a model fail following a shutdown of boinc after suspending computation prior to a system restart after some updates. I thought I had waited long enough for all files to be written to disk. I think this is the second time recently (past two or three months) this has happened with eu models. I can only remember having had it happen with the the full resolution ocean models in the past. I shall pay attention to see if things have gotten more sensitive recently. ID: 48492 · Reply Quote

Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0	Message 48493 - Posted: 23 Mar 2014, 12:15:42 UTC - in response to Message 48492. [quote]Interestingly I just had a model fail following a shutdown of boinc after suspending computation prior to a system restart after some updates. I thought I had waited long enough for all files to be written to disk. I think this is the second time recently (past two or three months) this has happened with eu models. I can only remember having had it happen with the the full resolution ocean models in the past.] That has happened to me in the recent past too. Happily, when I shut down BOINC in such circumstances, I still backup the data files. (Old habits die hard). And I found that I was able to run satisfactorily by restoring the back up. Might just be luck. ID: 48493 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,499,590 RAC: 5,672	Message 48692 - Posted: 3 Apr 2014, 8:49:13 UTC Might just be luck. Tried restarting from backup and got the same result with another of the recent eu models. I think I will suspend unstarted models and only do the reboots when no tasks are running. On the regional models, I haven't had any of them fail due to using hibernate which I use every evening. I am wondering if there is something specific about a kernel update that causes them to fall over? ID: 48692 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 48693 - Posted: 3 Apr 2014, 9:32:18 UTC Last modified: 3 Apr 2014, 9:32:33 UTC We should only ever reboot when no CPDN models are running. In general, rebooting with BOINC and other projects' tasks running is not a problem but CPDN models are an exception to this. Before shutting down, rebooting or restarting it's a good idea to suspend all tasks in the BOINC Manager Activity tab then exit completely from BOINC. To do this, in the BOINC Manager File menu select Exit, or else right-click on the BOINC icon then select Exit. Pressing the X button of BOINC Manager is not enough. The X button doesn't suspend the models. The models do not enjoy being closed down by an OS which may be too quick and catch them at a critical moment in their calculations. If we don't exit from running models and BOINC when rebooting, sooner or later we are likely to crash the occasional model. As far as I know this has always been the case and is not a recent phenomenon. Exiting from model computation + BOINC only takes a moment and helps protect our considerable investment in computer time and electricity. Cpdn news ID: 48693 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,499,590 RAC: 5,672	Message 48695 - Posted: 3 Apr 2014, 9:47:13 UTC - in response to Message 48693. Perhaps I wasn't clear. With both of these models I suspended the models individually as well as suspending activity from the activity menu. I then gave BOINC several minutes before I closed it down via file exit and again waited several minutes before rebooting. That is why I was wondering if there was something specific about having done a kernel update that encourages models to crash. ID: 48695 · Reply Quote

Iain Inglis Volunteer moderator Send message Joined: 16 Jan 10 Posts: 1081 Credit: 6,981,170 RAC: 3,836	Message 48697 - Posted: 3 Apr 2014, 10:31:47 UTC - in response to Message 48695. Dave: The only odd thing I could see about the failed models that weren't server-side errors such as REPLANCA was that they had a lot of suspend entries in the stderr log. It might be worth setting the suspend threshold to zero (to turn it off). It's easy to forget to do that on a new install, since the default BOINC setting tries to make BOINC a good sharer of computing resources, which doesn't necessarily work too well for big models like CPDN. As you say it's usually the HADCM3N models that have the problem, but unless you know that computer will have a problem then stopping the suspends is better. ID: 48697 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4342 Credit: 16,499,590 RAC: 5,672	Message 48699 - Posted: 3 Apr 2014, 11:39:59 UTC - in response to Message 48697. Thanks Iain, done that. I will still wait till no models running before suspending, leaving BOINC and restarting when updates require it because of a kernel upgrade as that seems to be the only time there is a problem with the eu tasks. My intuition may completely off the wall but it will be interesting to see if anyone else shares my experience. ID: 48699 · Reply Quote

Eirik Redd Send message Joined: 31 Aug 04 Posts: 391 Credit: 219,888,554 RAC: 1,481,373	Message 48706 - Posted: 4 Apr 2014, 9:29:16 UTC - in response to Message 48699. What works for me, on Linux (Ubu and Deb) -- and I've done a few kernel upgrades the last few months with no problems, eu, anz, cm3n running. Fire up the boincmgr. Suspend all running tasks. Wait a few seconds. Go to command line and sync sync. Back to boincmgr. Suspend network, suspend project. Back to cli, and sync sync. Back to boincmgr File, exit, stop tasks on the exit menu. Sync Sync. Sudo service boinc-client stop. Sync sync. If boinc-client stop takes more than one second, or displays anything, there's a problem, but no way to fix it. Do backup. Install new kernel. boot. Works for me. ID: 48706 · Reply Quote

ritterm Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0	Message 48723 - Posted: 7 Apr 2014, 0:06:48 UTC Any idea what might have happened to these two: Task 16395586 Task 16395660 I see "no heartbeat", "quit request" and "CPDN process is not running" messages in the stderr output. As best I can tell, these models had been running until higher priority WU's with another project needed time. I can't be sure, but it might be that they errored out after they resumed running. Adding insult to injury, Task 16395660 had no recorded trickles after about 16 hours of work... :-( ID: 48723 · Reply Quote

mo.v Volunteer moderator Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0	Message 48724 - Posted: 7 Apr 2014, 1:15:55 UTC Unfortunately you can't at the moment compare how that second task fares on another computer because it was sent to a computer that crashes everything it receives. Both of those models were on the same computer and reported at the same time. Did they crash at the same moment? The reason why one computer received credit but not the other would generally be because one got stuck in a loop. That used to be a problem with certain model types but AFAIK in these regional models looping is rare or unknown. I don't mean impossible. Cpdn news ID: 48724 · Reply Quote

Ananas Volunteer moderator Send message Joined: 31 Oct 04 Posts: 336 Credit: 3,316,482 RAC: 0	Message 48725 - Posted: 7 Apr 2014, 7:18:26 UTC - in response to Message 48723. Last modified: 7 Apr 2014, 7:37:43 UTC The only thing I can see is that the models crashed when 16395660 must have been just about to trickle / compose the first upload. (trickle times within the same model type are usually quite constant, the other one returned the first upload after 60,837, the crash happened after 59,819.36 of this one - close enough to assume there's a connection between trickle and crash) Do you allow enough HDD space in your global settings ? And excluded all CPDN stuff from virus scans (scanners can scan inside ZIP files, might disturb the ZIP process)? ID: 48725 · Reply Quote

ritterm Send message Joined: 29 May 08 Posts: 128 Credit: 6,289,876 RAC: 0	Message 48727 - Posted: 7 Apr 2014, 15:27:17 UTC - in response to Message 48725. Last modified: 7 Apr 2014, 15:27:33 UTC Ananas wrote: Do you allow enough HDD space in your global settings ? And excluded all CPDN stuff from virus scans (scanners can scan inside ZIP files, might disturb the ZIP process)? Thanks for the feedback, Ananas. I allow BOINC to use 100GB on my HDD and have excluded all BOINC-related directories and files from my AV software. ID: 48727 · Reply Quote