climateprediction.net home page
Computation Errors

Computation Errors

Message boards : Number crunching : Computation Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
GlobalWarring
Avatar

Send message
Joined: 15 Feb 09
Posts: 2
Credit: 2,077,912
RAC: 2,927
Message 48458 - Posted: 19 Mar 2014, 17:46:25 UTC

Hi All,

I have recently switched two machines from SETI@home to ClimatePrediction.

My main machine (Computer number 1319270) is using 5 cores to crunch HADAM3P and all results are being returned with a 'Computation Error'

Am I using too many cores? I used to use 5 cores plus the GPU for SETI no issues???

Help please.

Thanks in advance
ID: 48458 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2464
Credit: 3,124,201
RAC: 383
Message 48459 - Posted: 19 Mar 2014, 18:55:43 UTC - in response to Message 48458.  

If you look at the individual tasks by going to your account >Computers on this account>Tasks Then click on the + sign just under stderr you will see

"Invalid global lon coordinates: 359.780000" as one of the lines. This I think indicates that it is a problem with that particular model. A number of the latest batch seem to be affected but not all. Other problems being discussed in this thread
ID: 48459 · Report as offensive     Reply Quote
GlobalWarring
Avatar

Send message
Joined: 15 Feb 09
Posts: 2
Credit: 2,077,912
RAC: 2,927
Message 48460 - Posted: 19 Mar 2014, 19:16:24 UTC - in response to Message 48459.  

Thanks Dave, encouraging that it's not a machine issue.

Regards

David
ID: 48460 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2464
Credit: 3,124,201
RAC: 383
Message 48492 - Posted: 23 Mar 2014, 8:59:03 UTC

Interestingly I just had a model fail following a shutdown of boinc after suspending computation prior to a system restart after some updates. I thought I had waited long enough for all files to be written to disk. I think this is the second time recently (past two or three months) this has happened with eu models. I can only remember having had it happen with the the full resolution ocean models in the past.

I shall pay attention to see if things have gotten more sensitive recently.
ID: 48492 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 187
Credit: 9,677,946
RAC: 71
Message 48493 - Posted: 23 Mar 2014, 12:15:42 UTC - in response to Message 48492.  

[quote]Interestingly I just had a model fail following a shutdown of boinc after suspending computation prior to a system restart after some updates. I thought I had waited long enough for all files to be written to disk. I think this is the second time recently (past two or three months) this has happened with eu models. I can only remember having had it happen with the the full resolution ocean models in the past.]

That has happened to me in the recent past too. Happily, when I shut down BOINC in such circumstances, I still backup the data files. (Old habits die hard). And I found that I was able to run satisfactorily by restoring the back up. Might just be luck.
ID: 48493 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2464
Credit: 3,124,201
RAC: 383
Message 48692 - Posted: 3 Apr 2014, 8:49:13 UTC

Might just be luck.

Tried restarting from backup and got the same result with another of the recent eu models. I think I will suspend unstarted models and only do the reboots when no tasks are running. On the regional models, I haven't had any of them fail due to using hibernate which I use every evening.

I am wondering if there is something specific about a kernel update that causes them to fall over?
ID: 48692 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 13,363,707
RAC: 28
Message 48693 - Posted: 3 Apr 2014, 9:32:18 UTC
Last modified: 3 Apr 2014, 9:32:33 UTC

We should only ever reboot when no CPDN models are running. In general, rebooting with BOINC and other projects' tasks running is not a problem but CPDN models are an exception to this. Before shutting down, rebooting or restarting it's a good idea to suspend all tasks in the BOINC Manager Activity tab then exit completely from BOINC. To do this, in the BOINC Manager File menu select Exit, or else right-click on the BOINC icon then select Exit.

Pressing the X button of BOINC Manager is not enough. The X button doesn't suspend the models.

The models do not enjoy being closed down by an OS which may be too quick and catch them at a critical moment in their calculations.

If we don't exit from running models and BOINC when rebooting, sooner or later we are likely to crash the occasional model. As far as I know this has always been the case and is not a recent phenomenon.

Exiting from model computation + BOINC only takes a moment and helps protect our considerable investment in computer time and electricity.
Cpdn news
ID: 48693 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2464
Credit: 3,124,201
RAC: 383
Message 48695 - Posted: 3 Apr 2014, 9:47:13 UTC - in response to Message 48693.  

Perhaps I wasn't clear.

With both of these models I suspended the models individually as well as suspending activity from the activity menu. I then gave BOINC several minutes before I closed it down via file exit and again waited several minutes before rebooting. That is why I was wondering if there was something specific about having done a kernel update that encourages models to crash.
ID: 48695 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 979
Credit: 3,103,609
RAC: 95
Message 48697 - Posted: 3 Apr 2014, 10:31:47 UTC - in response to Message 48695.  

Dave:

The only odd thing I could see about the failed models that weren't server-side errors such as REPLANCA was that they had a lot of suspend entries in the stderr log. It might be worth setting the suspend threshold to zero (to turn it off). It's easy to forget to do that on a new install, since the default BOINC setting tries to make BOINC a good sharer of computing resources, which doesn't necessarily work too well for big models like CPDN. As you say it's usually the HADCM3N models that have the problem, but unless you know that computer will have a problem then stopping the suspends is better.
ID: 48697 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2464
Credit: 3,124,201
RAC: 383
Message 48699 - Posted: 3 Apr 2014, 11:39:59 UTC - in response to Message 48697.  

Thanks Iain,

done that. I will still wait till no models running before suspending, leaving BOINC and restarting when updates require it because of a kernel upgrade as that seems to be the only time there is a problem with the eu tasks.

My intuition may completely off the wall but it will be interesting to see if anyone else shares my experience.
ID: 48699 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 364
Credit: 114,089,091
RAC: 1,509
Message 48706 - Posted: 4 Apr 2014, 9:29:16 UTC - in response to Message 48699.  

What works for me, on Linux (Ubu and Deb) -- and I've done a few kernel upgrades the last few months with no problems, eu, anz, cm3n running.

Fire up the boincmgr. Suspend all running tasks. Wait a few seconds. Go to command line and sync sync. Back to boincmgr. Suspend network, suspend project. Back to cli, and sync sync. Back to boincmgr File, exit, stop tasks on the exit menu. Sync Sync. Sudo service boinc-client stop. Sync sync. If boinc-client stop takes more than one second, or displays anything, there's a problem, but no way to fix it. Do backup. Install new kernel. boot.

Works for me.

ID: 48706 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 48723 - Posted: 7 Apr 2014, 0:06:48 UTC

Any idea what might have happened to these two:

Task 16395586
Task 16395660

I see "no heartbeat", "quit request" and "CPDN process is not running" messages in the stderr output. As best I can tell, these models had been running until higher priority WU's with another project needed time. I can't be sure, but it might be that they errored out after they resumed running.

Adding insult to injury, Task 16395660 had no recorded trickles after about 16 hours of work... :-(
ID: 48723 · Report as offensive     Reply Quote
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 13,363,707
RAC: 28
Message 48724 - Posted: 7 Apr 2014, 1:15:55 UTC

Unfortunately you can't at the moment compare how that second task fares on another computer because it was sent to a computer that crashes everything it receives.

Both of those models were on the same computer and reported at the same time. Did they crash at the same moment?

The reason why one computer received credit but not the other would generally be because one got stuck in a loop. That used to be a problem with certain model types but AFAIK in these regional models looping is rare or unknown. I don't mean impossible.
Cpdn news
ID: 48724 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 48725 - Posted: 7 Apr 2014, 7:18:26 UTC - in response to Message 48723.  
Last modified: 7 Apr 2014, 7:37:43 UTC

The only thing I can see is that the models crashed when 16395660 must have been just about to trickle / compose the first upload. (trickle times within the same model type are usually quite constant, the other one returned the first upload after 60,837, the crash happened after 59,819.36 of this one - close enough to assume there's a connection between trickle and crash)

Do you allow enough HDD space in your global settings ? And excluded all CPDN stuff from virus scans (scanners can scan inside ZIP files, might disturb the ZIP process)?
ID: 48725 · Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 29 May 08
Posts: 128
Credit: 6,289,876
RAC: 0
Message 48727 - Posted: 7 Apr 2014, 15:27:17 UTC - in response to Message 48725.  
Last modified: 7 Apr 2014, 15:27:33 UTC

Ananas wrote:
Do you allow enough HDD space in your global settings ? And excluded all CPDN stuff from virus scans (scanners can scan inside ZIP files, might disturb the ZIP process)?

Thanks for the feedback, Ananas. I allow BOINC to use 100GB on my HDD and have excluded all BOINC-related directories and files from my AV software.
ID: 48727 · Report as offensive     Reply Quote

Message boards : Number crunching : Computation Errors

©2019 climateprediction.net