climateprediction.net home page
Error while computing???

Error while computing???

Message boards : Number crunching : Error while computing???
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Thund3rb1rd

Send message
Joined: 18 Jun 05
Posts: 24
Credit: 2,500,676
RAC: 0
Message 58430 - Posted: 19 Jul 2018, 10:11:30 UTC

My tasks keep dying. They get just so far - in some cases VERY far - then die off. This was happening even befor the Situation.

Anyone have any ideas why?
ID: 58430 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58431 - Posted: 19 Jul 2018, 11:44:28 UTC - in response to Message 58430.  

The only obvious thing from your Tasks list, is that you appear to be using the default setting for Suspend when non-BOINC CPU usage is above.
This causes BOINC to keep stopping and starting the models, which they don't like.

They're from the UK Met Office, where they run on supercomputers, and are not coded to survive constant stopping and starting.

So setting this option to 100% will "turn it off", and allow the tasks to run continuously.
If you find this makes your computer sluggish, then reduce the number of tasks that run at the same time. Use at most 100% of the CPUs . Make this 50%.

See what these 2 changes do for that computer.

After that it may be down to something that you're using that computer for.
ID: 58431 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 18 Jun 05
Posts: 24
Credit: 2,500,676
RAC: 0
Message 58435 - Posted: 19 Jul 2018, 19:16:53 UTC - in response to Message 58431.  

First, thank you for the help. I appreciate your time.

I wasn't using the suspend option at all - at least, the box wasn't checked. I've activated it and set it to 100% per your comment. I wonder if not having a setting at all may have been the problem.

I was already using only 75% of the CPUs, but have set that to 50%.

Okay. We'll see what we see, I guess.

Again, thank you for your time.
ID: 58435 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58437 - Posted: 19 Jul 2018, 22:08:39 UTC - in response to Message 58435.  

75% may be OK too; it all depends on how sluggish the computer "feels".

I've got a quad core which is hyper-threaded, so I limit it to 50% and just use, hopefully, the "real" cores, leaving the others for housekeeping, etc.
ID: 58437 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 18 Jun 05
Posts: 24
Credit: 2,500,676
RAC: 0
Message 58447 - Posted: 21 Jul 2018, 22:43:00 UTC

A post in a different thread reminded me of something - Since October/November 2017, virtually every task I've attempted has died with a computing error.

Before that, I had gone for at least year with virtually every task running to completion - not all, of course, but surely more than 95%. I got the occasional time-out, and the occasional Error while Computing, but by and large, I had no real problems.

After that period, out of 33 tasks attempted, I've had only 3 run to completion, with 2 hung up in purgatory.

In looking over my task list, it's been a mixed bag as far as which of my three boxes had problems, but one thing is certain - the problems started just about a year ago regardless of which machine was running the task.

I spent too many years as a programmer to make changes willy-nilly with no evidence to back up the changes. Up until I examined the situation, I was of the opinion that my machine was messed up somehow. Now, I'm not so sure, particularly if others started having problems about that time.
ID: 58447 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58448 - Posted: 21 Jul 2018, 23:38:56 UTC - in response to Message 58447.  

That would make it the wah2 models.
Perhaps they're more susceptible to the stop/start business.
ID: 58448 · Report as offensive     Reply Quote
Profile JIM

Send message
Joined: 31 Dec 07
Posts: 1152
Credit: 22,053,321
RAC: 4,417
Message 58449 - Posted: 22 Jul 2018, 2:51:06 UTC

There is something that I have been wondering about of some time. We know that CPDN does not like being started and stopped often. In fact the WU’s tend to crash. We tweak the settings to prevent repeated stops and starts due to CPU usage.

But, if you run more than one project on the same machine and one of them is CPDN they are being stated and stopped frequently. The default setting is to switch between projects every 120 minutes. Wouldn’t this tend to increase the failure rate. Is it safe to run 2 or more projects alongside CPDN?
ID: 58449 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58450 - Posted: 22 Jul 2018, 5:50:04 UTC - in response to Message 58449.  

Yes, I think that would happen with multiple projects.
But perhaps the longer time between switches lessens the chance that the program "gets caught at a bad time".

I don't know where the sensitivity is, but one place may be where the calcs are paused so that the program can exchange data across the "cell" boundaries.

The size of the cells for the "area of interest" is the smallest, (the number at the end of the abbreviated name, such as the current nam50), then the reset of the globe, then the ocean cells. (These latter change quite slowly.)

I know that lots of people do run a mix of projects with cpdn, but I've never been interested in how successful this is.
ID: 58450 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 58451 - Posted: 22 Jul 2018, 5:53:13 UTC - in response to Message 58430.  

My tasks keep dying. They get just so far - in some cases VERY far - then die off. This was happening even befor the Situation.


I have received essentially no work units in what seems to be a year. I run Linux, so that accounts for this.

But if I remember correctly, I had no trouble running up to four work units at a time on a 4-core processor or, before that on a 2 hyper-threaded Xeon processor machine. And I had no such problem. I believe the secret was to have the Leave non-GPU tasks in memory while suspended option checked.
ID: 58451 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58454 - Posted: 22 Jul 2018, 19:38:32 UTC

It's been so long since I set my options that I'd forgotten about "leave in memory". Which is also important.

On another note, Jean, have a read of this post:
How to install Wine in Linux Mint and Ubuntu

There's a couple of tricky spots, but it gets the latest version, which is currently 3.0.something.
ID: 58454 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 58457 - Posted: 23 Jul 2018, 6:17:17 UTC - in response to Message 58454.  

Like Les, I haven't paid particular attention to any other projects apart from CPDN. However, I do run WCG when no work is available for CPDN and haven't noticed any increase in the error rate if I leave the WCG tasks running when I do get Climate Prediction work. I have on occasion with the same thoughts in mind, increased the switching time but not for a number of years.
ID: 58457 · Report as offensive     Reply Quote
Thund3rb1rd

Send message
Joined: 18 Jun 05
Posts: 24
Credit: 2,500,676
RAC: 0
Message 58462 - Posted: 23 Jul 2018, 17:44:51 UTC

Several new posts to this thread have postulated various causes for CPDN to error out. I've checked each of my machines and have found nothing that stands out as a smoking gun.

I've been running CPDN since 2005 together with as many as 10 additional BOINC projects on various versions of Wintel machines, and up until last fall had no problems with any of them, apart from the occasional hiccups which go with any BOINC project. For the most part, they have all played nicely together for more than a decade. As I remarked below, I spent too many decades as a programmer to make system changes without investigating them thoroughly beforehand.

Right now, I'm running eight BOINC projects - including CPDN - and the only project I'm currently having issues with is CPDN, and I didn't start having those problems until last fall.

I do not make configuration changes to BOINC without good reason. However it wants to install itself is fine with me. In all the years I've been running BOINC projects, the only global change I've ever made to all of my machines is to disable GPU use. This was the case before last fall, and is true today.

I've carefully investigated each of the clues and suggestions found in previous threads and come up with zilch. I'm not denying the fault MAY lie with my machine, but if that's the case, it's with all three machines and that isn't reasonable.

The theory that the problem is caused by constant switching back and forth from CPDN to other projects doesn't hold up in my experience with multiple projects.
ID: 58462 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 58487 - Posted: 29 Jul 2018, 22:03:17 UTC

Had 3 batch 738 models fail with this error:

<core_client_version>7.12.1</core_client_version>
<![CDATA[
<message>
The device does not recognize the command.
(0x16) - exit code 22 (0x16)</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048
Suspended CPDN Monitor - Suspend request from BOINC...

Model crashed: INANCILA:integer header error tmp/pipe_dummy 2048
Suspended CPDN Monitor - Suspend request from BOINC...
Sorry, too many model crashes! :-(
00:33:07 (6124): called boinc_finish(22)

</stderr_txt>
]]>
ID: 58487 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 58488 - Posted: 30 Jul 2018, 7:27:21 UTC - in response to Message 58487.  

Had 3 batch 738 models fail with this error:


738 is one of the test batches that would if it were running have gone to the testing site. At least the one I looked at also failed on its two other attempts.

I am afraid that this sort of thing is likely until the testing site is back up and running.
ID: 58488 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 58489 - Posted: 30 Jul 2018, 8:24:51 UTC - in response to Message 58488.  

I have just had it confirmed that there was an issue with this batch and if any still out there they can be aborted. Further resends of any that haven't already failed three times will be stopped.
ID: 58489 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 637
Credit: 26,751,529
RAC: 653
Message 58490 - Posted: 30 Jul 2018, 14:28:57 UTC - in response to Message 58462.  

The theory that the problem is caused by constant switching back and forth from CPDN to other projects doesn't hold up in my experience with multiple projects.

But your experience thus far seems to be that you do have problems with multiple projects. Have you tried running CPDN by itself?
ID: 58490 · Report as offensive     Reply Quote
flashawk

Send message
Joined: 29 Jun 12
Posts: 31
Credit: 1,438,478
RAC: 0
Message 58493 - Posted: 1 Aug 2018, 5:04:37 UTC - in response to Message 58490.  

All the new WU's are failing - wah2_sam25. These are the new ones that were just released, computation error within 3 minutes of starting.
ID: 58493 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 58494 - Posted: 1 Aug 2018, 5:30:09 UTC - in response to Message 58493.  

All the new WU's are failing - wah2_sam25. These are the new ones that were just released, computation error within 3 minutes of starting.


Checked and certainly a lot are failing. None of those I have been able to find so far have been out long enough to fail on a second machine but I found enough to justify informing the project. I imagine this will be seen in about 2-2.5 hours time.
ID: 58494 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,377,675
RAC: 3,657
Message 58495 - Posted: 1 Aug 2018, 6:53:04 UTC - in response to Message 58494.  

At least one has failed twice with the segfault error, though not quite conclusive as one of the two had above average failure rates and the other was a brand new machine with only 6 tasks listed.

From Sihan at the project,

Thanks. I don't see what the problem is right away, will do some investigation.
ID: 58495 · Report as offensive     Reply Quote
flashawk

Send message
Joined: 29 Jun 12
Posts: 31
Credit: 1,438,478
RAC: 0
Message 58501 - Posted: 1 Aug 2018, 11:59:51 UTC - in response to Message 58495.  

I'm still running 8 WU's from the previous batch without any issues. All of the new ones failed within 3.5 minutes of being started, I'm downloading 11 more right now.
ID: 58501 · Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Error while computing???

©2024 climateprediction.net