climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 91 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58498 - Posted: 1 Aug 2018, 9:56:27 UTC

:)

Just about to say that I have 7 that have just gone past 1 hour.
2 of these have failed before, both on AMD computers.
These are series s8, the other 5 are series sa.

So, luck of the draw.
ID: 58498 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58499 - Posted: 1 Aug 2018, 10:51:25 UTC - in response to Message 58498.  
Last modified: 1 Aug 2018, 10:52:17 UTC

A little digging found out of a little over 30 failures, just over half were AMD. If sample size big enough to be significant that would imply to me a higher failure rate for AMD.

Edit: Got bored after finding that many, now getting back to work.
ID: 58499 · Report as offensive
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 58500 - Posted: 1 Aug 2018, 11:31:09 UTC

I have a Batch 742 model that's just coming up to 3 hours of processing. Early days, I know, but well past the initial fail zone.
ID: 58500 · Report as offensive
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,229,255
RAC: 3,258
Message 58503 - Posted: 1 Aug 2018, 13:23:57 UTC

I have 12 workunits. All processing fine without any problems.

Keep cooling down your hardware (35° celsius in Germany) and have a nice day,

Bonsai911
ID: 58503 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58505 - Posted: 1 Aug 2018, 13:46:35 UTC - in response to Message 58503.  

35° celsius in Germany


I am only running three out of four cores on my laptop as it has been so warm here in UK. I wouldn't be at all surprised if high temperatures in Europe at least are contributing to the current high failure rate, though I haven't seen suggestions that it is widespread apart from the latest batch.
ID: 58505 · Report as offensive
Sardis73

Send message
Joined: 1 Apr 12
Posts: 3
Credit: 13,763,432
RAC: 4,782
Message 58507 - Posted: 1 Aug 2018, 13:50:44 UTC - in response to Message 58499.  

I have about 17 or 18 failures in a row. All are 742. They fail in less than 3 minutes.
ID: 58507 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58509 - Posted: 1 Aug 2018, 15:05:00 UTC - in response to Message 58507.  

Sardis73

All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above.

It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running.

Try setting it to 100% to turn it off.
ID: 58509 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 58512 - Posted: 1 Aug 2018, 15:56:39 UTC - in response to Message 58497.  
Last modified: 1 Aug 2018, 16:03:38 UTC

8 failed 5 mins after starting, all segment violation.
ID: 58512 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58514 - Posted: 1 Aug 2018, 16:45:36 UTC - in response to Message 58512.  

Another small batch 743 along the lines of 740 has been released. JUst 120 work units. Not on Front page yet.
ID: 58514 · Report as offensive
rjs5

Send message
Joined: 16 Jun 05
Posts: 16
Credit: 18,625,550
RAC: 11,526
Message 58515 - Posted: 2 Aug 2018, 6:32:44 UTC - in response to Message 58509.  

Sardis73

All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above.

It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running.

Try setting it to 100% to turn it off.


I had 2 of the sam25 generate compute errors (Signal 11 received: Segment violation).

It is easy to ignore a "sensitive" model when it generates an error and aborts. How do you detect when the "sensitive" model just gets the wrong answer and does NOT abort?

Seems like something important to isolate.
ID: 58515 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58516 - Posted: 2 Aug 2018, 7:09:01 UTC - in response to Message 58515.  
Last modified: 2 Aug 2018, 7:39:21 UTC

Only the researchers can do that when they run various programs against the data received.

With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast.

****************

On reflection

Perhaps I used words that have a different "obvious meaning" to others. So lets try again.

Most of the failures are at about 3 minutes. (I've seen a few that were further along.)

For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts.

All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job.
ID: 58516 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 58517 - Posted: 2 Aug 2018, 8:47:59 UTC - in response to Message 58512.  
Last modified: 2 Aug 2018, 8:57:40 UTC

Another 7 from batch 742 failed this morning with segmentation error. Unfortunately this exceeds my daily task quota so the computer concerned won't be allowed to get more tasks until tomorrow. Also had one fail on my other computer, again after about 4 mins.
ID: 58517 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58518 - Posted: 2 Aug 2018, 9:52:43 UTC - in response to Message 58517.  

And 744 has been released now. Produced from sam25t 10 year restarts. Too early to say whether these will have the same high failure rate or not.
ID: 58518 · Report as offensive
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 33,347,857
RAC: 0
Message 58519 - Posted: 2 Aug 2018, 11:41:12 UTC

Just to add some perspective. My Win 10 i-7 Intel computer is running 4 from the 742 batch. To check on the reported problems, having let them run for an hour, I deliberately suspended them and then resumed them. They did not fail.
One has reached 1 day and 3 hours, the other 3 are at 9 hours, all with no problems.
ID: 58519 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58520 - Posted: 2 Aug 2018, 12:01:53 UTC

Perhaps not surprisingly it does seem that batch 744 is also suffering from the higher than normal failure rate.

I have looked at a few computers that have failed all of the tasks from 742 and 744 they have received and there is no obvious common factor in either these or those that are successful.

Unless someone with more patience than I have and the skills to extract the data automagically to look at a larger dataset than I have done manually, it is not going to be easy to work this one out.
ID: 58520 · Report as offensive
rjs5

Send message
Joined: 16 Jun 05
Posts: 16
Credit: 18,625,550
RAC: 11,526
Message 58521 - Posted: 2 Aug 2018, 15:08:26 UTC - in response to Message 58516.  

Only the researchers can do that when they run various programs against the data received.

With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast.

****************

On reflection

Perhaps I used words that have a different "obvious meaning" to others. So lets try again.

Most of the failures are at about 3 minutes. (I've seen a few that were further along.)

For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts.

All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job.


Thanks for the reply. It just seemed like a program bug where the results COULD possibly be from crunching some garbage.

If it is a rogue pointer that points into random program code/data (does not SEGVIO), rather than the "SEGVIO" abort ... then it seems like the computed results will not be what the researcher wants.
ID: 58521 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,018
RAC: 3,616
Message 58522 - Posted: 2 Aug 2018, 15:18:49 UTC - in response to Message 58521.  

I don't pretend to understand it but they use some complex statistical package to analyse the results and decide if any need to be discarded. The program running the tasks will eliminate some, e.g. if an impossible climate is produced. At one time a fairly common example of this was -ve theta indicating that there was a negative atmospheric pressure.
ID: 58522 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 58523 - Posted: 2 Aug 2018, 19:44:08 UTC - in response to Message 58521.  

The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time.

But there's a large number of ancillary files, and some smaller programs to make them work with the main program.

It seems that I may have both "guessed right, and guessed wrong" in what's happening.
It is something that happens at about 3 minutes, but it's to do with switching from the global model to the regional model.

All of which is up to the researchers that are putting various bits together, ready for people to download, to solve. We just need to crunch on and provide the data for them, both good and bad.

Hopefully, some/lots of the crashes also produce the "out files", and they got uploaded. These contain info about where the programs were up to at the time they ended.

And batch 742 has been paused while thinking is in progress.
ID: 58523 · Report as offensive
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 33,347,857
RAC: 0
Message 58524 - Posted: 2 Aug 2018, 22:08:18 UTC

And batch 742 has been paused while thinking is in progress.

So Should I continue crunching my 742s and those in my queue?
ID: 58524 · Report as offensive
Thund3rb1rd

Send message
Joined: 18 Jun 05
Posts: 24
Credit: 2,500,676
RAC: 0
Message 58525 - Posted: 2 Aug 2018, 23:01:02 UTC

The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time.


At the risk of being thought disagreeable, I respectfully disagree. This situation regarding numerous failing tasks is purely the result of inadequate; nay, POOR CPDN software design, aka bugs... perhaps an entire nest of them!

The entire purpose of BOINC is to enable multiple projects to be run on individual PC's, not supercomputers. Dinking around with the global settings inherent in BOINC to PERHAPS stabilize one project - i.e., CPDN - at the risk of destabilizing other BOINC-related projects - i.e., SETI, LHC, Cosmology, Milky Way, etc, etc, etc - is NOT a solution and is in fact foolhardy.

The tasks may or may not contain garbage data - if they do, then it is up to the programmers to determine what that bad data may contain and adjust the operating code to compensate, OR to adjust the code creating the tasks to edit the data more courageously.

In any event, comparing the operating system and processing software that may be running on whatever mainframe CPDN uses to the myriad operating systems being used by BOINC volunteers in a vain hope to stabilize CPDN is just simply useless. To reiterate a comment I made recently on this subject in another thread, NO ONE really understands what the problem is, let alone what a solution may be.

To mis-quote the Bard,

The fault, dear Brutus, is not in our PC's, but in CPDN, for we are underlings.

ID: 58525 · Report as offensive
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 91 · Next

Message boards : Number crunching : New work Discussion

©2024 climateprediction.net