climateprediction.net home page
New work

New work

Message boards : Number crunching : New work
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 35 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58497 - Posted: 1 Aug 2018, 9:49:18 UTC - in response to Message 58496.  

17,550 released for batch 742.
And lots are failing soon after starting
.

Anyone getting these past first few minutes? If so it would be useful to know that they are not all failing with segfault.
ID: 58497 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7006
Credit: 20,926,388
RAC: 5,087
Message 58498 - Posted: 1 Aug 2018, 9:56:27 UTC

:)

Just about to say that I have 7 that have just gone past 1 hour.
2 of these have failed before, both on AMD computers.
These are series s8, the other 5 are series sa.

So, luck of the draw.
ID: 58498 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58499 - Posted: 1 Aug 2018, 10:51:25 UTC - in response to Message 58498.  
Last modified: 1 Aug 2018, 10:52:17 UTC

A little digging found out of a little over 30 failures, just over half were AMD. If sample size big enough to be significant that would imply to me a higher failure rate for AMD.

Edit: Got bored after finding that many, now getting back to work.
ID: 58499 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 194
Credit: 9,702,343
RAC: 1,128
Message 58500 - Posted: 1 Aug 2018, 11:31:09 UTC

I have a Batch 742 model that's just coming up to 3 hours of processing. Early days, I know, but well past the initial fail zone.
ID: 58500 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 210
Credit: 28,317,278
RAC: 210
Message 58503 - Posted: 1 Aug 2018, 13:23:57 UTC

I have 12 workunits. All processing fine without any problems.

Keep cooling down your hardware (35° celsius in Germany) and have a nice day,

Bonsai911
ID: 58503 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58505 - Posted: 1 Aug 2018, 13:46:35 UTC - in response to Message 58503.  

35° celsius in Germany


I am only running three out of four cores on my laptop as it has been so warm here in UK. I wouldn't be at all surprised if high temperatures in Europe at least are contributing to the current high failure rate, though I haven't seen suggestions that it is widespread apart from the latest batch.
ID: 58505 · Report as offensive     Reply Quote
Sardis73

Send message
Joined: 1 Apr 12
Posts: 2
Credit: 9,912,483
RAC: 6,055
Message 58507 - Posted: 1 Aug 2018, 13:50:44 UTC - in response to Message 58499.  

I have about 17 or 18 failures in a row. All are 742. They fail in less than 3 minutes.
ID: 58507 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7006
Credit: 20,926,388
RAC: 5,087
Message 58509 - Posted: 1 Aug 2018, 15:05:00 UTC - in response to Message 58507.  

Sardis73

All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above.

It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running.

Try setting it to 100% to turn it off.
ID: 58509 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 295
Credit: 14,768,121
RAC: 313
Message 58512 - Posted: 1 Aug 2018, 15:56:39 UTC - in response to Message 58497.  
Last modified: 1 Aug 2018, 16:03:38 UTC

8 failed 5 mins after starting, all segment violation.
ID: 58512 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58514 - Posted: 1 Aug 2018, 16:45:36 UTC - in response to Message 58512.  

Another small batch 743 along the lines of 740 has been released. JUst 120 work units. Not on Front page yet.
ID: 58514 · Report as offensive     Reply Quote
rjs5

Send message
Joined: 16 Jun 05
Posts: 12
Credit: 8,397,641
RAC: 0
Message 58515 - Posted: 2 Aug 2018, 6:32:44 UTC - in response to Message 58509.  

Sardis73

All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above.

It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running.

Try setting it to 100% to turn it off.


I had 2 of the sam25 generate compute errors (Signal 11 received: Segment violation).

It is easy to ignore a "sensitive" model when it generates an error and aborts. How do you detect when the "sensitive" model just gets the wrong answer and does NOT abort?

Seems like something important to isolate.
ID: 58515 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7006
Credit: 20,926,388
RAC: 5,087
Message 58516 - Posted: 2 Aug 2018, 7:09:01 UTC - in response to Message 58515.  
Last modified: 2 Aug 2018, 7:39:21 UTC

Only the researchers can do that when they run various programs against the data received.

With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast.

****************

On reflection

Perhaps I used words that have a different "obvious meaning" to others. So lets try again.

Most of the failures are at about 3 minutes. (I've seen a few that were further along.)

For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts.

All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job.
ID: 58516 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 295
Credit: 14,768,121
RAC: 313
Message 58517 - Posted: 2 Aug 2018, 8:47:59 UTC - in response to Message 58512.  
Last modified: 2 Aug 2018, 8:57:40 UTC

Another 7 from batch 742 failed this morning with segmentation error. Unfortunately this exceeds my daily task quota so the computer concerned won't be allowed to get more tasks until tomorrow. Also had one fail on my other computer, again after about 4 mins.
ID: 58517 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58518 - Posted: 2 Aug 2018, 9:52:43 UTC - in response to Message 58517.  

And 744 has been released now. Produced from sam25t 10 year restarts. Too early to say whether these will have the same high failure rate or not.
ID: 58518 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 126
Credit: 31,451,222
RAC: 1,172
Message 58519 - Posted: 2 Aug 2018, 11:41:12 UTC

Just to add some perspective. My Win 10 i-7 Intel computer is running 4 from the 742 batch. To check on the reported problems, having let them run for an hour, I deliberately suspended them and then resumed them. They did not fail.
One has reached 1 day and 3 hours, the other 3 are at 9 hours, all with no problems.
ID: 58519 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58520 - Posted: 2 Aug 2018, 12:01:53 UTC

Perhaps not surprisingly it does seem that batch 744 is also suffering from the higher than normal failure rate.

I have looked at a few computers that have failed all of the tasks from 742 and 744 they have received and there is no obvious common factor in either these or those that are successful.

Unless someone with more patience than I have and the skills to extract the data automagically to look at a larger dataset than I have done manually, it is not going to be easy to work this one out.
ID: 58520 · Report as offensive     Reply Quote
rjs5

Send message
Joined: 16 Jun 05
Posts: 12
Credit: 8,397,641
RAC: 0
Message 58521 - Posted: 2 Aug 2018, 15:08:26 UTC - in response to Message 58516.  

Only the researchers can do that when they run various programs against the data received.

With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast.

****************

On reflection

Perhaps I used words that have a different "obvious meaning" to others. So lets try again.

Most of the failures are at about 3 minutes. (I've seen a few that were further along.)

For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts.

All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job.


Thanks for the reply. It just seemed like a program bug where the results COULD possibly be from crunching some garbage.

If it is a rogue pointer that points into random program code/data (does not SEGVIO), rather than the "SEGVIO" abort ... then it seems like the computed results will not be what the researcher wants.
ID: 58521 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 2514
Credit: 3,130,281
RAC: 198
Message 58522 - Posted: 2 Aug 2018, 15:18:49 UTC - in response to Message 58521.  

I don't pretend to understand it but they use some complex statistical package to analyse the results and decide if any need to be discarded. The program running the tasks will eliminate some, e.g. if an impossible climate is produced. At one time a fairly common example of this was -ve theta indicating that there was a negative atmospheric pressure.
ID: 58522 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7006
Credit: 20,926,388
RAC: 5,087
Message 58523 - Posted: 2 Aug 2018, 19:44:08 UTC - in response to Message 58521.  

The climate models used here are from the UK Met Office, where they run on supercomputers, so it's unlikely that there's still any bugs after all of this time.

But there's a large number of ancillary files, and some smaller programs to make them work with the main program.

It seems that I may have both "guessed right, and guessed wrong" in what's happening.
It is something that happens at about 3 minutes, but it's to do with switching from the global model to the regional model.

All of which is up to the researchers that are putting various bits together, ready for people to download, to solve. We just need to crunch on and provide the data for them, both good and bad.

Hopefully, some/lots of the crashes also produce the "out files", and they got uploaded. These contain info about where the programs were up to at the time they ended.

And batch 742 has been paused while thinking is in progress.
ID: 58523 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 126
Credit: 31,451,222
RAC: 1,172
Message 58524 - Posted: 2 Aug 2018, 22:08:18 UTC

And batch 742 has been paused while thinking is in progress.

So Should I continue crunching my 742s and those in my queue?
ID: 58524 · Report as offensive     Reply Quote
Previous · 1 . . . 14 · 15 · 16 · 17 · 18 · 19 · 20 . . . 35 · Next

Message boards : Number crunching : New work

©2019 climateprediction.net