|
Message boards :
Number crunching :
HadCM3 short errors
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0 |
Downloads failed on every one of them for a cpl days now |
![]() ![]() Send message Joined: 6 Jul 06 Posts: 98 Credit: 1,664,649 RAC: 0 |
Your failures all are for work units that were made back in Sept '14 and a few from Oct '14, no one has been able to run these work units they are faulty. The successful ones you have run come from Jan '15. We seem to have to wait till they all cycle through the system to get rid of them. I have been getting similar errors. Conan |
Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 387 Credit: 10,590,320 RAC: 16,269 |
Andy says that the Sep/Oct 2014 batches are no longer needed by the researchers and have been removed from the server this morning - they should not trouble us any further. There is a later batch still in progress, which can be identified by task names starting hadcm3s_7 - the researchers do still need this new batch, and they should be allowed to run. |
![]() Send message Joined: 13 Jun 08 Posts: 6 Credit: 1,372,493 RAC: 0 |
Got it. Thank you for the update! |
![]() Send message Joined: 22 Feb 06 Posts: 369 Credit: 17,427,588 RAC: 6,502 |
Have just had some computation errors on hadcm3s-4 models with a year date of 2007. Checking on my account these were mnarked as no-resubmission so have been aborted. |
Jim1348 Send message Joined: 15 Jan 06 Posts: 554 Credit: 23,540,409 RAC: 10,002 |
I have seen that too, and unfortunately they run for 8 to 9 hours before they fail. |
Bob Browett Send message Joined: 31 Aug 04 Posts: 11 Credit: 2,558,802 RAC: 0 |
Ah! That explains why all my nits are currently going "phut" after wasting my time for 10 hours... Do they know how many more there are? I will go and crunch for World Community Grid for a coupe of weeks, and then come back to CPDN. Regards Bob |
Les Bayliss Volunteer moderator Send message Joined: 5 Sep 04 Posts: 7355 Credit: 23,425,081 RAC: 1 |
There are currently no more "short" models, as can be seen on the Server Status page, 5th from the bottom in the blue menu to the left. (Except, of course, the usual few re-tries that have failed on other computers.) |
Professor Desty Nova![]() Send message Joined: 19 Sep 04 Posts: 92 Credit: 1,833,588 RAC: 88 |
It seems the server decided to sent sure to fail "No Resubmission" models from November 2014 ( or someone pushed the wrong button :-P ). I just received two, one failed (error: Out Of Memory), the other I aborted, because it has failed on three other people already (also error: Out Of Memory). ![]() Professor Desty Nova Researching Karma the Hard Way |
MartinNZ Send message Joined: 22 Mar 06 Posts: 144 Credit: 24,695,428 RAC: 0 |
Yup, just noticed a bad run, all "no resubmission" from 19 Nov workunits. Looks like things will be quiet with no more models for the moment, but hey that's life. Might be a good time to blow the dust out of the PC :-) |
alanb1951 Send message Joined: 31 Aug 04 Posts: 19 Credit: 7,895,701 RAC: 4,024 |
There seems to be another batch of "No resubmission" jobs (originally from 22nd December 2014) - I'd had several of these fail (memory allocation error) before I realized... Since the first of these turned up, I've not seen a single hadcm3s job that isn't from that bad batch, though I presume not all 35,000+ jobs available according to the server status page are bad jobs. So I'm left wondering whether to babysit BOINC/CPDN to watch for bad jobs or to [temporarily] stop taking hadcm3s jobs at all... Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... |
Jim1348 Send message Joined: 15 Jan 06 Posts: 554 Credit: 23,540,409 RAC: 10,002 |
Ah, well, as a Linux user at least I can do MOSES+Triffid jobs instead... It could be worse. I had both of my Win7 64-bit machines that were doing the shorts do BSODs on me in the last 24 hours, the first time they have ever done that. I could recover from one (with a CHKDSK for errors), but had to reload the OS on the other. The only thing I can see is that I was picking up a lot of HadCM3 short errors at that time. I didn't know that they could crash machines, but you learn something every day. |
ed2353 Send message Joined: 15 Feb 06 Posts: 137 Credit: 33,337,958 RAC: 11 |
All the 1980s seem to be "No Resubmission". It would have helpful if notice could have been given that there was a "Rogue" batch, so that we could abort them. |
![]() Send message Joined: 31 Dec 07 Posts: 1134 Credit: 20,798,831 RAC: 4,794 |
|
Jim1348 Send message Joined: 15 Jan 06 Posts: 554 Credit: 23,540,409 RAC: 10,002 |
|
Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
I seem to be getting a real spate of hadcms models which are marked "No Resubmission". It has got to the point where when one of these starts coming to me from the server I look to see if it is so marked and abort it before it has fully arrived. Saves bandwidth and time. |
![]() Send message Joined: 10 Dec 04 Posts: 15 Credit: 4,870,098 RAC: 0 |
It seems to me that the error rate for short models is significantly greater on my AMD machines than it is on the Intel ones. "Nothing will benefit human health and increase chances for survival of life on Earth as much as the evolution to a vegetarian diet." - Einstein |
![]() Send message Joined: 22 Feb 06 Posts: 369 Credit: 17,427,588 RAC: 6,502 |
These are all a batch of 1980 models which will fail due to an error. They should have been stopped from release but have slipped through. I expect Les will confirm that they should be aborted - I have had about 12 in the last few days and have aborted them all. There was a post in another thread about the same problem some months ago. |
Lockleys Send message Joined: 13 Jan 07 Posts: 195 Credit: 10,581,566 RAC: 0 |
Alan, While most of my "No Resubmission" tasks are the 1980s batch, a few are not, so it is necessary to check them all. |
![]() Send message Joined: 22 Feb 06 Posts: 369 Credit: 17,427,588 RAC: 6,502 |
Just been checking some of my other tasks which have dates 1994 and 2004 which are giving compute errors on other machines. Either no heartbeat or invalid theta errors. I guess these are going to fail at some point. These are not "no resubmission" tasks. |
©2021 climateprediction.net