climateprediction.net home page
HadCM3 short - errors galore

HadCM3 short - errors galore

Message boards : Number crunching : HadCM3 short - errors galore
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50499 - Posted: 11 Oct 2014, 19:15:56 UTC - in response to Message 50495.  

Alex

As stated in the post before yours, the problems started with v7.0.38, so you need to go back to a version earlier than this.

ID: 50499 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50500 - Posted: 11 Oct 2014, 19:18:10 UTC - in response to Message 50496.  

Albert

Does your computer on which they fail have BOINC installed as a service?

ID: 50500 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50502 - Posted: 12 Oct 2014, 0:42:33 UTC - in response to Message 50491.  

Lots of useful feedback - thanks.

Richard

I went back to v7.2.36 as suggested and things seem to be running fine as a service. One task has been going for an hour now and I'm bringing more on line - thought I'd stagger their start times. Task run time is around 20 hours so we'll see what happens in the morning.

In addition, I can stop the service and restart without knackering the task. Important for the way I use my PC.

Ron

Hadn't thought of that one, but the above solved the problem. Another alternative I found was to create a shortcut to screensaver settings, that way instead of logging out, just hit the shortcut and tick the box to say logon on resume. If anyone is interested see SevenForums here.
ID: 50502 · Report as offensive     Reply Quote
keputnam

Send message
Joined: 31 Aug 04
Posts: 26
Credit: 3,806,118
RAC: 1,473
Message 50503 - Posted: 12 Oct 2014, 1:07:28 UTC
Last modified: 12 Oct 2014, 1:10:01 UTC

Running Boinc 7.2.42 in user mode

Currently running two short models

Just got home to find multiple Visual Fortan errors

forrtl severe (38) error during write D:\Boinc ...\tmp\pipe\dummy


Image hadcm3s_um_7.24_w
Routine unknown
Line unknown
Source unknown

I OKed the errors, and immedialtely got two more, etc

Aborting both WUs at this point

ID: 50503 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50504 - Posted: 12 Oct 2014, 3:53:47 UTC

And keep in mind that there are reasons for the latter versions.
The downgrade may not be possible if multiple projects are being run. In that case it may be necessary to stop running cpdn for a few months, and concentrate on one of your other projects.

ID: 50504 · Report as offensive     Reply Quote
Albert H.

Send message
Joined: 18 Feb 06
Posts: 72
Credit: 54,705,189
RAC: 14,986
Message 50507 - Posted: 12 Oct 2014, 8:35:26 UTC - in response to Message 50500.  

Les
excuse my ignorance, how can i see if BOINC is installed as service or..
ID: 50507 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50508 - Posted: 12 Oct 2014, 9:20:43 UTC - in response to Message 50507.  

Look in the Event log. Should be 4th from the start of the list.

See this post for a highlighted example, with the source of that post just under it.
If it's NOT a server install, then there'll be a user name, such as: 12/10/2014 8:18:11 PM||Running under account Leslie


ID: 50508 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50509 - Posted: 12 Oct 2014, 9:21:29 UTC - in response to Message 50502.  

Been through quite a few short tasks since coming back on line. Quite a few have failed, but all that I have checked are Error code 22 & Invalid Theta, so I assume that's OK. e.g see the errors at the top of this page. Interestingly, if you then view the previous page it shows the 5 hadcm3s short tasks that are still running (or were when I wrote this!) The ones still running were sent on the 11th, all those sent on the 12th have Invalid Theta. Coincidence or model differences? (Note for those that do not realise, the page references will only stay valid for a few days and if you are looking at some time in the future, you will be looking at the wrong page and task list. For individual tasks see Example of task with Error 22 and Example of task running (10 hours, 9 to go) at time of post.)

I've put the beast on no new tasks overnight and hopefully there will be some feedback in the morning.
ID: 50509 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,178,853
RAC: 6,329
Message 50510 - Posted: 12 Oct 2014, 9:40:11 UTC - in response to Message 50502.  

I went back to v7.2.36 as suggested and things seem to be running fine as a service. One task has been going for an hour now and I'm bringing more on line - thought I'd stagger their start times. Task run time is around 20 hours so we'll see what happens in the morning.

That's really good news, and finally confirms the diagnosis of what causes this particular problem. The staff will now know exactly which component needs to be updated before they release the next revision of this application.
ID: 50510 · Report as offensive     Reply Quote
Albert H.

Send message
Joined: 18 Feb 06
Posts: 72
Credit: 54,705,189
RAC: 14,986
Message 50513 - Posted: 12 Oct 2014, 15:24:32 UTC - in response to Message 50508.  

Les,
thanks,
both of my computers have BOINC NOT installed as service.

No. 1289686 version 7.2.42 W x86_64 Processor 4 Genuine Intel Q9550

No. 1316390 version 7.2.42 W x86_64 Processor 8 Genuine Intel i7-3612
this one having problems with the shorts
ID: 50513 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50541 - Posted: 17 Oct 2014, 3:53:40 UTC - in response to Message 50509.  

The batch sent 16/17 Oct have all failed - that's a total of 46 on my PC. Error Code 22 & invalid Theta. All the work units have multiple failures.

The run time was so short before failure that it was only because I heard the cooling fans speed up that I noticed them running.

As I assume this is a model failure, I'm leaving the PC to accept new tasks.
ID: 50541 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 50542 - Posted: 17 Oct 2014, 5:49:46 UTC

Right now my boxes are all on linux, because of these problems on Windows. Failures are few.
This is not a Linux is wonderful Widnows sux post.

What I'm thinking is, the cpdn beta site could use more volunteers and more of the severely limited staff time.

I know (or think I know) that new models are compiled with some version of ICC and/or Intel fortran (compiler that does lots of intel-specific optimizing).

I do know that the number of combinations of options for optimizing either Intel or gcc compilers is - like - 30 -- so that makes 2^30 possible combinations.

Some of these options advertise for specific hardware, some for "sloppy math" -- and from my limited experience on the beta site a few years ago, any combi could work, or not.


So, while I have asked for more testing from time to time, totally exhaustive testing isn't practical.

The details of making a new model, or a particular batch, work --
well it's not rocket surgery, but it sure aint simple.


ID: 50542 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 33,452,399
RAC: 5,451
Message 50543 - Posted: 17 Oct 2014, 7:47:07 UTC - in response to Message 50541.  

So far, I have 24 of these from 16/17th October.
I've just run them all for 5-10 minutes to check for quick failures.
None have failed in that time.

I run v 7.2.33 (64bit) on Windows 8.1 (64bit).

I conclude that this is not a batch failure for Windows boxes.
ID: 50543 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 33,452,399
RAC: 5,451
Message 50544 - Posted: 17 Oct 2014, 7:53:10 UTC - in response to Message 50543.  

On the other hand, those sent out on the 8th October were all failures on my box.
ID: 50544 · Report as offensive     Reply Quote
Nigel Garvey

Send message
Joined: 5 May 10
Posts: 69
Credit: 1,169,103
RAC: 2,258
Message 50545 - Posted: 17 Oct 2014, 9:41:06 UTC - in response to Message 50467.  

You also had:
process exited with code 9
Might be a Mac issue, or it could be related to you using a test version of BOINC.
The 2 trickle_up files got returned, so it may just be a post-processing error.


Thanks for the interesting info.

But most of my failures are the "code 9", so...


Similarly here.

My Mac's had three "shorts" so far. The first two ran to completion, but their final uploads took surprisingly little time and their outcomes are shown on the site as "Client error", the Stderr reports being "process exited with code 9 (0x9, -247)". No errors were reported in my BOINC Manager (7.2.42).

The third's currently at 48% completion and has just successfully completed its first upload.

All three tasks have also been attempted by Windows machines, which have errored after only a few seconds. The second task was later successfully completed by a Linux machine, which got the same amount of credit for it as my Mac did.

NG
ID: 50545 · Report as offensive     Reply Quote
SuperSluether

Send message
Joined: 6 Jul 14
Posts: 11
Credit: 367,660
RAC: 0
Message 50546 - Posted: 17 Oct 2014, 15:48:39 UTC - in response to Message 50545.  

I just got another round of "short" tasks. Very odd behavior I'm seeing here. They all went to about 6.369%, then every few minutes switch between that and 100%, finally giving the Computation Error after 15 minutes.

I looked in the Event Log, and it says that the output file was absent for these tasks. Not sure if that helps figure out what's wrong.
ID: 50546 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4342
Credit: 16,497,933
RAC: 6,477
Message 50547 - Posted: 17 Oct 2014, 16:01:16 UTC

SuperSluether If you go to the model page, you will see the invalid theta message if you click on the plus sign next to stderr. This indicates an unstable or impossible climate has been generated by the model's physics.

I still don't understand why the physics should be so much more stable under linux though?
ID: 50547 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1081
Credit: 6,980,320
RAC: 3,893
Message 50548 - Posted: 17 Oct 2014, 18:31:36 UTC - in response to Message 50545.  

[Nigel Garvey wrote:] ... The first two ran to completion, but their final uploads took surprisingly little time and their outcomes are shown on the site as "Client error", the Stderr reports being "process exited with code 9 (0x9, -247)". ...
... this is a Mac problem, which affects all 7-series application releases on my machine. I've decided not to run model types on my Mac that end with "error 9" - which only leaves 6-series EU, ANZ and HADCM3N at the moment. So it's finishing off an ANZ currently and will have then have a rest. (Though prepare for the return of the HADCM3N: I just hope it's not a rebuild ...)
ID: 50548 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 50551 - Posted: 18 Oct 2014, 4:47:39 UTC - in response to Message 50542.  

Right now my boxes are all on linux, because of these problems on Windows. Failures are few.


Ah Eric, the old win/linux/mac discussion.

Looking at the list of top computers, I chose at random three linux and three win based boxes from the first page. Looking at the HadCM3 short tasks, to make it easy for me I only counted tasks sent during Sept and only counted errors while computing. These are the results (computer ID - % tasks with errors(give or take a bit)):

Win: 1270273 - 24%, 1278109 - 17%, 1295173 - 27%
Linux: 124 - 30%, 805269 - 27%, 1292107 - 30%

So, not really much in it, and actually on that tiny sample, windows would be the better system. Would be nice to see someone do a bigger sample :-)

For your own boxes Eric, I looked at your top two PCs and on the same basis there is a error rate of 20 % & 25% for those two PCs.

Now why I have such a high failure rate I don't know, for the few I ran in Sept, my error rate was 44% and all of those were Invalid Theta. Luck of the draw? We just won't go into my October tasks! Grrrrr.

ID: 50551 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 50552 - Posted: 18 Oct 2014, 8:11:11 UTC - in response to Message 50551.  

Looks to me like the "downgrade BOINC if you use it as a service on Windows" advice is good, and as Martin posted, the windows-linux difference goes away when that problem is resolved.

Thanks all who figured that out.

That leaves to 20-25% failure rate. I just spent an hour looking through my own failed wu's and a few other hosts --

It seems the failure modes run in clusters. many seem to be faults in the configuration of the wu's. "REPLANCA - ... mismatch" or "download error" where the needed files don't exist. Or the FORTRAN errors.

And, yeah Martin, thanks again,

I really think it may be just luck of the draw. Martin looked at my two top machines -- anyone can -- but somehow the last 2-3 weeks one of those machines has got mostly pnw's and the other has got mostly shorts. The failure rate of the shorts is similar on both -- but how two such similar machines got vary different distribution of pnw vs short?

is there a statistician out there who can say -- "random selection variation?"
or "WLM variance explains (x%) of the model failure rate with this months models at the cpdn project"

keep on crunching, and thanks


ID: 50552 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : Number crunching : HadCM3 short - errors galore

©2024 climateprediction.net