climateprediction.net home page
Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

AuthorMessage
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 31751 - Posted: 16 Dec 2007, 17:08:26 UTC - in response to Message 31746.  

The HADCM is not in the list anymore, only the HADSM, and I haven´t got any backups. So, it doesn´t metter if I click abort or reset.
Or is it possible to continue the HADCM without having a backup?
ID: 31751 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31752 - Posted: 16 Dec 2007, 18:01:43 UTC

The HadCM crashed on 24 October so if you haven\'t got a backup you can\'t continue it. We can only continue models we can see in Boinc Manager or that we can restore from a backup. So that computer will need a new model. In your project preferences you can select either a HADCM or a HADSM for it. The computer has enough RAM for both types of model.

Look at the project README about backups; get to it through my signature. The first method by Les is very easy and quick to do regularly. Then, if another model crashes you can restore the backup and continue the same model.

About your second computer, the Athlon 1600. It has XP service pack 1 so if you can upgrade to service pack 2 that would be a good idea. This computer only has 256Mb RAM so make sure it only runs HADSM models, which is what it has at the moment, because it hasn\'t got enough RAM for HADCMs. If the computer works well, you could consider buying an extra 256Mb RAM card and adding it to the spare slot on the motherboard. This isn\'t very expensive and would probably improve the computer\'s general performance.

The second computer has 4 HADSM models that don\'t seem to be running. (? I\'m not sure about that.) Don\'t allow it to get any more models because it can only run one model at a time. In the Projects tab of Boinc manager, select CPDN and click the No New Tasks button once only. The day the computer needs a new model you can click the button again.

Hope that helps.
Cpdn news
ID: 31752 · Report as offensive
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 31753 - Posted: 16 Dec 2007, 18:31:14 UTC - in response to Message 31752.  

I have installed the bbc-backup.exe it works perfect.
The reason why the 1600+ has 4 models in the list is because I had some troubles getting BOINC to run. It only has one model running now.
I will reset the bad model and start a HADCM.
Now I know that I have to look after my models.
ID: 31753 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31754 - Posted: 16 Dec 2007, 19:13:42 UTC

Congratulations on getting the backup program.

Other BOINC projects have relatively or very short tasks, so if one of them crashes usually because of some temporary computer problem it doesn\'t matter very much - the computer hasn\'t wasted a lot of time. But these climate models are gigantic in comparison, so if you lose a well-advanced model it\'s very disappointing. On the other hand, even partial model results are used for the research statistics and the computer never loses its credits. But it\'s more fun to complete models. Some members say it\'s like caring for a baby.......
Cpdn news
ID: 31754 · Report as offensive
old_user442168

Send message
Joined: 12 Apr 07
Posts: 26
Credit: 12,681
RAC: 0
Message 31755 - Posted: 16 Dec 2007, 19:25:26 UTC - in response to Message 31754.  

Good to know that my my models aren´t completely useless (hopefully).
ID: 31755 · Report as offensive
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 31767 - Posted: 18 Dec 2007, 10:30:17 UTC - in response to Message 30790.  

I have TWO Blue Planets as follows...!!

BLUE PLANET 1
Model/ResultID webpage is at http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6913635

Current timestep from globe graphic = 235777 of 259248.

The s/TS value is 1.88.

Whether the temperature display of the globe graphic is blue. YES.

CPU is Intel Core 2 Duo T5600 1.83Ghz.

Whether you are overclocking. No

BLUE PLANET 2
Model/ResultID webpage is at http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6983460

Current timestep from globe graphic = 137717 of 259248.

The s/TS value is 22.71.

Whether the temperature display of the globe graphic is blue. YES.

=====
Interesting to note the same number of timesteps but Planet 1 is a \'big\' model suggesting approx 1900 total hours to run, while the other \'small\' one comes in at 400 hours.

Looking forward to hearing from you...!
ID: 31767 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31770 - Posted: 18 Dec 2007, 11:48:26 UTC
Last modified: 18 Dec 2007, 11:52:02 UTC


One of them appears to be running OK, and the other is going slow and getting slower. I\'d abort the 6913635 (hadsm3fub_0040_005908091_1) model, but keep the other running. I think you have the s/ts times swapped over.

They\'re both HadSM3 models which should be taking around 400 hours to run.

The slow one is still on phase 1, whereas the quicker one is on phase 3 (note that the timesteps reset to 1 at the start of each phase).


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31770 · Report as offensive
RichardRodd

Send message
Joined: 16 Mar 06
Posts: 28
Credit: 3,219,100
RAC: 0
Message 31771 - Posted: 18 Dec 2007, 12:29:57 UTC - in response to Message 31770.  

Yes you\'re quite right - and I tried to transfer the data sooo carefully...!

So the \'big\' one (which incidentally always said it was going to take 1900 hours) was the one on 137717 with a s/TS of 22.71...

And the \'little\' one (400 hours) was on 235777 with a s/TS of 1.88.

But both definitely have the same total number of TS = 259248... which I guess is what you\'d expect if they\'re actually the same size pkg, despite differing estimated runtimes.

I might just hold fire a short while and see if anyone else has a different insight / recommendation...! Failing that, I\'ll kill 8091_1.

And many thanks for your help.
ID: 31771 · Report as offensive
old_user21706

Send message
Joined: 28 Sep 04
Posts: 15
Credit: 167,093
RAC: 0
Message 31848 - Posted: 23 Dec 2007, 22:15:54 UTC

Unrecoverable error

My model had progressed until 58.1% when I got this message:

Unrecoverable error for result hadcm3_pbb_amvy_05770297_0(<file_xfer_error> <file_name>had..

And BOINC started a new calculation. Since I already had some trouble getting this far with the calculation, I\'d like to be sure that this error is indeed unrecoverable before I abort it. I use 2 back ups (staggered), and already retried from the latest back up once, in the following way:
1. Switch off BOINC manager
2. Delete all files new since the last back up (with some margin)
3. Copy the necessary files from the back up (= only for the last day or so)
4. Restart.
To do this I use the TotalCommander \"Overwrite older files\" option.

Same result. Should I give this up as a bad job. or perhaps overwrite the whole thing from back up?

P.S. I checked these for more on \"unrecoverable error\" but didn\'t find anything relevant to my problem (as far as I can tell).
ID: 31848 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31850 - Posted: 23 Dec 2007, 23:07:15 UTC


The usual advice is to backup the ENTIRE BOINC folder, and to restore the ENTIRE BOINC folder from the backup when needed.

Anything else, using partial restores, may lead to unknown results, and is at the descretion of users, with no guarantees.

I notice that you don\'t seem to be returning trickles.
Have you got BOINC set to Network suspended? If not, you may have firewall problems.

You may find it useful to merge the many vitual appearances of your phsyical computers; it will make it easier to follow the progress of models.


Backups: Here
ID: 31850 · Report as offensive
Profile mo.v
Volunteer moderator
Avatar

Send message
Joined: 29 Sep 04
Posts: 2363
Credit: 14,611,758
RAC: 0
Message 31852 - Posted: 23 Dec 2007, 23:26:30 UTC

ID: 31852 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31856 - Posted: 24 Dec 2007, 0:41:52 UTC - in response to Message 31848.  
Last modified: 24 Dec 2007, 1:29:16 UTC

...
P.S. I checked these for more on \"unrecoverable error\" but didn\'t find anything relevant to my problem (as far as I can tell).


Generally speaking, once the error has been reported to the server, you\'ll see both an exit status, and an error text entry.

If you look near the end of the \'error text\', look for the following things:


  • Exit status 0 = Recoverable (system shutdown / killed in task manager?)
  • Exit status 0 / Error text \'MASS-WEIGHTED QT SUMMED OVER A LEVEL WAS NEGATIVE.\' = ??? No idea
  • Exit status 0 / Error text \'Post Processing failed\' = ??? Could be permissions problem. Might not actually be a problem at all, in some cases it\'s a misleading message.
  • Exit status -1 = Recoverable (system shutdown / graphics problem ?)
  • Exit status 2 = Update manager from V4 to V5
  • Exit status -6 = (MAC) Recoverable, increase shared memory segments or update Boinc manager to a recent version
  • Exit status 21 / Error text \'Post Processing failed\' = ??? Could be permissions problem
  • Exit status 22 / Error text \'NEGATIVE PRESSURE VALUE DETECTED\' = Unrecoverable
  • Exit status 22 / Error text \'NEGATIVE THETA DETECTED\' = Unrecoverable
  • Exit status 22 / Error text \'Missing data in ocean UV fields\' = Usually Unrecoverable
  • Exit status 22 / Error text \'Model crash detected\' = Not sure but probably recoverable
  • Exit status 22 / Error text \'Access denied\' = Not sure but possibly caused by two copies of the same model running, or file permission problems. Should be recoverable.
  • Exit status 22 / Error text \'READ_FLH: I/O error\' = Probably recoverable, possibly due to file permissions or exclusive file lock
  • Exit status 22 / Other things in error text = Sometimes recoverable
  • Exit status 99 = Should not be recovered (killer trickle)
  • Exit status 128 = Recoverable (Update DirectX to 9.0c or later)
  • Exit status -161 = Sometimes recoverable (look at error text)
  • Exit status -197 = Recoverable (Aborted by user)
  • Exit status -1073741502/0xC0000142 = Recoverable (Couldn\'t load DLL file, often during shutdown or when PC is out of resources)
  • Exit status -1073741510/0xC000013A = Recoverable (Vista shutdown)
  • Exit status -1073741819/0xC0000005 = Recoverable (Windows stop / fortran exception)
  • Exit status -1073807364/0x40010004 = Recoverable (graphics error)


(note that my memory is unreliable so some exit statuses might not have the indicated sign)

Once the error has been reported to the server, the message will always stay against that model, but it can be ignored. The servers will still accept the uploads from the restored model despite the error status. Right at the end of the run, Boinc will show a \'result refused\' message. This can be ignored.

If the error has NOT been reported to the server, you may be able to find the above text in a log file in one of the \'slot\' directories.



  • In many cases, no useful information will appear so you have no way of telling if it is recoverable or not
  • By \'Unrecoverable\' I mean less than 10% of restores will work
  • By \'Sometimes recoverable\' I mean since we can\'t tell what caused the crash it may or may not be recoverable
  • By \'Recoverable\' I mean that the majority of restores should work
  • By \'Should not be recovered\' I mean that even if a restore may work, the project wants to cancel that model, so don\'t restore it!
  • By \'Probably recoverable\', I mean I\'m just guessing, but it sounds like a recoverable error...



I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31856 · Report as offensive
old_user21706

Send message
Joined: 28 Sep 04
Posts: 15
Credit: 167,093
RAC: 0
Message 31876 - Posted: 24 Dec 2007, 22:13:05 UTC - in response to Message 31850.  


The usual advice is to backup the ENTIRE BOINC folder, and to restore the ENTIRE BOINC folder from the backup when needed.

Anything else, using partial restores, may lead to unknown results, and is at the descretion of users, with no guarantees.

I notice that you don\'t seem to be returning trickles.
Have you got BOINC set to Network suspended? If not, you may have firewall problems.

You may find it useful to merge the many vitual appearances of your phsyical computers; it will make it easier to follow the progress of models.


Les,
1. I will try recovering the works then. However it seems strange to me that re-copying all these old, unchanged files (many GB) could make any difference.
2. I\'ll check this out.
3. I wouldn\'t mind doing that, but don\'t know how. The reference you provide (\"here\") is very general (thousands of posts). Can you a be a little more specific?

P.S. Thanks to the other people who answered my post, too.
ID: 31876 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31877 - Posted: 24 Dec 2007, 22:41:46 UTC


I\'m not sure what 3) is referring to, but if it\'s instructions on backing up in Les\'s signature, the \'here\' is a link (bold blue text). Similarly the links in my signature are worth reading.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31877 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31878 - Posted: 24 Dec 2007, 22:42:38 UTC
Last modified: 24 Dec 2007, 22:46:03 UTC

Hi Dick

1) Gigs?!!!
You should only have a bit over one gig when running 2 models, and only then just before they create a zip to upload.
If you still have all of your old, failed models, you can delete the folders.
(Write down the names of running models so you don\'t make a Whoops! by deleting them)

3) The 2 \"Here\'s\" are part of my sig, and are just links to notes for those that need them.

If the intent of 3) is about merging, just look at the bottom of the page for each \"real\" computer, where it says merge this computer, and then follow the notes. You\'ll have to do this for each \"real\" computer.

I\'ll post this without my sig so that you can see the difference.

edit
Spent too much time thinking about the wording again, but now you have 2 opinions.

ID: 31878 · Report as offensive
old_user21706

Send message
Joined: 28 Sep 04
Posts: 15
Credit: 167,093
RAC: 0
Message 31881 - Posted: 25 Dec 2007, 9:31:39 UTC - in response to Message 31878.  
Last modified: 25 Dec 2007, 10:02:05 UTC

Les,

OK, that clears up a few things.

1) Amount of data: I retired just about everything (older than 2007) under \"projects\"; that should probably make a difference.

2) Merging. With some help from the BOINC-wiki I found out what you meant by \"the page\" and did a clean up.

3) How this conversation started: here\'s the unrecoverable error, no 161, which was reliably reproduced even after a full recopy & restart. Hope this means something to you:

25/12/2007 10:13:55|climateprediction.net|Reason: Unrecoverable error for result hadcm3pbb_amvy_05770297_0 (<file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_10.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_11.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_12.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_13.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_14.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_15.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_16.zip</file_name> <error_code>-161</error_code></file_xfer_error>)

If this model can still be salvaged please let me know.
ID: 31881 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31882 - Posted: 25 Dec 2007, 10:18:34 UTC
Last modified: 25 Dec 2007, 10:22:01 UTC

1) Yes.

2) Go to your Account page.
(You can do this by:
A) Clicking on your name to the left of here,
B) By clicking on Your account in the blue menu on the left side of this page,
C) From your manager by going to the Projects tab, clicking the project\'s name to select it, and then looking in the left hand column near the bottom. )

When there, find the line which says: Computers on this account (near the bottom), and click the word View to the right of this.

Pick one of the computers listed, and click Computer ID
Right at the bottom it says: merge this computer
Click on this phrase.

The page you finally end up on should have a list of computers that the server thinks matches the computer with the Computer ID that you selected earlier.
It also has 2 lines:
Select all
Unselect all

Use the first to select the lot. (I think you can also do this individually, as well as de-select individually.)

Then click the button at the bottom of the page.
THIS ACTION IS IRREVERSABLE!, so make sure that the selected computers really ARE \"all the same computer\".

3) The error code 161 just means that the file that BOINC wants to upload doesn\'t exist. The REAL error code gets masked in earlier versions.
Later versions of the program have more extensive error reporting.

As to salvaging the model, the usual applies:
Only if you have a backup of the ENTIRE BOINC folder made before the problem occured.

Edit
OK. I see that you\'re updated your post since I started mine, so just ignore anything that\'s not relevant now.


Backups: Here
ID: 31882 · Report as offensive
old_user21706

Send message
Joined: 28 Sep 04
Posts: 15
Credit: 167,093
RAC: 0
Message 31884 - Posted: 25 Dec 2007, 22:39:31 UTC - in response to Message 31882.  
Last modified: 25 Dec 2007, 22:41:50 UTC

Hi Les,

OK, I\'ll try running this model once more from my oldest backup (by the way, I do keep full backups, I just keep the amount of copying limited by using the \"Overwrite older only\" option in TotalCommander).

Is it possible to use a newer version of BOINC to find out what the error (-161) really is? If so, how do I combine the old model with new BOINC?
ID: 31884 · Report as offensive
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31885 - Posted: 25 Dec 2007, 23:07:47 UTC
Last modified: 25 Dec 2007, 23:09:03 UTC

Just install the new version of Boinc over the top of the existing installation (when the model has been restored and is still running). It\'ll pick up whatever models were within the original boinc installation.

I usually just let the crashed model report to the servers. You get the usual error messages on the website, and these stick forever, but restored models will continue to upload data to the servers OK. Alternatively you could try to find whatever log file the error messages went into.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31885 · Report as offensive
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31886 - Posted: 26 Dec 2007, 0:01:30 UTC
Last modified: 26 Dec 2007, 0:07:06 UTC

I\'ll reword / rephrase 3) to try and remove ambiguity:

3) The error code 161 just means that the file that BOINC wants to upload doesn\'t exist. The REAL error code gets masked in earlier versions of the climate program, which all start with HAD. (Which stands for The HADley Centre, The UK Met Office\'s computer centre.)
Later versions of the climate program have more extensive error reporting.

PS
Changing BOINC versions won\'t get you more/better climate model error reporting.
It WILL give you different (better?) messages related to it\'s own work, which is just uploads/downloads/work-fetch/work-run-coordinating.

ID: 31886 · Report as offensive
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next

Message boards : Number crunching : Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion

©2024 cpdn.org