Message boards :
Number crunching :
Iceworlds & Slowdowns hadsm3/mh - Closed - Discussion
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 15 · Next
Author | Message |
---|---|
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
The HADCM is not in the list anymore, only the HADSM, and I haven´t got any backups. So, it doesn´t metter if I click abort or reset. Or is it possible to continue the HADCM without having a backup? |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
The HadCM crashed on 24 October so if you haven\'t got a backup you can\'t continue it. We can only continue models we can see in Boinc Manager or that we can restore from a backup. So that computer will need a new model. In your project preferences you can select either a HADCM or a HADSM for it. The computer has enough RAM for both types of model. Look at the project README about backups; get to it through my signature. The first method by Les is very easy and quick to do regularly. Then, if another model crashes you can restore the backup and continue the same model. About your second computer, the Athlon 1600. It has XP service pack 1 so if you can upgrade to service pack 2 that would be a good idea. This computer only has 256Mb RAM so make sure it only runs HADSM models, which is what it has at the moment, because it hasn\'t got enough RAM for HADCMs. If the computer works well, you could consider buying an extra 256Mb RAM card and adding it to the spare slot on the motherboard. This isn\'t very expensive and would probably improve the computer\'s general performance. The second computer has 4 HADSM models that don\'t seem to be running. (? I\'m not sure about that.) Don\'t allow it to get any more models because it can only run one model at a time. In the Projects tab of Boinc manager, select CPDN and click the No New Tasks button once only. The day the computer needs a new model you can click the button again. Hope that helps. Cpdn news |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
I have installed the bbc-backup.exe it works perfect. The reason why the 1600+ has 4 models in the list is because I had some troubles getting BOINC to run. It only has one model running now. I will reset the bad model and start a HADCM. Now I know that I have to look after my models. |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Congratulations on getting the backup program. Other BOINC projects have relatively or very short tasks, so if one of them crashes usually because of some temporary computer problem it doesn\'t matter very much - the computer hasn\'t wasted a lot of time. But these climate models are gigantic in comparison, so if you lose a well-advanced model it\'s very disappointing. On the other hand, even partial model results are used for the research statistics and the computer never loses its credits. But it\'s more fun to complete models. Some members say it\'s like caring for a baby....... Cpdn news |
Send message Joined: 12 Apr 07 Posts: 26 Credit: 12,681 RAC: 0 |
Good to know that my my models aren´t completely useless (hopefully). |
Send message Joined: 16 Mar 06 Posts: 28 Credit: 3,219,100 RAC: 0 |
I have TWO Blue Planets as follows...!! BLUE PLANET 1 Model/ResultID webpage is at http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6913635 Current timestep from globe graphic = 235777 of 259248. The s/TS value is 1.88. Whether the temperature display of the globe graphic is blue. YES. CPU is Intel Core 2 Duo T5600 1.83Ghz. Whether you are overclocking. No BLUE PLANET 2 Model/ResultID webpage is at http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6983460 Current timestep from globe graphic = 137717 of 259248. The s/TS value is 22.71. Whether the temperature display of the globe graphic is blue. YES. ===== Interesting to note the same number of timesteps but Planet 1 is a \'big\' model suggesting approx 1900 total hours to run, while the other \'small\' one comes in at 400 hours. Looking forward to hearing from you...! |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
One of them appears to be running OK, and the other is going slow and getting slower. I\'d abort the 6913635 (hadsm3fub_0040_005908091_1) model, but keep the other running. I think you have the s/ts times swapped over. They\'re both HadSM3 models which should be taking around 400 hours to run. The slow one is still on phase 1, whereas the quicker one is on phase 3 (note that the timesteps reset to 1 at the start of each phase). I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 16 Mar 06 Posts: 28 Credit: 3,219,100 RAC: 0 |
Yes you\'re quite right - and I tried to transfer the data sooo carefully...! So the \'big\' one (which incidentally always said it was going to take 1900 hours) was the one on 137717 with a s/TS of 22.71... And the \'little\' one (400 hours) was on 235777 with a s/TS of 1.88. But both definitely have the same total number of TS = 259248... which I guess is what you\'d expect if they\'re actually the same size pkg, despite differing estimated runtimes. I might just hold fire a short while and see if anyone else has a different insight / recommendation...! Failing that, I\'ll kill 8091_1. And many thanks for your help. |
Send message Joined: 28 Sep 04 Posts: 15 Credit: 167,093 RAC: 0 |
Unrecoverable error My model had progressed until 58.1% when I got this message: Unrecoverable error for result hadcm3_pbb_amvy_05770297_0(<file_xfer_error> <file_name>had.. And BOINC started a new calculation. Since I already had some trouble getting this far with the calculation, I\'d like to be sure that this error is indeed unrecoverable before I abort it. I use 2 back ups (staggered), and already retried from the latest back up once, in the following way: 1. Switch off BOINC manager 2. Delete all files new since the last back up (with some margin) 3. Copy the necessary files from the back up (= only for the last day or so) 4. Restart. To do this I use the TotalCommander \"Overwrite older files\" option. Same result. Should I give this up as a bad job. or perhaps overwrite the whole thing from back up? P.S. I checked these for more on \"unrecoverable error\" but didn\'t find anything relevant to my problem (as far as I can tell). |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
The usual advice is to backup the ENTIRE BOINC folder, and to restore the ENTIRE BOINC folder from the backup when needed. Anything else, using partial restores, may lead to unknown results, and is at the descretion of users, with no guarantees. I notice that you don\'t seem to be returning trickles. Have you got BOINC set to Network suspended? If not, you may have firewall problems. You may find it useful to merge the many vitual appearances of your phsyical computers; it will make it easier to follow the progress of models. Backups: Here |
Send message Joined: 29 Sep 04 Posts: 2363 Credit: 14,611,758 RAC: 0 |
Here\'s Les\'s easy backup and restore method: http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=4890 Cpdn news |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
... Generally speaking, once the error has been reported to the server, you\'ll see both an exit status, and an error text entry. If you look near the end of the \'error text\', look for the following things:
I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 28 Sep 04 Posts: 15 Credit: 167,093 RAC: 0 |
Les, 1. I will try recovering the works then. However it seems strange to me that re-copying all these old, unchanged files (many GB) could make any difference. 2. I\'ll check this out. 3. I wouldn\'t mind doing that, but don\'t know how. The reference you provide (\"here\") is very general (thousands of posts). Can you a be a little more specific? P.S. Thanks to the other people who answered my post, too. |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
I\'m not sure what 3) is referring to, but if it\'s instructions on backing up in Les\'s signature, the \'here\' is a link (bold blue text). Similarly the links in my signature are worth reading. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
Hi Dick 1) Gigs?!!! You should only have a bit over one gig when running 2 models, and only then just before they create a zip to upload. If you still have all of your old, failed models, you can delete the folders. (Write down the names of running models so you don\'t make a Whoops! by deleting them) 3) The 2 \"Here\'s\" are part of my sig, and are just links to notes for those that need them. If the intent of 3) is about merging, just look at the bottom of the page for each \"real\" computer, where it says merge this computer, and then follow the notes. You\'ll have to do this for each \"real\" computer. I\'ll post this without my sig so that you can see the difference. edit Spent too much time thinking about the wording again, but now you have 2 opinions. |
Send message Joined: 28 Sep 04 Posts: 15 Credit: 167,093 RAC: 0 |
Les, OK, that clears up a few things. 1) Amount of data: I retired just about everything (older than 2007) under \"projects\"; that should probably make a difference. 2) Merging. With some help from the BOINC-wiki I found out what you meant by \"the page\" and did a clean up. 3) How this conversation started: here\'s the unrecoverable error, no 161, which was reliably reproduced even after a full recopy & restart. Hope this means something to you: 25/12/2007 10:13:55|climateprediction.net|Reason: Unrecoverable error for result hadcm3pbb_amvy_05770297_0 (<file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_10.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_11.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_12.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_13.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_14.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_15.zip</file_name> <error_code>-161</error_code></file_xfer_error><file_xfer_error> <file_name>hadcm3pbb_amvy_05770297_0_16.zip</file_name> <error_code>-161</error_code></file_xfer_error>) If this model can still be salvaged please let me know. |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
1) Yes. 2) Go to your Account page. (You can do this by: A) Clicking on your name to the left of here, B) By clicking on Your account in the blue menu on the left side of this page, C) From your manager by going to the Projects tab, clicking the project\'s name to select it, and then looking in the left hand column near the bottom. ) When there, find the line which says: Computers on this account (near the bottom), and click the word View to the right of this. Pick one of the computers listed, and click Computer ID Right at the bottom it says: merge this computer Click on this phrase. The page you finally end up on should have a list of computers that the server thinks matches the computer with the Computer ID that you selected earlier. It also has 2 lines: Select all Unselect all Use the first to select the lot. (I think you can also do this individually, as well as de-select individually.) Then click the button at the bottom of the page. THIS ACTION IS IRREVERSABLE!, so make sure that the selected computers really ARE \"all the same computer\". 3) The error code 161 just means that the file that BOINC wants to upload doesn\'t exist. The REAL error code gets masked in earlier versions. Later versions of the program have more extensive error reporting. As to salvaging the model, the usual applies: Only if you have a backup of the ENTIRE BOINC folder made before the problem occured. Edit OK. I see that you\'re updated your post since I started mine, so just ignore anything that\'s not relevant now. Backups: Here |
Send message Joined: 28 Sep 04 Posts: 15 Credit: 167,093 RAC: 0 |
Hi Les, OK, I\'ll try running this model once more from my oldest backup (by the way, I do keep full backups, I just keep the amount of copying limited by using the \"Overwrite older only\" option in TotalCommander). Is it possible to use a newer version of BOINC to find out what the error (-161) really is? If so, how do I combine the old model with new BOINC? |
Send message Joined: 13 Jan 06 Posts: 1498 Credit: 15,613,038 RAC: 0 |
Just install the new version of Boinc over the top of the existing installation (when the model has been restored and is still running). It\'ll pick up whatever models were within the original boinc installation. I usually just let the crashed model report to the servers. You get the usual error messages on the website, and these stick forever, but restored models will continue to upload data to the servers OK. Alternatively you could try to find whatever log file the error messages went into. I'm a volunteer and my views are my own. News and Announcements and FAQ |
Send message Joined: 5 Sep 04 Posts: 7629 Credit: 24,240,330 RAC: 0 |
I\'ll reword / rephrase 3) to try and remove ambiguity: 3) The error code 161 just means that the file that BOINC wants to upload doesn\'t exist. The REAL error code gets masked in earlier versions of the climate program, which all start with HAD. (Which stands for The HADley Centre, The UK Met Office\'s computer centre.) Later versions of the climate program have more extensive error reporting. PS Changing BOINC versions won\'t get you more/better climate model error reporting. It WILL give you different (better?) messages related to it\'s own work, which is just uploads/downloads/work-fetch/work-run-coordinating. |
©2024 cpdn.org