climateprediction.net home page
UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03

UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03

Message boards : Number crunching : UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 48905 - Posted: 26 Apr 2014, 5:49:13 UTC - in response to Message 48903.  

Opps, didn't mean to start an OS war as I'm not one that sides one against the other as all have merits. 'Whatever floats your boat' as a mate of mine would say.

Also wasn't worried about more work coming/not coming the Win way, I was just curious as there may have been something which made running on the Linux/Darwin platform more sense.

All's now clear, thanks Les.

Mart
ID: 48905 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 48906 - Posted: 26 Apr 2014, 8:59:41 UTC - in response to Message 48895.  
Last modified: 26 Apr 2014, 9:27:34 UTC

I notice Les's comment that it's unlikely a Window's version of this model will be developed, and must admit I thought BOINC was BOINC and didn't realise that different model versions had to be developed for each OS.

Given that this is model is the largest release by far for some time (I think), I was wondering on the rational behind the decision. To get an idea of the split I looked at the top 200 hosts, we have 27 Linux, 22 Darwin & 151 Windows boxes of varying shades.

Assuming this split is fairly representative, it seems odd to be limiting the processing to 25% of available units?


Given the hurryup to get this model out and crunching -makes sense that the easier, most similar to the supercomputers conversion is out there first.
Or, another way to say it - us minority Linux and OS-whatever are the beta-2 testers - that's how it seems to me.

Whatever compiler the developers are using now - it's probably easier to get a Linux-darwin version tested and out there for us to crunch.

What this means to me is - I'm mostly Linux - I've suspended all other projects - seeing the enormous MOSES backlog on the server-status page.

As for OS wars -- bugger all that.

Edit >> the 60,000 backlog on the MOSES models is shrikning very-very- slowly.
ID: 48906 · Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 6 Aug 04
Posts: 264
Credit: 965,476
RAC: 0
Message 48909 - Posted: 26 Apr 2014, 14:07:17 UTC
Last modified: 26 Apr 2014, 14:09:25 UTC

Most Test4Theory@home users are Windows users and they run CERN Linux jobs without ever suspecting it in their Virtual Machines. All you have to do is to download Virtual Box and its Extension Pack, connect your BOINC client to Test4Theory@home and the rest is automatic. I run T4T on a HP laptop with SuSE 12.3 and BOINC 6.10.58. and a SuSE 13.1 SUN WS with BOINC 7.2.41. This same hosts a Ubuntu 12.04 Virtual Machine with BOINC 7.2.42. All work well.
Tullio
ID: 48909 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 48967 - Posted: 29 Apr 2014, 16:12:16 UTC - in response to Message 48853.  

[ . . . ]
A problem in beta tests with 7.02 was stop/restart, monthly zips were never built when stopping/restarting. Try that.

I did several suspend-resume on some of these models (while swapping some disks and data around, and doing backups).
Every suspend-resume led to the next zip upload not being done, but the following zip files continued after the skipped ones.
Have to change backup policy here, guaranteed loss of an upload is not what backups are for.
ID: 48967 · Report as offensive     Reply Quote
Profile rebirther
Avatar

Send message
Joined: 26 Aug 04
Posts: 17
Credit: 367,996
RAC: 0
Message 48992 - Posted: 1 May 2014, 6:06:40 UTC
Last modified: 1 May 2014, 6:07:04 UTC

The model ran longer than expected. The max. timesteps value must be 308,228 but after over 310000ts ends up in a computation error. Whats the problem?

The progress bar need also be fixed.
ID: 48992 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 48993 - Posted: 1 May 2014, 10:20:03 UTC - in response to Message 48992.  

from what I can figure, the MOSES models fail whenever they have been stopped and restarted.
Whether by me stopping models to shutdown for backup, or typical users that stop their machines even once between start of model and the very unlikely final upload.

Have there been any successful uploads of a completed MOSES model? Don't think so.

My experience is, any MOSES that ever gets stopped and restarted - will eventually crash and not upload its final huge (but never happened here) upload.

I have a few (2 or 3) out of or so MOSES models running that have never been restarted - and those 2 seem ok - But if these MOSES ever need a restart -
SPLOTTO. but in a few days an uninterrupted MOSES or 2 might complete - if the local power authorities and working cpus allow.


These MOSES cannot recover from any stop-restart - any stop, any restart -

model will fail - guaranteed - seen some models that only notice earlier uplaod fail at end of job - a few intermediate uploads fail without the model failing, and then, at end -- missing fiels.

Sorry for not being more clear.
Sorry that these MOSES things need a clear run with no interruptions whatsoever.
That might happen on dedicated supercomputers. I do my best, but - two weeks uninterrupted - might happen at supercomputer center,
If you want to finish one - got to commit to running it the whole 200(+-) hours with no interruptions.

(Always fail after any stop-restart -- got the logs - just ask)

(signed) beta-2 tester eek.

PS

If you linux-users and darwin-users can - please arrange uninterrupted run of the MOSES thing (a week or two) - to see if the model can possibly complete.
ID: 48993 · Report as offensive     Reply Quote
MyLittleBoinc

Send message
Joined: 31 Mar 13
Posts: 44
Credit: 6,950,896
RAC: 0
Message 48996 - Posted: 1 May 2014, 11:01:58 UTC

"These MOSES cannot recover from any stop-restart - any stop, any restart - "

That's good to know. I have now shut off all updates to my two little Linux computers. They run on battery-backed 12VDC, so are hopefully safe from power glitches. I was going to rearrange my computer nook but that will now have to wait for three or four weeks.
ID: 48996 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 48999 - Posted: 1 May 2014, 11:51:53 UTC

Erik, does that include, hibernating the computer? Or has that not been tried? If it does I will exclude the models from my box as there seems little point in running them just to guarantee failure.
ID: 48999 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49000 - Posted: 1 May 2014, 12:02:42 UTC - in response to Message 48999.  

Erik, does that include, hibernating the computer? Or has that not been tried? If it does I will exclude the models from my box as there seems little point in running them just to guarantee failure.


I have'nt tried the hibernate thing, my machines are all desktops and servers, don't know.

I can say for sure that any MOSES that I've suspended or restarted for any reason has eventually failed.


Like so

20-Apr-2014 10:26:44 [climateprediction.net] Started download of hadam3pm2_e96q_1991_10_008714949.zip
20-Apr-2014 10:26:48 [climateprediction.net] Finished download of hadam3pm2_e96q_1991_10_008714949.zip
20-Apr-2014 15:30:43 [climateprediction.net] task hadam3pm2_e96q_1991_10_008714949_2 suspended by user
21-Apr-2014 03:19:09 [climateprediction.net] task hadam3pm2_e96q_1991_10_008714949_2 resumed by user
21-Apr-2014 03:19:10 [climateprediction.net] Starting task hadam3pm2_e96q_1991_10_008714949_2
22-Apr-2014 01:12:20 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_1.zip
22-Apr-2014 01:29:45 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_1.zip
22-Apr-2014 22:50:21 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_2.zip
22-Apr-2014 23:07:27 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_2.zip
23-Apr-2014 20:36:19 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_3.zip
23-Apr-2014 20:53:30 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_3.zip
24-Apr-2014 18:22:16 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_4.zip
24-Apr-2014 18:42:55 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_4.zip
25-Apr-2014 04:39:19 [climateprediction.net] task hadam3pm2_e96q_1991_10_008714949_2 suspended by user
25-Apr-2014 04:59:24 [climateprediction.net] task hadam3pm2_e96q_1991_10_008714949_2 resumed by user
26-Apr-2014 14:33:29 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_6.zip
26-Apr-2014 14:50:39 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_6.zip
27-Apr-2014 13:18:32 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_7.zip
27-Apr-2014 13:36:15 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_7.zip
28-Apr-2014 12:02:24 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_8.zip
28-Apr-2014 12:25:05 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_8.zip
29-Apr-2014 10:51:35 [climateprediction.net] Started upload of hadam3pm2_e96q_1991_10_008714949_2_9.zip
29-Apr-2014 11:10:14 [climateprediction.net] Finished upload of hadam3pm2_e96q_1991_10_008714949_2_9.zip
29-Apr-2014 13:01:02 [climateprediction.net] Computation for task hadam3pm2_e96q_1991_10_008714949_2 finished
29-Apr-2014 13:01:02 [climateprediction.net] Output file hadam3pm2_e96q_1991_10_008714949_2_5.zip for task hadam3pm2_e96q_1991_10_008714949_2 absent
29-Apr-2014 13:01:02 [climateprediction.net] Output file hadam3pm2_e96q_1991_10_008714949_2_10.zip for task


ID: 49000 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49001 - Posted: 1 May 2014, 12:11:02 UTC

Well, I'm never going to download another MOSES - - unless --
I've got at least 5GB available, per model.
I expect never to have to interrupt the model run, for any reason. Looks like any interruption will eventually waste the whole model.
I think I can responsibly take a few more of the MOSES, but have to commit to at least two weeks guaranteed no stop-start. At all, ever.

Need to order battery for UPS.



ID: 49001 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,902,393
RAC: 6,787
Message 49002 - Posted: 1 May 2014, 12:15:00 UTC - in response to Message 49001.  

Well, I'm never going to download another MOSES - - unless --
I've got at least 5GB available, per model.
I expect never to have to interrupt the model run, for any reason. Looks like any interruption will eventually waste the whole model.
I think I can responsibly take a few more of the MOSES, but have to commit to at least two weeks guaranteed no stop-start. At all, ever.

Need to order battery for UPS.



It might be a kindness, Eirik, if you were to do that. The beta site has vanished so I can't check but my memory was that the Moses II I ran on Mac finished with an error despite running uninterrupted. It would be nice to know if any Moses II on any platform has completed successfully.
ID: 49002 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 49003 - Posted: 1 May 2014, 12:22:25 UTC - in response to Message 49002.  

Suspending the model doesn't result in a crash at the end, however, suspending the model when "leave tasks in memory when suspended" is unchecked will. Anything that removes it from memory will result in a missing yearly upload and an error status at the end because of a missing upload file.
ID: 49003 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49004 - Posted: 1 May 2014, 12:23:43 UTC - in response to Message 49001.  

Oh, and PS

Don't know what the deal is on the MOSES
Couldn't support the Beta last few years, sorry.
Yes, us Linux and Mac are doing a "beta-2" on these fragile and not-very-well-tested MOSES models -- not ready for prime time.

So -for all you WINDOWS lovers -
did you want to contribute testing to this difficult release? --
Hope it gets better when released again.

Love you all for contributing time.

Keep on crunching --

and pray for better help for the MOSES project.

ID: 49004 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49005 - Posted: 1 May 2014, 12:28:54 UTC - in response to Message 49003.  

Suspending the model doesn't result in a crash at the end, however, suspending the model when "leave tasks in memory when suspended" is unchecked will. Anything that removes it from memory will result in a missing yearly upload and an error status at the end because of a missing upload file.


naah, I've never unchecked "leave tasks in memory when suspended" - always allowed last 7 years.

Let me get this right - supposedly if I check "leave tasks in memory when suspended" there will be no problem?

That's what I've been doing the last few years, and no, I've still got problem where any suspend leads to upload loss and eventual model fail.
ID: 49005 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,902,393
RAC: 6,787
Message 49006 - Posted: 1 May 2014, 12:29:08 UTC

Mmmm. Bank holiday weekend coming up: I feel some PHP coming on ...
ID: 49006 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,888,554
RAC: 1,481,373
Message 49007 - Posted: 1 May 2014, 12:33:28 UTC - in response to Message 49006.  

Mmmm. Bank holiday weekend coming up: I feel some PHP coming on ...


Oi, Oi.

Time will tell. Me, I trust yall.
Take care everybody, and keep on crunching.


ID: 49007 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2167
Credit: 64,403,322
RAC: 5,085
Message 49008 - Posted: 1 May 2014, 12:37:49 UTC - in response to Message 49005.  

Suspending the model doesn't result in a crash at the end, however, suspending the model when "leave tasks in memory when suspended" is unchecked will. Anything that removes it from memory will result in a missing yearly upload and an error status at the end because of a missing upload file.


naah, I've never unchecked "leave tasks in memory when suspended" - always allowed last 7 years.

Let me get this right - supposedly if I check "leave tasks in memory when suspended" there will be no problem?

That's what I've been doing the last few years, and no, I've still got problem where any suspend leads to upload loss and eventual model fail.


I guess I'm just speaking for my experience then. This model of mine completed successfully with a suspend due to benchmarking:

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16486346

However, this task did not on the same PC, when I purposely unchecked "leave this task in memory when suspended", then ran a benchmark. Of course stopping and restarting boinc will do it.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=16548922

My supposition is that anything that removes the task from memory will cause the missing upload file.
ID: 49008 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,376,846
RAC: 3,590
Message 49009 - Posted: 1 May 2014, 13:30:11 UTC

I will try then, once current anz models have finished.
ID: 49009 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,480,698
RAC: 694
Message 49013 - Posted: 1 May 2014, 16:27:54 UTC

How about running them in a VM and saving the machine state (Virtual Box) when a reboot is required? I can set this up this weekend if nobody has tried it yet
ID: 49013 · Report as offensive     Reply Quote
Profile rebirther
Avatar

Send message
Joined: 26 Aug 04
Posts: 17
Credit: 367,996
RAC: 0
Message 49017 - Posted: 2 May 2014, 13:10:44 UTC - in response to Message 49013.  

How about running them in a VM and saving the machine state (Virtual Box) when a reboot is required? I can set this up this weekend if nobody has tried it yet


Mine was running in Vmware and saved the state, no problem here.
ID: 49017 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : UK Met Office HADAM3P (global only) with MOSES II landsurface scheme v7.03

©2024 climateprediction.net