climateprediction.net home page
UK Met Office HadCM3 Short

UK Met Office HadCM3 Short

Message boards : Number crunching : UK Met Office HadCM3 Short
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 49890 - Posted: 31 Aug 2014, 20:50:50 UTC

This kind of wu should not be stopped!
The wu works fine and is going to finish without any interruption.
But if you want to do backups and after that resume, it resolves in wu errors...
ID: 49890 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 49891 - Posted: 1 Sep 2014, 1:05:17 UTC - in response to Message 49890.  

Yes, unfortunately. :(
But I think that's only with Windows, with a service install of BOINC. :)

It's nice to see some short, speedy little models, even if the zips ARE big.


ID: 49891 · Report as offensive     Reply Quote
MartinNZ

Send message
Joined: 22 Mar 06
Posts: 144
Credit: 24,695,428
RAC: 0
Message 49892 - Posted: 1 Sep 2014, 3:17:20 UTC - in response to Message 49891.  

Well as I fit the description from Les, I've disconnected from HadCM3 Short as the BOINC service is stopped every night for backups and then restarted. That said, my failure success/error rate has been around 50/50 in the shorts and I managed a good number of those errors myself. Like I managed to trigger a windows update when 8 of them were running and that crashed them all. What a suprise! I always stop BOINC before doing things like that, but I guess we're all allowed our bad days :-( Not sure that a huge number were triggered by stopping and restarting the service.

The other approach would be not to stop BOINC for backups. Any thoughts? BOINC data is on a separate HD which is not backed up at all. BOINC programs are BUed as an image is taken of that drive.
ID: 49892 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 49895 - Posted: 1 Sep 2014, 9:24:33 UTC

Don'T get me wrong:

1. it's NOT installed as a service
2. I ALWAYS stop BOINC and look after that at the taskmgr for remaining CPU cycles and resident (not stopped) boincmgr.exe
3. it's EASY to do severall BKUPS a day with SSdrive installation and can't be afford to miss because of demands of other Projects
4. at last, untill now, E V E R Y wu fails because off interruption

Have a nice day
ID: 49895 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 49898 - Posted: 1 Sep 2014, 10:33:37 UTC

4. at last, untill now, E V E R Y wu fails because of interruption
ID: 49898 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 49991 - Posted: 4 Sep 2014, 6:37:46 UTC - in response to Message 49898.  

Weell buddy -- if you was running the Linux version -- smiley smiley.

Saving, backing up, and restarting would work ok (this particular issue only).The hadcm3s continue ok after restart on linux.

The weird bit is -- download about 50 meg. run a day or two, upload 63 meg twice.

And then leave 800 meg sitting the the wu's folder. Have to clear it out myself. Have done so for many dozens wu.
The down,up,remainder seems mathematically strange.
ID: 49991 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 49999 - Posted: 4 Sep 2014, 13:47:32 UTC

Well Eirik,

that's strange, because I don't have that remaining folder over here.
ID: 49999 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50015 - Posted: 4 Sep 2014, 18:02:40 UTC

BOINC 7.2.42 (64bit) on Ubuntu Trusty (64bit). BOINC folder is in BOINC user's home directory with good permissions. When a hadcm3s_ fails, the subfolder BOINC/projects/climateprediction.net/hadcm3s_<task-id> gets removed along with the other task-specific files.
When a hadcm3s succeeds --- that's when the 814 megabyte folder gets left behind. Seen several dozen examples last few weeks.
No idea why.
ID: 50015 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 50017 - Posted: 4 Sep 2014, 18:30:36 UTC

in case of success I have also not that folder
ID: 50017 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 50019 - Posted: 4 Sep 2014, 19:38:03 UTC

here's what it looks like

cpdn@thistle:~$ du -s BOINC/projects/climateprediction.net/* | sort -g -r | head -n3
814848	BOINC/projects/climateprediction.net/hadcm3s_1bb6_1990_2_008918940
720684	BOINC/projects/climateprediction.net/hadam3p_anz_rudx_2012_1_008965960
673152	BOINC/projects/climateprediction.net/hadam3p_anz_rue2_2012_1_008965965


1bb6 completed OK.

Time to browse client.state and the log files
ID: 50019 · Report as offensive     Reply Quote
Profile [AF>Le_Pommier] Jerome_C2005

Send message
Joined: 21 Oct 10
Posts: 53
Credit: 2,101,753
RAC: 3,985
Message 50064 - Posted: 8 Sep 2014, 12:03:56 UTC

I got 2 of those WUs on my iMac and they both failed after more than one day of calculation

<core_client_version>7.5.0</core_client_version>
<![CDATA[
<message>
process exited with code 9 (0x9, -247)
</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...

</stderr_txt>
]]>


But I did not suspend boinc or anything, the only thing I can think of is that I have changed (long ago) the parameter that tells boinc to switch application after one hour, I set it on one day instead (1440 minutes), so is this "killing" this application ?

But these would be failing for almost everybody then, since this parameter is set to 60 mins by default in boinc installation and most people are probably not changing it... ?
ID: 50064 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1084
Credit: 7,663,596
RAC: 4,112
Message 50069 - Posted: 8 Sep 2014, 12:43:09 UTC - in response to Message 50064.  

... But these would be failing for almost everybody then, since this parameter is set to 60 mins by default in boinc installation and most people are probably not changing it... ?

The variability between machines is very large for this model. Some users (e.g. astroWX) have completed many of these models and others (including me) have not succeeded in starting a single one.

Some of my crashed models have reported "INVALID THETA DETECTED", which is normally interpreted as an unphysical model. That so many should crash in that way, so early, and others crash with different errors suggests to me some model configuration error or BOINC compatibility problem - so I have excluded HADCM3S from my project preferences. I have not yet seen any explanation for why a particular model should fail to start on a completely reliable machine (as you can see I don't believe the "filtering parameter space for viable points" explanation) ...
ID: 50069 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 50071 - Posted: 8 Sep 2014, 13:33:36 UTC

Briefly wondering why this has never been an issue for me till I realised that it doesn't affect those of us who will only be running one project at a time.
ID: 50071 · Report as offensive     Reply Quote
Niall

Send message
Joined: 18 Dec 13
Posts: 62
Credit: 1,078,935
RAC: 0
Message 50077 - Posted: 8 Sep 2014, 22:40:40 UTC - in response to Message 50071.  

Briefly wondering why this has never been an issue for me till I realised that it doesn't affect those of us who will only be running one project at a time.


I'm only running CPDN while I have CPDN work, without interruptions. I crashed 8 hadcm3s units (and successfully ran none) before giving up.

Two hadam3p_anz units, a hadam3p_pnw unit and a hadcm3n unit are all currently running normally.

I don't think that's where the problem lies.
ID: 50077 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 50078 - Posted: 8 Sep 2014, 22:47:17 UTC

The fail/succeed difference is Windows/Linux. Mostly, anyway.

During beta testing, I tried all sorts of things to crash them, including setting the prefs for "don't keep in memory", and shutting down both BOINC and the computer.
They just kept on running.

ID: 50078 · Report as offensive     Reply Quote
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 50085 - Posted: 9 Sep 2014, 20:26:33 UTC - in response to Message 49890.  

This kind of wu should not be stopped!
The wu works fine and is going to finish without any interruption.
But if you want to do backups and after that resume, it resolves in wu errors...


Oh, it should be stopped, doesn't work properly.

20000 new ones on the server

yikes!
ID: 50085 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 50093 - Posted: 10 Sep 2014, 7:40:34 UTC

Oh, it should be stopped, doesn't work properly.


Or just not made available to windows users. As les has said they seem bullet proof on nix. I have had them survive two power failures here.
ID: 50093 · Report as offensive     Reply Quote
ed2353

Send message
Joined: 15 Feb 06
Posts: 137
Credit: 34,739,344
RAC: 13,815
Message 50094 - Posted: 10 Sep 2014, 8:35:42 UTC - in response to Message 50093.  

Just to show a different perspective, here are my results.

My computer is running in Windows 8.1 and has a ratio of 44 successfully completed to 3 failures (though I was given full credit for 2 of the failures).

My son's computer runs Windows 7 and has a ratio of 52 success to 4 failures.

I run CPDN exclusively and continuously on both computers.
ID: 50094 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4472
Credit: 18,448,326
RAC: 22,385
Message 50096 - Posted: 10 Sep 2014, 9:53:46 UTC - in response to Message 50094.  

Takes us back to the question as to why some boxes but not others?
ID: 50096 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,577,520
RAC: 2,959
Message 50406 - Posted: 8 Oct 2014, 13:57:07 UTC

Seven new wu. E V E R Y wu crashed without interruption.
ID: 50406 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : UK Met Office HadCM3 Short

©2024 cpdn.org