climateprediction.net home page
Posts by MartinNZ

Posts by MartinNZ

41) Message boards : Number crunching : hadRM3P Europe 7.26 No Trickles No Credit (Message 51554)
Posted 7 Mar 2015 by MartinNZ
Post:
With my 2 month WUs, I've 9 completed with trickles and 15 without.
42) Message boards : Number crunching : hadRM3P Europe 7.26 No Trickles No Credit (Message 51547)
Posted 6 Mar 2015 by MartinNZ
Post:
Looking at ed2353's completed tasks, some of these tasks do have credits, and they do have trickles, some not. All show that the calcs seem to finish normally in the std err. My short EUs are the same, about 50/50 trickle/no trickle.

Are both valid results?
43) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 51445)
Posted 21 Feb 2015 by MartinNZ
Post:
Yup, all clear. Keep up the good work.
44) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 51441)
Posted 20 Feb 2015 by MartinNZ
Post:
I've 6 hadam3p_afr trickles that have been trying to upload all day. Log says Internet access OK - project servers may be temporarily down.

I'm sure someone is aware of it, but if they aren't there's still a few hours left before the weekend - which is when these thing normally go astray ;-)
45) Message boards : Number crunching : HadCM3 short - errors galore (Message 51188)
Posted 13 Jan 2015 by MartinNZ
Post:
Hi Les, if my memory serves me correctly, numbers of up to 30% failure was mentioned at some time as what to expect.

So now to foot in mouth time.

In my case, in HadCM3 short models issued since 10 Oct, I have had zero complete and around 180 error for whatever reason (this excludes all those we aborted.) Before this these models were completing OKish. Coincidentally, this is about the time I rolled BOINC back to 7.0.36 as stated in an early post below.

Out of interest I went to a page of my results that were all short models with errors in November. All 20 of mine failed with Invalid Theta, with a very short CPU time. When checking the workunits, 17 of those units went to completion on other boxes (mainly windows), the other 3 did not complete at all. So the model seems to OK, guess I should have done more digging earlier!!!!

I can only deduce that there was some change in the model/BOINC structure that makes my PC pretty crap with these models, possibly the running as a service issue.

I also looked at a couple of the PCs that completed a few of my failures, and they had run loads of HadCM3 short tasks and actually very few failures.
46) Message boards : Number crunching : HadCM3 short - errors galore (Message 51184)
Posted 12 Jan 2015 by MartinNZ
Post:
OK, all 4 HadCM3 short tasks have now failed. 2 with out of memory error, 2 with Invalid Theta, see tasks sent 4 & 9 Jan here

Both of the Invalid Theta managed to get past the first timestep and failed with Output File Absent e.g. Output file hadcm3s_75qk_1980_2_009366458_0_2.zip for task hadcm3s_75qk_1980_2_009366458_0 absent

I am really struggling to believe that the HadCM3 short tasks are worth running, as the error rate is way too high. I can understand that models can go out of bounds with Invalid Theta, but given the huge number of these that happen, then surely the model is lacking. As for system errors, I agree with Iain - these should never have gotten out of Beta testing (same for modelling actually). Volunteers, the vast majority of whom probably never read these forums, should not have to continually interact with their PC to keep CPDN models running.

There are a huge number of volunteers out there, all creating additional CO2, and costs, by running these models in the hope that there is a slight chance it will make a difference. I suggest that if the researchers had to run these models on their own super computer, the problems would soon be sorted out.

BTW, the other models (ANZ, Africa, Europe, Full Res Ocean) seem to chug along just fine.
47) Message boards : Number crunching : HadCM3 short - errors galore (Message 51180)
Posted 11 Jan 2015 by MartinNZ
Post:
I've just finished what I assume was one of the new batch of shorts task here (hadcm3s 1980, created 22 Dec.)

It crashed with the not uncommon error:
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception) (0xe06d7363) at address 0x7731C42D

I've 3 more running, but I've now turned them off for my PC. I should know within a couple of days if these running are OK, but Iain's comment here does come to mind.

I've just noticed that another of the 22 Dec batch snuck through, and looking at the workunit here, all 5 tasks failed. 1 similar to the above, 1 with linux library issue, and 3 were by Invalid Theta. Given the frequency of the Invalid Theta's that have been occurring in these models, you really have to wonder if it is a programming issue rather than a model instability.
48) Message boards : Number crunching : HadCM3 short - errors galore (Message 50838)
Posted 16 Nov 2014 by MartinNZ
Post:
Was that me?

Yes Richard, it was indeed you - I'm obviously having an off day or two.

Thanks for the detailed reply and link, I'm quite happy to stick with the older version. If I move to Win 10 in a few years that may be different, but by then hopefully CPDN will have caught up. Although I see what you mean about "foreseeable". I guess I won't be holding my breath.

When it is safe for Win service users to update BOINC, perhaps you'd be so good as to post a message in the Windows section of the BB.
49) Message boards : Number crunching : HadCM3 short - errors galore (Message 50836)
Posted 16 Nov 2014 by MartinNZ
Post:
Anyone running 7.4.27 as a service yet? I think Bonsai911 doesn't. Can't see much in the changelog that would improve things on the service side, but the update details are always pretty sketchy. Peter Haselgrove seems to have a handle on the service side of things in this post below - any news?

BTW Peter, I reported back that I installed 7.2.36, that should of course been 7.0.36, but hopefully you figured that out. I've obviously been getting too much sun :-)

I've just gone through a batch of HadCM3 Shorts, and they all failed with Invalid Theta. [Edit: Interestingly on the workunits I checked, my PC was the only one getting Invalid Theta, the others were the old errors reported below.] I assume these Invalid Thetas are a model error, but if moving to the new BOINC version would improve things then I will, especially as I notice there are a huge number of 'short' task available for download. Not much use running them if it is the PC causing the issue. In the meantime I'll stick with 7.0.36 running as a service.

We seem to be getting a few splinter threads on the "short" issue - shame really. Perhaps they could be brought into the main one or closed to new comments?
50) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50706)
Posted 2 Nov 2014 by MartinNZ
Post:
OK, last off topic post here from me. OK, I see there are mechanisms, but really can't see the point. To me it's up to the researchers to allocate the tasks and when the task run out, they run out - time for some housekeeping.

Given that on this PC I run 10-12 tasks, and at a reasonable pace, it is actually not all that often that I run out anyway. When I do, it's normally for less than a week - hey, I can live with that. So I'll just run with the default cache settings.
51) Message boards : Number crunching : Update on HadCM3 'Short' WU crashes with shutdown in Windows (Message 50705)
Posted 2 Nov 2014 by MartinNZ
Post:
Interesting Erik, but not quite what Pete was on about. However to continue your train of thought, I noticed it is an AMD box so went through the top 300 PCs, but no, only found one other AMD giving similar results here. (The CPU run time is a dead giveaway as to the type of error; numbers less than 100 sec normally mean Invalid Theta.) Several others crashing, several hadn't run the model.

Then thought I should have a look at some others and found a windows laptop (with more suspends than I've had hot breakfasts - ever!!) and yet was chugging all the 'r' models through getting Invalid Theta.

Gave up after that, and I guess the researchers will by now have figured out what is going on.
52) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50676)
Posted 29 Oct 2014 by MartinNZ
Post:
Thanks Les, for me as I have a full quota of 's' series running it's just a matter of knocking the 'r' series off once a day, but great that it is being sorted out at the server end. For those with lesser bandwidth I'm sure it will be welcome.


Speaking of cache (and a wee bit off topic), some PCs seem to a cache of tasks that seems to be a fair bit greater than the max work allowed (10 days?), or am I missing something. I've noticed this in the past when PCs with old Pentium processors suddenly seem to appear on the leader boards, where the 'In Progress' tasks seemed to be months worth of work. Not really fair on the research process if this is the case. Not in the Pentium camp, and a PC that seems to be making a useful contribution, but has just under 100 tasks 'In Progress' is this i5 PC. If you look at some of the completed 'shorts' they are taking a month between sending and completing which implies that the task cache calculation is off a bit. Just seems a bit odd.
53) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50651)
Posted 28 Oct 2014 by MartinNZ
Post:
Just a thought Les, if we keep aborting the 'r' series, wont the re-runs keep being sent out until someone finally ends up running them? Given that the tasks kill themselves after a few minutes run, would we be better off running them so they quickly reach their max number of re-runs, or, does an Abort count as a re-run?
54) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50642)
Posted 27 Oct 2014 by MartinNZ
Post:
Thanks Les.

My 's' series are running well, now up to 60% complete on the first ones, but as they have 300+ hours run time, it will be a while before I see if they complete OK.

Given the high failure rate on the 'r' series, it seems a lack of testing on that one - guess these things happen & must be frustrating for the researchers who have more than enough to do anyway.
55) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50594)
Posted 23 Oct 2014 by MartinNZ
Post:
Just looked at my tasks in progress at 10+ hrs and compared those to my failures. All the failures happened at around 23000 CPU secs, and this just happens to be around the time of the first upload trickle at timestep 25,920 on my rig. None of the failures had any trickles. Interesting?
56) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50593)
Posted 23 Oct 2014 by MartinNZ
Post:
I would definitely NOT recommend to use ANY so called registry cleaner and Les you may want to consider removing the link if you can edit that post. I'll come back to why in a moment, as I want to address the error issues first.

Yes it could be the allocated memory that is an issue, but it is not just my PC it is all the others that I have looked at - haven't seen one successful one yet (although some of mine are now up to 50 hrs running - fingers crossed). Mac and Linux boxes are also failing. Your link to the new experiments suggests early failure with Invalid Theta. None of the tasks I looked at have Invalid Theta. All seem to run for 15k - 35k secs CPU time, Win PCs fail with the memory error stated in my earlier post, Macs and Linux boxes with different variants of 'process exited with code 193' Mac sample, Linux sample (Hi Eric).

An interesting BOINC related post here from 2008 on the 'exit code -529697949' error and it suggests memory leakage by a wrapper - starting to get a bit technical for me although I understand the purpose of a wrapper. I was thinking of doing a memory test when I get back from the long weekend, but I note this was done in the link and no errors found.

It would be nice to know if these are expected errors in addition to the Invalid Theta. I guess the researchers might come back at some point and let us know.

Registry cleaners. Personally I don't think they are a good idea and most independent writers advise against them. Microsoft link, Microsft MVP comment link Firstly I would NEVER EVER run anything that tinkered with the registry that wasn't well documented and proven. Secondly when I went to the linked page, I immediately had a download file box asking if I wanted to save or run the file - AND I hadn't clicked anything on the page! I didn't realise that was possible - definitely get me out of here time.
57) Message boards : Number crunching : hadcm3n Full Res Ocean out of memory error (Message 50584)
Posted 22 Oct 2014 by MartinNZ
Post:
All these tasks are failing with:
- exit code -529697949 (0xe06d7363)
</message>
<stderr_txt>


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Out Of Memory (C++ Exception)(0xe06d7363) at address 0x7688C42D

Engaging BOINC Windows Runtime Debugger...

When I check the workunits, all other PCs tasks are failing with the same error.

For a sample of mine see task 17232786

Just in case anyone wonders, I am not out of memory (32GB). Well not on the PC anyway ;-)
58) Message boards : Number crunching : HadCM3 short - errors galore (Message 50551)
Posted 18 Oct 2014 by MartinNZ
Post:
Right now my boxes are all on linux, because of these problems on Windows. Failures are few.


Ah Eric, the old win/linux/mac discussion.

Looking at the list of top computers, I chose at random three linux and three win based boxes from the first page. Looking at the HadCM3 short tasks, to make it easy for me I only counted tasks sent during Sept and only counted errors while computing. These are the results (computer ID - % tasks with errors(give or take a bit)):

Win: 1270273 - 24%, 1278109 - 17%, 1295173 - 27%
Linux: 124 - 30%, 805269 - 27%, 1292107 - 30%

So, not really much in it, and actually on that tiny sample, windows would be the better system. Would be nice to see someone do a bigger sample :-)

For your own boxes Eric, I looked at your top two PCs and on the same basis there is a error rate of 20 % & 25% for those two PCs.

Now why I have such a high failure rate I don't know, for the few I ran in Sept, my error rate was 44% and all of those were Invalid Theta. Luck of the draw? We just won't go into my October tasks! Grrrrr.
59) Message boards : Number crunching : HadCM3 short - errors galore (Message 50541)
Posted 17 Oct 2014 by MartinNZ
Post:
The batch sent 16/17 Oct have all failed - that's a total of 46 on my PC. Error Code 22 & invalid Theta. All the work units have multiple failures.

The run time was so short before failure that it was only because I heard the cooling fans speed up that I noticed them running.

As I assume this is a model failure, I'm leaving the PC to accept new tasks.
60) Message boards : Number crunching : HadCM3 short - errors galore (Message 50509)
Posted 12 Oct 2014 by MartinNZ
Post:
Been through quite a few short tasks since coming back on line. Quite a few have failed, but all that I have checked are Error code 22 & Invalid Theta, so I assume that's OK. e.g see the errors at the top of this page. Interestingly, if you then view the previous page it shows the 5 hadcm3s short tasks that are still running (or were when I wrote this!) The ones still running were sent on the 11th, all those sent on the 12th have Invalid Theta. Coincidence or model differences? (Note for those that do not realise, the page references will only stay valid for a few days and if you are looking at some time in the future, you will be looking at the wrong page and task list. For individual tasks see Example of task with Error 22 and Example of task running (10 hours, 9 to go) at time of post.)

I've put the beast on no new tasks overnight and hopefully there will be some feedback in the morning.


Previous 20 · Next 20

©2024 climateprediction.net