climateprediction.net home page
Posts by MartinNZ

Posts by MartinNZ

21) Message boards : Number crunching : Error while computing (Message 54568)
Posted 25 Jul 2016 by MartinNZ
Post:
I've had a couple of pop-ups about Windows having a problem.

Funny, I had some but ignored them as there was nothing in the event log at the time. Never seen pop-ups from CPDN/BOINC before.

Over the last few days had two hadam3p_afr50 tasks go down with the same stderr message -

"The extended attributes are inconsistent. (0xff) - exit code 255 (0xff)"
e.g. task 19819095
Any thoughts? Sound model related issue to me.

Also had a wah2 go down with "The system cannot find the drive specified. (0xf) - exit code 15 (0xf)", which of course could be PC related. Task 19795175
22) Message boards : Number crunching : 2 recent wah2 crashes - all tasks (Message 54473)
Posted 10 Jul 2016 by MartinNZ
Post:

I think the butler did it.

But was it with the lead pipe? ;-)

Well, I did several stress tests over the weekend. Everything passed with flying colours.

Over the years, Prime95 has come out as the favourite on the CPDN boards, so I started with that. Hmmm, looking at the memory use, it was only using about 6.7 GB. As I have 32 GB of RAM, testing only 25% of it doesn't seem like a good idea. So looked about and decided on Aida64 and ran its stress test for 25 hours. Not only did it seem to use all available memory, but from web chatter it seems to be better suited for modern processors. E.g. see the wiki article here and note the comment from the Asus engineer about the Intel X79 system, which just happens to be mine. OK, it might not push the maths the same as Prime97, but the PC is used for loads of other work as well, all of which work the system in ways Prime97 doesn't. Just for good measure I ran heaps of other programs like Photoshop, Excel etc while Aida64 was testing. Nothing seemed to throw it, although things did get a litte slow at times - no surprises on that one!

Just to be sure I also did 4 hours with OCCT and another 4 hours with Prime97.

So the upshot is that I'll keep the PC plodding on cranking out the results. I still think something odd is happening though.
23) Message boards : Number crunching : 2 recent wah2 crashes - all tasks (Message 54448)
Posted 7 Jul 2016 by MartinNZ
Post:
Thanks everyone. As Dave noticed, plenty of memory. This is a Xeon workstation with 32 GB ECC RAM, and CPDN having its own 2 TB data hard drive. Art, I noticed your thread and decided this was a separate issue. I run 10 hyperthreading task (out of possible 16), as long ago decided that was the most effective combination - never caused an issue in the past. Sometimes I suspend BOINC when I'm doing some particularly intensive work on the PC, but again no link with the crash times.

I'll run a soak test in the weekend, but will be surprised if it throws anything up - running CPDN does a pretty good job as a soak test anyway.

It could be a software update of course, but I always suspend and exit BOINC before doing any work. I have noticed a couple of big Norton updates coming through recently that require the system to be rebooted, but cannot recall if the times were coincident. Norton notifies that a restart is required, but what I don't know is how much installation work it has done in the background before the notification.
24) Message boards : Number crunching : 2 recent wah2 crashes - all tasks (Message 54437)
Posted 6 Jul 2016 by MartinNZ
Post:
Hi Les,
Until these two crashes I could say the same. If I exclude the latest failures, so far this year the PC has crunched 182 task successfully and only 6 have had errors which I reckon is pretty reasonable. Hence the concern.
25) Message boards : Number crunching : 2 recent wah2 crashes - all tasks (Message 54432)
Posted 6 Jul 2016 by MartinNZ
Post:
I've been having a great run over the past months, but I've noticed 2 recent crashes where all the 10 running tasks have bombed out; 4 Jul 21:15 & 26 Jun 11:57. These are mostly wah2 models, see the errors here (Comp ID 1290283) It's one thing for a model crash, but to take out all the others is a bit of a feat.

My first thought was computer error, but now I'm not so sure. Looking through the work units, only one of these have gone onto completion on any other PC so far. (I've also found a large number of PCs with very high failure which I'll report in the appropriate place.)

Some of the PCs are showing similar symptoms, in that there are multiple tasks on error at the same time stamp, check 1323410 and 1211978 This seems to be a bit too regular to be a coincidence. I assume the number of errors at each time is dependent on the number of tasks being run.

Again, scanning through the work units, there seem to be a large number of wah2 failures, with only one or two PCs having (mixed) success. Example here

Although I now don't think it's relevant, I've run chkdsk on the BOINC hard drive (no errors reported), but no memory test as yet. I can't really run a soak test until Sat morning.

So the questions are:
1. Is the likelihood these are PC based errors? I suggest not.
2. If yes I can either run all the current task until they complete or explode, or stop processing until after the soak test. Suggestions?
3. Should we be excluding wah2 from the models to run?

All this is very reminiscent to similar issues I had some moons ago. Spent ages trying to fault find when it turned out to be a model issue.
26) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 54420)
Posted 5 Jul 2016 by MartinNZ
Post:
> is anyone else seeing this?

Confirmed. Possibly a side effect of the software update? I've had 'network activity suspended' for the last few days as I thought there may be a few issues. Suspended again until we hear things are better - as we always do :-)
27) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 54163)
Posted 22 May 2016 by MartinNZ
Post:
Thanks Les,

I had noticed that the servers were OK on the status page & this seems to the case 95% of the time with upload errors.

From your reply I take it we keep dropping a line here with issues. We know the IT staff will get it back up and running asap.
28) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 54159)
Posted 21 May 2016 by MartinNZ
Post:
Oh look, it's the weekend again :-) Who needs a calendar?

21/05/2016 14:21:28 | climateprediction.net | [error] Error reported by file upload server: can't open file /storage/incoming/uploader/hadam3p_pnw_j7lf_201412_12_405_010579265_0_2.zip: No such file or directory


Out of interest, are these issues automatically flagged on the servers, or do the IT guys rely on feedback?
29) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 54068)
Posted 10 May 2016 by MartinNZ
Post:
Yup got a couple of doz afr50 zips stuck with transient HTTP error. Had network activity disabled over the weekend because of the issues, but it has been running today to get rid of the eu25 zips.

Soon be morning time and all will be sorted :-)
30) Message boards : Number crunching : HadCM3n release (Message 53742)
Posted 22 Mar 2016 by MartinNZ
Post:
Hi Dave, quite happy to alter clientstate.xml, but not sure I understand your answer.

I've nothing waiting to be transferred, trickles seem to be going up OK (you may note a week with nothing when I was on holiday), there doesn't seem to be much in the project folders.

So, does this mean I leave it running or abort? The news article from Sarah posted by Les on 15 March implies that all those batch numbers get aborted.

BTW, that should have been 220hrs in my original post, not 320hrs. Currently 166hrs to go.
31) Message boards : Number crunching : HadCM3n release (Message 53733)
Posted 22 Mar 2016 by MartinNZ
Post:
I see the notice about aborting batches 350-3 in the news. I have one that seems to have 352 in the name, so assume it is to be culled. As it has already run 320hrs, will someone please confirm this for me before I quit the task.
See my tasks here
32) Questions and Answers : Windows : WINDOWS 10 (Message 53368)
Posted 1 Feb 2016 by MartinNZ
Post:
Well after a long absence I'm back - with Win 7 on this workstation.

Did an insitu upgrade to Win 10 and it seemed to run fine. Had stopped doing CPDN during the upgrade and trial period. The trial period got extended after my first blue screen. Problem was it would run fine for weeks and then bang crash - blue screen with loads of other issues. After the 3rd time I reinstalled Win7 from a backup image. Two others (both laptops) are running fine on Win 10 upgrades & I like it as a system (although one of them an HP, doesn't have a driver for the fingerprint reader - the only hardware issue I had.)

Had loads of issues with software, especially AV, but that was mostly fixed by updates except for the AV. Was running Kaspersky and that wasn't happy so as it was the end of the subscription tried BitDefender as that seems to be topping all the charts. However that caused problems with the network and Outlook. Ended up going back to Norton and that has been fine in my mixed environment.

Perhaps in another year I'll do another install, but this time it will be clean install. My local computer shop (who built and maintain the workstation) reckons that most of the Win 10 problems that they have are from the upgrades. Very few problems from machines with a clean install.
33) Questions and Answers : Windows : Problems running on new computer (Message 53367)
Posted 1 Feb 2016 by MartinNZ
Post:
I had a look at the Workunits for that PC. On the first page ALL had 3 failed tasks & no completions, so I'd suggest this is a model issue. Saw Darwin, Vista, Win 7, 8.1 & 10 as an OS. But my feeling without counting was that the percentage of Win10 could be higher than the % of machines visible on the results page.
34) Message boards : Number crunching : CPDN Win10 Compatible ? (Message 52268)
Posted 19 Jul 2015 by MartinNZ
Post:
Hmm, see the OP on Win 10 compatibility as been hijacked somewhat. To come back to the OP, I notice in a recent Grauniad article it seems that the Home version of Win 10 is likely to install updates automatically, not the case with Pro & Enterprise versions.

Recommendation has always been to stop Boinc/CPDN before applying updates, so don't know how this may affect the running of tasks in the future. I normally close everything down before applying update and must admit the update procedure seems to be pretty flawless these days, but it still does need the occasional reboot as part of the process. To me at least, it also seems that the Boinc/CPDN combination is a bit more robust these days and seems to survive things like power loss better than a few years ago.

Does anyone let Win updates happen automatically while Boinc/CPDN is running? I guess only time will tell if the Win 10 Home version update will become an issue.
35) Message boards : Number crunching : HadCM3 short errors (Message 52044)
Posted 10 Jun 2015 by MartinNZ
Post:
That is of course true Richard, and it poses a couple of interesting points.

1. The number of PCs running as a service is pretty low as I understand it, and I understood from Les that particular error was triggered by a combination of 'Windows + BOINC 7 [>7.0.36] + a service install'.
2. I looked through a few of my completed short tasks to see if they had wingmen with errors. Out of 17 workunits, there were the following number of task failures, the "L" = Linux, the rest Windows.

6.10.58 - 1, 1L
7.4.23 - 0, 2L
7.4.27 - 2
7.4.28 - 1
7.4.36 - 4
7.4.42 - 22

I then checked a few of those failed PCs and there was a mix of those that failed all shorts and those that failed a lot (excluding the 1980 runs). I didn't notice any where short model failures were low. But I only checked a couple of PCs.

3. When I was failing tasks last year using the v7.2.42 of BOINC (yes that is 7.2.42, not 7.4.42), many tasks had Invalid Theta errors that are generally explained as model errors, not computer errors. When I went back from the v7.2.42 to v7.0.36, I went from 100% failure to 100% success. Interesting.

Could it be that something else in the later versions of BOINC is triggering errors in the PCs not running as a service?

It would be interesting to see one of those other PCs go back to an earlier version of BOINC to see what happens. But then of course we would need more short tasks :-(

Anyway, I leave that question to better minds than mine to figure out.
36) Message boards : Number crunching : HadCM3 short errors (Message 52028)
Posted 7 Jun 2015 by MartinNZ
Post:
Oh dear, not the OS debate again. Hopefully the project managers can do a proper analysis of whether the models are failing because of the OS, as to me it does not seem to be simply an OS fail/complete.

Last year ALL my short models were failing (over 180), and Les's comment at the time was "Luck of the draw, I think." He could well be correct for now over the last month or so I've had well over 100 complete (1980 models excluded), and the only failures were 1980 No Resubmissions run in error.

I'm running Win7 with BOINC as a service and BOINC Service gets stopped and started automatically every night while the PC does a backup. Then there are all the other times it gets shut down when doing Win updates, stopping to get intensive graphics work done etc. Once I accidentally hit the power button and when rebooted the tasks resumed just fine.

So the model (all models actually) now seems pretty stable on my Win PC. Had a quick look at some of my short tasks, and most were new with just my PC having run them through. Of those where I succeeded and others failed, of the failures 7 were Linux, 20 were Win. About even stevens I would have thought. I didn't look at the reason for failure - leave that for someone else.
37) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 51826)
Posted 13 Apr 2015 by MartinNZ
Post:
Yup, I've quite a few ANZ zips not uploading. Same log as Lockleys showing problem with project servers.
38) Message boards : Number crunching : Linux/Mac/Windows segmentation (Message 51801)
Posted 9 Apr 2015 by MartinNZ
Post:
Yeh, had noticed before on Tullus's graphs that the linux only tasks took ages to run down, but figured the researchers were happy enough.

It was an interesting exercise and brings another dimension to the age old OS debate.

Still, the break in work meant that I was able to give the insides of the PC a good and thorough sudsy jet wash, although I did have to get the sand blaster out to finally get the water cooler radiators clean. ;-)
39) Message boards : Number crunching : HadCM3 short errors (Message 51789)
Posted 7 Apr 2015 by MartinNZ
Post:
Yup, just noticed a bad run, all "no resubmission" from 19 Nov workunits.

Looks like things will be quiet with no more models for the moment, but hey that's life. Might be a good time to blow the dust out of the PC :-)
40) Message boards : Number crunching : Errors & Boinc service installation (Message 51737)
Posted 1 Apr 2015 by MartinNZ
Post:
A week ago I turned on the HadCM3 short downloads again for a bit. 9 downloaded and ran perfectly. Not running them now as there is plenty of work around and thought I'd leave them for the linux machines. In fact I'm now having very few errors on this PC, and the PNW models also completing without error so far. This PC has always been running as a service installation and I had surmised that the earlier short model problems with the service installations had been solved. I'm still running Boinc v7.0.36.

BTW, one of the threads on this is here HadCM3 short - errors galore


Previous 20 · Next 20

©2024 climateprediction.net