climateprediction.net home page
Posts by nairb

Posts by nairb

1) Message boards : Number crunching : Lost another one! (Message 69124)
Posted 6 Jul 2023 by nairb
Post:
I find that a power cut is good for killing w/u as well. All other projects w/u recover ok.
2) Message boards : Number crunching : New work discussion - 2 (Message 68863)
Posted 7 Jun 2023 by nairb
Post:
I must confess that I must have confused those 64bit applications with the other apps that need those 32bit libs.
I did think for a moment that the 64 bit ones had been statically linked.
It would be far better if those 32 bit jobbies could be re-compiled for 64 bit. It's only the fedora 32 that I have those extra libs installed. Whilst I have vbox installed on fedora 36 I failed to get those 32 bit libs installed. So I had better remember which climate w/u's are to be worked on and not download them to the wrong machine.

Anyway, I'm sure the world will keep turning.
3) Message boards : Number crunching : New work discussion - 2 (Message 68849)
Posted 5 Jun 2023 by nairb
Post:
I meant w/u's not needing those 32 bit libs, We had some of those earlier. Worked well on fedora 36 without installing those 32 bit libs.
4) Message boards : Number crunching : New work discussion - 2 (Message 68847)
Posted 5 Jun 2023 by nairb
Post:
Are all the future w/u's going to be statically linked.... Hope so.
5) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68514)
Posted 28 Feb 2023 by nairb
Post:
At least this w/u ran to the end before aborting with

"Process still present 5 min after writing finish file; aborting</message>£

No other error messages.

Only 50% success so far with the latest bunch.
6) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68453)
Posted 25 Feb 2023 by nairb
Post:
Old friend is back.....
"double free or corruption (out)"

I have been missing these.
7) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68114)
Posted 30 Jan 2023 by nairb
Post:
With 1 cpdn task running by its self it failed after zip no 83 with
13:30:43 STEP 2039 H=2039:00 +CPU= 18.156
double free or corruption (out)

I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads.
8) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68027)
Posted 25 Jan 2023 by nairb
Post:

Memory issues like this are not easy to track down in the code. So far it looks like there is a small memory leak in the boinc libraries responsible for zipping up the results files rather than the CPDN code, but it's early days so I can't be sure where the error is coming from.


Yup. I agree it can be difficult to track down. I did work for years on call centre kit with over a 1000 concurrent users. We did test for 8000 concurrent jobs on the machines. With multiple layers of software it was a challenge to find the culprit with a memory leak/corruption problem.

I always tried to get the application programming teams to "try" and give informative error messages........... not always seen as the most important issue. But a useful error message can save endless hours later!!!.

The machine I use for cpdn seems able to run any combination of projects without issues. 8 of anything seems ok, and they all seem to recover from a power cut..... Unlike some cpdn w/u.
With 4 cpdn w/u running at once it uses very little swap space and usually shows about 5~6 gig of memory free. I know peak usage will vary.
Anyway, I hope the bug is found, since it will save a lot of frustration for everyone
9) Message boards : Number crunching : The uploads are stuck (Message 68026)
Posted 25 Jan 2023 by nairb
Post:
And WCG can't feed tasks either right now (they assign them, but most of them don't download),


Just got a bucket full of WCG with some ARP w/u as well. They do download after a while with a bit of prodding.
10) Message boards : Number crunching : The uploads are stuck (Message 67938)
Posted 21 Jan 2023 by nairb
Post:
Just had 3 uploads fail at 100% with same message
"No space left on server"
11) Message boards : Number crunching : w/u failed at the 89th zip file (Message 67925)
Posted 20 Jan 2023 by nairb
Post:
2) Do you have "Leave non-GPU tasks in memory while suspended" enabled in Computing preferences? It's highly recommended, especially if the tasks often get interrupted for any reason like task swapping, BOINC/PC restarts.


Yes its ticked. It's a dedicated machine and I try to ensure that once a climate task starts its not suspended by other work and runs to completion.

When I checked the machine today, the machine seemed to be running almost idle with 2 tasks using almost no cpu time.
It needed a hard reboot. I should have done a memory check but it looks to have come back to life, but has dumped 2 of the working w/u's with computation errors.

When it's cleared the running jobs I will run a memory checker just to be sure.

I do tend to load the thing with 4 climate jobs and 4 WCG jobs at once....... its done ok so far. But maybe I have been lucky.
12) Message boards : Number crunching : w/u failed at the 89th zip file (Message 67915)
Posted 19 Jan 2023 by nairb
Post:
This w/u https://www.cpdn.org/result.php?resultid=22269116 failed with a most informative error

"double free or corruption (out)"

Anybody had one of these? Just curious what it might mean??
Ta
Nairb
13) Message boards : Number crunching : The uploads are stuck (Message 67710)
Posted 14 Jan 2023 by nairb
Post:
For some reason I seem to be blessed with endless uploads. It worked all thru the night and now keeps up with the output of 4 running w/u.
It does seem that once a connection is made it keeps uploading the zips until they are all gone....... lucky me..
14) Message boards : Number crunching : The uploads are stuck (Message 67662)
Posted 13 Jan 2023 by nairb
Post:

The current limit is 50 concurrent connections.

Well, on an optimistic note, I finally snared a connection and at a rattling 30+kBits/s have managed to upload a whole w/u. And still the connection is holding. I might make it up to 65 kbits/s later in the evening.

With luck all the rest of the zips will go overnight.
15) Message boards : Number crunching : The uploads are stuck (Message 67592)
Posted 12 Jan 2023 by nairb
Post:
Good job its not the weekend....... At least I got 2 complete w/u uploaded.
16) Message boards : Number crunching : The uploads are stuck (Message 67542)
Posted 11 Jan 2023 by nairb
Post:
Well, here in slow land, one zip file at a time is being uploaded....... Huuurrray.
It's happening as I write this.
17) Message boards : Number crunching : The uploads are stuck (Message 67358)
Posted 5 Jan 2023 by nairb
Post:
Well I managed to get 2 w/u worth of zip files uploaded but it seems to have gone "pop" again. Nothing uploading. Good job those zip files are not 100+meg in size or we might never catch up.
18) Message boards : Number crunching : The uploads are stuck (Message 67342)
Posted 4 Jan 2023 by nairb
Post:
Still going here. Took 5.5 hrs to upload all the zips for the first w/u. Could finish the lot by sometime tomorrow evening(late) if all keeps going.
19) Message boards : Number crunching : The uploads are stuck (Message 67315)
Posted 4 Jan 2023 by nairb
Post:
Its looking hopeful....... one just uploaded. 400+ to go.
20) Questions and Answers : Unix/Linux : Fedora 36 (Message 66740)
Posted 3 Dec 2022 by nairb
Post:
These are 64bit models which they need to be to address the amount of RAM they use. I have had two that failed right at the end and 4 successes so far.


The final one failed at the end and there was nothing in the stderr log. But at least the w/u are short. And static linked removes the problem of missing libs.


Next 20

©2024 climateprediction.net