climateprediction.net home page
Posts by ChrisD

Posts by ChrisD

1) Message boards : Number crunching : Error while computing (Message 53348)
Posted 27 Jan 2016 by ChrisD
Post:
Just bought a fine new Samsung SSD 850 Pro 256 Gb.
This drive has been assigned solely to BOINC and its Projects, read CPDN for CPU and SETI Beta for GPU.
Just uploaded 7 _2 zips, so until now the new CPDN Tasks behave nicely. :)

Fingers Crossed.

ChrisD
2) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53293)
Posted 21 Jan 2016 by ChrisD
Post:
... Alternatively, a large write-cache will work.

That may be worth a try :)

I can provide details if necessary.

That would be great, so please..

ChrisD

Re. Using a RAM drive. Then I must get me a No-Break first. One glitch and all is lost.

I actually use a RAM-Disk on my I7 to store Firefox's Cache. Otherwise my Wear Level Count is rapidly declining, due to the rediculous way Firefox is storing Your browsed Pages. Myriads of small 1K maybe 2K files, trashing any SSD in record time.
But, if I loose powewr, I can safely discard any file on the RAM-Drive.
3) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53291)
Posted 21 Jan 2016 by ChrisD
Post:
My I7 is a 3770K with HT enabled, and BOINC is limited to 7 processors, just to make sure one CPU is always there to do something else.

The Xeon has HT disabled, 6 CPU's active, and as both machines fail at random, the spare CPU does not seem to help any.

Is there any Log settings that might help me pinpoint the problem?

Errors are random,the Xeon has one Africa Region task left and has passed TS 95000 whithout problems.

??

ChrisD
4) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53288)
Posted 20 Jan 2016 by ChrisD
Post:
Sorry, but my Xeon cruncher just trashed a task 5 trickles down the list.

BUT: this machine has just a vanilla windows 7 install. Nothing besides BOINC and BOINC-Tasks are installed.
Still this heartbeat error trashes my CPDN tasks at random.

Am I really the only one fighting these errors. If I am and nothing is done about it, I am not sure I can afford to compute for the trashcan much longer..

Sorry.

ChrisD
5) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53276)
Posted 17 Jan 2016 by ChrisD
Post:
You might find a longer event file in the BOINC folder in program data. Its called stdoutdae and is a text file. Don't forget the ProgramData folder is hidden by default.


Tnx :)

Found it, and it told me the same sad story.
After the internet went down, BOINC kept trying every minute to upload trickles to CPDN.
Finally it simply lost patience and started trashing each and every CPDN task in an inferiour rage. 'Reporting xx completed Tasks' repeatedly together with result .zip files none of which ever made it out from here.

Strangely enough, SETI Beta that runs om my GPU survived. A lot of results were waiting to be uploaded, but when internet was back the result queue cleared.

Seems that BOINC/CPDN combined does not tolerate a bad internet. :(

ChrisD

All tasks were trashed with the excuse, no Heartbeat. Maybe someone should make this service a bit more robust??
6) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 53270)
Posted 16 Jan 2016 by ChrisD
Post:
sudo apt-get install libxmu:i386 gets the graphics to work, at least it does on my beta tasks, none of the other work I have has graphics available.


E:unable to locate package

seems this lib is obsolete now.

Can You help me with another package name?

Tnx :)


ChrisD

7) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53268)
Posted 16 Jan 2016 by ChrisD
Post:
At half past four this morning, my internet went down.
When I checked up on my crunchers this morning at 8, I found that my I7 cruncher had trashed all 7 running CPDN Tasks plus the 7 tasks that I had waiting in the queue.

BOINC Event Log was not long enough to tell me what had happened. All tasks seem to have died due to No Heartbeat.....

My Xeon Cruncher has survived and can tell me a little more. (Maybe because it has only 6 tasks running?)

When the internet went down, CPDN tasks could no longer upload their Trickles. When this happens, they keep trying every 2 minutes..
7 CPDN tasks that can not upload their trickles, seems to have trashed BOINC, and it had lost control completely.

This is a devatating blow, and I will revoke this machine from CPDN until I have found out what really happened..

Any help would be appreciated.
Thanks

ChrisD
8) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53261)
Posted 14 Jan 2016 by ChrisD
Post:
Great :)

Thanks.

ChrisD
9) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53257)
Posted 14 Jan 2016 by ChrisD
Post:
Re the failing Task mentioned in msg. 53247.

I suspended CPU using BOINC Manager, and when I reenabled CPU the task died.

Here is the stderr:

Can anybody tell me, what happended here??


<core_client_version>7.6.9</core_client_version>
<![CDATA[
<stderr_txt>
CPDN Monitor - Quit request from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=2592, selfPID=2592, iMonCtr=2
Global Worker:: CPDN process is not running, exiting, bRetVal = 0, checkPID=0, selfPID=0, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=0, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=6620, selfPID=6352, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
Called boinc_finish

</stderr_txt>

This is not the first task that has trashed when CPU has been suspended, so there is a problem somewhere, if that could be solved, we might save quite a few resends.

ChrisD
10) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53248)
Posted 13 Jan 2016 by ChrisD
Post:
Clicking on the plus next to stderr on at least one of the failed tasks shows a replanca error which is a task problem rather than the computer it is running on.


Thanks :)

That is comforting to know.

ChrisD
11) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53247)
Posted 13 Jan 2016 by ChrisD
Post:
Just got 4 more tasks, 1 trashed in 20 secs flat another started OK, but I had to suspend processing for a while.

1/13/2016 12:58:48 PM | | Suspending computation - user request
1/13/2016 1:00:26 PM | | Resuming computation
1/13/2016 1:00:40 PM | climateprediction.net | Computation for task hadam3p_anz_h01n_201112_12_287_010252705_1 finished
1/13/2016 1:00:40 PM | climateprediction.net | Output file hadam3p_anz_h01n_201112_12_287_010252705_1_1.zip for task hadam3p_anz_h01n_201112_12_287_010252705_1 absent
etc.

This process had been processing for 16 min 33 secs...

Tasks running now: 5, of which 4 has passed one hour of computing time and one of the new has reached 5 minutes.
One task in queue, destiny unknown...

I wish I could do some more debuging.
Any suggestions?

ChrisD (pulling his hair out)
12) Questions and Answers : Windows : Errors in new HADAM3P_ ANZ Tasks (Message 53245)
Posted 13 Jan 2016 by ChrisD
Post:
Just started computing the new batch of HADAM3P_ANZ tasks, but both my machines has errors.
My I7-3770K has ditched 3 tasks, all in just a few secs, but is processing 4 tasks OK.
My Xeon 6790 has trashed 4 and is running two...

Plenty of HDD space and plenty of RAM.
I7 have 16 Gigs and the Xeon has 24, the max RAM possible.

It happens so fast that I can not see what goes wrong :(

Is it only me??

ChrisD
13) Message boards : Number crunching : Error while computing (Message 53172)
Posted 24 Dec 2015 by ChrisD
Post:
Thanks for the replies :)

Through SETI Beta I ended up reading about a BOINC bug, that causes BOINC to choke when it can not find the DNS Server, and thus trashing the tasks that were trying to upload files.

If BOINC can not fix this, maybe CPDN could be made a bit more 'BOINC-safe'?

Suggestion: a small mod to the exception handler, if 'file not found error', wait for 2 secs and retry, say twice before really giving up. (I know, back in DOS days, several retries had been done before reporting an Error, but this is a long way from DOS, so maybe this will help anyway.)

How about letting the CPDN task run for a few seconds after having created the final .zips, (a couple of dummy time loops will do) thus preventing BOINC from reporting the 'CPDN Task not running', prematurely?

Well, it's a comfort, that the trickles, at least gets through.
(Is there a way to make the BOINC Event Log tell what task has requested a trickle-up?)

ChrisD

p.s.

Sorry Les, this was just for fun.

"The system cannot find the drive specified.

What drive?"

I was just trying to say that such an error message is no good.
At least it should say which drive, and maybe even the name of the file the program is trying to access.
14) Message boards : Number crunching : Error while computing (Message 53166)
Posted 23 Dec 2015 by ChrisD
Post:
Just checked my reported tasks, and once again I find a task that has crashed.

Workunit# 10238153

Error while computing, is the report, but all 46,379 Time steps seems to have been been reported.

???

Error msg: The system cannot find the drive specified.
(0xf) - exit code 15 (0xf)

What drive? All 7 CPDN tasks are running, and they have all checkpointed OK within the last 5 minutes, so they can certainly find 'the' drive.

To me it seems that BOINC erases files too fast, read before CPDN has finished its checks.

Another Task trashed, thanks to BOINC and CPDN not sync'ing correct.

ChrisD

EDIT:

Checked the log. all checkpoints are reported in the log, all that is different from an OK task is the last two .zips are reported missing.

Earlier I have noticed BOINC deleting files even if all tasks were suspended.
15) Message boards : Number crunching : transient HTTP error (Message 53096)
Posted 15 Dec 2015 by ChrisD
Post:
You can always abandon the task, but personally I think this will not be a very good idea. You have already spent the power doing the calculations, so I think You should get the results back to the scientists.

I am, myself, on a GSM Internet with a fixed quota, so I think I understand Your dilemma.

I have the following suggestions for You:

If You are temporary running out of quota, suspend network traffic untill Your quota is renewed.

In BOINC Manager/Options/Computing Pref./Network, fill out the "Limit Usage" fields. Here You can grant BOINC the quota You can afford to use.

If Your Internet connection is slow, Limit the Upload Rate. If I understand this BOINC thing correctly, this will make BOINC more patient, granting You more time to transfer the files before BOINC gives up.

Maybe we can persuade the crew to tell us, for each task, the anticipated data x-fer size. This way You can actively deselect the tasks You can not afford to transfer.

Whatever You choose, I wish You the best of luck and happy crunching. :)

ChrisD
16) Message boards : Number crunching : transient HTTP error (Message 53094)
Posted 14 Dec 2015 by ChrisD
Post:
Don't worry.
I had the same error a little after 5pm today. My GSM based internet had a short dropout.

File was succesfully uploaded after a 10 minute project backoff.

ChrisD
17) Message boards : Number crunching : Model crash?? (Message 53088)
Posted 13 Dec 2015 by ChrisD
Post:
Thanks for clarifying :)

ChridD

18) Message boards : Number crunching : Model crash?? (Message 53084)
Posted 13 Dec 2015 by ChrisD
Post:
Model crashed: REPLANCA: PP HEADERS ON ANCILLARY FILE DO NOT MATCH
<file_name>hadam3p_pnw_xirp_200912_12_010214469_1_12.zip</file_name>
<error_code>-161 (not found)</error_code>

2 models crashed at the very last Time Step with similar Error Messages....

Can anybody see what has happened?

ChrisD
19) Message boards : Number crunching : CPDN process is not running?? (Message 53048)
Posted 7 Dec 2015 by ChrisD
Post:
Thanks Dave, for the suggestion. :)

Right now, with 9 CPDN tasks running (4 WAH2 and 5 PNW) memory use is 6 Gb, with another 6Gb as cache.

Unless I have a failing memory controller, Windows should be able to keep every task fed with sufficient working space.

Windows Task Manager is not very detailed, so maybe I should look for a better tool to find out how much my memory load really is.

ChrisD
20) Message boards : Number crunching : CPDN process is not running?? (Message 53044)
Posted 7 Dec 2015 by ChrisD
Post:
Just got 5 WAH2 tasks, all exited within 25 seconds with this error message:

<stderr_txt>
Signal 11 received, exiting...
10:54:27 (4520): called boinc_finish(193)
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5188, selfPID=5188, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=5188, selfPID=1904, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
10:54:31 (1904): called boinc_finish(0)

Can somebody help me by telling me what is wrong here?

ChrisD


Next 20

©2021 climateprediction.net