climateprediction.net home page
Posts by Richard Haselgrove

Posts by Richard Haselgrove

1) Questions and Answers : Unix/Linux : BOINC crashes when running CPDN (Message 56870)
Posted 18 Sep 2017 by Richard Haselgrove
Post:
Haven't narrowed it down closely enough - it may be a combination of a crashed model for one of the familiar reasons, with a full dump/trace in stderr_txt, plus all those error -161 failed uploads. We don't think it's down to the new v7.8.2 alone, because people are reverting to previous versions and BOINC still won't start.

There is a proposed fix in the BOINC code-base, but unfortunately many fixes weren't deployed in what was announced as a new public release. There's talk of another version soon, this time including all fixes including this one - so fingers crossed. No timetable yet.

Unfortunately, task 20729988 - that started this discussion - probably won't ever be reported: we had to delete it to get the BOINC client to start running again. Might be worth keeping an eye on reissues, or other batch 658 tasks, to see if a pattern emerges.
2) Questions and Answers : Unix/Linux : BOINC crashes when running CPDN (Message 56863)
Posted 18 Sep 2017 by Richard Haselgrove
Post:
I'm working in that BOINC thread 11853 that Jim mentions, and a couple of people have sent me their files for investigation - both Mac users, as it happens.

Both users have a failed CPDN task in their logs - WAH2 PNW. The task record is showing a crash dump, and 51 upload files - a total of about 14 KB for the <result> section in client_state.xml

There is a growing suspicion that the BOINC client's buffers can't handle that many upload files, and these tasks may be causing the problems. Is anyone successfully run one of these tasks, and - more to the point - has anyone completed one and uploaded the results successfully?
3) Message boards : Number crunching : For the betterment of BOINC (Message 56553)
Posted 25 Jul 2017 by Richard Haselgrove
Post:
Ubuntu (and I think one or two other distros) has an active volunteer package maintainer for BOINC.
4) Message boards : Number crunching : For the betterment of BOINC (Message 56550)
Posted 25 Jul 2017 by Richard Haselgrove
Post:
Version numbers are derived from tags in the central BOINC git repository: they should be consistent across platforms.

The current (public, recommended) Windows version is also v7.6.33: Mac has an update to v7.6.34 to cope with an incompatibility in the latest release of the OS. So comparisons should be valid.
5) Message boards : Number crunching : Computation error (Message 53135)
Posted 17 Dec 2015 by Richard Haselgrove
Post:
That's why I had to go all the way back to 10-Dec-2015, and it shows in stdoutdae.txt format - the Manager display scrolls older entries off the top if there's no restart.
6) Message boards : Number crunching : Computation error (Message 53133)
Posted 17 Dec 2015 by Richard Haselgrove
Post:
The BOINC Manager Event Log gives the data location at startup:

10-Dec-2015 22:46:36 [---] Starting BOINC client version 7.6.20 for windows_intelx86
10-Dec-2015 22:46:36 [---] log flags: file_xfer, sched_ops, task, cpu_sched, sched_op_debug
10-Dec-2015 22:46:36 [---] Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8
10-Dec-2015 22:46:36 [---] Data directory: C:\BOINCdata
10-Dec-2015 22:46:36 [---] Running under account Richard Haselgrove

(note that my location is non-standard, precisely to avoid this kind of confusion)

But once you find the root location, the contents are standard:

7) Message boards : Number crunching : Computation error (Message 53123)
Posted 16 Dec 2015 by Richard Haselgrove
Post:
I had that problem with S@H and resolved it.

Can you locate the S@H folder in BOINC's data directory (projects sub-directory)? CPDN's directory should be right next to it. And it would have been created as soon as you attached to the project, ready to receive the downloaded files.

Edit - the folder will be called 'climateprediction.net', rather than the abbreviated CPDN we write here to save typing.
8) Message boards : Number crunching : Computation error (Message 53117)
Posted 16 Dec 2015 by Richard Haselgrove
Post:
I'm sorry, like i said, there's no CPDN folders on my laptop..The download didn't make it that far.

BOINC wouldn't have attempted to start the app until all downloads are complete - and that's a significant number of both programs and data files before the first task runs.

It has been known for anti-virus applications to prevent new and unknown programs from running.
9) Message boards : Number crunching : transient HTTP error (Message 53102)
Posted 15 Dec 2015 by Richard Haselgrove
Post:
Set the <http_debug> logging option, and you will get better information about exactly why the upload is failing. It might be something, like an overfull storage disk, that the staff can do something about, but they need your detailed input first.
10) Message boards : Number crunching : CPDN SITE STILL UNRESPONIVE (Message 52761)
Posted 29 Oct 2015 by Richard Haselgrove
Post:
BOINC treats uploads on the project level rather that the upload URL level, so pending WAH2 PNW uploads might never be attempted if you have uploads for other regions.

We got BOINC to change this, starting (I think) with the first BOINC v7.x releases. If you are using a new-ish client, every upload should be tried at least once as soon as it's ready, rather than waiting behind uploads from other regions that might be stuck as Ian describes.
11) Message boards : Number crunching : WAH2 CREDITS SET TO LOW (Message 52630)
Posted 24 Sep 2015 by Richard Haselgrove
Post:
my computer 1364207

Byron, I'd be wary about taking too much timing data from that machine - it looks to be under considerable stress, and is throwing a lot of errors.

Error tasks for computer 1364207

Even with 256 GB of RAM to support the 40 running models - 6 GB each should be plenty - can the hard disk system cope with all 40 models checkpointing and preparing upload files in quick succession? That could be a big bottleneck, since all the disk accesses have to be to the same drive (or presumably RAID array) over the same interface. CPDN models can be sensitive to delays around those upload generation moments.
12) Message boards : Number crunching : WAH2 CREDITS SET TO LOW (Message 52629)
Posted 24 Sep 2015 by Richard Haselgrove
Post:
It's been said that BOINC takes about 10 tasks from a given project to work out times. On cpdn, it needs to be "several" of EVERY different model type, to work out how long that type of model is going to take.

The remark about projects needing 10 completed tasks (actually 11 - "more than 10") before calculating realistic initial runtime estimates applies to projects running the 'CreditNew' version of the server code. Doesn't matter whether they actually give credit that way - it's the runtime estimation which is all done on the server in those cases.

But we don't have that server code at this project. Any adjustment to the initial estimates is done locally by the old DCF mechanism - and that can only keep track of one value at a time. Unless the task sizes estimated by the project are accurately in proportion to their eventual total running time, different model types will pull DCF in different directions, and it'll never be able to settle to a common value which is right for all models.
13) Message boards : Number crunching : wah tasks failed (Message 52583)
Posted 16 Sep 2015 by Richard Haselgrove
Post:
And time estimate errors in one application will go on affecting all applications for the project, until CPDN can finally complete the migration to a new version of the BOINC server software which can decouple the runtime estimate smoothing of the different application versions.

But the current Runtime Estimation code is so crude that I'd hesitate to advocate its adoption here.
14) Questions and Answers : Windows : Fortran error dialog boxes (Message 52577)
Posted 16 Sep 2015 by Richard Haselgrove
Post:
It would help the developers if Brian, and anybody else who observes this problem, could either capture a screenshot of the error message before dismissing the dialog box, or transcribe the full contents of the dialog.

From earlier reports describing these errors, it usually appears to be a data error, so being able to identify the task name, and the faulty file name/details, would help considerably.
15) Message boards : Number crunching : wah tasks failed (Message 52539)
Posted 11 Sep 2015 by Richard Haselgrove
Post:
And now I've got a 'Signal 11' crash of my own.

<result>
<name>wah2_eu2_c86m_1928_1_010155439_0</name>
<final_cpu_time>92076.300000</final_cpu_time>
<final_elapsed_time>95316.104979</final_elapsed_time>
<exit_status>0</exit_status>
<state>3</state>
<platform>windows_intelx86</platform>
<version_num>705</version_num>
<final_peak_working_set_size>327622656</final_peak_working_set_size>
<final_peak_swap_size>299331584</final_peak_swap_size>
<final_peak_disk_usage>11765</final_peak_disk_usage>
<stderr_out>
<![CDATA[
<stderr_txt>
Signal 11 received, exiting...

17:37:31 (12784): called boinc_finish(193)

Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=16120, iMonCtr=2

Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=0, selfPID=15396, iMonCtr=2

Model crash detected, will try to restart...

Leaving CPDN_Main::Monitor...

17:37:43 (15396): called boinc_finish(0)

That one had been plodding along quietly, about 26.5 hours in and maybe 5% done.

Windows 7, nothing untoward shown in either the BOINC logs or the system Event Viewer. It does seem that 'Signal 11' is the default error message for these applications, whether it's a startup problem as others have reported, or a model crash well into the run.

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=18882396
16) Message boards : Number crunching : wah tasks failed (Message 52534)
Posted 11 Sep 2015 by Richard Haselgrove
Post:
How often should trickles be uploaded with the WAHs? I have 9 tasks across three hosts that have been running for 15-20 hours and I see no trickles logged. And, I see no upload attempts in the BOINC message logs.

With the great variation in computer speeds, it's probably best to answer that in terms of progress made, rather than absolute time.

With 12 'simulation months' to be completed by each model, the trickle+upload pair should be generated around 8.3%, 16.6%, 25% ... progress. My leader is still only at 4.064% (after 21 hours), so it would be some time before I could fill in the third decimal place for the actual moment when it happens.
17) Message boards : Number crunching : wah tasks failed (Message 52522)
Posted 10 Sep 2015 by Richard Haselgrove
Post:
Just got another failure on my older laptop. That is 2 failed tasks on that laptop (computer name "Andy") and 1 fail so far on the new laptop ("Beats").
Seems to be a problem with the zip file each time.
Watching a movie on Beats right now so task is suspended at 1.189% for now.

Downloaded 3 more tasks on Andy and 2 of them are running. We'll see how they go...

We can't see your computer names. so we'll have to guess which is which. But I see that all computers on your account have been upgraded to Windows 10 - this might possibly be significant. The project staff are going to check in the morning whether there is a significant correlation across the database between running Windows 10 and these new task failures.
18) Message boards : Number crunching : New experiment launched: weather@home 2015: Western US Drought (Message 52407)
Posted 10 Aug 2015 by Richard Haselgrove
Post:
As 3rkko says, there is a very specific and identified incompatibility between BOINC clients later than 7.0.36 running in service mode, and the 7.22 / 7.24 CPDN applications for Windows deployed during 2014: yes, that includes HadAM3P-HadRM3P Africa. You can solve that any which way you like: not running service mode, not running a client later than 7.0.36, or not running the affected climate models.

This bug is exclusive to BOINC - any Folding issues are separate.
19) Message boards : Number crunching : I need to clear aborted models for others to crunch (Message 52352)
Posted 27 Jul 2015 by Richard Haselgrove
Post:
Bernard

It doesn't matter if a few models get missed here and there.
This project relies on an overkill of models to build the ensembles, and if there ARE any critical ones missed, the researchers can always submit more. And may have already done so, not just in your case, but for all of those computers that are failing to return valid data.

And if you do a Remove, your computer will get a new computer ID, and not know about the previous models.

And then you can merge the computer records, and see them again.
20) Message boards : Number crunching : Cross-project ID's question (Message 52308)
Posted 22 Jul 2015 by Richard Haselgrove
Post:
Thanks for the reply Richard. I looked at the computers on my account and realised that this morning, before I restored the backup, I had updated the graphics drivers from 340.76 to 346.59. The restored client_state.xml was expecting 340.76 and when it saw 346.59 it must have thought this is a new machine and set up a new hostid automatically.

Now that I have a reason, I am not too bothered if Boinc thinks its on a new machine given that I'll probably be upgrading to new Ubuntu versions in future.

Before I read your post my machine had already contacted the server twice...and its done a day's crunching so I think it probably safe to leave things as they are for the moment...but I'll keep an eye on things.

Thanks for your help.

No, please read what I said. I's not your changed graphics driver that matters, it's the <rpc_seqno>, or "Remote Procedure Call - sequence number". When you restored from backup, you must have re-imported an old sequence number. The BOINC server software sees that as at attempt at cheating - trying to boost Recent Average Credit by using two computers in parallel - and responds by assigning a new Host ID.

Looking at your computers on the website, I see

Computer ID ... Last contact
------------------------------------
1370014 ... ... ... 22 Jul 2015 15:30:14 UTC
1362952 ... ... ... 21 Jul 2015 19:56:20 UTC

- implying that the computer now running is using the wrong ID number. That will likely invalidate your running tasks if allowed to continue. Now that you've got this far, it might be easiest to merge the two host records so the tasks are assigned to the correct host before they are completed.


Next 20

©2019 climateprediction.net