climateprediction.net home page
Posts by Richard Haselgrove

Posts by Richard Haselgrove

1) Message boards : Number crunching : Uploading files fails (Message 62552)
Posted 31 May 2020 by Richard Haselgrove
Post:
Ta. Simple timeout, not a certificate problem. As you were - false alarm.
2) Message boards : Number crunching : Uploading files fails (Message 62550)
Posted 31 May 2020 by Richard Haselgrove
Post:
Can you confirm that it is from the three recent batches and go to the event log and say what the message is so we have a bit more information to pass on to Andy and via him to the servers in Oz if needed.
If you see the phrase "transient HTTP error" in your log, please enable the additional detail "http_debug" in BOINC Diagnostic Log Flags (Ctrl+Shift+F). Just retry one upload to get the details, and turn it off again - otherwise it rather fills up your log.

There is a BOINC-wide problem at the moment connecting with some https servers. We need to check whether this is causing the spate of upload problems starting yesterday.
3) Message boards : Number crunching : Credits (Message 62520)
Posted 27 May 2020 by Richard Haselgrove
Post:
An export - or dump of updated files for harvesting - has happened, but the crediting of completed tasks hasn't. Which is the wrong order, really, but it's certainly progress.
4) Message boards : Number crunching : Work available and being requested but none downloaded (Message 62361)
Posted 29 Apr 2020 by Richard Haselgrove
Post:
Wed 29 Apr 2020 08:37:42 BST | climateprediction.net | [work_fetch] request: CPU (25064.93 sec, 3.00 inst)
Wed 29 Apr 2020 08:37:42 BST | climateprediction.net | Sending scheduler request: Requested by user.
Wed 29 Apr 2020 08:37:42 BST | climateprediction.net | Requesting new tasks for CPU
Wed 29 Apr 2020 08:37:48 BST | climateprediction.net | Scheduler request completed: got 0 new tasks
Wed 29 Apr 2020 08:37:48 BST | climateprediction.net | No tasks sent
Wed 29 Apr 2020 08:37:48 BST | climateprediction.net | Project requested delay of 3636 seconds
Well, he's asking for a perfectly reasonable amount of work, but the server isn't sending it.

At most projects, I'd say 'check your project preferences - make sure you're allowing the type of work currently available'. But this project has those options turned off.
5) Message boards : Number crunching : Download errors on UK Met Office HadAM4 at N216 resolution v8.52 tasks (Message 62357)
Posted 29 Apr 2020 by Richard Haselgrove
Post:
A second (near identical) machine has downloaded a new task which is currently ready to run. I'll re-enable the machine with yesterday's problem.

Has anyone heard what the problem was? I couldn't make any sense of the mixed messages.

Edit - problem machine has downloaded new work and is running again.
6) Message boards : Number crunching : Download errors on UK Met Office HadAM4 at N216 resolution v8.52 tasks (Message 62349)
Posted 28 Apr 2020 by Richard Haselgrove
Post:
Thanks Dave. Other tasks are running to completion, so it's not the libs here. A third has just failed - I'd better set NNT overnight.
7) Message boards : Number crunching : Download errors on UK Met Office HadAM4 at N216 resolution v8.52 tasks (Message 62346)
Posted 28 Apr 2020 by Richard Haselgrove
Post:
Two successive tasks have failed to download cleanly. task 21931151 reports:

<message>
WU download error: couldn't get input files:
<file_xfer_error>
  <file_name>a10l_867_atmos.gz</file_name>
  <error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>ic_N216_2002_12_000004.nc.gz</file_name>
  <error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>HAPPI_1.5K_sst_N216_2095-10-01_2096-04-30.gz</file_name>
  <error_code>-119 (md5 checksum failed for file)</error_code>
</file_xfer_error>
</message>
but local client reports

Tue 28 Apr 2020 11:34:04 BST | climateprediction.net | [unparsed_xml] SCHEDULER_REPLY::parse(): unrecognized ?xml
Tue 28 Apr 2020 11:34:04 BST | climateprediction.net | [unparsed_xml] SCHEDULER_REPLY::parse(): unrecognized upload_template
Tue 28 Apr 2020 11:34:06 BST | climateprediction.net | Started download of a10l_867_atmos.gz
Tue 28 Apr 2020 11:34:06 BST | climateprediction.net | Started download of ic_N216_2002_12_000004.nc.gz
Tue 28 Apr 2020 11:34:06 BST | climateprediction.net | Started download of HAPPI_1.5K_sst_N216_2095-10-01_2096-04-30.gz
Tue 28 Apr 2020 11:34:08 BST | climateprediction.net | Temporarily failed download of a10l_867_atmos.gz: connect() failed
Tue 28 Apr 2020 11:34:08 BST | climateprediction.net | Temporarily failed download of ic_N216_2002_12_000004.nc.gz: connect() failed
Tue 28 Apr 2020 11:34:08 BST | climateprediction.net | Temporarily failed download of HAPPI_1.5K_sst_N216_2095-10-01_2096-04-30.gz: connect() failed
These two sets of messages don't seem to tie up.

Machine is just completing task 21922323, successfully downloaded 22 April. Anyone know of any changes since then?

Edit - second failure is task 21922854 - still in reporting delay, but similar symptoms locally.
8) Questions and Answers : Unix/Linux : BOINC crashes when running CPDN (Message 56870)
Posted 18 Sep 2017 by Richard Haselgrove
Post:
Haven't narrowed it down closely enough - it may be a combination of a crashed model for one of the familiar reasons, with a full dump/trace in stderr_txt, plus all those error -161 failed uploads. We don't think it's down to the new v7.8.2 alone, because people are reverting to previous versions and BOINC still won't start.

There is a proposed fix in the BOINC code-base, but unfortunately many fixes weren't deployed in what was announced as a new public release. There's talk of another version soon, this time including all fixes including this one - so fingers crossed. No timetable yet.

Unfortunately, task 20729988 - that started this discussion - probably won't ever be reported: we had to delete it to get the BOINC client to start running again. Might be worth keeping an eye on reissues, or other batch 658 tasks, to see if a pattern emerges.
9) Questions and Answers : Unix/Linux : BOINC crashes when running CPDN (Message 56863)
Posted 18 Sep 2017 by Richard Haselgrove
Post:
I'm working in that BOINC thread 11853 that Jim mentions, and a couple of people have sent me their files for investigation - both Mac users, as it happens.

Both users have a failed CPDN task in their logs - WAH2 PNW. The task record is showing a crash dump, and 51 upload files - a total of about 14 KB for the <result> section in client_state.xml

There is a growing suspicion that the BOINC client's buffers can't handle that many upload files, and these tasks may be causing the problems. Is anyone successfully run one of these tasks, and - more to the point - has anyone completed one and uploaded the results successfully?
10) Message boards : Number crunching : For the betterment of BOINC (Message 56553)
Posted 25 Jul 2017 by Richard Haselgrove
Post:
Ubuntu (and I think one or two other distros) has an active volunteer package maintainer for BOINC.
11) Message boards : Number crunching : For the betterment of BOINC (Message 56550)
Posted 25 Jul 2017 by Richard Haselgrove
Post:
Version numbers are derived from tags in the central BOINC git repository: they should be consistent across platforms.

The current (public, recommended) Windows version is also v7.6.33: Mac has an update to v7.6.34 to cope with an incompatibility in the latest release of the OS. So comparisons should be valid.
12) Message boards : Number crunching : Computation error (Message 53135)
Posted 17 Dec 2015 by Richard Haselgrove
Post:
That's why I had to go all the way back to 10-Dec-2015, and it shows in stdoutdae.txt format - the Manager display scrolls older entries off the top if there's no restart.
13) Message boards : Number crunching : Computation error (Message 53133)
Posted 17 Dec 2015 by Richard Haselgrove
Post:
The BOINC Manager Event Log gives the data location at startup:

10-Dec-2015 22:46:36 [---] Starting BOINC client version 7.6.20 for windows_intelx86
10-Dec-2015 22:46:36 [---] log flags: file_xfer, sched_ops, task, cpu_sched, sched_op_debug
10-Dec-2015 22:46:36 [---] Libraries: libcurl/7.45.0 OpenSSL/1.0.2d zlib/1.2.8
10-Dec-2015 22:46:36 [---] Data directory: C:\BOINCdata
10-Dec-2015 22:46:36 [---] Running under account Richard Haselgrove

(note that my location is non-standard, precisely to avoid this kind of confusion)

But once you find the root location, the contents are standard:

14) Message boards : Number crunching : Computation error (Message 53123)
Posted 16 Dec 2015 by Richard Haselgrove
Post:
I had that problem with S@H and resolved it.

Can you locate the S@H folder in BOINC's data directory (projects sub-directory)? CPDN's directory should be right next to it. And it would have been created as soon as you attached to the project, ready to receive the downloaded files.

Edit - the folder will be called 'climateprediction.net', rather than the abbreviated CPDN we write here to save typing.
15) Message boards : Number crunching : Computation error (Message 53117)
Posted 16 Dec 2015 by Richard Haselgrove
Post:
I'm sorry, like i said, there's no CPDN folders on my laptop..The download didn't make it that far.

BOINC wouldn't have attempted to start the app until all downloads are complete - and that's a significant number of both programs and data files before the first task runs.

It has been known for anti-virus applications to prevent new and unknown programs from running.
16) Message boards : Number crunching : transient HTTP error (Message 53102)
Posted 15 Dec 2015 by Richard Haselgrove
Post:
Set the <http_debug> logging option, and you will get better information about exactly why the upload is failing. It might be something, like an overfull storage disk, that the staff can do something about, but they need your detailed input first.
17) Message boards : Number crunching : CPDN SITE STILL UNRESPONIVE (Message 52761)
Posted 29 Oct 2015 by Richard Haselgrove
Post:
BOINC treats uploads on the project level rather that the upload URL level, so pending WAH2 PNW uploads might never be attempted if you have uploads for other regions.

We got BOINC to change this, starting (I think) with the first BOINC v7.x releases. If you are using a new-ish client, every upload should be tried at least once as soon as it's ready, rather than waiting behind uploads from other regions that might be stuck as Ian describes.
18) Message boards : Number crunching : WAH2 CREDITS SET TO LOW (Message 52630)
Posted 24 Sep 2015 by Richard Haselgrove
Post:
my computer 1364207

Byron, I'd be wary about taking too much timing data from that machine - it looks to be under considerable stress, and is throwing a lot of errors.

Error tasks for computer 1364207

Even with 256 GB of RAM to support the 40 running models - 6 GB each should be plenty - can the hard disk system cope with all 40 models checkpointing and preparing upload files in quick succession? That could be a big bottleneck, since all the disk accesses have to be to the same drive (or presumably RAID array) over the same interface. CPDN models can be sensitive to delays around those upload generation moments.
19) Message boards : Number crunching : WAH2 CREDITS SET TO LOW (Message 52629)
Posted 24 Sep 2015 by Richard Haselgrove
Post:
It's been said that BOINC takes about 10 tasks from a given project to work out times. On cpdn, it needs to be "several" of EVERY different model type, to work out how long that type of model is going to take.

The remark about projects needing 10 completed tasks (actually 11 - "more than 10") before calculating realistic initial runtime estimates applies to projects running the 'CreditNew' version of the server code. Doesn't matter whether they actually give credit that way - it's the runtime estimation which is all done on the server in those cases.

But we don't have that server code at this project. Any adjustment to the initial estimates is done locally by the old DCF mechanism - and that can only keep track of one value at a time. Unless the task sizes estimated by the project are accurately in proportion to their eventual total running time, different model types will pull DCF in different directions, and it'll never be able to settle to a common value which is right for all models.
20) Message boards : Number crunching : wah tasks failed (Message 52583)
Posted 16 Sep 2015 by Richard Haselgrove
Post:
And time estimate errors in one application will go on affecting all applications for the project, until CPDN can finally complete the migration to a new version of the BOINC server software which can decouple the runtime estimate smoothing of the different application versions.

But the current Runtime Estimation code is so crude that I'd hesitate to advocate its adoption here.


Next 20

©2020 climateprediction.net