climateprediction.net home page
Posts by Richard Haselgrove

Posts by Richard Haselgrove

1) Message boards : Number crunching : no credit awarded? (Message 68567)
Posted 10 days ago by Richard Haselgrove
Post:
Yes, I've written to Andy (who was busy with the BOINC workshop yesterday), and requested a specific chunk of data which will help us localise where the problems start. Once I receive that, I can work out whether we need to search forwards or back to the source of the trouble.

It'll take several steps, and I won't keep up a running commentary, but I'll let you know when we make any significant change that may be observable in your own accounts.
2) Message boards : Number crunching : no credit awarded? (Message 68559)
Posted 16 days ago by Richard Haselgrove
Post:
One click on the KVM button later...

Here's a version of that final <msg_from_host>, recorded from an IFS_bl task a couple of weeks ago.

<msg_from_host>
      <result_name>oifs_43r3_bl_a27b_2016092300_15_991_12209642_0</result_name>
      <time>1676364290</time>
<variety>orig</variety>
<wu>oifs_43r3_bl_a27b_2016092300_15_991_12209642</wu>
<result>oifs_43r3_bl_a27b_2016092300_15_991_12209642_0_r863024831</result>
<ph></ph>
<ts>864000</ts>
<cp>17458</cp>
<vr></vr>
</msg_from_host>
The pp fields are no longer used, and a couple of others are blank, but I doubt that matters.

But please compare carefully the tag <result>.

In the old hadcm3 tasks, that's identical to the <result_name> tag added by BOINC. But in IFS, it's been extended by _r863024831 - used in the upload file names.

IF (and that's a very big if) CPDN were relying on <result> to match a trickle to its ResultID, that would be a point of failure. It's the first smoking bearing in a very big machine.
3) Message boards : Number crunching : no credit awarded? (Message 68558)
Posted 16 days ago by Richard Haselgrove
Post:
OK, back to credit issues, and specifically the breakdown of credit awards for Hadley tasks in late 2022. I find one of AndreyOR's computers very helpful in isolating the start of this event:

Filtered list of HadSM4 at N144 resolution tasks for computer 1526028, page 5

It is clear that tasks reported on Tuesday 29 Nov 2022, up to 20:29:05 UTC, have been granted credit.
Tasks reported on Wednesday 30 Nov 2022, from 10:56:28 UTC, have not.

Because it happened mid-week, it's unlikely to be a strict "credit script" event: it would most likely have become visible at a weekend, if that was the case. And looking at individual sample tasks, trickles disappeared from the task display in the same time interval. So I think it's more likely to be a problem introduced into the trickle transfer stage of the process.

Switching to trickles I've captured on my own machines at various times. These are from September 2014, and different task types, but they illustrate the flow.

A trickle starts life as an XML file of the project's directory:

<variety>year</variety>
<wu>hadcm3s_1aby_2001_2_008988784</wu>
<result>hadcm3s_1aby_2001_2_008988784_1</result>
<ph>1</ph>
<ts>51840</ts>
<cp>187137</cp>
<vr>7.24</vr>
<ppname>
trickle_hadcm3s_1aby_2001_2_008988784_1_2003.zip</ppname>
<pplen>
110326</pplen>
<ppdataz>
0MT $0! "  " DJ=O4$_U^CWDV4  00!&  , <' H%&9CUV,S]5,A)6>?)#,P$S7
R\%,P@3.X@S-X0S7Q\5;E%F;A]E,P S,?!'9NXV831D8 P+     ( PNX<.(C1&8
...
[snip]
...
</ppdataz>
This gets copied by BOINC into a "sched_request" message to the project server. I'll ignore the ppdata to save space.

<msg_from_host>
      <result_name>hadcm3s_1aby_2001_2_008988784_1</result_name>
      <time>1410789211</time>
<variety>year</variety>
<wu>hadcm3s_1aby_2001_2_008988784</wu>
<result>hadcm3s_1aby_2001_2_008988784_1</result>
<ph>1</ph>
<ts>51840</ts>
<cp>187137</cp>
<vr>7.24</vr>
...
[pp fields snipped]
...
</msg_from_host>
Note that at this stage, we only know the result by name: it has to matched up by the server with the full result record in the database, which is keyed by ResultID number. I'm suspicious that this may be where our problems start.

At this stage, I have to switch to a Linux machine for the next part of the story. Be right back ...
4) Message boards : Number crunching : no credit awarded? (Message 68557)
Posted 16 days ago by Richard Haselgrove
Post:
I was possibly misled by a page I pulled up during an earlier conversation with Glenn: http://news.bbc.co.uk/1/hi/sci/tech/3100024.stm

A page dated September 2003 says:

A massive worldwide online effort to predict how the global climate will change this century is being launched in the UK.

Computer users anywhere on Earth can join by downloading a climate model from a website.

The organisers say it will be the world's largest climate prediction experiment.

They hope it will result in a much more robust picture of the probable future climate.

The experiment is being launched on 12 September at the Science Museum in London and at the British Association science festival in Salford.

It is the fruit of collaboration between the universities of Oxford and Reading, the Met Office, the Open University, the Rutherford Appleton Laboratory, and a software company, Tessella Support Services.
I assumed that was the start of the Beeb's editorial backing of the project as part of its educational support services, but I may have conflated two separate events.
5) Message boards : Number crunching : no credit awarded? (Message 68550)
Posted 16 days ago by Richard Haselgrove
Post:
That's a bizarre way of doing it.
I'm probably referring back to around late 2009 (that's when I last looked in detail at credit), or even earlier.

To me, it smells like a quick'n'dirty kludge, thrown together in the early days of the project (and of BOINC), to bridge the gap between two parts of an incomplete system. Never expected, or intended, to be still running 20 years later with today's vastly quicker flow of results from modern tech.

Did you refer to the history of David Anderson's involvement with BOINC, that he alluded to at the start of his talk to the workshop? The section on CPDN is illuminating, though I don't trust David's recall of history - my name appears later on in the blog, and the roles he ascribes to me are broadly accurate, but that's an amendment after Jord appealed. I still don't recognise myself.

But here's the CPDN section, for what it's worth:

When we released the BOINC-based version of SETI@home to the public, there was a lot of backlash. People don't like change in general, and they didn't like the complexity of BOINC. We lost a big fraction of our volunteer base; it went from 600K to 300K or something like that.

I was very eager to get Climateprediction.net (CPDN) working. It had very long jobs: 6 months on some computers. We added "trickle" mechanisms to let the jobs upload intermediate results, and grant partial credit. I went to Oxford and spent a month working with Carl Christensen and Tolu Aina.
Myles Allen, Climateprediction.net, and Oxford
Myles is a visionary climate scientist at Oxford University. He proposed using volunteer computing for climate research in a Nature article in 2000. I read this and immediately contacted him. They had done something remarkable: taking a state-of-the-art climate model - a giant FORTRAN program that had only been run on supercomputers - and getting it to run on Windows PCs. They initially hired a local company to develop the job-distribution software, but switched to BOINC as soon as it was available.

I was very eager to make CPDN a success. In 2005 I spent a month in Oxford, staying in Myles' house (he was away for the summer) and working with Carl and Tolu.

In my view, CPDN hasn't lived up to its potential. Carl didn't feel appreciated at CPDN, and he left in 2008. Tolu left a year or two later. That left CPDN without a lot of technical resources. Oxford appointed a "director of volunteer computing", but nothing came of it.
Note that the role of the BBC in promoting the early, pre-BOINC, stage of CPDN's life has escaped David's notice.
6) Message boards : Number crunching : no credit awarded? (Message 68546)
Posted 17 days ago by Richard Haselgrove
Post:
There have been times in the past when credit hasn't shown despite zips being on the website but most of them have been when there are problems with the credit script having fallen over or not been restarted after an event of some kind. There have also been times when the credits have appeared despite zips not showing on the task pages, presumably because the problem occurs after the processes to display them and the ones to go into the credit script separate.
I think that's just a simple matter of timing. The original system had two scripts - one to copy the trickles to a place where they could be seen on the website and used in credit calculations: and the other to work out the actual credit and RAC. They both took several hours to run, and the first had to finish before the second one started, otherwise some hosts got missed (that was another problem).

One script ran on an interval basis: "every 24 hours (then) since the project had last been restarted". The other ran as a cron job: "at hh:mm o'clock every day". If emergency maintenance meant that the project had to be restarted at an unusual time of day, those timings could clash, and credit was erratic until the staff could get round to an orderly, planned, restart - with a check that every component was active, and running in the right sequence. Until the next time ...

I don't know what the current mechanism is supposed to be: just that it doesn't appear to be going to plan. If my offer to take a look is taken up, I suppose the first question is: "can you supply me with a schematic flow-chart of the expected credit system as it stands now?". If they don't have one to hand, then drawing one up would be a useful first step.
7) Message boards : Number crunching : no credit awarded? (Message 68538)
Posted 18 days ago by Richard Haselgrove
Post:
Another idea is perhaps asking around at the BOINC workshop for ideas as to where to look for a problem like this?
Sadly, I don't think that will help. There's not much cross-over between projects at these events: the 'trickle' mechanism is pretty much unique to CPDN.

Up till now. Both Glenn and I picked up a clear similarity with an emerging 'BlackHoles@Home' project, which aims to study Einsteinian physics through simulations of black hole development. Massive datasets, multi-month simulation runtimes - sound familiar? But they need to work on the difference between 'checkpoints' (stored locally by the client), and 'tickles' (reporting progress to the server). Intermediate uploads are a third contender in that space.
8) Message boards : Number crunching : no credit awarded? (Message 68528)
Posted 18 days ago by Richard Haselgrove
Post:
I think it would also imply that it's experiencing issues with the trickles but not the completion message.
Right. And often the final trickle data is prepared for, and included in, the same file as the completion report.

It seems to me that the problem must be occurring at the project end, when the message is received and broken down into its constituent parts for filing and reporting. I'm trying to assemble evidence for a search in that part of the system.
9) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68524)
Posted 18 days ago by Richard Haselgrove
Post:
In this case, it's much easier than that.

The older tasks, like WU 12213345 (issued on 22 Feb) are resends from the failed batch 992, which was withdrawn because of a missing data file in the package.

The newer tasks, like WU 12213630 (issued on 24 Feb) are from the corrected replacement batch issued on that day.
10) Message boards : News : New study going out to volunteer's machines (Message 68512)
Posted 20 days ago by Richard Haselgrove
Post:
There's a difference between 'notices' and 'notifications'. I got annoyed by the constant reminders from the system tray when notices were first introduced, and quickly found the "never" reminder option. The notices themselves can sit there forever unread, as far as I'm concerned - like the one shown in your screenshot.
11) Questions and Answers : Windows : Future CPDN on Windows? (Message 68500)
Posted 21 days ago by Richard Haselgrove
Post:
There is a - hopefully temporary - problem at the moment which is preventing RAC being calculated, even on platforms where work has been available in recent weeks and months. It's affecting all of us.
12) Message boards : Number crunching : no credit awarded? (Message 68488)
Posted 21 days ago by Richard Haselgrove
Post:
The 'two scripts' is a reference to the dev & production sites running different versions. The 'old' version is on the production site and the 'new' one is active on the dev site. They are not both active together. CPDN want to roll out the 'new' one to production but it will completely alter how credit is computed, so want to prepare something to go out to users first.

That's as much as I know. Richard, I suspect you know more about the differences between the 'new' and 'old' boinc credit scripts than I do. I'm sure I've seen you talk about it in other posts.
Yes, those were the references I was alluding to (one script on each server, but different).

But the question - in reference to bullschuck's question - becomes "How old is old?". His machines (1526736, 1519502) clearly show a problem. For tasks completed in July, trickles were displayed on the result pages, and credit was awarded - including partial credit according to the trickle reached, for tasks which didn't complete. But tasks completed in December or later aren't showing their trickles, and aren't getting any credit, either.

But IFS tasks are getting credit on the production site, for completed tasks at least - even though they aren't showing their trickles. And tasks on the dev site are showing their trickles for both IFS and Hadley tasks. So we seem to have at least three scripts in play: should we call them old, middle-aged, and young?

I did do some work for Milo Thurston back in the day, when we had a RAC problem on one particular application. But any knowledge I gained on that occasion is positively geriatric by comparison. That's why I'm suggesting that the time has come (subject to other constraints, which come first) for a thorough re-examination of the current situation. I'm happy to lend a hand in that process, if it would help.
13) Message boards : Number crunching : no credit awarded? (Message 68483)
Posted 21 days ago by Richard Haselgrove
Post:
Credit is based on the trickle up files that generally go at the same time as the zips are uploaded.
That's the way it always used to be, but something seems to have slipped in the last three months or so.

There was a credit run last last night or very early this morning (UTC, 25/26 Feb), just as I was finishing up the last of my batch 993 tasks. One task reported at 23:50 has been awarded full credit, the next reported at 04:27 still shows zero. In the 'trickle' days, that one would have received credit for the trickles received before, say, midnight.

Another strange thing: my event log has an entry for

26-Feb-2023 00:51:38 [climateprediction.net] [sched_op] handle_scheduler_reply(): got ack for task oifs_43r3_01i7_2019110100_123_993_12215389_0
That's task 22316800, which the server says is still in progress. The event log timing (also UTC) suggests that it was reported right in the middle of the period when I'm suggesting the credit script was running. Could that have interfered with the status update?

There have been suggestions on the message boards that we currently have two different credit scripts running on different servers, an old one and a new one. But it seems to be more complicated than that. I quite understand that the project team have had their hands full with the testing and launch of the new apps, and delivering the results to the commissioning scientists in spite of problems with the upload servers. But there will come a time when - I hope - they will be able to take a step back and review the health of the project as a whole.
14) Message boards : Number crunching : OpenIFS Discussion (Message 68466)
Posted 22 days ago by Richard Haselgrove
Post:
I've pulled the overnight event log from the system journal, but there are no signs of any errors in there - seemed to be a normal finish after the final zip.
15) Message boards : Number crunching : OpenIFS Discussion (Message 68463)
Posted 23 days ago by Richard Haselgrove
Post:
Task 22316388 failed with "process exited with code 9 (0x9, -247)".

But there's no error in the portion of stderr.txt that we can see (from upload 97 to the end). I can only guess that there was a child process error earlier in the run: the restart succeeded, but the error flag wasn't cleared from the BOINC task status. The final task finish looks normal, with:

..The child process terminated with status: 0
...
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
07:35:35 (41942): called boinc_finish(0)
That's going to be a tough one to debug.
16) Message boards : Number crunching : OpenIFS Discussion (Message 68444)
Posted 23 days ago by Richard Haselgrove
Post:
The server status page has finally updated, and both 'Unsent' and 'In progress' have gone up substantially. Looks like the workunit generator is running at about twice the speed of our demand load, which is fine.
17) Message boards : Number crunching : OpenIFS Discussion (Message 68441)
Posted 23 days ago by Richard Haselgrove
Post:
Yikes - there are 123 upload files in all, and the first one was over 15 MB. Your band is going to get very bored, Dave!
18) Message boards : Number crunching : OpenIFS Discussion (Message 68439)
Posted 23 days ago by Richard Haselgrove
Post:
There were 188 unsent on the server status page at 12:45, and I got one of them at 13:01. It's running, but still in the early stages - I'll watch how it runs for a while, before switching into full multi-fetch mode.
19) Message boards : Number crunching : OpenIFS Discussion (Message 68422)
Posted 25 days ago by Richard Haselgrove
Post:
The consistency of the errors makes it much more likely that this is a task data error.

I reported my single-case failure as quickly as possible, to warn the project and other users - I don't have any further way to analyse the data, save to say that it failed on a machine with 32 GB of RAM, much earlier (67.31 seconds elapsed time, 2.35 seconds CPU) than I would expect memory to fill up.

I'm trying to get another for fuller examination, but at one request an hour, they're proving elusive. Apart from the single email from Andy Bowery on Monday evening, confirming that distribution had been paused, I haven't seen anything from the team.
20) Message boards : Number crunching : OpenIFS Discussion (Message 68404)
Posted 27 days ago by Richard Haselgrove
Post:
The way to have a section of an xml file be skipped is by surrounding it with <!-- -->.
Be careful with that - try with a simple comment, and check for error messages when it's read in.

The boinc client doesn't use a fully-featured XML parser - it uses its own simplified code, only implementing the features it needs.


Next 20

©2023 climateprediction.net