climateprediction.net home page
Posts by Richard Haselgrove

Posts by Richard Haselgrove

1) Message boards : Number crunching : New work discussion - 2 (Message 66137)
Posted 11 days ago by Richard Haselgrove
Post:
It would be a good idea, but I don't think that the standard BOINC server package reports the right sort of operating system information back to the server. 32-bit libraries, and vsyscall emulation, are both a bit niche.

My only suggestion would be that someone could write a small probe app that asserted that both features were in place, and reported back yay or nay - either to the user (probably not much use), or to the project. If that could be sent in place of a full task download, say at the start of each batch, it would save a great deal of time and bandwidth, by automatically inhibiting work send to misconfigured devices. There would need to be a route for "I've installed them - please retest".
2) Message boards : Number crunching : New work discussion - 2 (Message 66134)
Posted 11 days ago by Richard Haselgrove
Post:
There's quite a good discussion of this issue in https://github.com/BOINC/boinc/issues/2120
3) Message boards : Number crunching : New work discussion - 2 (Message 66116)
Posted 15 days ago by Richard Haselgrove
Post:
Was it result 22235033?

That's a curious one. Exit status 0 (0x00000000) (zero normally signifies success), nothing at all recorded from stderr.

But it's a resend (replication _1). The _0 copy also failed, leaving rather more evidence behind.

Result 22229598, Exit status 15, stderr ends with

Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=22328, iMonCtr=1
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=22328, selfPID=21096, iMonCtr=1
after many, many interruptions.

You've thought of the obvious ideas. Beyond that, I can only suggest:

Start another task and let it get into its stride.
Stop BOINC prior to reboot, and examibe the state of the files.
Disable automatic BOINC start at reboot/login.
Reboot, and examine the state of the files again before BOINC has a chance to run.
Pull the network cable, and allow BOINC to start.
Assuming it crashes as before, the slot folder will be cleared, but the upload files and report should be held until after BOINC has reported them to the server and got an ack response. Stderr will be embedded in client_state.xml, not kept as a separate file.
4) Message boards : Number crunching : New work discussion - 2 (Message 66082)
Posted 25 days ago by Richard Haselgrove
Post:
According to ye ancient scrolls of yore, a BOINC credit is also known as a cobblestone, defined as:

By definition, 200 cobblestones are awarded for one day of work on a computer that can meet either of two benchmarks:

    * 1,000 double-precision MFLOPS based on the Whetstone benchmark
    * 1,000 VAX MIPS based on the Dhrystone benchmark

That's all. Nothing else. Pure CPU grunt. No brownie points for complexity, cleverness, memory usage, disk usage, efficiency of execution, artistic merit, ..., ...

In reality, the purity of that definition was abandoned over 10 years ago. Projects can, and do, award whatever credit they wish.Some of us old crusties hanker after the old days, when credits really meant something scientifically measurable, but those days have gone.
5) Message boards : Number crunching : New work discussion - 2 (Message 66068)
Posted 27 days ago by Richard Haselgrove
Post:
A couple of unfinished points, continuing from the previous thread - specifically in reply to Glenn Carver's message 66054.

1) I don't think MilkyWay has a separate app version for each core count. Something like that would normally be handled by the plan_class mechanism, and the MilkyWay applications page only shows two application versions for the N-Body simulation - one for Windows, and the other for Linux. Both have the same simple [mt] plan_class.

2) I've set up a basic machine to run MilkyWay nbody tasks, and tracked the messages passing between the machine, the server, and the running science app. I think I've got a possible explanation of how they've done it.

a) the machine is a small 4-core Intel, no hyperthreading, running Windows 10. I've set it, via local preferences, to use 80% of the available CPUs. That calculation is done in integer maths, so the machine has three cores available.
b) the request file from the machine to the server contains these lines:
<working_global_preferences>
<global_preferences>
   <max_ncpus_pct>80.000000</max_ncpus_pct>
</global_preferences>
</working_global_preferences>
<host_info>
    <p_ncpus>4</p_ncpus>
</host_info>
- so the local settings are reported to the server: "use 80% of 4 CPUs".
c) The reply from the server, when new work is allocated, contains these lines:
<app_version>
    <app_name>milkyway_nbody</app_name>
    <avg_ncpus>3.000000</avg_ncpus>
</app_version>
d) The allocated tasks are shown in BOINC Manager, and marked (3 CPUs)
3) When the BOINC client starts a new task, it populates an empty slot directory with the required files, and also creates its own file called "init_data.xml". That contains the lines:
<ncpus>3.000000</ncpus>
<host_info>
    <p_ncpus>4</p_ncpus>
</host_info>

"init_data.xml" can be read by the BOINC API library linked into a project app at compile time. I think that's how the app must be getting its threading instructions: "Although the processor has 4 cores (p_ncpus), only use 3 of them (ncpus). I can't find any other way it can be passed, and I've eyeballed every single occurrence of the digit '3' in these files. And still it reports "Using OpenMP 3 max threads on a system with 4 processors".
6) Message boards : Number crunching : New work Discussion (Message 66051)
Posted 29 days ago by Richard Haselgrove
Post:
Trying to summarise the various controls and options. You also need to consider the placement - the context - of each individual control in BOINC's various files and applications.

First, at the top level, there's

<ncpus>N</ncpus>
Act as if there were N CPUs; e.g. to simulate 2 CPUs on a machine that has only 1. Zero means use the actual number of CPUs. Don't use this to limit CPU usage; use computing preferences instead.
That goes in cc_config.xml, and makes the BOINC client behave, in every respect, like a computer with N cores. That's what it will tell the server, and the server will respond appropriately. Useful for exploring how well the client is behaving once the first test multi-threaded app is deployed, but not munch else.

Next, comes preferences. Again, this applies to the whole machine.

Usage limits
Use at most N % of the CPUs: Keeps some CPUs free for other applications. Example: 75% means use 6 cores on an 8-core CPU.
Use at most N % CPU time: Suspend/resume computing every few seconds to reduce CPU temperature and energy usage. Example: 75% means compute for 3 seconds, wait for 1 second, and repeat.
That's mostly useful for keeping temperatures under control, especially when crunching on a laptop.That goes in global_prefs.xml (which is managed through project websites), or global_prefs_override.xml (which you can set manually on a individual machine).

And finally, there a couple that apply to a single project, or even a single task-type within a project. That's what we're mainly discussing here. Both versions live the a single file called app_config.xml, and the full formal specification looks like this:

<app_config>
   [<app>
      <name>Application_Name</name>
      <max_concurrent>1</max_concurrent>
      [<report_results_immediately/>]
      [<fraction_done_exact/>]
      <gpu_versions>
          <gpu_usage>.5</gpu_usage>
          <cpu_usage>.4</cpu_usage>
      </gpu_versions>
    </app>]
   ...
   [<app_version>
       <app_name>Application_Name</app_name>
       [<plan_class>mt</plan_class>]
       [<avg_ncpus>x</avg_ncpus>]
       [<ngpus>x</ngpus>]
       [<cmdline>--nthreads 7</cmdline>]
   </app_version>]
   ...
   [<project_max_concurrent>N</project_max_concurrent>]
   [<report_results_immediately/>]
</app_config>
The first thing to note is that THESE SETTINGS CONTROL BOINC ONLY They DON'T control the behaviour of the project's own science applications (with one exception): the idea is that they describe how the science app is going to behave after it's launched, so the BOINC can leave enough space free for them to run efficiently.

The fractional values for <cpu_usage>.4</cpu_usage> and <gpu_usage>.4</gpu_usage> (app section), and <avg_ncpus>x</avg_ncpus> and <ngpus>x</ngpus> (app_version section) do much the same thing: they allow boinc to launch more copies of the app, to run concurrently, until the device is "full". They're really aimed at GPUs, and assume that GPUs won't use the CPU as intensively as a standard native CPU app would use it.

The odd one out is <cmdline>--nthreads 7</cmdline>. That one is designed to be passed to a multi-threaded science app: <avg_ncpus> can be used to ensure the required resources are available and not in use: --nthreads is supposed to keep the science app to its allotted space. I've talked about this value earlier in this thread: it transpired that the MilkyWay project had found a way of combining both these controls into a single project setting. Good for them, and if CPDN can follow their lead, it makes life easier for all of us.
7) Message boards : Number crunching : New work Discussion (Message 66011)
Posted 28 Aug 2022 by Richard Haselgrove
Post:
I'm seeing huge numbers of tiny files for GPU Covid. I've done thousands of those tasks now, and they still send loads of files. So either they are different, or the server isn't acknowledging I already have that file. With most projects, if I come back after a while, or every so often, there's a big dataset that gets downloaded just once, then the task files are smaller.
Only one project: Einstein@home. It's called "locality scheduling", and their server was specially enhanced by their boss, Bruce Allen. Everybody else does it their own way.

Technical detail: that only applies to the Gravity Wave search, using data from the LIGO detectors. Those are large, exquisitely detailed, datasets, recorded by the 4 km laser interferometers. Thousands of individual workunits are created to scan them every which way possible. You don't do that with nano-scale protein molecules. They are, indeed, all different.
8) Message boards : Number crunching : New work Discussion (Message 65994)
Posted 25 Aug 2022 by Richard Haselgrove
Post:
I can get through to their forums with no problems - just posted there again. My usual problem is dropped connections on short files - they can't handle the concurrency. But today it was a 102 MB shared file, that came at less than 350 KBps (on a 70 MB fibre line) - imagine that here.

Some projects have networking wizards, others don't.
9) Message boards : Number crunching : New work Discussion (Message 65991)
Posted 25 Aug 2022 by Richard Haselgrove
Post:
That's for another forum. I've posted about it at WCG.
10) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65973)
Posted 24 Aug 2022 by Richard Haselgrove
Post:
And the station is a long way from the town centre. I remember that schlep!
11) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65971)
Posted 24 Aug 2022 by Richard Haselgrove
Post:
Peter. "City" is a specific legal status - ask Southend

On 18 October 2021, it was announced that Southend would be granted city status, as a memorial to ...

Southend was granted city status by letters patent dated 26 January 2022.
12) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65964)
Posted 23 Aug 2022 by Richard Haselgrove
Post:
Wow, I'm surprised in Cambridge (a large city) BT haven't given everyone fibre. Perhaps it's because Virgin have snuck in. A friend in Hull had big problems with some little local telephone company taking over. I have principles too, but I'd sell my sister to get fast internet.
I was born in Cambridge, and I spent four years of my adult life there. It's a small place, I doubt even if legally a city. More like a large town - and an important one.

And it has a large rural hinterland. Dave is free to choose how much he wants to disclose of his distance from the centre: my four years were spent "within three miles of Great St. Mary's". And there wasn't any broadband.
13) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65955)
Posted 23 Aug 2022 by Richard Haselgrove
Post:
Where are you in the UK that you can't get semi-fibre? They've been rolling out full fibre to the home for years, I thought semi-fibre was long since finished.
Learn to do your own research: https://labs.thinkbroadband.com/local/

Dave lives in England. 97.7 % is not 100%
14) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65942)
Posted 22 Aug 2022 by Richard Haselgrove
Post:
And now I've got the machine idle, time for the security updates!
15) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65941)
Posted 22 Aug 2022 by Richard Haselgrove
Post:
OK, I'm running the tests now. Computer is 945095: it has 6 cores, but I normally run it at 85% to keep things loose. Milkyway has picked it up as a 5-core, and all the comms have come through for 5: ncpus is set to five, and there's no sign of nthreads.

First task was 424163425: Using OpenMP 5 max threads on a system with 6 processors

Then I set an app_config to avg_ncpus 3 (BOINC Manager still said 5 - that doesn't change until you fetch new work).
Task ID 424163428: Using OpenMP 3 max threads on a system with 6 processors

Third test: I released the overall CPU count to 100%, and allowed two tasks to run at once.
Tasks 424163431, 424163427, both together Using OpenMP 3 max threads on a system with 6 processors

Fourth test - back to avg_ncpus 5 in app config, still at 100% CPU.
Task 424163430: Using OpenMP 5 max threads on a system with 6 processors

Finally, avg_ncpus 6 and 100% CPU.
Tasks 424163455, 424163432: Using OpenMP 6 max threads on a system with 6 processors

So it seems you're right: In the specific case of MilkyWay nbody, thread usage is entirely controlled by avg_ncpus, and doesn't need --nthreads. I'm impressed - I don't know how they've done that.If Glenn can find out how they've pulled off a neat trick, he might want to use it: but for the rest of us, RTFM is safest.
16) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65936)
Posted 22 Aug 2022 by Richard Haselgrove
Post:
@Glenn Carver:

In the specific case of planning IFS for CPDN, it might be easiest to have the BOINC requirements in mind at an early stage in the process. For example,

https://boinc.berkeley.edu/trac/wiki/AppMultiThread
https://boinc.berkeley.edu/trac/wiki/AppPlan (where the --nthreads [space] N format is again mentioned).
17) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65935)
Posted 22 Aug 2022 by Richard Haselgrove
Post:
* both --nthreads x and --nthreads=x work.
Thanks. I'd like to nail that one exactly, if we can. I copied my original reference directly from the BOINC User Manual, where they document it without the equals sign - I can change that if it's wrong. I'll try and look it up in the OpenMP documentation if I can.

I'm surprised by the reports that avg_ncpus also controls --nthreads - I can't see any point in the pathway where that could be implemented. But it's possible that MilkyWay have found a way of implementing it, which might be for their project only. I have a couple of six-core Linux machines, so I can test with one of those.
18) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65924)
Posted 21 Aug 2022 by Richard Haselgrove
Post:
The acid test is to look at the stderr output from one of your tasks. For example, task 421516802is from your 16-core computer:

<stderr_txt>
<search_application> milkyway_nbody 1.76 Linux x86_64 double  OpenMP, Crlibm </search_application>
Using OpenMP 4 max threads on a system with 16 processors
'OpenMP' is the tool BOINC is designed around and - as you say - in your case the tlread count is limited to 4. My understanding is that under OpenMP, that will have been set by an '--nthreads=4' directive on the command line. If you don't have it in your app_config.xml, it's possible that they have - at long last - configured their server in the way I suggested CPDN may have to do.

I participated in some of the very early testing of nbody at Milkyway, and I found it very frustrating dealing with the management team of the day. I gather things have improved, but I'm not currently active there. For some background, read my message 58339 of 19 May 2013. I don't run many high core-count computers myself, and I mainly run Windows computers. If I could, I'd run something like the Windows utility 'Process Explorer' to see exactly how many threads the calling application spawned, under various conditions. Something like that should be possible under Linux too.
19) Message boards : Number crunching : Uploads slow and overtake the system (Message 65922)
Posted 21 Aug 2022 by Richard Haselgrove
Post:
I think that BOINC Manager displays an averaged speed, derived from "the position reached in the file, divided by the elapsed time since the start of the transfer" (or something like that). On these long-distance intercontinental links, it's not unusual for individual packets to be lost from the data stream. The internet TCP/IP protocol is designed to be tolerant of these known weaknesses: each packet has to be acknowledged by the receiver, and if no ack is received, the sender repeats the packet until it gets though.

This means that on big transfers like CPDN's, there can be temporary 'holes' in the received file, later patched by these resent packets. I think BOINC's indicated speed is based on the 'solid' part of the file, up to the first 'hole'. On the other hand, I think that Windows Task Manager will be indicating the instantaneous speed, including the resent packets. The difference may simply be down to the different measurement techniques,
20) Message boards : Number crunching : Hardware requirements for upcoming models (Message 65917)
Posted 21 Aug 2022 by Richard Haselgrove
Post:
@ AndreyOR,

The app_config.xml file in that example is insufficient. It should read:

<app_config>
   <app_version>
      <app_name>milkyway_nbody</app_name>
      <plan_class>mt</plan_class>
      <avg_ncpus>4</avg_ncpus>
      <cmdline>--nthreads 4</cmdline>
   </app_version>
</app_config>
For efficient use of the CPU, you need both commands.

<avg_ncpus> controls BOINC, and tells it how to schedule other work around the multi-threaded task.

--nthreads on the command line is passed to the multi-threaded app itself, and is intended to limit its own core usage.

@Glenn Carver, @Jean-David Beyer,

If a multi-threaded application is deployed via a BOINC server, the automatic (default) behaviour is for the server to configure each task to use every one of the cores reported by the requesting client - 16, in Jean-David's example. The assumption is that the application will understand the --nthreads directive, and configure itself accordingly. If the proposed IFS application uses a different MT calling convention, it will require a bespoke modification of the CPDN server code.


Next 20

©2022 climateprediction.net