climateprediction.net home page
Posts by Bryn Mawr

Posts by Bryn Mawr

21) Message boards : Number crunching : no credit awarded? (Message 67752)
Posted 15 Jan 2023 by Bryn Mawr
Post:
Over the last few weeks, a lot of my tasks have uploaded. And if I look at the CPDN web site for my tasks, almost all of them have received credit. Yet if I look at my projects list, my average work done has steadily decreased to about 50.54. If I look at the statistics graph for the last month, it has dropped from 1600 to about 50.

I am confused that the CPDN web site seems to know I have received lots of credit but wherever the projects list and the statistics graph get their data from, they have not been getting the credits for a month or so.

Notice: my signature box (below) seems to know I am getting credits (though I do not know if it is up-to-date).


I’d second this.

Over the past month I’ve accumulated about 275,000 credits (about 240,000 over the past fortnight) and Boinc is showing a RAC of 12.05 overall and zero for the host that’s doing the work.

BoincStats is showing no better.
22) Message boards : Number crunching : The uploads are stuck (Message 67569)
Posted 11 Jan 2023 by Bryn Mawr
Post:
I’ve now cleared my backlog and files are still trickling up slowly - only trouble is that I’m generating new files just slightly more often than the link is clearing them :-)

Definitely progress as all of the outstanding WUs have gone and I can see my task list again.
23) Message boards : Number crunching : Best Swap file size for CPDN? (Message 67137)
Posted 30 Dec 2022 by Bryn Mawr
Post:
Server: 128c/256t

Currently running 512GB memory with a default OS Swap file of 2.1GB (which is 75% - 100% in usage).

What amount I should set it for the best results for CPDN?


Zero.

Adjust then number of tasks you run so that you don’t swap out - more efficient and less likely to crash the tasks.
24) Message boards : Number crunching : OpenIFS Discussion (Message 67018)
Posted 23 Dec 2022 by Bryn Mawr
Post:
Hi
One of mine just ended but not accepted:
https://www.cpdn.org/result.php?resultid=22272660



<core_client_version>7.17.0</core_client_version>
<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>


Looks silly


I’m glad I’m not alone - I just came in to report the same error.
25) Message boards : Number crunching : OpenIFS Discussion (Message 67010)
Posted 22 Dec 2022 by Bryn Mawr
Post:
Is there any way that this could be made a user selectable option to set the default value before it is downloaded? I would want this on every WU I process and I can imagine so would all the other volunteers who process 24/7.
We've had this discussion about adjusting the checkpointing already in this (or another) thread - if I wasn't supposed to be wrapping Christmas presents I'd find it.

This is never going to be a user selectable option because it requires an understanding of how the model works and if you get it wrong, it could both seriously thrash your filesystem and delay your task. The model is capable of generating very large amounts of output, which have to be tuned carefully to run on home PCs. We might tweak it after these batches if it proves to be causing problems, which is why feedback is always welcome.


So don’t make it infinitely variable, just give the users the choice between 2 or 3 “safe” values?
26) Message boards : Number crunching : OpenIFS Discussion (Message 67004)
Posted 22 Dec 2022 by Bryn Mawr
Post:
Adjusting write I/O from OpenIFS tasks

Further to Conan's point about the amount of write I/O. It can be adjusted but only once a task has started running. The adjustment made will reduce the checkpoint frequency, meaning if the model does have to restart from a shutdown, it will have to repeat more steps. This change does NOT affect the model's scientific output as that's controlled differently.

ONLY make this change if you leave the model running near-continuously with minimal possibility of a restart. Do NOT do it if you often shutdown the PC or boinc client, otherwise it will hamper the progress of the task. If in doubt, just leave it.

To make the change:
1/ shutdown the boinc client & make sure all processes with 'oifs' in their name have gone.
2/ change to the slot directory.
3/ make a backup copy of the fort.4 file (just in case): cp fort.4 fort.4.old
4/ edit the text file fort.4, locate the line:
NFRRES=-24,
and change it to:
NFRRES=-72,
Preserve the minus sign and the comma. This will reduce the checkpoint frequency from 1 day (24 model hrs) to 3 days (72 model hrs). But, it will mean the model might have to repeat as many as 3 model days if it has to restart.
5/ restart the boinc client.

The changes can only be made once the model has started in a slot directory, not before.


Is there any way that this could be made a user selectable option to set the default value before it is downloaded? I would want this on every WU I process and I can imagine so would all the other volunteers who process 24/7.
27) Message boards : Number crunching : OpenIFS Discussion (Message 66573)
Posted 25 Nov 2022 by Bryn Mawr
Post:
Add a project max concurrent to that and job's a good'un


Do you mean like this?

$ cat app_config.xml 
<app_config>
    <project_max_concurrent>5</project_max_concurrent>
<app>
    <name>OpenIFSname1</name>
    <max_concurrent>2</max_concurrent>
    </app>
<app>
    <name>OpenIFSname2</name>
    <max_concurrent>2</max_concurrent>
   </app>
<app>
    <name>hadam3_8.09</name>
    <max_concurrent>3</max_concurrent>
    </app>
<app>
    <name>hadam3_8.52</name>
    <max_concurrent>3</max_concurrent>
    </app>
<app>
    <name>hadcm3s_8.36</name>
    <max_concurrent>3</max_concurrent>
    </app>
<app>
    <name>hadsm3_8.02</name>
    <max_concurrent>3</max_concurrent>
    </app>
</app_config>


If so, why have the the itemized list of the traditional work units in there at all?


Yes, the project max controls the overall count and the itemised list controls the individual apps and you have as much control as you want.
28) Message boards : Number crunching : OpenIFS Discussion (Message 66564)
Posted 24 Nov 2022 by Bryn Mawr
Post:
You would to add a separate <app>...</app> section for each IFS variant, once we know the exact application names in use. You could then use <max_concurrent> to limit each IFS type, but I don't see a way to limit the total IFS number of all types, once multiple versions are in play at the same time.

Depends how likely it is that batches of multiple variants will be out there at once I guess.


I have an upper limit of 12 (in the winter, and 8 in the summer) of total BOINC jobs. (No air conditioning.)
I am currently allowing an upper limit of 5 CPDN jobs. So now I have a prototype app_config.xml file that allows a max of 2 for each of the OpenIFS types and a max of 3 for each of the "traditional" ones. If there are only one variant of each distributed at a time, I should be OK. And the max limit of BOINC jobs will prevent disaster. I hope.

It looks, in part, like this:
$ cat app_config.xml 
<app_config>
<app>
    <name>OpenIFSname1</name>
    <max_concurrent>2</max_concurrent>
    </app>
<app>
    <name>OpenIFSname2</name>
    <max_concurrent>2</max_concurrent>
   </app>
<app>
    <name>hadam3_8.09</name>  <---<<<
    <max_concurrent>3</max_concurrent>
    </app>

I guessed at the names of the traditional tasks. One is marked with <---<<< . Is that the correct way the traditional ones are named?


This prototype is not in effect yet


Add a project max concurrent to that and job's a good'un
29) Message boards : Number crunching : New work discussion - 2 (Message 66281)
Posted 2 Nov 2022 by Bryn Mawr
Post:
I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.


24 cores, or 24 threads? There's a difference, and especially for CPDN tasks, "more threads" is not always better.

I've got a pair of 3900X boxes (12C/24T), and I've written some scripts that track "instructions retired per second." I rarely see a difference between 12 and 18 tasks running for most BOINC workloads (and if I do, the 18 task box is usually accomplishing less actual work per second), and the CPDN tasks typically seem to peak around 8 threads, though I don't recall seeing much of a difference dropping to 6. It's not just the cores that matter - it's the cache. I've absolutely seen "more threads mean lower aggregate system throughput," and CPDN is particularly bad for that.

I would expect 4C on a 12T/24C processor to be below optimum, but... depending on the tasks, not by much. Though we'll have to see once the actual OpenIFS tasks show up.

There are some WCG tasks that use very little cache and I get linear speedups with number of threads assigned, but CPDN is definitely not that.


24T, I run 3900 rather than 3900X as they only pull 65w.

I always run a mix, no more than 4 CPDN, no more than 6 Rosetta and the rest WCG, TN-Grid and SIDock and I find that sort of mix is fairly happy.

I agree, running fully loaded runs up against the peak package power of the CPU, running fewer threads pulls the same power by running faster clock speeds but I’m more the big kid than the deep analytical thinker :-)
30) Message boards : Number crunching : New work discussion - 2 (Message 66272)
Posted 29 Oct 2022 by Bryn Mawr
Post:
I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.
I might be wrong but I think in this situation you would not get the OpenIFS tasks anyway, because the server would see there's not enough free memory available. Remember it's boinc making the decisions, not the model.


I’ll give them their chance and I certainly won’t shoot the messenger, these new tasks sound perfect for those with the kit to run them but, same as the Rosetta Python tasks, if my set-up is not up to the job of running them and trying restricts my ability to run other work then I’ll block them and run what work I can.
31) Message boards : Number crunching : New work discussion - 2 (Message 66265)
Posted 29 Oct 2022 by Bryn Mawr
Post:
"BOINC starts multiple OpenIFS tasks because there are free CPU slots, even though the total memory for the tasks exceeds what's available. "

Can this be overcome by limiting the number of cores available to BOINC before downloading any of the IFS models? Allthough I have a four core CPU the box only has 24Gb of RAM.


I will not restrict my 24 core box to running 4 cores with the other 20 waiting for memory - I’ll block the OpenIFS jobs if they won’t play happily.
32) Message boards : Number crunching : New work discussion - 2 (Message 66125)
Posted 20 Sep 2022 by Bryn Mawr
Post:
I'd like to put it more bluntly and say that CPDN tasks are definitely very sensitive to interruptions (and I believe it's relatively well documented in the forums). By far the worst of any project I'm aware of. Even a couple of LHC subprojects that must be run to completion without interruption, will just restart from the beginning. CPDN's error rate is at least 10%, Bryn Mawr's (who posted above) is over 11%. Mine is over 22%. Many of those are due to restarts (especially if happens more than once). I'd expect CPDN to have a higher error rate than other projects due to valid reasons (i.e. "Negative Pressure Detected"). But for a project that has workunits that take days to weeks to complete, 10%+ error rate is too high, I think, as that means that days' and weeks' worth of processing time is wasted because the tasks can't handle interruptions well. Glenn, it's encouraging to hear that you'd like to look into this and potentially fix it. I'm not sure which OS is worse but the issue affects Windows, macOS, and Linux tasks.


Whilst I have had errors, mostly negative theta, I have not had a task fail on restart in a long time. Then, I very rarely restart more than once during the running of a single task.
33) Message boards : Number crunching : New work discussion - 2 (Message 66120)
Posted 20 Sep 2022 by Bryn Mawr
Post:
Dave, that's a very poor survival for the linux tasks. Other projects seem to handle a cold restart just fine. I am surprised because operational models are pretty resilient to hardware & data failures but it could be something in the wrapper code that's not tolerating restarts properly. I'll ask the CPDN team as I'm interested to find out.

May not be quite that bad. I will when work appears again, start keeping some real data on this rather than relying on my impressions.


I can only report my experiences. I do not take any precautions when rebooting (Ubuntu 20.04) and I have not had any CPDN fails in a couple of years.
34) Message boards : Number crunching : New work Discussion (Message 66048)
Posted 5 Sep 2022 by Bryn Mawr
Post:
Milkyway for example has a 'project preferences' page under the user account which allows you to limit the number of cores in workunits sent to you. CPDN doesn't support this at present because until now they have not done any multicore work.


True. For all projects, I believe the <project_max_concurrent>4</project_max_concurrent> will work. it certainy works for all of mine. You can pick a different number for each project. For MilkyWay, the app_version stuff, especially <avg_ncpus>4</avg_ncpus> limits the number of processors per work unit. I would be very surprised if current ClimatePrediction tasks look at this. If you put something like this in for CPDN, I do not know if it would cause errors, but it would almost certainly be ignored.
[/b]
[/var/lib/boinc/projects/milkyway.cs.rpi.edu_milkyway]$ cat app_config.xml 
<app_config>
    <project_max_concurrent>4</project_max_concurrent>
<app_version>
      <app_name>milkyway_nbody</app_name>
      <plan_class>mt</plan_class>
      <avg_ncpus>4</avg_ncpus>
   </app_version>
</app_config>

[/var/lib/boinc/projects/climateprediction.net]$ cat app_config.xml 
<app_config>
    <project_max_concurrent>4</project_max_concurrent>
</app_config>




That would quite happily limit Milky Way to running 4 WUs each using 4 cores and CPDN to running 4 WUs at any time but would not limit the number of cores that each CPDN WU used.

It would be interesting to see how much work would need to be done on the CPDN server to implement average CPUs and total CPUs - it might just be filling a data field within each WU with the number of CPUs it is set up to grab, the rest of the checking might be part of the standard Boinc server software.
35) Questions and Answers : Unix/Linux : *** Running 32bit CPDN from 64bit Linux - Discussion *** (Message 65577)
Posted 17 Jun 2022 by Bryn Mawr
Post:
Its a Linux host.

Any way to limit the number of jobs it does at once? I run other projects on the VM as well and don't want to lower the thread count.


max_concurrent in app_config.xml but be aware that this can lead to runaway downloads if you’re unlucky.
36) Message boards : Number crunching : New work Discussion (Message 65562)
Posted 14 Jun 2022 by Bryn Mawr
Post:
Sorry, I've been away from this project for a while, and hadn't kept up to date with recent changes. It would normally be on https://www.cpdn.org/prefs.php?subset=project, but I see it's been taken away.


The setting that caused me grief of a similar nature was no_alt_platform in the cc_config.xml file. With this set on the system worked fine with all other projects but would not download any CPDN WUs.
37) Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning (Message 65494)
Posted 4 Jun 2022 by Bryn Mawr
Post:
It worked fine before, why are they messing about?
Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc.
But they already had this and will be using the same scientific programs as before, they're not going to change all that. And why on earth didn't they get this one up and running before they stopped using the other one?! Imagine if Google shut down for 3 months while they moved house.


Evidently they have been changing all that, probably to make it easier to launch new projects in the future.
38) Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning (Message 65491)
Posted 3 Jun 2022 by Bryn Mawr
Post:
If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.

This I know having worked on just such a system and seen the problems first hand.
It worked fine before, why are they messing about?


Because they need back end systems to create WUs in the first place and validate and post process the WUs on return, all of which is project related and not part of Boinc.
39) Message boards : Cafe CPDN : World Community Grid mostly down for 2 months while transitioning (Message 65483)
Posted 3 Jun 2022 by Bryn Mawr
Post:

What's concerning to me is that BOINC is not using fancy, bleeding edge, failure prone technology. It's an absolutely ancient technology stack - straight up LAMP, as far as I can tell from the server install guides (Linux, Apache, MySQL, PHP). It's not the sort of thing that should be hard to port, and while WCG is more complex than some others, unless they've gone absolutely nuts or have zero "legacy Linux sysadmins," it shouldn't take more than a couple weeks to move. I assume the bulk of the time was moving data around, but... even then, just drive a server around and hook up 10G server to server. I don't get how this is nearly so long a transition as it is, and I'm not at all optimistic that they're "Nearly, almost, just a tippy tappy bit more... almost... any day now, soon... *crickets*" about the move.


If you look at the updates they’ve provided the development they’re doing is WebSphere / Message Broker and whilst MB has been around for a long time WS is quite new and the combination is very much current technology for real time transaction processing and is complex and difficult to get right - especially if you get into scenarios like dual centre working for security fallback.

This I know having worked on just such a system and seen the problems first hand.
40) Message boards : Number crunching : Windows Work Units (Message 65444)
Posted 15 May 2022 by Bryn Mawr
Post:
Is this project generating any work for windows machines? I've had a Windows 10 machine attached for several months and nada . . . I used to get work up until a year or so -- stopped getting work so removed the project from my four machines and put them to work on other things. Then reconnected one machine awhile back and still nada . . .


No Windows work units for a long time. Currently Mac with a side portion of Linux unless you spin up a vm to run the work in.


Previous 20 · Next 20

©2024 climateprediction.net