climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 42 · Next

AuthorMessage
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66462 - Posted: 16 Nov 2022, 9:57:59 UTC - in response to Message 66460.  

Go back and read what Glenn and I have been saying. It's the CPDN app (alone) that decides when its data is in a consistent state for a checkpoint. BOINC cannot 'ask it to checkpoint': the most it can ask for is a delay to the checkpoints, so the apps can get on with the science.
Lets say I've limited CPDN severely to only checkpoint every 3 hours to save wear on my drive, using the Boinc setting. But say Boinc wishes to switch to another project after 2 hours. Surely it ought to let CPDN know it can write a checkpoint now? And what if I shut down the machine? Obviously I'd prefer a checkpoint to be written before shutdown.

I guess this is why I had to set everything to leave suspended tasks in memory, to get around this screwup.

No. Ultimately, the operating system is the boss. It will ask BOINC to shut down, and in turn BOINC will ask CPDN to shut down. The OS will wait for a polite interval, and if they don't respond, it will kill them.
But with a non-Boinc program, if that program signals it doesn't want to shut down (eg. a word processor has unsaved work), Windows will warn me it's "preventing shutdown", and I can choose to cancel and go save it, or wait longer. Shouldn't the same happen with Boinc? A lack of communication is a terrible thing.
ID: 66462 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66463 - Posted: 16 Nov 2022, 10:42:21 UTC - in response to Message 66450.  

Glenn, do you happen to have an app_config with any restrictions? I remember having a puzzling problem of not being able to get tasks for another project some time ago but unfortunately I don't remember the details. I want to say that the problem ended up being with app_config but don't really remember for sure.
I don't use app_config files. They just complicate the setup and I like to keep things as simple as possible. One thing that bugs me about boinc is the number of knobs you can twiddle which are not well documented and often don't do what you think they will.

Perhaps try some of the other Event Log flags and see if any hints come from any of them.
I found this useful forum post here which described how the scheduler decides. But my client is empty with no other projects. I've got as far as I can go debugging at my end. I need to see the server logs. I'm visiting the CPDN team next week, I'll find out then.

It might be that I adjust my job cache from the default to 'store at least'=0.5, 'store up to..'=0.01 (*~1hr) because I like to keep a small job cache which reports jobs as soon as they are done. I remember I fiddled with this on WSL/Ubuntu and suddenly got a task. I tried increasing it this morning but of course all the tasks have now gone :) But I don't see why the server should be paying attention to the size of my job cache, which in this case is empty anyway, as it's the number of jobs not the duration of each job. I need to see the server logs otherwise I'm just wasting time twiddling boinc knobs.

It does seem a bit odd that some of your PCs don't show anything in app version info or don't match tasks completed. I wonder if resetting the project may help in any way?
Some of my hosts are old incarnations of the same hardware. Unfortunately there's no facility to delete them (another issue I have).

I did try resetting the project but that didn't help other than re-download all the project files again. I also tried detaching & reattaching but that didn't do anything. As Richard pointed out, the server searches it's database for a match when a 'new' client attaches and reuses what it finds, so I went back to a long wait.
ID: 66463 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66466 - Posted: 16 Nov 2022, 11:16:58 UTC

Input for CPDN team

I'm visiting the CPDN team next week for a couple of days to progress the OpenIFS work. If there are any issues people on the forum would like me to mention let me know. My initial wishlist is below.

Please! No comments about lack of work or why is X app not in Y OS because these are not generally under CPDN control. Just project specifics.

  • Reduce task deadline times to something more realistic (currently 1?yr).
  • Badges: both CPDN general and CPDN project specific (they are keen to do this - just a question of resource).
  • Project preferences webpage:
      - add choice of apps to run
      - add max number cores per app
      - add max tasks per client


    * ... add your thoughts here...



Thank you.

ID: 66466 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,479,688
RAC: 5,922
Message 66467 - Posted: 16 Nov 2022, 11:44:46 UTC - in response to Message 66463.  

It might be that I adjust my job cache from the default to 'store at least'=0.5, 'store up to..'=0.01 (*~1hr) because I like to keep a small job cache which reports jobs as soon as they are done. I remember I fiddled with this on WSL/Ubuntu and suddenly got a task. I tried increasing it this morning but of course all the tasks have now gone :) But I don't see why the server should be paying attention to the size of my job cache, which in this case is empty anyway, as it's the number of jobs not the duration of each job. I need to see the server logs otherwise I'm just wasting time twiddling boinc knobs.


Next batch is out now.

Some of my hosts are old incarnations of the same hardware. Unfortunately there's no facility to delete them (another issue I have).
If you go to your account page and then your computers, there is an option to merge hosts by name so if the same OS, name and hardware it can merge them. Different OS's however don't merge so my windows client running under WINE won't merge with my Linux client running natively under Ubuntu.
ID: 66467 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 941
Credit: 34,120,827
RAC: 2,006
Message 66471 - Posted: 16 Nov 2022, 13:35:48 UTC - in response to Message 66456.  

15/11/2022 21:33:47 | climateprediction.net | [sched_op] estimated total CPU task duration: 87303 seconds
@Richard
Is the "estimated total CPU task duration" dependent on benchmark results?
The actual duration estimate is calculated locally, but from two values supplied by the server. They are the 'size' and the 'speed' of the task in question.

The size is stored in the technical workunit definition of the job this task was derived from, and the speed from the application that will be processing it. Using figures from my host 1498009,

<rsc_fpops_est> 1215407627777776.750000
<flops> 7628619026.028317
(how on earth did we get to three-quarters of a floating point operation?)

That particular one - size / speed - works out to 159,322 seconds: it's a different model from the one I showed yesterday, _3_ instead of _2_

The size figure is provided by the person who put the jobs together. It's the one thing which is absolutely under the direct control of the project team.

The speed figure has changed over the years. Up to 2010, it was exactly the benchmark figure supplied by the host, and there was a fiddle factor (DCF) calculated and maintained by the client to correct it for local running conditions.

After 2010 (and the addition of support for GPU processing to BOINC), DCF wasn't sophisticated enough to handle the differing speeds encountered. It was removed, and replaced by a different fiddle factor (APR) calculated and maintained on the server. You can see it in the Application details for the machine I linked: you'll see that there's a separate APR ('Average processing rate') for each application version. This task is due to run on version 8.02 for N144 tasks - I can't explain why the figures differ (*).

The old system started with a DCF of 1 when you attached the computer to the project in the first place, and started tracking it immediately: every time a task finished, it was nudged up or down immediately as each task finished.

The new system should (I think) start with the benchmark for the host, but the big flaw is that it doesn't started tracking it until 11 tasks have been successfully completed with that particular app. That's a big problem for projects with long-running tasks like CPDN: imagine what would happen if we were still running months-long coupled models on single core computers!

(*) my flops figure is actually quite close to the benchmark, so possibly CPDN have disabled that particular part of the server code, the same as they don't use the default CreditNew system. Glenn can ask the project people when he visits Oxford next week.
ID: 66471 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 251
Credit: 31,538,452
RAC: 30,191
Message 66472 - Posted: 16 Nov 2022, 17:21:42 UTC - in response to Message 66466.  

Reduce task deadline times to something more realistic (currently 1?yr).


I don't mind if they're reduced to a few months, but much shorter starts to impact my ability to do work - and one of the reasons I'm aggressive about CPDN tasks is because they have long deadlines, so as long as the compute gets done in some reasonable period of time, I'm not being a nuisance to the project... I hope. My compute is purely off-grid solar, so only runs during the day, on days I have enough energy from the sun. In the summer, this can be 16+h/day of compute, in the winter it's a lot less, and I can go a week without enough energy to run the compute rigs during inversions.

I don't have a good sense what's the most useful for researchers - I know they shoot a lot of tasks out and expect "most of them back in some reasonable period," but I would guess that a task isn't of much value to them after three years of timeouts. So shortening it to something they can deal with would be useful, but I'd still prefer it on the long end of the window if possible, to deal with intermittent power situations (office compute is done on use-it-or-lose-it solar, though I do have a rig in the house helping heat in the winter as well, on a grid-tie system).

If faster result returns are helpful but longer term results are still useful, perhaps some sort of bonus for a rapid return of tasks? I know a lot of people, myself included, tend to "queue up as many tasks as possible when they're available" to have work for a long term.

Badges: both CPDN general and CPDN project specific (they are keen to do this - just a question of resource)


Don't care in the slightest. Couldn't tell you if I have any or what they do. Won't care if they show up.

Project preferences webpage:
- add choice of apps to run
- add max number cores per app
- add max tasks per client


I don't care which apps run, long as they can work on my system. I would prefer a way to limit CPDN to X tasks on the system at once, without having to fiddle around with things manually - as I've mentioned before, I typically find about 8 tasks per system on a 12C/24T Ryzen 3900X to be optimal in terms of total instructions retired per second - the cache thrashing with more tasks actually slows the system down. But I don't know a good way to make this more generic as a broad setting, so it may be something to just let people manage on their own.

I've also found that running "all the same task" on the systems helps throughput - this is less of an issue with CPDN, but having N216 and N144 mixed seems to hurt throughput compared to just running all N144 or N216 tasks. I've found this on other projects too, and my running theory is that having all the same stuff running means less cache pressure for the code sections loaded.

... add your thoughts here...


No more 32-bit Intel MacOS tasks! Those VMs are a pain to keep running!
ID: 66472 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66473 - Posted: 16 Nov 2022, 17:24:49 UTC - in response to Message 66472.  

I don't mind if they're reduced to a few months, but much shorter starts to impact my ability to do work - and one of the reasons I'm aggressive about CPDN tasks is because they have long deadlines, so as long as the compute gets done in some reasonable period of time, I'm not being a nuisance to the project... I hope.
An admin in here somewhere told me they actually like them in about a month. I just want the deadline to be what they want it to mean. Then if I'm not going to be able to do one in the time they like, I can abort it, or move them to faster machines.
ID: 66473 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66474 - Posted: 16 Nov 2022, 17:29:19 UTC - in response to Message 66467.  

Next batch is out now.
@Dave, Indeed, but still not getting any despite ~1hrly requests from my idle client and jobs showing in the queue.
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | [work_fetch] REC 0.000 prio -0.000 can request work
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | [sched_op] Starting scheduler request
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | [work_fetch] request: CPU (352512.00 sec, 4.00 inst)
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | Sending scheduler request: To fetch work.
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | Requesting new tasks for CPU
Wed 16 Nov 2022 16:49:41 GMT | climateprediction.net | [sched_op] CPU work request: 352512.00 seconds; 4.00 devices
Wed 16 Nov 2022 16:49:43 GMT | climateprediction.net | Scheduler request completed: got 0 new tasks
I'll discuss with CPDN, something is clearly not right. I'm not spending any more time on it here.

(*) my flops figure is actually quite close to the benchmark, so possibly CPDN have disabled that particular part of the server code, the same as they don't use the default CreditNew system. Glenn can ask the project people when he visits Oxford next week.
@Richard. I did ask if credit is related to flops calculations for OpenIFS and apparently it isn't. It's set to a number consistent with the credit for the Hadley Centre models.
ID: 66474 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66475 - Posted: 16 Nov 2022, 17:38:28 UTC - in response to Message 66473.  

I don't mind if they're reduced to a few months, but much shorter starts to impact my ability to do work - and one of the reasons I'm aggressive about CPDN tasks is because they have long deadlines, so as long as the compute gets done in some reasonable period of time, I'm not being a nuisance to the project... I hope.
An admin in here somewhere told me they actually like them in about a month. I just want the deadline to be what they want it to mean. Then if I'm not going to be able to do one in the time they like, I can abort it, or move them to faster machines.
Agree. Some of the upcoming OpenIFS batches have contract deadlines to meet, so I have asked for task deadlines to be significantly shortened for these workunits and for the server to limit the no. of tasks given to a client (to restrict the practise mentioned by @SolarSyonyk of creating big caches of jobs). This isn't to penalize users, it's so failed tasks can be quickly identified and rerun.

Otherwise, I agree with a month or two as the deadline (I'm also on solar&battery). But it does depend on the contract the scientist has with CPDN. Money to keep going is not easy for them to find and they need to promote the facility.
ID: 66475 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66476 - Posted: 16 Nov 2022, 17:44:16 UTC - in response to Message 66475.  

Agree. Some of the upcoming OpenIFS batches have contract deadlines to meet, so I have asked for task deadlines to be significantly shortened for these workunits and for the server to limit the no. of tasks given to a client (to restrict the practise mentioned by @SolarSyonyk of creating big caches of jobs). This isn't to penalize users, it's so failed tasks can be quickly identified and rerun.
In times of more users than tasks, perhaps duplicates of urgent work could be sent out incase one crashes or doesn't get done in time? Ok, a waste of resources, but it could mean getting the science done quicker.
ID: 66476 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 251
Credit: 31,538,452
RAC: 30,191
Message 66477 - Posted: 16 Nov 2022, 18:04:39 UTC - in response to Message 66473.  

An admin in here somewhere told me they actually like them in about a month. I just want the deadline to be what they want it to mean. Then if I'm not going to be able to do one in the time they like, I can abort it, or move them to faster machines.


Ok, yeah. Then the deadline should be set shorter. I'd run my rigs somewhat differently if there were a month deadline - less in the queue, and probably "fewer rigs, running closer to 24/7." Less total throughput, but quicker returns. Though at some point here, I plan to double my battery bank and that should resolve most of it as I'll be able to run loads 24/7 here most of the time...

In times of more users than tasks, perhaps duplicates of urgent work could be sent out incase one crashes or doesn't get done in time? Ok, a waste of resources, but it could mean getting the science done quicker.


More users than tasks? *looks around* That'd be Windows tasks...

My understanding of the CPDN tasks is that there are no "urgent tasks," just "We'd like some percentage of them back before we go forward with analysis." They just sweep the parameters across the range of interest for whatever parameters are being studied, and as long as you get a good percentage back, you have the data you need - if you have 100 tasks with 0.001 differences between parameters, you can still get the shape of the curve without getting every task back.
ID: 66477 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 66478 - Posted: 16 Nov 2022, 18:08:05 UTC - in response to Message 66477.  

More users than tasks? *looks around* That'd be Windows tasks...
Linux has been empty quite a lot here too. I did try that OS twice, but after banging my head on the wall several times, I went back to the easier windows. I just can't stand Linux (too geeky) or Mac (too childish). Windows seems better suited to the average person like me.

My understanding of the CPDN tasks is that there are no "urgent tasks," just "We'd like some percentage of them back before we go forward with analysis." They just sweep the parameters across the range of interest for whatever parameters are being studied, and as long as you get a good percentage back, you have the data you need - if you have 100 tasks with 0.001 differences between parameters, you can still get the shape of the curve without getting every task back.
Ah, similar to Rosetta, and why they didn't need checking.
ID: 66478 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66479 - Posted: 16 Nov 2022, 18:38:53 UTC - in response to Message 66477.  
Last modified: 16 Nov 2022, 18:39:49 UTC

My understanding of the CPDN tasks is that there are no "urgent tasks," just "We'd like some percentage of them back before we go forward with analysis." They just sweep the parameters across the range of interest for whatever parameters are being studied, and as long as you get a good percentage back, you have the data you need - if you have 100 tasks with 0.001 differences between parameters, you can still get the shape of the curve without getting every task back.
That was probably the case back in the early days when CPDN was running very large ensembles of climate length forecasts - I wasn't involved back then. That's not the case for the current projects which are looking at extreme weather, changes in weather patterns due to changing climate etc. Such forecasts are much shorter; weeks/months. In these cases, the experiments are designed requiring 100% return. Hence to need to make sure we rerun failures quicker.

The project I'm most closely associated with wanted to put out 125,000 OpenIFS tasks, but a quick calculation showed it would be nearly Christmas before we had 80% back. The scientist doing the work has a contract that finishes in Feb 2023 and needs analysis done by end of the year in order to write reports/papers/etc. So we've had to compromise and will probably only put out ~60,000. But we'll watch the return rate and if we think more can be done in time, they will get sent out.

Hope that gives a picture of how things look from the scientist's side.
ID: 66479 · Report as offensive
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 941
Credit: 34,120,827
RAC: 2,006
Message 66480 - Posted: 16 Nov 2022, 19:03:49 UTC - in response to Message 66474.  

@Richard. I did ask if credit is related to flops calculations for OpenIFS and apparently it isn't. It's set to a number consistent with the credit for the Hadley Centre models.
Absolutely. That's always been the case ever since I've known the project (since 2007, which was the first time I got a workstation-class machine that could handle the work). CPDN also is the only project I know that uses trickles (they may even have asked for that facility specially), and awards intermediate credit at each trickle. That leaves the user with some concrete record of achievement, even if the task subsequently fails.
ID: 66480 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1053
Credit: 16,508,518
RAC: 286
Message 66481 - Posted: 16 Nov 2022, 19:10:09 UTC - in response to Message 66479.  

My understanding of the CPDN tasks is that there are no "urgent tasks," just "We'd like some percentage of them back before we go forward with analysis." They just sweep the parameters across the range of interest for whatever parameters are being studied, and as long as you get a good percentage back, you have the data you need - if you have 100 tasks with 0.001 differences between parameters, you can still get the shape of the curve without getting every task back.

That was probably the case back in the early days when CPDN was running very large ensembles of climate length forecasts - I wasn't involved back then. That's not the case for the current projects which are looking at extreme weather, changes in weather patterns due to changing climate etc. Such forecasts are much shorter; weeks/months. In these cases, the experiments are designed requiring 100% return. Hence to need to make sure we rerun failures quicker.


Way back when I first started, running Linux with a two-processor, 4 thread machine running at 3.06GHz, we had CPDN tasks that took several months to run. Luckily we did not have to worry about rebooting in those days. The CPDN work units would resume from where they left off (perhaps, since the most recent checkpoint?).

So it made sense to have long deadlines then. Today, with somewhat faster machines, and much shorter work units, they could consider cutting deadlines in half, or even to 25% of what they are now.
ID: 66481 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,479,688
RAC: 5,922
Message 66482 - Posted: 16 Nov 2022, 19:36:37 UTC

@ Glen,

I remember maybe nine months ago or a bit more @George had similar problems on a new install of Ubuntu. I am not sure he ever worked out what the problem was. At some point it started working normally. But, George didn't have access to logs at the project so your checking things out there might be useful but it is I think worth your knowing this is not a problem unique to your machine.
ID: 66482 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66483 - Posted: 16 Nov 2022, 20:46:58 UTC - in response to Message 66482.  

I remember maybe nine months ago or a bit more @George had similar problems on a new install of Ubuntu. I am not sure he ever worked out what the problem was. At some point it started working normally. But, George didn't have access to logs at the project so your checking things out there might be useful but it is I think worth your knowing this is not a problem unique to your machine.
That's a good point, I hadn't considered a delay for newly attached clients (which these are). However, Andy's just emailed to say he's not aware of any 'new client' delay in the server. So it's still a case of looking at the server logs next week. I will report back. It'll probably be something stupid I've done on the client, and then I'll grumble it let me do it :)
ID: 66483 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 798
Credit: 13,526,425
RAC: 4,698
Message 66506 - Posted: 17 Nov 2022, 17:16:50 UTC

Small batch of OpenIFS tasks about to go onto the test site next week. If all ok, expect 60,000 workunits (linux only) to appear soon after. I'm sure moderators will announce nearer the time. I understand this project will take priority so no Hadley model tasks for a while.
ID: 66506 · Report as offensive
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 251
Credit: 31,538,452
RAC: 30,191
Message 66508 - Posted: 17 Nov 2022, 20:54:08 UTC - in response to Message 66506.  

60k Linux tasks?

*quivers in anticipation*

That'll keep me warm all winter and then some!
ID: 66508 · Report as offensive
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1053
Credit: 16,508,518
RAC: 286
Message 66509 - Posted: 17 Nov 2022, 22:03:54 UTC - in response to Message 66508.  

60k Linux tasks?
*quivers in anticipation*
That'll keep me warm all winter and then some!


I am currently running 5 hadsm4_um_8.02_i686-pc CPDN work units on my 64-bit Linux machine and have two more work units queued up to run. These current ones take about 30 hours each, and if I have them, I can run up to five at a time.

So I should be able to take quite a few of the new tasks if they are offered to my machine. I have about 390 GBytes of disk space for the dedicated partition for Boinc work.
ID: 66509 · Report as offensive
Previous · 1 . . . 9 · 10 · 11 · 12 · 13 · 14 · 15 . . . 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 climateprediction.net