climateprediction.net home page
OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!

OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!

Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,497,933
RAC: 6,477
Message 67204 - Posted: 2 Jan 2023, 9:24:20 UTC

People playing around with app_config tend to know what they're doing, so this shouldn't impact casual crunchers who just install Boinc and add some projects because they sound interesting.
I have bolded "tend" as I am sure encouraging people to play around with those files will lead to the odd one screwing things up. (Though to be fair, even when I started with CPDN and knew a lot less about Linux than I do now, I managed without doing that.
ID: 67204 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67252 - Posted: 3 Jan 2023, 14:42:06 UTC - in response to Message 67186.  

Richard Haselgrove wrote:
So the most productive single change for CPDN might be to flip the default setting for 'leave applications in memory' to ON, and run a script to change all current settings in the database similarly. It won't be a simple query, because these things are stored in XML blobs, but it could be done.
I may have misunderstood what has been suggested here, but: Project admins should never manipulate the users' settings (computing prefs, community prefs, or project prefs).
ID: 67252 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67265 - Posted: 3 Jan 2023, 22:30:31 UTC - in response to Message 67197.  

That's exactly the problem, by changing one project's settings it might break the settings that CPDN needs for tasks to be successful. It seems to assume what's right for one project is right for all, which is a reasonable starting point but not generally the case.
This is not about the projects.
This is about the use of one's machines and things like availability of RAM.
It's about both as I said in previous message.

Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.
But the projects should work on their apps to run stable even if not kept in memory, because they want to get their work done.
It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O.

We (CPDN) have done alot of work on OpenIFS to make it work reasonably in a computing environment it was not designed for, including finding a balance in terms of computing resources for the volunteer. OpenIFS is very stable, it will restart fine if it has to. But you don't want it to do this if you want a decent throughput.

In any case, I'm sure if the OS wanted the RAM for a bigger, non-niced process, the task can still be kicked out of RAM, even with that 'keep in memory' flag on.
ID: 67265 · Report as offensive     Reply Quote
Harri Liljeroos
Avatar

Send message
Joined: 9 Dec 05
Posts: 111
Credit: 12,038,780
RAC: 1,393
Message 67281 - Posted: 4 Jan 2023, 9:24:32 UTC

And user can always select to use local preferences that will override web site preferences.
ID: 67281 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,510,982
RAC: 1,493
Message 67284 - Posted: 4 Jan 2023, 9:44:11 UTC - in response to Message 67265.  

Work might get done faster if you let things in memory, and certainly it helps if apps don't handle checkpoints well or don't have any.
But the projects should work on their apps to run stable even if not kept in memory, because they want to get their work done.
It's the time-stepping & complex nature of the CPDN weather models that they work that way. For every checkpoint, the model needs to dump its working arrays in 64bit precision so it can do a bit reproducible restart. That's alot of I/O and alot of data, but if you don't want to keep it in memory, it'll have to restart from checkpoint more often than we currently allow for. That means much more I/O, filespace, wearing out SSDs etc. I could indeed allow the model to checkpoint more to cope with being in & out of memory frequently, but you'd pay a price on your drives instead of RAM, and a much slower throughput because of the added I/O.

How often does the OpenIFS model need to checkpoint? Looking at my event log it seems every second? Is that normal?

Wed 04 Jan 2023 11:37:46 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:47 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:48 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:49 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:50 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:51 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:52 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:53 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:54 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
ID: 67284 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 32
Credit: 9,526,696
RAC: 109,831
Message 67292 - Posted: 4 Jan 2023, 11:10:19 UTC - in response to Message 67284.  

I asked about the checkpoint "log spam" in the OpenIFS discussion thread on 23rd December but. given when I posted it, that message is now two pages back :-)

For what it's worth, BOINC Manager on my machine doesn't seem to think the application checkpoints using the client checkpoint mechanism at all whilst it's running if you look at the task properties (there was never a last checkpoint time); that's consistent with my understanding of some of what Glenn has said about the matter, but it doesn't explain this -- is it something odd in the client libraries or something in the CPDN wrapper or main program?

It would be nice if it didn't do this, and it would be interesting to know why it does do it!

Cheers - Al.

P.S. Please tell me it's not using something in the BOINC checkpoint mechanism as a 1 second timer :-) ...
ID: 67292 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67294 - Posted: 4 Jan 2023, 11:24:45 UTC - in response to Message 67284.  


How often does the OpenIFS model need to checkpoint? Looking at my event log it seems every second? Is that normal?

Wed 04 Jan 2023 11:37:46 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
Wed 04 Jan 2023 11:37:54 AM EET | climateprediction.net | [checkpoint] result oifs_43r3_ps_0247_2007050100_123_976_12192891_0 checkpointed
I have no idea about those messages but I notice they are referring to the results files not the model progress. OpenIFS will ignore any request to checkpoint from the boinc client. It knows best how to manage it's checkpointing & generation of restart files. It's a relatively expensive I/O operation, not something we want to happen too often.
ID: 67294 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67297 - Posted: 4 Jan 2023, 11:57:05 UTC - in response to Message 67294.  

I notice they are referring to the results files not the model progress.
That's a misinterpretation, I'm afraid. In BOINC-speak. 'result' is a synonym for (and early form of) 'task' - programmer-speak, rather than user-speak. The concept of 'checkpointing' a file doesn't really make sense.
ID: 67297 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67298 - Posted: 4 Jan 2023, 12:18:08 UTC - in response to Message 67297.  

I notice they are referring to the results files not the model progress.
That's a misinterpretation, I'm afraid. In BOINC-speak. 'result' is a synonym for (and early form of) 'task' - programmer-speak, rather than user-speak. The concept of 'checkpointing' a file doesn't really make sense.
Thx for clearing that up.

But as Alan in the original post said, it also doesn't make sense that the task is being checkpointed roughly every minute? What does boinc mean by 'checkpoint' in this context? Does it mean the client sent a 'you need to checkpoint' message to the task - regardless of whether the task did it or not?
ID: 67298 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67300 - Posted: 4 Jan 2023, 12:42:39 UTC - in response to Message 67298.  
Last modified: 4 Jan 2023, 12:50:42 UTC

I was trying to work that out. In BOINC's case, 'checkpointed' should mean "BOINC has successefully written the files needed for a restart at ... [time]'

The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by

old_time = atp->checkpoint_cpu_time;  // the saved time of the last checkpoint
if (old_time != atp->checkpoint_cpu_time) {  // if they are different ...
so you shouldn't see two messages with the same time.

Edit - OK, so 1 second apart is indeed 'different'. But I can never unravel David Anderson's spaghetti code much beyond that.
ID: 67300 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67301 - Posted: 4 Jan 2023, 12:59:23 UTC

OK, mine's doing it too:
04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0608_2007050100_123_976_12193252_0 checkpointed
04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0447_1987050100_123_956_12173091_1 checkpointed
04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0806_1998050100_123_967_12184450_1 checkpointed
04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0656_2008050100_123_977_12194300_0 checkpointed
04/01/2023 12:54:19 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0860_1987050100_123_956_12173504_1 checkpointed
04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0608_2007050100_123_976_12193252_0 checkpointed
04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0447_1987050100_123_956_12173091_1 checkpointed
04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0806_1998050100_123_967_12184450_1 checkpointed
04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0656_2008050100_123_977_12194300_0 checkpointed
04/01/2023 12:54:20 | climateprediction.net | [checkpoint] result oifs_43r3_ps_0860_1987050100_123_956_12173504_1 checkpointed
Possibly, the BOINC client is sending the science app a message "you can checkpoint now", and the app is replying "It's OK, I've done one now".

I think the wisest thing is to turn off that log flag (it'll spam the system journal in no time), and stop worrying about it.
ID: 67301 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 32
Credit: 9,526,696
RAC: 109,831
Message 67305 - Posted: 4 Jan 2023, 13:30:46 UTC - in response to Message 67300.  

Richard,

Thanks for having a look to see what's going on...

The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by

old_time = atp->checkpoint_cpu_time;  // the saved time of the last checkpoint
if (old_time != atp->checkpoint_cpu_time) {  // if they are different ...
so you shouldn't see two messages with the same time.

Edit - OK, so 1 second apart is indeed 'different'. But I can never unravel David Anderson's spaghetti code much beyond that.

But it isn't trying to checkpoint so unless something is writing a non-zero value to the task's checkpoint_cpu_time it should just do nothing (always zero) - or have I misread/misunderstood that section of code (quite likely; my opinion of David's code is much the same as yours!)?

And from a later message...

I think the wisest thing is to turn off that log flag (it'll spam the system journal in no time), and stop worrying about it.

That's what I did, but as soon as WCG's GPU application returns I either forego some performance analysis I'm doing that needs to know where it checkpoints (to make some sort of sense of a GPU activity trace) or I forego CPDN (as I currently have no machines with the capacity to consider setting up a second BOINC client with different log behaviour...)

Now, my tiny contribution wouldn't be missed (no sarcasm intended), but if someone could find out how to stop that spam...

Cheers - Al.
ID: 67305 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,490,152
RAC: 542
Message 67306 - Posted: 4 Jan 2023, 14:00:49 UTC - in response to Message 67300.  
Last modified: 4 Jan 2023, 14:01:38 UTC

I was trying to work that out. In BOINC's case, 'checkpointed' should mean "BOINC has successefully written the files needed for a restart at ... [time]'

The message is written by https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L1551, and seems to be controlled by

old_time = atp->checkpoint_cpu_time;  // the saved time of the last checkpoint
if (old_time != atp->checkpoint_cpu_time) {  // if they are different ...
so you shouldn't see two messages with the same time.

Edit - OK, so 1 second apart is indeed 'different'. But I can never unravel David Anderson's spaghetti code much beyond that.

I would hazard a guess that BOINC respects the initial "Request tasks to checkpoint at most every X seconds" set in the client but after that time has elapsed and because the checkpoint time is still showing zero it will then repeat the request every second as that code gets called every second (all being well).

My next task isn't due to start until about 2am so have set the checkpoint time period to be 10 hours. If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing.

I thought the task/project could also specify a time period for checkpointing ?
I know I have seen it being set inside LHC vboxwrapper code.
If the project isn't using them internally can it not set a massively high period for their tasks to follow ?
ID: 67306 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67308 - Posted: 4 Jan 2023, 15:10:46 UTC - in response to Message 67306.  

BOINC doesn't control checkpoints, except in the sense of setting a minimum interval between checkpoints - default 60 seconds. And that's a global setting - it doesn't make any difference whether it's a 3-minute GPU task for WCG, or a 14-hour CPDN task.

Looking at a CPDN task 'properties' in BOINC Manager, it always seems to say "CPU time since checkpoint ---". I think that means that CPDN - in this case, the CPDN wrapper app - is constantly writing "I've just checkpointed now" into the inter-process communications file: that might well be triggering the event log message. [I'll go downstairs and do some excavations in the filing system in a moment]

If that turns out to be true, CPDN have chosen to do it wrongly. I'd suggest the possible options are:

1) Lie, and say it's never checkpointed - that would inhibit task switching, but would upset users who might like to know when would be a good time to shut down for the night.
2) Tell the truth, so the user knows what's going on, even if it doesn't explain why it isn't behaving the way he or she asked it to.
3) Try to fool the system, by making it report that "I have just checkpointed 10+ seconds into the future". Thus trying to invoke:

// Normally this is called every second.
// If delta_t is > 10, we'll assume that a period of hibernation
// or suspension happened, and treat it as zero.
// If negative, must be clock reset. Ignore.
//
if (delta_t > 10 || delta_t < 0) {
delta_t = 0;
}
(that possibility needs thorough checking)
ID: 67308 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67309 - Posted: 4 Jan 2023, 15:21:23 UTC - in response to Message 67308.  
Last modified: 4 Jan 2023, 15:46:40 UTC

OK, this is the excavation for CPDN:

<active_task>
    <project_master_url>https://climateprediction.net/</project_master_url>
    <result_name>oifs_43r3_ps_0525_2009050100_123_978_12195169_0</result_name>
    <checkpoint_cpu_time>25677.370000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>25712.590836</checkpoint_elapsed_time>
    <fraction_done>0.509490</fraction_done>
    <peak_working_set_size>4621619200</peak_working_set_size>
    <peak_swap_size>5215875072</peak_swap_size>
    <peak_disk_usage>2189510262</peak_disk_usage>
</active_task>
I'll have to switch to another machine for a comparison. Back in a mo.
Well, I didn't expect that.

<active_task>
    <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url>
    <result_name>wu_sf7_DS-16x10_Grp400291of5000000_0</result_name>
    <checkpoint_cpu_time>2595.560000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>2623.082207</checkpoint_elapsed_time>
    <fraction_done>0.583066</fraction_done>
    <peak_working_set_size>10067968</peak_working_set_size>
    <peak_swap_size>306024448</peak_swap_size>
    <peak_disk_usage>12271</peak_disk_usage>
</active_task>
Near enough the same. Yet NumberFields says:


More excavation needed. Try these:

<active_task>
    <project_master_url>http://numberfields.asu.edu/NumberFields/</project_master_url>
    <result_name>wu_sf7_DS-16x10_Grp400291of5000000_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>400</app_version_num>
    <slot>0</slot>
    <checkpoint_cpu_time>2399.857000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>2426.131861</checkpoint_elapsed_time>
    <checkpoint_fraction_done>0.571426</checkpoint_fraction_done>
    <checkpoint_fraction_done_elapsed_time>2426.131861</checkpoint_fraction_done_elapsed_time>
    <current_cpu_time>2416.206000</current_cpu_time>
    <once_ran_edf>0</once_ran_edf>
    <swap_size>306024448.000000</swap_size>
    <working_set_size>10067968.000000</working_set_size>
    <working_set_size_smoothed>10067968.000000</working_set_size_smoothed>
    <page_fault_rate>0.000000</page_fault_rate>
    <bytes_sent>0.000000</bytes_sent>
    <bytes_received>0.000000</bytes_received>
</active_task>

<active_task>
    <project_master_url>https://climateprediction.net/</project_master_url>
    <result_name>oifs_43r3_ps_0525_2009050100_123_978_12195169_0</result_name>
    <active_task_state>1</active_task_state>
    <app_version_num>105</app_version_num>
    <slot>2</slot>
    <checkpoint_cpu_time>27035.320000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>27074.658515</checkpoint_elapsed_time>
    <checkpoint_fraction_done>0.535951</checkpoint_fraction_done>
    <checkpoint_fraction_done_elapsed_time>27074.658515</checkpoint_fraction_done_elapsed_time>
    <current_cpu_time>27035.320000</current_cpu_time>
    <once_ran_edf>0</once_ran_edf>
    <swap_size>4426641408.000000</swap_size>
    <working_set_size>3926339584.000000</working_set_size>
    <working_set_size_smoothed>3556255585.502848</working_set_size_smoothed>
    <page_fault_rate>0.000000</page_fault_rate>
    <bytes_sent>0.000000</bytes_sent>
    <bytes_received>0.000000</bytes_received>
</active_task>
Now we're getting somewhere. I see
<checkpoint_cpu_time>27035.320000</checkpoint_cpu_time>
<current_cpu_time>27035.320000</current_cpu_time>
Identical to 6 decimal places. I bet that's what's doing it. Those last two code comparisons come from the <active_task_set> in BOINC's client_state.xml
ID: 67309 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,490,152
RAC: 542
Message 67311 - Posted: 4 Jan 2023, 15:33:05 UTC - in response to Message 67308.  
Last modified: 4 Jan 2023, 15:51:42 UTC

BOINC doesn't control checkpoints, except in the sense of setting a minimum interval between checkpoints - default 60 seconds. And that's a global setting - it doesn't make any difference whether it's a 3-minute GPU task for WCG, or a 14-hour CPDN task.

But it does observe what it is supposed to do with them.

boinc_time_to_checkpoint returns true only when sufficient time has passed since the last checkpoint. This minimum interval is the maximum of:

A user preference (e.g. laptop users might want to checkpoint infrequently).
An optional application-supplied, specified by calling

boinc_set_min_checkpoint_period(int nsecs);

So if the wrapper/application calls boinc_set_min_checkpoint_period() with a number > the longest amount of time it would expect to take then the BOINC code shouldn't try to request perform a checkpoint [Edit: because it is set to Deny].

PS. Apart from the ignorant, who leaves the default as 60 seconds !
ID: 67311 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67312 - Posted: 4 Jan 2023, 15:39:06 UTC - in response to Message 67311.  

... the BOINC code shouldn't try to request a checkpoint.
BOINC can't request a checkpoint. The only options are 'allow' or 'deny'.
ID: 67312 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 802
Credit: 13,560,429
RAC: 6,808
Message 67326 - Posted: 4 Jan 2023, 18:42:57 UTC
Last modified: 4 Jan 2023, 18:48:59 UTC

I confess I am a little lost now in the discussion (which is also wildly off the topic of the thread but no matter..).

That code blob points to the function 'get_msgs' and I think that's all the client is doing. It's the terminology that's confusing us here I think. To me a 'checkpoint' is when the model writes it's internal state to disk so it can restart after a client stop. But! To a boinc client, 'checkpoint' means what the fn 'get_msgs' does. The code at the end of the loop does:
        atp->get_trickle_up_msg();
        atp->get_graphics_msg();
Ignoring all the previous guff about checking task state/cpu-time etc, that's essentially all a checkpoint is to the client, get any trickle_up message (same as 'trickles' from CPDN?), and any messages from the graphics. Nothing to do with OpenIFS's checkpointing mechanism at all.

Richard, you lost me at this bit:
Richard wrote:
Now we're getting somewhere. I see
<checkpoint_cpu_time>27035.320000</checkpoint_cpu_time>
<current_cpu_time>27035.320000</current_cpu_time>

Identical to 6 decimal places. I bet that's what's doing it. Those last two code comparisons come from the <active_task_set> in BOINC's client_state.xml

'I bet that's what's doing it' - what is 'it'? The client/wrapper doing what? There's a comment in that get_msgs function that its usually called every sec, so there's the time difference you see in the logs.

I have checked the OpenIFS wrapper code that talks to the client. There are no explicit calls to any boinc functions with 'checkpoint' in their name. As far as I can see, the wrapper can't be sending anything to the client about having completed a checkpoint. And checkpoint probably means different things anyway to model & boinc client.

It would be straightforward to send a checkpoint message though. We know how often the model will checkpoint and the wrapper monitors the model's step count, so we can use that (strictly speaking we should check for the presence of the files too but for now...).

Am I anywhere close to understanding this? (then again, I leave the boinc interface stuff to Andy who I should probably go talk to..)
ID: 67326 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 942
Credit: 34,160,630
RAC: 5,164
Message 67330 - Posted: 4 Jan 2023, 19:55:31 UTC - in response to Message 67326.  

Richard, you lost me at this bit:
Richard wrote:
Now we're getting somewhere. I see
<checkpoint_cpu_time>27035.320000</checkpoint_cpu_time>
<current_cpu_time>27035.320000</current_cpu_time>

Identical to 6 decimal places. I bet that's what's doing it. Those last two code comparisons come from the <active_task_set> in BOINC's client_state.xml

'I bet that's what's doing it' - what is 'it'? The client/wrapper doing what? There's a comment in that get_msgs function that its usually called every sec, so there's the time difference you see in the logs.

I have checked the OpenIFS wrapper code that talks to the client. There are no explicit calls to any boinc functions with 'checkpoint' in their name. As far as I can see, the wrapper can't be sending anything to the client about having completed a checkpoint. And checkpoint probably means different things anyway to model & boinc client.

It would be straightforward to send a checkpoint message though. We know how often the model will checkpoint and the wrapper monitors the model's step count, so we can use that (strictly speaking we should check for the presence of the files too but for now...).

Am I anywhere close to understanding this? (then again, I leave the boinc interface stuff to Andy who I should probably go talk to..)
The original problem was that the Event Log was reporting that the model was checkpointing every second. It should not be saying that. It should be saying that the model has checkpointed if and only if a real life checkpoint - all those restart files - has been completed in the last second.

My interpretation is now that the model (or, probably more accurately, the wrapper) reports the current state of play to the BOINC client every second - timings, progress made, changes in status, anything like that. The client will analyse that, store what needs storing, and process all those changes in state. The client then has a snapshot of the overall status, and can respond when the Manager asks - again every second - for a summary fit for display to the user.

The bits of XML I posted are a small fraction of all that. For the current bug-hunt, I think the critical data points are "checkpoint_cpu_time" and "current_cpu_time" - both are the number of seconds of CPU work done since the task started. If they are identical, I'm suggesting that the client will notice that fact, and interpret it as "a checkpoint has happened in the last second", and report that fact in the Event Log, and pass it to the Manager for display to the user.

The model/wrapper should only report a change in checkpoint_cpu_time once per restart file dump.
ID: 67330 · Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,490,152
RAC: 542
Message 67352 - Posted: 5 Jan 2023, 8:49:54 UTC - in response to Message 67306.  

If it isn't checkpointing when I wake up I'll know that bit has worked and reduce the figure to induce checkpointing.
It was checkpointing.
So yes, BOINC needs fixing.
ID: 67352 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : OpenIFS tasks : make sure boinc client option 'Leave non-GPU tasks in memory' is selected!

©2024 climateprediction.net