climateprediction.net (CPDN) home page
Thread 'finish file present too long'

Thread 'finish file present too long'

Message boards : Number crunching : finish file present too long
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71046 - Posted: 25 Jun 2024, 17:58:34 UTC
Last modified: 25 Jun 2024, 17:59:09 UTC

Two tasks claim to have completed successfully for the 1019 NZ25 batch but actually failed with this error:
<message>
finish file present too long</message>
... as shown by this task:
https://www.cpdn.org/cpdnboinc/result.php?resultid=22454885

Does someone who knows boinc better than me understand what triggers this? The task has clearly failed, I suspect because the model has fallen over. But then something appears to go wrong on the boinc side and triggers this error message which I've not seen before.

Both tasks fail on the same machine, which is probably a clue. Anyone familiar with this message and what exactly is causing it?

Why they got logged as successful is also an issue.
---
CPDN Visiting Scientist
ID: 71046 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71047 - Posted: 25 Jun 2024, 18:14:31 UTC - in response to Message 71046.  

The best I can offer at the moment is https://github.com/BOINC/boinc/issues/3017 - which doesn't help much. Just says 'you are not alone'.
ID: 71047 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71049 - Posted: 25 Jun 2024, 22:00:51 UTC - in response to Message 71047.  

Ok, that is useful. Thanks Richard. At least I now understand what the error message means. The user in question is also running quite an old version of the client, ~7.11, since that error message has been edited in the latest version. But that means the client didn't have the bugfix you pointed to where the wait time was extended from 10secs to 5 mins.

So maybe the user Aborted the task, the controller process kills the model, finish file is created, but because the machine was swapping(?) the processes were slow to die and the 10 sec wait on the finish file triggered the client error. I still don't understand why this caused the task to be reported a success since it triggered the debugger but never mind. Since the code has since changed and it's not anything I can do on the CPDN side, I will leave it here.

Thanks.
---
CPDN Visiting Scientist
ID: 71049 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71094 - Posted: 23 Jul 2024, 13:25:31 UTC

This error has popped up again on the latest EAS25 batch. I'm posting this in case it's useful later for anyone searching for more information on this bug. The task in question is: https://www.cpdn.org/cpdnboinc/result.php?resultid=22462930. If the stderr has long gone, the key part of the task stderr log is:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
spended CPDN Monitor - Suspend request from BOINC......(lots of suspend requests) ...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 19636, selfPID = 24304, iMonCtr = 1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
monitor:finished called ... tidying up.
monitor:finished: Uploading out files...
Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip
Detaching shared memory... Done.
monitor:finished: Closed output file : stdout_<>.txt
modelResultFiles : Removing : wah2_eas25_a3cd_200612_24_1020_012305460 in C:\ProgramData\BOINC/projects/climateprediction.net
monitor:finished: handing over to boinc_finish(RetVal=0)
21:29:20 (24304): called boinc_finish(0)
CPDN Monitor - Abort request from BOINC...
monitor:finished called ... tidying up.
monitor:finished: Uploading out files...

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0048C569 write attempt to address 0x0271712C

Engaging BOINC Windows Runtime Debugger...
The host is a 2950X with 32Gb RAM. It's running quite an old client, as noted before in previous posts.

What's odd about this output is the repeated extra lines from the CPDN monitor process AFTER seeming to handover to the boinc client. I think what happens here is the monitor process calls boinc_finish (time stamp 21:29). But something goes wrong in the boinc client with the finish_file and then the client appears to issue an Abort request to the monitor process. This then goes into the 'tidying up' phase again, tries to Upload output files again, which doesn't work as it's already done this. Am uncertain how the mechanism works for the client to re-enter the monitor process this way but it looks like this is the cause of the error.

The other issue here is why this gets flagged as a successful completion when the debugger has been initiated.
---
CPDN Visiting Scientist
ID: 71094 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71095 - Posted: 23 Jul 2024, 14:45:38 UTC - in response to Message 71094.  

You might also consider that in conjunction with the number of "Suspend request from BOINC". BOINC doesn't really like those, and has code to handle 'too many' requests:

https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L364

That has a hard limit of 100, where your cited task has 882 in the portion we can see - more have scrolled off the top of the file. This task probably got away with it because of the comment in the code:

// handle a task that exited prematurely (i.e. no finish file)
//
    // keep count of premature exits;
    // if this happens 100 times w/o a checkpoint, abort job
But if a premature exit is defined as one without a finish file, I sniff the possibility of a link between the two.
ID: 71095 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 71096 - Posted: 23 Jul 2024, 15:01:04 UTC - in response to Message 71046.  

I do not have any answers, but here is some data.
I have one 1019 task running on my Linux machine. It has uploaded three trickles so far.

Task 22463632
Name 	wah2_nz25_2296_209805_25_1019_012300899_1
Workunit 	12300899
Created 	22 Jul 2024, 18:59:57 UTC
Sent 	22 Jul 2024, 19:00:03 UTC
Report deadline 	30 Oct 2024, 19:00:03 UTC
Computer ID 	1511241


Since it is _1, I looked at what the previous guy got.,

Task 22456746
Name 	wah2_nz25_2296_209805_25_1019_012300899_0
Workunit 	12300899
Created 	24 Jun 2024, 13:30:32 UTC
Sent 	24 Jun 2024, 18:35:02 UTC
Report deadline 	2 Oct 2024, 18:35:02 UTC
Received 	22 Jul 2024, 18:59:56 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	0 (0x00000000)
Computer ID 	1542760
Run time 	2 days 1 hours 39 min 1 sec
CPU time 	23 hours 11 min 5 sec
Validate state 	Invalid
Credit 	4,163.15
Device peak FLOPS 	5.60 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Peak working set size 	264.26 MB
Peak swap size 	225.38 MB
Peak disk usage 	0.01 MB
Stderr 	

<core_client_version>8.0.2</core_client_version>
<![CDATA[
<stderr_txt>
forrtl: Handlen er ikke gyldig.

Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=48448, selfPID=48448, iMonCtr=2
Controller:: CPDN process is not running, exiting, bRetVal = 1, checkPID=48448, selfPID=48224, iMonCtr=1
Model crash detected, will try to restart...
Leaving CPDN_ain::Monitor...
20:51:49 (48224): called boinc_finish(0)

</stderr_txt>
<message>
upload failure: <file_xfer_error>
  <file_name>wah2_nz25_2296_209805_25_1019_012300899_0_r1222875874_6.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>
</file_xfer_error>
<file_xfer_error>
  <file_name>wah2_nz25_2296_209805_25_1019_012300899_0_r1222875874_7.zip</file_name>
  <error_code>-240 (stat() failed)</error_code>


It goes on like that.
ID: 71096 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71097 - Posted: 23 Jul 2024, 15:56:30 UTC - in response to Message 71095.  

I've looked at the code ref, thanks. I'm not sure that's triggered for multiple Suspend requests though? It specifically mentions calls to 'exit(0)' which wouldn't be the case if the task was suspended in memory. That would just be a SIGSTOP signal sent to the process.

This is what happened when the finish_file went to the wrong directory. The task called 'exit(0)', the client can't see the finish_file so the client then restarts it. That code is in the function you pointed to.

CPDN tasks don't use checkpointing as boinc understands it. The model handles its own checkpointing; the client has no control. So it's not that.
---
CPDN Visiting Scientist
ID: 71097 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71099 - Posted: 23 Jul 2024, 16:51:11 UTC - in response to Message 71097.  
Last modified: 23 Jul 2024, 17:03:53 UTC

The more I look at that task, the weirder it seems. We have:

Sent		22 Jul 2024, 16:14:01 UTC
Received	23 Jul 2024, 10:47:26 UTC
so it was on the user's machine for about 18.5 hours. In that time, it ran for:

Run time	1 hours 45 min 38 sec
CPU time	1 hours 44 min 28 sec
Not a high proportion, but we don't know when the first attempt started.

We have at least 882 restarts, in somewhere between 1.75 and 18.5 hours. That makes limits of between 8.4 and 0.8 attempts per minute. The only way I can think of pausing an app at that sort of frequency is to set BOTH a thermal limit of less than 100% CPU usage AND remove tasks from memory when suspended. That shouldn't be possible, should it?

Don't worry too much about the checkpoint interval. All BOINC projects are like that - apps checkpoint when they're able to, and BOINC can't order "checkpoint NOW". The most BOINC can do is to say "don't do it too often", and apps should respect that. The other thing to check is whether CPDN apps report back to the BOINC client when they're checkpointed, which should reset that counter - I'll try and find that in the code. [* see edit]

All my main machines have picked up tasks from this run, so I don't want to mess with them right now. But I do have one last testbed, currently idle, so I might see if I can make that thrash a generic task (not CPDN!) in and out of memory at that speed.

[* edit] Looks like messages are handled at line 1535 of the same file, with checkpoints at line 1572 onwards. We can check if they're reported here with a debug log message.
ID: 71099 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71100 - Posted: 23 Jul 2024, 19:31:29 UTC - in response to Message 71099.  

We have at least 882 restarts, in somewhere between 1.75 and 18.5 hours. That makes limits of between 8.4 and 0.8 attempts per minute. The only way I can think of pausing an app at that sort of frequency is to set BOTH a thermal limit of less than 100% CPU usage AND remove tasks from memory when suspended. That shouldn't be possible, should it?
The task is staying in memory, otherwise we'd see the normal model startup messages about checking namelists etc after the suspend message. Might be suspension when non-BOINC usage is above xxx%. If it's a server the person might be using spare cycles for boinc work? The machine looks busy because of the delay on removing the finish_file.

The bit I don't understand is why this task was flagged as success.
ID: 71100 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 71101 - Posted: 24 Jul 2024, 9:05:53 UTC

The bit I don't understand is why this task was flagged as success.
I see unsurprisingly no credit has been granted for the task as no trickle ups have occurred. I do see the computer in question which has a higher floating point speed than my Ryzen9 (It is a 12 core threadripper) has quite a high failure rate, all the ones (13/36 including in progress tasks from current batch) I looked at had nothing in the stderr on the task pages. They had all produced at least one trickle.
ID: 71101 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71105 - Posted: 24 Jul 2024, 16:29:36 UTC - in response to Message 71100.  
Last modified: 24 Jul 2024, 16:29:49 UTC

Richard, here's an example of the output from a task that's repeatedly being Suspended by the client but not being held in memory whilst suspended. You get a group of 'startup' messages from the model each time.
https://www.cpdn.org/cpdnboinc/result.php?resultid=22461413

There must be some semi-intelligent way of detecting this in the monitoring code, though I am unsure what the best approach would be to deal with it. One of my bug-bears with boinc is that it's not possible to send information from the task back up to the client to present to the user. Or send a task that checks what settings the client has where the scheduler can check if 'keep in memory whilst suspended' is on.
ID: 71105 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71107 - Posted: 24 Jul 2024, 18:27:16 UTC - in response to Message 71105.  

Yes, I see that - and it confirms your comment from yesterday that removing tasks from memory would leave a much bigger footprint in stderr. Scrub that conjecture.

I can confirm that the BOINC client is aware of whether, and when, CPDN tasks checkpoint. It shows up in the task 'properties' in BOINC Manager, and can be displayed in the Event Log. So the line about only allowing exit 100 requests per checkpoint applies.

But there are two things we don't know about the "Suspended CPDN Monitor - Suspend request from BOINC..." line at this stage: what the time interval between requests is, and exactly what is being requested by BOINC, and why.

That line isn't a regular feature of BOINC - it's coming from the bespoke CPDN wrapper. It might be helpful to add a time stamp to each occurrence, at least temporarily. If the wrapper receives any sort of reason code, that could be added too.

I've woken up my remaining Celeron test machine, and it's running a single instance of a batch 1022 task - nothing else. I'm regarding that as a sacrificial test case, which we can poke and prod with conjectures as and when we think of them.
ID: 71107 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71111 - Posted: 25 Jul 2024, 8:40:04 UTC - in response to Message 71107.  

I had the same thought. I have added a timestamp to the messages coming from the W@H monitor code in the next version. Adding a reason code is a good idea, I'll do that. Hadn't realised it was possible. Thanks.

Although the monitor code 'logs' a checkpoint with the boinc client, that's not necessarily when the model itself does it. That is under the control of the model itself.
---
CPDN Visiting Scientist
ID: 71111 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71112 - Posted: 25 Jul 2024, 9:22:48 UTC - in response to Message 71111.  

I'm not sure if BOINC provides a reason code in this case - I haven't dug that deeply into the code. But if it does, it would be nice to see it. [But there must be something to flag the difference between 'pause and hold', 'quit and terminate', 'quit and abort', and 'terminate with extreme prejudice']

It doesn't matter a jot if CPDN ignores the interval request set by BOINC - it has to be done at a moment appropriate for the science. But it does matter that the fact that it happens is reported back to BOINC (mainly via the timestamp for 'last checkpoint' changing). And it does, so we're good.
ID: 71112 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 127
Credit: 24,578,627
RAC: 8,256
Message 71115 - Posted: 25 Jul 2024, 12:36:47 UTC - in response to Message 71112.  

Fooling-around a little with various preferences, the BOINC log at least does show the difference between suspending due to "computer in use", "CPU is busy", "no recent user activity" and "Exclusive app XXX is running".
Where's also the line with "Number of usable CPUs has changed from X to Y." that can happen if uses different CPU % between active and inactive.

If any of this information would also be available to applications I've no idea.
ID: 71115 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71117 - Posted: 25 Jul 2024, 13:01:20 UTC - in response to Message 71115.  

Ok thanks. When I've finished with these new batches, I will look in the boinc code and see what can be added to the output.
ID: 71117 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 71118 - Posted: 25 Jul 2024, 17:02:45 UTC - in response to Message 71117.  

ID: 71118 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 1061
Credit: 36,748,059
RAC: 5,647
Message 71119 - Posted: 25 Jul 2024, 17:59:19 UTC

I think we have our QED. My little low-power Celeron box has just produced its first trickle:

24/07/2024 16:07:13 | climateprediction.net | [cpu_sched] Starting task wah2_eas25_n1c6_201312_24_1022_012313909_0 using wah2_ri version 832 in slot 0
25/07/2024 17:35:54 | climateprediction.net | [checkpoint] result wah2_eas25_n1c6_201312_24_1022_012313909_0 checkpointed
25/07/2024 18:14:30 | climateprediction.net | Sending scheduler request: To send trickle-up message.
25/07/2024 18:14:47 | climateprediction.net | Started upload of wah2_eas25_n1c6_201312_24_1022_012313909_0_r238175385_1.zip
25/07/2024 18:16:36 | climateprediction.net | Finished upload of wah2_eas25_n1c6_201312_24_1022_012313909_0_r238175385_1.zip (99139925 bytes)
25/07/2024 18:27:20 | climateprediction.net | [checkpoint] result wah2_eas25_n1c6_201312_24_1022_012313909_0 checkpointed
That gives an idea of speed: over a day before the first trickle, and over 50 minutes between checkpoints.

Then I set 50% CPU usage (BOINC's thermal throttling control), and stderr.txt now starts like this:
modelGetExecutables: check control files, strTemp0 & 1 : 
D:\BOINCdata/projects/climateprediction.net/wah2_eas25_n1c6_201312_24_1022_012313909/jobs/xadae.namelists
D:\BOINCdata/projects/climateprediction.net/wah2_eas25_n1c6_201312_24_1022_012313909/jobs/xacxf.namelists
modelGetExecutables: unzipping control files : strInput & strTmp 
wah2_eas25_n1c6_201312_24_1022_012313909.zip
wah2_eas25_n1c6_201312_24_1022_012313909/jobs
gstrDump[0] = generic_phase1_spinup_eas25_global_aabaka_f
gstrDump[1] = generic_phase1_spinup_eas25_regional_aabaka_f
global model: command string: "D:\BOINCdata/projects/climateprediction.net/wah2am3m2_um_8.32_windows_intelx86.exe" wah2_eas25_n1c6_201312_24_1022_012313909 generic_phase1_spinup_eas25_global_aabaka_f ic19610227_14_N96 NATclim_ancil_168months_CMIP6-ACCESS-CM2_SST_2009-01-01_2022-12-30_v2404b NATclim_ancil_168months_CMIP6-ACCESS-CM2_SIC_2009-01-01_2022-12-30_v2404b so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5
regional model: command string: "D:\BOINCdata/projects/climateprediction.net/wah2rm3m2t_um_8.32_windows_intelx86.exe" wah2_eas25_n1c6_201312_24_1022_012313909
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
 cpdn_check_running: got RM PID of zero; ignoring this call and waiting for PID via shMem. 
executeModelProcess: MonID=4072, GCM_PID=8144, RCM_PID=6952
Queuing intermediate upload for CPDN/BOINC: cpdnout1.zip
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
...
and so on. I counted 376 requests in 12 minutes before I got it under control again. It suggests that is one cause of the exceeding long lists in stderr, which simply get in the way and waste space - to the exclusion of more useful reports. If we can find a reason code to match, I'd suggest suppressing this particular form of report. And to advise CPDN users not to use throttling if at all possible - reduce the number of cores in use instead.

That doesn't solve the 'finish file' problem, of course, but it may help us see more clearly what else is going on.
ID: 71119 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 1051
Credit: 16,656,265
RAC: 10,640
Message 71120 - Posted: 25 Jul 2024, 20:38:04 UTC - in response to Message 71119.  
Last modified: 25 Jul 2024, 20:40:48 UTC

I had a look at what the code for OpenIFS does. We can find 'finish_file' errors here too:
https://www.cpdn.org/cpdnboinc/result.php?resultid=22438218

again, it's an older boinc client 7.14.

I think while I'm still hunting bugs I would like more information from the models not less. OIFS puts out a 'resuming' message which would be useful (for me anyway). Plus timestamps on the messages. I take your point it can make for a lot of noise, but that's the problem when debugging these things remotely.

I have often toyed with the idea of having a 'developer control' XML file that can be placed in the project/task dir that allows me (or anyone) to control things such as debug message levels.
---
CPDN Visiting Scientist
ID: 71120 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 71121 - Posted: 26 Jul 2024, 9:52:05 UTC
Last modified: 26 Jul 2024, 9:56:47 UTC

And another All seem to be with boinc7.14.2 This one on the same machine as one of the two from #1020. Would there be any mileage in looking at the release notes of boinc in two or three subsequent versions to see if anything was fixed that might have caused the problem? Just a thought about another angle of attack.
ID: 71121 · Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Number crunching : finish file present too long

©2024 cpdn.org