climateprediction.net home page
Posts by Glenn Carver

Posts by Glenn Carver

1) Message boards : Number crunching : New Work Announcements 2024 (Message 71132)
Posted 9 hours ago by Glenn Carver
Post:
I'd rather it was a thread specific to the problem/issue, not all the recent batches. That's how forums normally work. Sorry Dave.
2) Message boards : Number crunching : finish file present too long (Message 71129)
Posted 12 hours ago by Glenn Carver
Post:
Interesting point. I have to use an older version of boinc because they abandoned 32bit builds a while ago, which I need as WaH is still 32bit. I've checked and I compile & link against boinc v7.20.2. This is the latest version that still includes the Visual Studio 32bit project files. It does have the code fix from the PR you mentioned. So the WaH side of things is ok?
3) Message boards : Number crunching : New Work Announcements 2024 (Message 71124)
Posted 14 hours ago by Glenn Carver
Post:
All Weather@Home batches mentioned in the previous post have now been submitted.

Any issues/problems please do report them as they can be useful when debugging the code remotely. Please post to a new thread & not this one, it makes searching alot easier. Thanks and enjoy!
4) Message boards : Number crunching : finish file present too long (Message 71123)
Posted 14 hours ago by Glenn Carver
Post:
Dave, it's because the timeout on the finish_file was too short in those earlier boinc versions. I think Richard mentioned in an earlier post it's since been raised to 10mins which seems to solve it for busy systems. And it's the busy systems that are constantly suspending/resuming that seem to have the problem. We were wondering whether to impose some kind of limit on the boinc version in the task XML but personally I am reluctant to make the system any more complicated than it is already.

Richard, I had a look at debug output. I know what's going on. For this scenario, it appears the monitor code is being asked by the client to do something to tidy up it's already done and it then fails somewhere. I've created an issue to look into it but it's not the highest priority.
5) Message boards : Number crunching : finish file present too long (Message 71120)
Posted 1 day ago by Glenn Carver
Post:
I had a look at what the code for OpenIFS does. We can find 'finish_file' errors here too:
https://www.cpdn.org/cpdnboinc/result.php?resultid=22438218

again, it's an older boinc client 7.14.

I think while I'm still hunting bugs I would like more information from the models not less. OIFS puts out a 'resuming' message which would be useful (for me anyway). Plus timestamps on the messages. I take your point it can make for a lot of noise, but that's the problem when debugging these things remotely.

I have often toyed with the idea of having a 'developer control' XML file that can be placed in the project/task dir that allows me (or anyone) to control things such as debug message levels.
6) Message boards : Number crunching : finish file present too long (Message 71117)
Posted 1 day ago by Glenn Carver
Post:
Ok thanks. When I've finished with these new batches, I will look in the boinc code and see what can be added to the output.
7) Message boards : Number crunching : Tasks available, but I am not getting them. (Message 71114)
Posted 1 day ago by Glenn Carver
Post:
CPDN have found the problem. The linux app was deprecated but accidentally got re-enabled when the new wah2-ri v8.32 was installed. It's been deprecated again and should stop any more linux tasks going out. But let the tasks complete normally - there's no need to abort them.
8) Message boards : Number crunching : finish file present too long (Message 71111)
Posted 1 day ago by Glenn Carver
Post:
I had the same thought. I have added a timestamp to the messages coming from the W@H monitor code in the next version. Adding a reason code is a good idea, I'll do that. Hadn't realised it was possible. Thanks.

Although the monitor code 'logs' a checkpoint with the boinc client, that's not necessarily when the model itself does it. That is under the control of the model itself.
9) Message boards : Number crunching : Tasks available, but I am not getting them. (Message 71110)
Posted 1 day ago by Glenn Carver
Post:
I'm not sure how you got that task. The linux app should be disabled and not used for Weather@Home. I will check with CPDN. All weather@Home batches should be Windows only.

Let it finish. It'll be ok.

After fixing a few things in the code, the Linux version of W@H works fine. But before using it for batches, CPDN need to assess the differences in results to the Windows version.
10) Message boards : Number crunching : #1020,1,2,3... (Message 71106)
Posted 2 days ago by Glenn Carver
Post:
#1023 5040 tasks GHG WAH2 East Asia 25km
And another one!
Dave, I posted a list of the forthcoming batches already. See: https://www.cpdn.org/forum_thread.php?id=9232&postid=71086
11) Message boards : Number crunching : finish file present too long (Message 71105)
Posted 2 days ago by Glenn Carver
Post:
Richard, here's an example of the output from a task that's repeatedly being Suspended by the client but not being held in memory whilst suspended. You get a group of 'startup' messages from the model each time.
https://www.cpdn.org/cpdnboinc/result.php?resultid=22461413

There must be some semi-intelligent way of detecting this in the monitoring code, though I am unsure what the best approach would be to deal with it. One of my bug-bears with boinc is that it's not possible to send information from the task back up to the client to present to the user. Or send a task that checks what settings the client has where the scheduler can check if 'keep in memory whilst suspended' is on.
12) Message boards : Cafe CPDN : How to detect hallucinations in AI & advances in AI weather forecasting (Message 71102)
Posted 2 days ago by Glenn Carver
Post:
How do you detect when a AI model has come up with something that sounds plausible but isn't?

https://www.ox.ac.uk/news/2024-06-20-major-research-hallucinating-generative-models-advances-reliability-artificial

AI & ML is getting more prominent in weather forecasting. Google also published this article recently:
https://www.nature.com/articles/d41586-024-02391-9
13) Message boards : Number crunching : finish file present too long (Message 71100)
Posted 3 days ago by Glenn Carver
Post:
We have at least 882 restarts, in somewhere between 1.75 and 18.5 hours. That makes limits of between 8.4 and 0.8 attempts per minute. The only way I can think of pausing an app at that sort of frequency is to set BOTH a thermal limit of less than 100% CPU usage AND remove tasks from memory when suspended. That shouldn't be possible, should it?
The task is staying in memory, otherwise we'd see the normal model startup messages about checking namelists etc after the suspend message. Might be suspension when non-BOINC usage is above xxx%. If it's a server the person might be using spare cycles for boinc work? The machine looks busy because of the delay on removing the finish_file.

The bit I don't understand is why this task was flagged as success.
14) Message boards : Number crunching : finish file present too long (Message 71097)
Posted 3 days ago by Glenn Carver
Post:
I've looked at the code ref, thanks. I'm not sure that's triggered for multiple Suspend requests though? It specifically mentions calls to 'exit(0)' which wouldn't be the case if the task was suspended in memory. That would just be a SIGSTOP signal sent to the process.

This is what happened when the finish_file went to the wrong directory. The task called 'exit(0)', the client can't see the finish_file so the client then restarts it. That code is in the function you pointed to.

CPDN tasks don't use checkpointing as boinc understands it. The model handles its own checkpointing; the client has no control. So it's not that.
15) Message boards : Number crunching : finish file present too long (Message 71094)
Posted 3 days ago by Glenn Carver
Post:
This error has popped up again on the latest EAS25 batch. I'm posting this in case it's useful later for anyone searching for more information on this bug. The task in question is: https://www.cpdn.org/cpdnboinc/result.php?resultid=22462930. If the stderr has long gone, the key part of the task stderr log is:

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
finish file present too long</message>
<stderr_txt>
spended CPDN Monitor - Suspend request from BOINC......(lots of suspend requests) ...
Suspended CPDN Monitor - Suspend request from BOINC...
Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 19636, selfPID = 24304, iMonCtr = 1
Model crash detected, will try to restart...
Leaving CPDN_Main::Monitor...
monitor:finished called ... tidying up.
monitor:finished: Uploading out files...
Queuing intermediate upload for CPDN/BOINC: cpdnout_out.zip
Detaching shared memory... Done.
monitor:finished: Closed output file : stdout_<>.txt
modelResultFiles : Removing : wah2_eas25_a3cd_200612_24_1020_012305460 in C:\ProgramData\BOINC/projects/climateprediction.net
monitor:finished: handing over to boinc_finish(RetVal=0)
21:29:20 (24304): called boinc_finish(0)
CPDN Monitor - Abort request from BOINC...
monitor:finished called ... tidying up.
monitor:finished: Uploading out files...

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0048C569 write attempt to address 0x0271712C

Engaging BOINC Windows Runtime Debugger...
The host is a 2950X with 32Gb RAM. It's running quite an old client, as noted before in previous posts.

What's odd about this output is the repeated extra lines from the CPDN monitor process AFTER seeming to handover to the boinc client. I think what happens here is the monitor process calls boinc_finish (time stamp 21:29). But something goes wrong in the boinc client with the finish_file and then the client appears to issue an Abort request to the monitor process. This then goes into the 'tidying up' phase again, tries to Upload output files again, which doesn't work as it's already done this. Am uncertain how the mechanism works for the client to re-enter the monitor process this way but it looks like this is the cause of the error.

The other issue here is why this gets flagged as a successful completion when the debugger has been initiated.
16) Message boards : Cafe CPDN : Off-Grid Solar/Renewable Energy Discussion (Message 71093)
Posted 3 days ago by Glenn Carver
Post:
With all my kit on running boinc tasks it uses 1.2kW, which can drain the batteries pretty quick if I run overnight. So I'm just a daytime volunteer. :)
17) Message boards : Number crunching : WINE or VM? (Message 71092)
Posted 3 days ago by Glenn Carver
Post:
Difference might be due to the process priority. On Windows the default is 'Low' and I get a speed boost changing it to 'Normal'. There's been a previous thread about this. I don't know how Wine would interpret process priority that the boinc client might try to set on linux.

I've tested a Win10 VM against Wine and Wine (not surprisingly) is faster. Less in the way of the bare metal.
18) Message boards : Number crunching : New Work Announcements 2024 (Message 71086)
Posted 4 days ago by Glenn Carver
Post:
My mistake. The CPDN internal files have those prefixes. Apologies.

Batch 1020 is the CLIM batch.

The others will come out as:

1021 : ALL : 6048 workunits
1022 : NAT : 5040 "
1023 : GHG : 5040
1024 : AER : 5040
1025 : P15 : 5544
1026 : P20 : 5544
1027 : P30 : 5544
19) Message boards : Number crunching : New Work Announcements 2024 (Message 71083)
Posted 4 days ago by Glenn Carver
Post:
First new W@H EAS25 batch going out today, 5000 workunits. A total of 42,800 Windows workunits will go out this week.
20) Message boards : Number crunching : Leftover files in projects/climateprediction (Message 71082)
Posted 5 days ago by Glenn Carver
Post:
The latest weather@home app does not put the task directory into the boinc slot directory. I have that in the development version but it's not released yet.

Openifs tasks do however run in the slot directory.

As has already been said, using the Reset Project option in the boinc client is the easiest way to clear out old files, for any project.


Next 20

©2024 climateprediction.net