climateprediction.net home page
Posts by Glenn Carver

Posts by Glenn Carver

1) Message boards : Number crunching : stuck task (1006) (Message 70926)
Posted 6 days ago by Glenn Carver
Post:
Both files are 14KB, and each contains 14239 NUL characters.
Ok, thanks for letting me know. Not sure how that could have happened. I fixed a memory overwrite bug that was causing the model to crash at the start of the new year, but it wasn't anywhere near the code at the start of the model run. I can fix the hang but there might be something else that's causing a file full of nothing.

I've created an issue for this and I should get time to fix it before the next version goes out. Thanks for the report.
2) Message boards : Number crunching : stuck task (1006) (Message 70922)
Posted 6 days ago by Glenn Carver
Post:
I've looked into this and can see what's happening in the code. The file the model is trying to read is either missing or has zero contents. That should cause the model process to die but I've identified the error in the code which means that doesn't happen. So thanks for reporting the issue, it's a big help.

That error is unrecoverable so feel free to Abort the task.
3) Message boards : Number crunching : stuck task (1006) (Message 70921)
Posted 7 days ago by Glenn Carver
Post:
*********************************************************************************
Model aborted with error code - 1 Routine and message:-
READHIST: End of file in READ from history file for namelist NLIHISTO
*********************************************************************************
Right, thanks. That's the problem. The global model has tried to read its history file and failed. However, it should have exited at that point but didn't for some odd reason. Not sure why. Because the global model process didn't exit, the others didn't either (they each check the others are still running).

I'd need to look in the code to remind me which file it's trying to read. It'll be one of : dataout/xadae.phist or dataout/xadae.thist. Those two text files should be identical, each contains 173 lines and the top of the file should look similar to:

 &NLIHISTO
 MODEL_DATA_TIME =        1840,          12,           1, 3*0,
 RUN_MEANCTL_RESTART     =           0,
 RUN_INDIC_OP    =           0,
 RUN_RESUBMIT_TARGET     =           0,           3,          19, 3*0,
 FT_LASTFIELD    = 40*0,         119,         152,         114, 78*0,         442, 8*0
 /
What does your file look like? The error message suggests the file is empty or truncated.
4) Message boards : Number crunching : stuck task (1006) (Message 70919)
Posted 7 days ago by Glenn Carver
Post:
So after client restart, all 3 processes are there but apparently not consuming CPU? Is that right?

Let's check a few things. You mentioned the stdout_mon.txt file in the task's working directory. Can you please go back to that file see what the last line is. I'm interested to see if there are any lines indicating the model is running timesteps. If it is, you'll see lines like this:
wah2_eas25_h000_208912_24_d643_000011038 - PH 1 TS 0025346 A - 07/02/2090 00:30 - H:M:S=0010:12:20 AVG= 1.45 DLT= 0.00

The global model output log is in the text file : dataout/xadae.out (relative to the stdout_mon.txt file). The regional model log is dataout/xacxf.out. These files might be quite long, or they'll be empty. If long, then please just post the last 10 lines from each file.

Also could you let me know the sizes of the following 3 files, also in 'dataout': atmos_restart.day, region_restart.day, shmem_restart.day. I'd like to compare them with known working filesizes.

What might be happening is all the processes start but the global & regional models haven't started running timesteps. They hand-shake via shared memory; the shmem_restart.day is a dump of the shared memory block. If that file is damaged, then it could be that both models are waiting for the other to finish.

If any of these restart files are damaged, there's nothing you can do to recover it as we don't keep more than 1 set of restarts to save on filespace. I think the task is probably a lost cause unfortunately, but it would be useful to understand what's happening.

Thanks for the update, Glenn. Checked Task Manager and all 3 process IDs were in a Running state, but not consuming any CPU. I tried stopping the client and restarting, which reset the remaining time to 1d17h+, but I can see that it's still not using any CPU. Again, not a high priority.
5) Message boards : Number crunching : stuck task (1006) (Message 70917)
Posted 8 days ago by Glenn Carver
Post:
stdout_mon.txt:
.....
executeModelProcess: MonID=17500, GCM_PID=18344, RCM_PID=14596
Those 3 numbers are the process IDs (PIDs) for the task. Have a look in Task Manager or Resource Manager and see if those 3 process numbers are actually running. I suspect only 17500 is present and the other two are not. That indicates the model has died but the BOINC side of things has not realised.

In which case, I suggest shutting down the boinc client (not the machine) and restarting it. Hopefully that will clear it.

Cheers.
6) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70914)
Posted 10 days ago by Glenn Carver
Post:
Weather@Home running on Intel crashes at start of new year.

Followers of this thread will recall there were problems with batches 1008-1012 where the regional model would crash when it started the new calendar year. This only happened on Intel CPUs and not on AMD ones. I've been spending time understanding what's going on. The behaviour of the model was "correct" on Intel, it should have crashed.

The problem is caused by a bug in the model code which causes a memory overwrite. Not a lot but enough to do some damage. It turns out this bug has been in the code from the time CPDN originally obtained it from the UK MetO (who have since moved this code on I hasten to add). The impact of the code bug was data dependent and also compiler optimization dependent. The problem was in a part of the model code that recomputed the solar flux variability at the start of a new year. A scalar variable was being passed to a subroutine when it should have been an array of values. As the solar variability is small year on year and Weather@Home runs are relatively short, analysis shows this only has a minimal effect on model results. Certainly less than the variability introduced by the ensemble of forcings.

Investigating the crashes also identified another problem. There was a slight discrepancy in the land/sea masks being used by the new sea-surface temperature input data and the model itself for some of the EAS25 batches. This lead to some extra bogus sea-ice points appearing off the western edge of some coasts. This has now been corrected and verified with tests.

The code changes will require a new app version. This is being prepared though I am also making some more improvements to the exception handling and a few other aspects. It will be a couple of weeks before a new app appears. We will then rerun one of the earlier batches to do an analysis of the differences.
7) Message boards : climateprediction.net Science : World heading for 2.5C warming (Message 70911)
Posted 12 days ago by Glenn Carver
Post:
https://www.theguardian.com/environment/article/2024/may/08/world-scientists-climate-failure-survey-global-temperature
8) Message boards : Number crunching : Weather At Home 2 (wah2) (region independent) v8.29 - very short deadline & mismatch for total time calculation (Message 70909)
Posted 14 days ago by Glenn Carver
Post:
It's a grace period. It was originally implemented to deal with upload servers that develop problems preventing workunits from being uploaded.

The server will not keep adding to the grace period. If the task is not completed by the end of the deadline+grace time the task will be reassigned to another host and the current host gets credit for work done.
9) Message boards : Number crunching : Recycled Work Units (Message 70908)
Posted 14 days ago by Glenn Carver
Post:
That's correct. That's normal operation. Each workunit has a maximum of three attempts at a successful run before being declared a failure.

Note that unlike some other projects, CPDN do not need more than 1 successful task to treat the workunit as succeeded.
10) Message boards : Cafe CPDN : Thanks for support (Message 70903)
Posted 18 days ago by Glenn Carver
Post:
I can second the advice of do what the physio says (from personal experience). Maybe time to treat yourself to that new blisteringly fast PC you've always wanted?!

Have just treated myself to 63GB if RAM. about to call it a night and give up on trying to find out election results.
I think the painkillers might still be affecting you Dave, 63?? :D
11) Message boards : Cafe CPDN : Thanks for support (Message 70900)
Posted 18 days ago by Glenn Carver
Post:
I can second the advice of do what the physio says (from personal experience). Maybe time to treat yourself to that new blisteringly fast PC you've always wanted?!
12) Message boards : Number crunching : New Work Announcements 2024 (Message 70896)
Posted 20 days ago by Glenn Carver
Post:
There will be a new batch going out in the next few days for the New Zealand 25km configuration (nz25). There will be a subsequent batch afterwards, but not soon due to storage constraints.
Oh, just got 3 resends from batch 1005 with app 8.24
Shall I keep them?
Yep. Nothing wrong with batch 1005. If there's a suspect problem with a batch, we disable resends while we investigate.
13) Message boards : Number crunching : Is there an official list of CPDN task names? (Message 70892)
Posted 20 days ago by Glenn Carver
Post:
Hi, good question. Unfortunately the list of apps shown on the 'Applications' page under 'Computing' on this website only shows the long names not the short ones you need for the app_config.xml file. You can find the short <app><name>.... </name></app> in the client_state.xml file in the main BOINC data directory. However, this requires you first get a task!

You can remove the entries in your file with the long names, they will not work (I see you have them commented out).

You can also remove the <name>oifs_43r3_ps</name>. The code for this app has now been integrated into the default OpenIFS app.

Here is the list of currently defined OpenIFS apps at CPDN on the main site (there are more in development):

OpenIFS 43r3_t159:        <name>oifs_43r3</name>
OpenIFS 43r3-bl_t159:     <name>oifs_43r3_bl</name>
OpenIFS 43r3_t255:        <name>oifs_43r3_l255</name>
OpenIFS 43r3_tco159:     <name>oifs_43r3_c159</name>
OpenIFS 43r3_tco95:      <name>oifs_43r3_c95</name>
OpenIFS 43r3_tl63:        <name>oifs_43r3_l63</name>

The '43r3' is the OpenIFS model version number. The '-bl' is the baroclinic lifecycle variant. The 't', 'tl' & 'tco' indicate the type of horizontal grid configuration the model is using.

I presume you only wanted the OpenIFS apps as that's all that was in your file.

HTH.

I have an app_config file in my Linux box for CPDN tasks. It looks like this:
<snipped>....</snipped>

The first four entries are active and they work as expected.
The rest are commented out, but tasknames like those have been mentioned on these boards from time to time. I would like to delete the hopeless ones and correct the others, uncommenting them as they become available.

Is there an official list somewhere?
14) Message boards : Number crunching : New Work Announcements 2024 (Message 70886)
Posted 21 days ago by Glenn Carver
Post:
Dave, best wishes for the new hip!
15) Message boards : Number crunching : New Work Announcements 2024 (Message 70884)
Posted 21 days ago by Glenn Carver
Post:
Do we know yet how much of the East Asia data is affected by the anomalous handling of land vs sea?
Batches 1006 & 1007 are fine. 1008-1012 are not (though we knew this anyway from earlier posts). The latter ones are still being looked at as far as I know.
16) Message boards : Number crunching : New Work Announcements 2024 (Message 70880)
Posted 21 days ago by Glenn Carver
Post:
There will be a new batch going out in the next few days for the New Zealand 25km configuration (nz25). There will be a subsequent batch afterwards, but not soon due to storage constraints.

For the East Asia 25km configuration, we have identified the main issue with the failing runs. There was a slight discrepancy between what the model treated as land points compared to what the new sea-surface temperature fields treated as land. This has been corrected and is currently under test. I can't give a date for the release of the batches yet.
17) Message boards : Number crunching : processors, memory, performance and heat. (Message 70878)
Posted 22 days ago by Glenn Carver
Post:
A practical reason to have swap these days is if you intend using 'hibernate', which dumps the memory to swap space on the drive (i.e. configure swap to be 1.5x installed RAM). Otherwise, modern OSes work fine without swap as long as you have sufficient memory for what it is you intend to do with the machine.
18) Message boards : climateprediction.net Science : Heavy rain in UAE & Oman an increasing threat (Message 70877)
Posted 22 days ago by Glenn Carver
Post:
https://www.worldweatherattribution.org/heavy-precipitation-hitting-vulnerable-communities-in-the-uae-and-oman-becoming-an-increasing-threat-as-the-climate-warms/

Interesting article about rain in a changing climate in these hyper-arid regions of the world.
19) Message boards : Cafe CPDN : WCG African Rainfall Project (ARP) restart update Apr 25, 2024 (Message 70872)
Posted 24 days ago by Glenn Carver
Post:
CPDN have been looking at incorporating the WRF model. It's similar to how WAH works. WCG implemented WRF in a peculiar way, by splitting timesteps if I understand correctly. Keeps tasks short but at the expense of moving alot more data around.
20) Message boards : Number crunching : Weather At Home 2 (wah2) (region independent) v8.29 - very short deadline & mismatch for total time calculation (Message 70866)
Posted 26 days ago by Glenn Carver
Post:
The original task deadlines used to be a year, which had not been changed since the days of the very long climate models. They are no longer used, instead weather@home is a much shorter forecast.
Unfortunately we saw alot of task hogging with those long deadlines so they have recently been shortened. I believe it's set to 70 days a task but with a 20(?) day grace period. What that means is a client has 70 days to start the task otherwise it gets bounced to someone else. Once started the deadline is then 70+20 days from the date the host computer received the task.
I'm not near a computer so I can't check those numbers but they should be mostly correct.
As noted earlier, don't overload the CPU, stick to one CPDN task per core, not per thread. There are only one set of floating point units per core.


Next 20

©2024 climateprediction.net