climateprediction.net home page
Posts by Glenn Carver

Posts by Glenn Carver

21) Message boards : Number crunching : New Work Announcements 2024 (Message 70821)
Posted 18 days ago by Glenn Carver
Post:
Batch 1015 is being released now. This is the next batch in the East Asia 25km configuration (eas25).

We do not anticipate significant problems with this batch. As always, please open a new thread to report specific issues.
22) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70820)
Posted 19 days ago by Glenn Carver
Post:
Batch 1012 will fail on Intel machines. Batches 1013 & 1014 should continue running.
23) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70815)
Posted 21 days ago by Glenn Carver
Post:
It's probably where the model was restarted and ran over a trickle up timestep again.
24) Message boards : Number crunching : OpenIFS Discussion (Message 70809)
Posted 22 days ago by Glenn Carver
Post:
There are some new OpenIFS BL app batches coming once code development & testing is complete (some time yet).
25) Message boards : Number crunching : New Work Announcements 2024 (Message 70805)
Posted 22 days ago by Glenn Carver
Post:
There will another small batch out today or tomorrow to test some updated files. We expect this to work. Then were will be a 5000 workunit batch to follow soon after the test. This will be the aerosol (AER) experiment tested with batch 1013.

We are checking the upload server disk space availability before sending out any further batches. There are plans for a further 5 Weather@Home batches for the East Asia 25km experiments to come.
26) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70803)
Posted 22 days ago by Glenn Carver
Post:
They are all EAS25 configuration (East Asia 25km resolution). The batch using version 8.24 of the wah2 was a mistake. It was stopped and rereleased as 8.29. We're testing the input files to see which ones are causing the problems and what we're going to do about it.
27) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70765)
Posted 25 days ago by Glenn Carver
Post:
There will be some more small batch tests going out this week.
Progress has been made on the problem with batch 1008.
28) Message boards : Number crunching : Completing a WU? Impossible. How's the situation today? (Message 70763)
Posted 26 days ago by Glenn Carver
Post:
It won't be next week.
29) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70758)
Posted 27 days ago by Glenn Carver
Post:
Hi Richard, the model only loads the external library when it needs to convert the model raw output ready for sending. That doesn't happen at model start, but at fixed points in the forecast. So the model will start fine and load the library after some time. Hence a possible explanation for why they all fail on 1/Jan.

The boinc_finish error code is whatever value was passed to it. It could come from the return/errno value of LoadLibrary() call, or, it might come from a fortran operation. I'm still looking for the exact point of failure in the code.

(https://stackoverflow.com/questions/38579909/loadlibrary-fails-with-error-code-193)
30) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70756)
Posted 27 days ago by Glenn Carver
Post:
I don't want anyone spending any time looking at their failed tasks. Appreciate the response for the small test.

There is one clue in the log output (which might be a red herring). The regional model calls boinc_finish with an error code of 193. In windows that means a bad executable so I'm looking at the library the model loads dynamically during the run to handle converting the model output. It's possible it's been corrupted in some way. If that's not it, then it's back to the model code.
31) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70743)
Posted 28 days ago by Glenn Carver
Post:
There will be a small batch of about 100 workunits going out soon to test whether the issue we're seeing this with this batch is related to some of the input files.
32) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70740)
Posted 28 days ago by Glenn Carver
Post:
I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.
Should I just abort the two that are yet to start? I have five others that I can save files from that have all produced either 4 or 5 zips. Or would looking at what happens at the point where they fail on Intel machines be more useful?
Hard to answer that as I'm not the project scientist and it's really their call together with CPDN. Personally, as a developer I have all the kit I need to debug on intel & AMD so don't spend time saving files. As a volunteer, if it was me, I'd abort the tasks yet to start and keep running the tasks currently going until told otherwise. They might be useful for comparison later. Sorry Dave, that's the best answer I can give at the moment.
33) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70736)
Posted 28 days ago by Glenn Carver
Post:
I believe the Intel runs are behaving correctly and failing. It's the AMD runs not behaving.

Yes, this batch will be stopped from producing resends until we understand why testing did not show this problem.
34) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70724)
Posted 3 Apr 2024 by Glenn Carver
Post:
I doubt it. Wrong flags would have been picked up at compile time. It's exactly same executable used successfully for the 1006 and 1007 batches. But this input data is causing a problem.
Optimization is enabled up to O2 and code dispatch up to SSE 4.2. I'm not an expert on AMD but I believe it also supports 4.2.
I note the models all seem to fail on 1/Jan which suggests a problem with the input data in some way, maybe related to precision. Could be optimisation of fortran77 code by modern compiler playing a role too. I've got a fun few days ahead :)
35) Message boards : Number crunching : New Work Announcements 2024 (Message 70719)
Posted 3 Apr 2024 by Glenn Carver
Post:
As if on cue... There are two more batches planned for the Weather@Home EAS25 region ASAP. However, these have been put on hold temporarily while we investigate the error behaviour of the current 1008 batch.
36) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70715)
Posted 3 Apr 2024 by Glenn Carver
Post:
Dave, you might recall your dev test did fail and that was on AMD.
And that one completed for Richard on an Intel machine
Yup. But all my Intel based workunits are failing for 1008 and the only ones working at the minute are on AMD (scratching of head). I don't think it's a particular input file as they are different between the failed tasks. So for the time being, the focus is on understanding what the code is doing.
37) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70713)
Posted 3 Apr 2024 by Glenn Carver
Post:
Dave, you might recall your dev test did fail and that was on AMD. Without running some analysis on the database I can't give you a good answer. Repeating a failed task standalone reproduces the failure, so I've got something to debug now.

The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan.

Any idea of the percentage of Intel vs AMD chips. I have been trawling and every single failure I have looked at has been Intel but, the overwhelming majority of tasks have not returned a zip yet so there is no evidence they are running correctly. Mine which have returned zips are all Wind10 in a VM as opposed to WINE which might mask failures. (All on AMD Ryzen 7 3700X )
38) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70709)
Posted 3 Apr 2024 by Glenn Carver
Post:
Thanks Richard, but it won't be of any use. There is not enough information in the returned files to determine the cause. Workunits use different input files to get the forecast spread. It might be related to a problem in one of the files some of the workunits use. First step is to reproduce it locally and we'll go from there.
39) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70705)
Posted 3 Apr 2024 by Glenn Carver
Post:
They will all be the same output. There does appear to be some difference in success rate between intel & amd but for now I'm running a failed task standalone to see what's going on in more detail.
40) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70703)
Posted 3 Apr 2024 by Glenn Carver
Post:
I've found this myself and done some preliminary investigation. In the 3 task failures I've had all were due to the regional model crashing as it tried to run 1st/Jan. The forecasts all start from 1/Dec. I'm looking into it.

The only pattern I've noticed (if it is a pattern), is that my failures were on a Win10 VM running on a intel chip, whereas the same VM running on a AMD has got 3 tasks past 1/Jan.

I'll be running a failure workunit standalone to debug what's going on. The other two batches have been held pending investigation of possible issues with this one.

p.s. to determine which model has failed, look in the stderr for these lines:
executeModelProcess: MonID=8904, GCM_PID=10012, RCM_PID=252
23:57:52 (252): called boinc_finish(193)
Global Worker:: CPDN process is not running, exiting, bRetVal = T, checkPID = 252, selfPID = 10012, iMonCtr = 2
Controller:: CPDN process is not running, exiting, bRetVal = T, checkPID = 10012, selfPID = 8904, iMonCtr = 1

'Global worker' is the global model and it says it's checking process id = 252. From the executeModelProcess line above it, this process id belongs to the regional model (RCM_PID). If the regional model dies then the global model dies as well. Hence the 'CPDN process is not running, exiting.' The monitor controller process then reports the global model has died and it then dies.
To find out where the model was, navigate to the task folder in your boinc 'data' folder in 'projects/climateprediction.net' and you'll find a stdout_mon.txt file with the timesteps listed.


Previous 20 · Next 20

©2024 climateprediction.net