climateprediction.net home page
Posts by Glenn Carver

Posts by Glenn Carver

1) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70852)
Posted 2 days ago by Glenn Carver
Post:
The batch file examples are down to misunderstanding how wildcarding works. We don't use batch files in the Windows apps. Also, if we had an error like that it would fail probably every time.

I guess since most people use the default install of boinc so it ends up on the C: drive. If that gets busy due to other activities a boinc process drive access might time out. Particularly because the boinc processes run at a lower priority. I'm just guessing, but maybe running in a VM might make this error more likely? (assuming it's not a hardware issue of course).

The other Windows related error we see (about 15-20% of task fails), is "Invalid control block address". When I looked this up it seemed to be related to Windows Update doing something. I didn't read too far once I knew it wasn't a problem I needed to fix :D. But it's not obvious to me why Windows Update should cause an issue to a running task? Maybe someone who knows Windows better than me might have an idea. I'd be interested to know if it's potentially recoverable.
2) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70849)
Posted 3 days ago by Glenn Carver
Post:
The error message
The system cannot find the drive specified
.is a Windows issue. It has a number of possibilities, disk timeout, failing drive. Might be worth doing a SMART check on the drive concerned if it keeps happening.
This error accounts for about 10% of CPDN WAH task fails.
3) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70844)
Posted 3 days ago by Glenn Carver
Post:
Please let it run. We're not 100% certain of the results. They are a useful comparison to the failed Intel runs.

The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results).

Hello,
I have 4 1008 tasks at about 60% on a Ryzen 3600x. Is it worth continuing them if they don't output correct results or should I abort them?
Thanks!
4) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70839)
Posted 4 days ago by Glenn Carver
Post:
Thanks. I can see the problem.
I have 9 tasks ... I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though.
Several ways to link PID&task, I like Resource Monitor. Start it up. On the CPU tab, scroll to find the CPDN task process you are interested in. Click the little checkbox left of the process you are interested in. Once clicked, open up the 'Associated Handles' section (little up/down arrow on the title bar below), and it will show you all the files and folders associated with the process.

stdout_um.txt:
Starting HadAM3P model for ID# wah2_eas25_n15e_201212_24_1008_012272746...
... <snip>...
RUNID=xadae
Changing to slots dir C:\ProgramData\BOINC\slots\1
Closing model...
Detaching shared memory segment...

I don't see anything obviously wrong in them... I'm tempted to suspend and resume that task, see if it comes back up properly.
The last lines of that output from the global model 'stdout_um.txt' show the problem: 'Closing model'. That means the model has stopped but for some reason boinc hasn't recognised this and the process hasn't exited. That's why the model has hung up. The global model isn't running so the other two processes are just sitting waiting.

Rather than suspend/resume, I would shut down the client to kill the processes. Make sure they really have gone (check Resource Monitor) and then start up the client again. It's possible the tasks will then error but that's what you need anyway.

HTH

p.s. I've just checked the machine this was running on. I noticed it's only got 8Gb RAM. How many CPDN tasks are running simultaneously and how much of that 8Gb is BOINC allowed to use? Am thinking you might have hit a memory limit causing this odd behaviour.
5) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70838)
Posted 5 days ago by Glenn Carver
Post:
I had a look, there's no 'stderr' output on the task webpage, which is the task log, so I can't see why the model failed. Though the fact there's no stderr output on the task page itself is a clue.

I notice the PC only has 8Gb RAM. How much RAM do you allocate for BOINC? And how many CPDN tasks do you have running at a time? I suspect a problem with available memory.

Also, memory can get fragmented on Windows (similar to disk fragmentation). It's not impossible the task died because it couldn't allocate a memory segment big enough. The best way to clear memory fragmentation is to reboot the machine.

That's all I can help with on this.
6) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70834)
Posted 5 days ago by Glenn Carver
Post:
Rather than abort, could you please do me a favour? Open up Resource Monitor, click the 'CPU' tab and scroll down to find the 'wah2' list of processes. You should have 3 processes per task. For your task, n15e, please let me know how many you see. I think you'll only see one process: wah2_8.29_windows_intelx86.exe and not the wah2am_* and wah2rm_* processes. Can you confirm?

Also, if you know your way around the BOINC folder layout, would be great if you could locate the task directory and check a file for me. I'd like to see the last few lines of a file called 'stdout_mon.txt'. It can be found in the task directory, which will be under your BOINC install 'data' directory:
e.g. c:\Program Files\BOINC\data\projects\climateprediction.net\wah2_eas25_n15e_201212_24_1008_012272746_0\stdout_mon.txt
Note the 'data' directory under BOINC is usually a hidden directory, you'll need to unhide folders in file explorer.

The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). It sounds like yours has crashed and I'd like to see how far it got.

After this, rather than Abort I suggest doing 'End Task' in Resource Monitor (right click on the correct process name). I *think* this will avoid your host being marked down for aborting tasks.

Many thanks.
7) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70831)
Posted 6 days ago by Glenn Carver
Post:
Yes, definitely. Batch 1007 is a valid batch. Don't abort it!

1006 & 1007 might be hitting the deadline for volunteers who have not yet started tasks. That might be why resends are coming.
8) Message boards : Number crunching : Batch 1015 Discussion/problems (Message 70827)
Posted 7 days ago by Glenn Carver
Post:
It seems like that task directory & files that should go into the slots directory still goes into the projects/climateprediction.net directory. When I ran out of work a couple of days ago I cleaned out all of the older ones but when I got new work today, new ones appeared.
This is changed in the next release. In order to keep consistent results for running projects we keep the version the same for all batches per project.
9) Message boards : Number crunching : OpenIFS Discussion (Message 70826)
Posted 7 days ago by Glenn Carver
Post:
Andy has sent some tasks out for testing. these are peaking at a bit over 9GB/task which is the highest I have seen yet I think.
I checked with Andy about this as the configuration he's testing should only be ~3.5Gb. He said he's enabled boinc_diagnostics, which is a BOINC API set of functions for tracing various problems in the code. He's debugging the 'double free corruption' problem that we previously saw.

The production version of OpenIFS in the T159 configuration doesn't use anywhere near 9Gb.
10) Message boards : Number crunching : New Work Announcements 2024 (Message 70821)
Posted 7 days ago by Glenn Carver
Post:
Batch 1015 is being released now. This is the next batch in the East Asia 25km configuration (eas25).

We do not anticipate significant problems with this batch. As always, please open a new thread to report specific issues.
11) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70820)
Posted 9 days ago by Glenn Carver
Post:
Batch 1012 will fail on Intel machines. Batches 1013 & 1014 should continue running.
12) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70815)
Posted 11 days ago by Glenn Carver
Post:
It's probably where the model was restarted and ran over a trickle up timestep again.
13) Message boards : Number crunching : OpenIFS Discussion (Message 70809)
Posted 11 days ago by Glenn Carver
Post:
There are some new OpenIFS BL app batches coming once code development & testing is complete (some time yet).
14) Message boards : Number crunching : New Work Announcements 2024 (Message 70805)
Posted 11 days ago by Glenn Carver
Post:
There will another small batch out today or tomorrow to test some updated files. We expect this to work. Then were will be a 5000 workunit batch to follow soon after the test. This will be the aerosol (AER) experiment tested with batch 1013.

We are checking the upload server disk space availability before sending out any further batches. There are plans for a further 5 Weather@Home batches for the East Asia 25km experiments to come.
15) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70803)
Posted 12 days ago by Glenn Carver
Post:
They are all EAS25 configuration (East Asia 25km resolution). The batch using version 8.24 of the wah2 was a mistake. It was stopped and rereleased as 8.29. We're testing the input files to see which ones are causing the problems and what we're going to do about it.
16) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70765)
Posted 14 days ago by Glenn Carver
Post:
There will be some more small batch tests going out this week.
Progress has been made on the problem with batch 1008.
17) Message boards : Number crunching : Completing a WU? Impossible. How's the situation today? (Message 70763)
Posted 16 days ago by Glenn Carver
Post:
It won't be next week.
18) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70758)
Posted 16 days ago by Glenn Carver
Post:
Hi Richard, the model only loads the external library when it needs to convert the model raw output ready for sending. That doesn't happen at model start, but at fixed points in the forecast. So the model will start fine and load the library after some time. Hence a possible explanation for why they all fail on 1/Jan.

The boinc_finish error code is whatever value was passed to it. It could come from the return/errno value of LoadLibrary() call, or, it might come from a fortran operation. I'm still looking for the exact point of failure in the code.

(https://stackoverflow.com/questions/38579909/loadlibrary-fails-with-error-code-193)
19) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70756)
Posted 16 days ago by Glenn Carver
Post:
I don't want anyone spending any time looking at their failed tasks. Appreciate the response for the small test.

There is one clue in the log output (which might be a red herring). The regional model calls boinc_finish with an error code of 193. In windows that means a bad executable so I'm looking at the library the model loads dynamically during the run to handle converting the model output. It's possible it's been corrupted in some way. If that's not it, then it's back to the model code.
20) Message boards : Number crunching : Batch 1008, and test batches 1009 to 1014 for Windows - issues (Message 70743)
Posted 18 days ago by Glenn Carver
Post:
There will be a small batch of about 100 workunits going out soon to test whether the issue we're seeing this with this batch is related to some of the input files.


Next 20

©2024 climateprediction.net