climateprediction.net home page
New work discussion - 2

New work discussion - 2

Message boards : Number crunching : New work discussion - 2
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · Next

AuthorMessage
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 69819 - Posted: 13 Oct 2023, 13:42:35 UTC

don't bother, just ignore
ID: 69819 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69838 - Posted: 13 Oct 2023, 18:13:40 UTC - in response to Message 69814.  

If it was something daft it would have been fixed ages ago. It's more subtle than that. You see no reason because you don't understand the problem or the way the models work.
Why would it matter what's happening in the model? Presumably every so often a checkpoint is written, perhaps at the same time a trickle up is done, every 4%? In the event of anything going wrong, on restart the checkpoint file would be loaded and things would continue from that point, losing some work done after the checkpoint. The only possible way I can see this going wrong is if one is in the process of being written when the crash occurs. But in this case, the preceding checkpoint should not be deleted until the new one is written. Just like when you make a backup of your computer, you never delete all your old backups before making a new one. If something happens during the backup process, you've lost everything.

these models are not designed to run on systems that can be shutdown instantly
But that's 99% of computers. The model in question is a Windows model, and we all know Windows reboots for updates without warning [1], assuming because the user isn't there it's ok to do so. For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely. The same happens with LHC, although that has the added complication of running in a virtual machine. I would have expected Windows to wait for Boinc to say "finished closing down", and in turn Boinc to wait for CPDN to say "finished closing down".

[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me.
ID: 69838 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69840 - Posted: 13 Oct 2023, 19:39:09 UTC

these models are not designed to run on systems that can be shutdown instantly
A lot of the tasks I get have been failed twice already, so clearly most systems are doing so. Since it's set to only allow three attempts, there must be a lot of tasks which never get done. Perhaps you've set it this way because of tasks which are actually faulty, but there's way more users than work in this project, trying each task more than three times would perhaps be useful?
ID: 69840 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4475
Credit: 18,448,326
RAC: 22,385
Message 69841 - Posted: 13 Oct 2023, 21:13:24 UTC - in response to Message 69840.  

The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive.
ID: 69841 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69844 - Posted: 13 Oct 2023, 21:34:52 UTC - in response to Message 69841.  

The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive.
A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur?

Two examples:
https://www.cpdn.org/workunit.php?wuid=12227946
https://www.cpdn.org/workunit.php?wuid=12228012

If those two fail on my computer due to a restart, that will be 3 failures (possibly on a decent task), but presumably since the first two failed at different points, they failed for different reasons.
ID: 69844 · Report as offensive
Helmer Bryd

Send message
Joined: 16 Aug 04
Posts: 156
Credit: 9,035,872
RAC: 2,928
Message 69845 - Posted: 13 Oct 2023, 21:44:16 UTC

don't feed the trolls
ID: 69845 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69846 - Posted: 13 Oct 2023, 21:53:39 UTC - in response to Message 69845.  

don't feed the trolls
Grow up, I'm making suggestions and asking questions, your posts however are an utter waste of space.
ID: 69846 · Report as offensive
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 488
Credit: 30,559,581
RAC: 6,164
Message 69847 - Posted: 13 Oct 2023, 22:12:26 UTC - in response to Message 69838.  

"[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me."

The answer lies in settings for group permissions in the registry.
ID: 69847 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69848 - Posted: 13 Oct 2023, 22:14:56 UTC - in response to Message 69847.  
Last modified: 13 Oct 2023, 22:37:05 UTC

"[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me."

The answer lies in settings for group permissions in the registry.
That's one of the things I've changed. But Windows randomly overrides it, like they're treating us as criminals for daring to not take their updates. Something sinister is going on and I don't understand why they're legally allowed to do so.

I'll reset all 10 computers again, but I doubt it'll last. It may also be the setting is ignored for security updates (which is a lot of them!)

To be clear we're talking about the same thing, I do this:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU]
"AUOptions"=dword:00000003

Which alledgedly downloads automatically as normal, but prompts the user before installation.

Details here: https://www.ubackup.com/windows-10/disable-windows-10-update-registry-8523.html

I know it's for Windows 10, but I assume the same setting applies for 11.
ID: 69848 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4475
Credit: 18,448,326
RAC: 22,385
Message 69849 - Posted: 14 Oct 2023, 7:00:02 UTC - in response to Message 69844.  

A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur?
Quite a lot are having three fails at the start. Some others are getting three fails at the same point. I get that some are failing at different places during computation but the BOINC server code isn't sophisticated enough to pick up the differences.

My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model.

I don't know enough about the programming involved to say more than that so will leave it there.
ID: 69849 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4475
Credit: 18,448,326
RAC: 22,385
Message 69850 - Posted: 14 Oct 2023, 7:03:15 UTC - in response to Message 69848.  

As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates.

I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later.
ID: 69850 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69851 - Posted: 14 Oct 2023, 7:22:58 UTC - in response to Message 69849.  

My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model.
I assumed they were sending out the whole lot, or are too busy to create more.
ID: 69851 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69852 - Posted: 14 Oct 2023, 7:24:39 UTC - in response to Message 69850.  

As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates.
I don't wish to do it manually. I'd never bother.

I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later.
Very few things make me like Linux, that is one, the other is having ok and cancel the correct way round. Despite using Windows 99.9% of the time, I very often find myself clicking the wrong button, as I assume affirmative is n the right, like most things in life, eg. the car accelerator.
ID: 69852 · Report as offensive
Ingleside

Send message
Joined: 5 Aug 04
Posts: 122
Credit: 23,582,856
RAC: 17,515
Message 69853 - Posted: 14 Oct 2023, 11:28:17 UTC - in response to Message 69838.  

For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely.
The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start.
ID: 69853 · Report as offensive
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 984
Credit: 15,900,189
RAC: 16,285
Message 69856 - Posted: 14 Oct 2023, 14:47:53 UTC - in response to Message 69853.  

The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start.
It makes absolutely no difference to the chance of a fail if you suspend/exit BOINC/etc/etc before shutting down. I know this because I've been looking for a pattern in the way the model fails.

WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error *always* comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there.
ID: 69856 · Report as offensive
rob

Send message
Joined: 5 Jun 09
Posts: 93
Credit: 3,590,897
RAC: 4,028
Message 69858 - Posted: 14 Oct 2023, 15:14:34 UTC - in response to Message 69856.  

A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents?
ID: 69858 · Report as offensive
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4475
Credit: 18,448,326
RAC: 22,385
Message 69859 - Posted: 14 Oct 2023, 15:42:24 UTC - in response to Message 69858.  

A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents?
Time the simulation represents.
ID: 69859 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69861 - Posted: 14 Oct 2023, 16:54:49 UTC - in response to Message 69856.  

WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error *always* comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there.
What is the reason behind redoing the global part? Why can the original files not be used from the first time it did it?
ID: 69861 · Report as offensive
Yeti

Send message
Joined: 5 Aug 04
Posts: 177
Credit: 17,079,349
RAC: 54,381
Message 69869 - Posted: 15 Oct 2023, 17:27:37 UTC - in response to Message 69798.  

But I want updates, just no reboots until I say so.


That isn't too complicate: Set up local WSUS-Server and direct your clients to use it.

This works really great for me.

The WSUS-server fetches the new patches from Microsoft-Update-Servers.

The patches are only released to the clients by the WSUS when I activate them in the WSUS.

So I can deliver the Patches to my clients when I want it. Normally I hold them from Patchday several days until I hear (or hear not) if there are bigger Problems
Supporting BOINC, a great concept !
ID: 69869 · Report as offensive
Mr. P Hucker

Send message
Joined: 9 Oct 20
Posts: 690
Credit: 4,391,754
RAC: 6,918
Message 69882 - Posted: 15 Oct 2023, 23:45:28 UTC - in response to Message 69869.  

But I want updates, just no reboots until I say so.
That isn't too complicate: Set up local WSUS-Server and direct your clients to use it.

This works really great for me.

The WSUS-server fetches the new patches from Microsoft-Update-Servers.

The patches are only released to the clients by the WSUS when I activate them in the WSUS.

So I can deliver the Patches to my clients when I want it. Normally I hold them from Patchday several days until I hear (or hear not) if there are bigger Problems
Far too much hassle. I shouldn't have to go through all this. I never want to do updates manually. I want them to do it themselves, but wait until I say go!! I'm also not prepared to mess around setting up servers, this reminds me of the mess LHC is in, they send out the same data to each individual task running, which can be one per core, and don't cache it, then expect us to run Squid, some horrid Linux thing ported badly to Windows so it keeps failing, to cache locally. Forget it.

I've set the registry entry mentioned earlier and see if it works. Some machines do wait, I guess some forgot it for whatever reason.

And Microsoft should be in legal trouble for rebooting someone's property without their permission. We have billions of insane laws, but nothing usefull.
ID: 69882 · Report as offensive
Previous · 1 . . . 37 · 38 · 39 · 40 · 41 · 42 · Next

Message boards : Number crunching : New work discussion - 2

©2024 cpdn.org