New work discussion

Author	Message
Helmer Bryd Send message Joined: 16 Aug 04 Posts: 148 Credit: 8,321,582 RAC: 15,520	Message 69819 - Posted: 13 Oct 2023, 13:42:35 UTC don't bother, just ignore ID: 69819 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69838 - Posted: 13 Oct 2023, 18:13:40 UTC - in response to Message 69814. If it was something daft it would have been fixed ages ago. It's more subtle than that. You see no reason because you don't understand the problem or the way the models work. Why would it matter what's happening in the model? Presumably every so often a checkpoint is written, perhaps at the same time a trickle up is done, every 4%? In the event of anything going wrong, on restart the checkpoint file would be loaded and things would continue from that point, losing some work done after the checkpoint. The only possible way I can see this going wrong is if one is in the process of being written when the crash occurs. But in this case, the preceding checkpoint should not be deleted until the new one is written. Just like when you make a backup of your computer, you never delete all your old backups before making a new one. If something happens during the backup process, you've lost everything. these models are not designed to run on systems that can be shutdown instantly But that's 99% of computers. The model in question is a Windows model, and we all know Windows reboots for updates without warning [1], assuming because the user isn't there it's ok to do so. For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely. The same happens with LHC, although that has the added complication of running in a virtual machine. I would have expected Windows to wait for Boinc to say "finished closing down", and in turn Boinc to wait for CPDN to say "finished closing down". [1]No matter how may different ways I use to stop it doing so, it keeps thwarting me. ID: 69838 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69840 - Posted: 13 Oct 2023, 19:39:09 UTC these models are not designed to run on systems that can be shutdown instantly A lot of the tasks I get have been failed twice already, so clearly most systems are doing so. Since it's set to only allow three attempts, there must be a lot of tasks which never get done. Perhaps you've set it this way because of tasks which are actually faulty, but there's way more users than work in this project, trying each task more than three times would perhaps be useful? ID: 69840 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4353 Credit: 16,598,247 RAC: 6,156	Message 69841 - Posted: 13 Oct 2023, 21:13:24 UTC - in response to Message 69840. The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive. ID: 69841 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69844 - Posted: 13 Oct 2023, 21:34:52 UTC - in response to Message 69841. The tasks which fail on all three attempts are almost certainly suffering from an issue that generates a value outside of that allowed by the program so will never complete. I believe work is carrying on to try and get to the bottom of the problem but as Glen says, it is proving elusive. A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur? Two examples: https://www.cpdn.org/workunit.php?wuid=12227946 https://www.cpdn.org/workunit.php?wuid=12228012 If those two fail on my computer due to a restart, that will be 3 failures (possibly on a decent task), but presumably since the first two failed at different points, they failed for different reasons. ID: 69844 ·

Helmer Bryd Send message Joined: 16 Aug 04 Posts: 148 Credit: 8,321,582 RAC: 15,520	Message 69845 - Posted: 13 Oct 2023, 21:44:16 UTC don't feed the trolls ID: 69845 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69846 - Posted: 13 Oct 2023, 21:53:39 UTC - in response to Message 69845. don't feed the trolls Grow up, I'm making suggestions and asking questions, your posts however are an utter waste of space. ID: 69846 ·

Alan K Send message Joined: 22 Feb 06 Posts: 487 Credit: 29,721,858 RAC: 6,732	Message 69847 - Posted: 13 Oct 2023, 22:12:26 UTC - in response to Message 69838. "[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me." The answer lies in settings for group permissions in the registry. ID: 69847 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69848 - Posted: 13 Oct 2023, 22:14:56 UTC - in response to Message 69847. Last modified: 13 Oct 2023, 22:37:05 UTC "[1]No matter how may different ways I use to stop it doing so, it keeps thwarting me." The answer lies in settings for group permissions in the registry. That's one of the things I've changed. But Windows randomly overrides it, like they're treating us as criminals for daring to not take their updates. Something sinister is going on and I don't understand why they're legally allowed to do so. I'll reset all 10 computers again, but I doubt it'll last. It may also be the setting is ignored for security updates (which is a lot of them!) To be clear we're talking about the same thing, I do this: Windows Registry Editor Version 5.00 [HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU] "AUOptions"=dword:00000003 Which alledgedly downloads automatically as normal, but prompts the user before installation. Details here: https://www.ubackup.com/windows-10/disable-windows-10-update-registry-8523.html I know it's for Windows 10, but I assume the same setting applies for 11. ID: 69848 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4353 Credit: 16,598,247 RAC: 6,156	Message 69849 - Posted: 14 Oct 2023, 7:00:02 UTC - in response to Message 69844. A lot of my resends have a failure right at the start, then the next guy failed it part way through. At what point does the problem you mention occur? Quite a lot are having three fails at the start. Some others are getting three fails at the same point. I get that some are failing at different places during computation but the BOINC server code isn't sophisticated enough to pick up the differences. My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model. I don't know enough about the programming involved to say more than that so will leave it there. ID: 69849 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4353 Credit: 16,598,247 RAC: 6,156	Message 69850 - Posted: 14 Oct 2023, 7:03:15 UTC - in response to Message 69848. As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates. I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later. ID: 69850 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69851 - Posted: 14 Oct 2023, 7:22:58 UTC - in response to Message 69849. My personal view is that to get enough data back, an increase in the number of tasks going out would be more productive than more resends on these tasks. This particular region is covering a larger area and also because of the Himalayas, a more complex one which is what the scientists believe is behind the higher failure rate for this lot after the startup fails where the task switches from the global to the regional model. I assumed they were sending out the whole lot, or are too busy to create more. ID: 69851 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69852 - Posted: 14 Oct 2023, 7:24:39 UTC - in response to Message 69850. As I have said before, block them with your router, then re-enable the M$ domain when you want to install the updates. I don't wish to do it manually. I'd never bother. I will say that this is an area where Linux policy wins hands down. You always get the choice to restart now/restart later. Very few things make me like Linux, that is one, the other is having ok and cancel the correct way round. Despite using Windows 99.9% of the time, I very often find myself clicking the wrong button, as I assume affirmative is n the right, like most things in life, eg. the car accelerator. ID: 69852 ·

Ingleside Send message Joined: 5 Aug 04 Posts: 108 Credit: 19,725,087 RAC: 32,256	Message 69853 - Posted: 14 Oct 2023, 11:28:17 UTC - in response to Message 69838. For whatever reason (possibly a bad Boinc design), Windows does not allow CPDN to shut down completely. The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start. ID: 69853 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 820 Credit: 13,711,293 RAC: 7,064	Message 69856 - Posted: 14 Oct 2023, 14:47:53 UTC - in response to Message 69853. The problem here is, even with exiting BOINC beforehand some of the models still crapped-out on re-start. It makes absolutely no difference to the chance of a fail if you suspend/exit BOINC/etc/etc before shutting down. I know this because I've been looking for a pattern in the way the model fails. WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error always comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there. ID: 69856 ·

rob Send message Joined: 5 Jun 09 Posts: 80 Credit: 3,046,017 RAC: 3,192	Message 69858 - Posted: 14 Oct 2023, 15:14:34 UTC - in response to Message 69856. A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents? ID: 69858 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4353 Credit: 16,598,247 RAC: 6,156	Message 69859 - Posted: 14 Oct 2023, 15:42:24 UTC - in response to Message 69858. A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents? Time the simulation represents. ID: 69859 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69861 - Posted: 14 Oct 2023, 16:54:49 UTC - in response to Message 69856. WaH is two models; a global one which runs first for 24 hrs and creates the boundary & initial conditions for a regional model (the 25km grid) which then takes those files and runs itself for 24hrs; then it cycles around. It doesn't matter if the task is suspended/shutdown during the global model part or the regional model part, when it restarts it will always redo the global model 24hrs again. The error always comes when the regional model starts up again from the rerun global 24hrs. We have some ideas what's causing it but I've not yet been able to reproduce it standalone. Unfortunately the model doesn't produce any traceback diagnostics so it's tedious finding out exactly which part of the code is causing the problem, but I'll get there. What is the reason behind redoing the global part? Why can the original files not be used from the first time it did it? ID: 69861 ·

Yeti Send message Joined: 5 Aug 04 Posts: 171 Credit: 10,364,481 RAC: 21,716	Message 69869 - Posted: 15 Oct 2023, 17:27:37 UTC - in response to Message 69798. But I want updates, just no reboots until I say so. That isn't too complicate: Set up local WSUS-Server and direct your clients to use it. This works really great for me. The WSUS-server fetches the new patches from Microsoft-Update-Servers. The patches are only released to the clients by the WSUS when I activate them in the WSUS. So I can deliver the Patches to my clients when I want it. Normally I hold them from Patchday several days until I hear (or hear not) if there are bigger Problems Supporting BOINC, a great concept ! ID: 69869 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 69882 - Posted: 15 Oct 2023, 23:45:28 UTC - in response to Message 69869. But I want updates, just no reboots until I say so. That isn't too complicate: Set up local WSUS-Server and direct your clients to use it. This works really great for me. The WSUS-server fetches the new patches from Microsoft-Update-Servers. The patches are only released to the clients by the WSUS when I activate them in the WSUS. So I can deliver the Patches to my clients when I want it. Normally I hold them from Patchday several days until I hear (or hear not) if there are bigger Problems Far too much hassle. I shouldn't have to go through all this. I never want to do updates manually. I want them to do it themselves, but wait until I say go!! I'm also not prepared to mess around setting up servers, this reminds me of the mess LHC is in, they send out the same data to each individual task running, which can be one per core, and don't cache it, then expect us to run Squid, some horrid Linux thing ported badly to Windows so it keeps failing, to cache locally. Forget it. I've set the registry entry mentioned earlier and see if it works. Some machines do wait, I guess some forgot it for whatever reason. And Microsoft should be in legal trouble for rebooting someone's property without their permission. We have billions of insane laws, but nothing usefull. ID: 69882 ·

New work discussion - 2