New work discussion

Author	Message
Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 66438 - Posted: 15 Nov 2022, 17:54:29 UTC - in response to Message 66435. Last modified: 15 Nov 2022, 17:59:28 UTC Thoughts on work fetch... Richard, as always your expert input greatly appreciated. I have wondered if there's a timeout happening between the feeder & server that might explain it. I could try reducing the no. of available cpus - but these large Ryzen things seem to have no trouble filling up :). Couple more questions if I may: If a previous task fails (HadSM4 on a broken WSL/Unbuntu), does the server put a blackmark against hosts for this? Related, presumably the server only sees requests from the same IP, my public one, and not the internal subnet IPs. So does it treat different hosts using the same external IP the same? Or does it only consider hostnames? I ask because I'm wondering if my boinc hosts are considered entirely separate by the server, or whether what happens on one affects how the server treats the others? (if that makes sense). Cheers. p.s. good luck on resolving the issue. ID: 66438 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,276,661 RAC: 11,053	Message 66439 - Posted: 15 Nov 2022, 18:43:15 UTC - in response to Message 66438. Last modified: 15 Nov 2022, 18:45:34 UTC If a previous task fails (HadSM4 on a broken WSL/Unbuntu), does the server put a blackmark against hosts for this? Yes, but a very mild one - hardly even a slap on the wrist. Have a look at Application details for host 1498009 (my Mint 21 machine). For each host and app, the server starts you off with a general "Max tasks per day". At most projects, I think the default is 30, but this project probably has set it lower. If you return good, valid, work, it's put up by one. If you return errors or failure, first it drops to the original starting point. If you go on returning errors, it's dropped by one per bad task. Eventually, it gets down to one per day, so you can only run one test task per day, until you fix the problem and start returning good work again. Then the limit starts climbing back up, increasing by one for each successful task. Again, no attention is given to the reason for the failure. Hardware failure, power supply problem, software misconfiguration, or project data leading to negative theta - they're all the same. Related, presumably the server only sees requests from the same IP, my public one, and not the internal subnet IPs. So does it treat different hosts using the same external IP the same? Or does it only consider hostnames? I ask because I'm wondering if my boinc hosts are considered entirely separate by the server, or whether what happens on one affects how the server treats the others? (if that makes sense). Each host is treated as a separate, distinct, identity. The primary key value is the HostID, allocated by the server when the host is first attached to each project. If the host is newly attached, the server will try to find if a machine with the same hostname, owner (UserID or email hash), and hardware configuration exists in the database already and can be matched for reuse. For this reason, it's important than you don't clone a BOINC installation from another machine when setting up new hardware - the server software doesn't like doppelgängers. Problems with one machine shouldn't - and in practice doesn't - affect the others. ID: 66439 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 31,928,703 RAC: 38,417	Message 66441 - Posted: 15 Nov 2022, 19:20:32 UTC - in response to Message 66413. In the evening I always pause all work, then I wait 30 seconds before I shut em down to make sure all data is written to the ssd correctly.The next morning I get some aborts. The short answer, as far as I've found in quite a few years of running CPDN tasks, is simply "Stop shutting your machines down." CPDN tasks do not reliably handle shutdown and restarts of the tasks, though some of the tasks, some of the time, will continue running properly. However, I've also found that "suspend tasks and shut down" seems to lead to more errors than simply "shut down and let the tasks be killed." I haven't done a deep study on it, but it seems like trying to do the "right thing" with suspending tasks leads to worse results with CPDN than a simple "host has randomly power cycled." Use system suspend instead. I run quite a few nodes in my office, which is purely off grid/solar. They suspend every evening, and I power them on in the mornings, and this works perfectly - no task restarts or anything, and fairly few failures outside the normal "environmental conditions have gone nonsensical" ones. ID: 66441 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,532,809 RAC: 5,899	Message 66442 - Posted: 15 Nov 2022, 19:32:42 UTC - in response to Message 66441. Last modified: 15 Nov 2022, 19:40:12 UTC Use system suspend instead. I run quite a few nodes in my office, which is purely off grid/solar. They suspend every evening, and I power them on in the mornings, and this works perfectly - no task restarts or anything, and fairly few failures outside the normal "environmental conditions have gone nonsensical" ones. When I use sleep, and it works, I never have problems. The trouble is for some reason I haven't been able to determine, when I resume from sleep, about half the time I get a blank screen and keyboard doesn't respond (caps lock doesn't make the light on KB go on and off. Other than that it works - when I have known a task is close to completion and I have left computer long enough for it to finish before rebooting, it has completed fine. Edit: I have a couple of new ideas to try from searching the web. Will report back when I have tried them out enough to be statistically significant. ID: 66442 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 31,928,703 RAC: 38,417	Message 66444 - Posted: 15 Nov 2022, 19:59:37 UTC I know modern AMD systems have trouble with suspend/resume if you have hyperthreading disabled in the firmware configuration utility. But I have some Intel and AMD boxes that all suspend/resume more than reliably enough to use for these purposes. A failed resume is roughly the same as a random system reboot, which isn't automatically fatal to tasks, but is certainly best avoided. ID: 66444 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 66445 - Posted: 15 Nov 2022, 21:03:55 UTC - in response to Message 66441. CPDN tasks do not reliably handle shutdown and restarts of the tasks, though some of the tasks, some of the time, will continue running properly. For what it's worth the CPDN team are aware of this. For OpenIFS we've tested this alot and tried to deal with restart problems. The model will restart happily as long as it's checkpoint files are not corrupt. Problems seem to be on the boinc side. I think you'll find OpenIFS tasks more resistant to restarting, but if not, do let me know as I'd like to see what's happening and fix it. ID: 66445 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 66446 - Posted: 15 Nov 2022, 21:16:30 UTC - in response to Message 66439. Each host is treated as a separate, distinct, identity. The primary key value is the HostID, allocated by the server when the host is first attached to each project. If the host is newly attached, the server will try to find if a machine with the same hostname, owner (UserID or email hash), and hardware configuration exists in the database already and can be matched for reuse. I have two clients running; one for just cpdn & dev.cpdn, the other for other projects. I split because of the infrequent tasks from CPDN. I had a look at the client_state.xml in both and they do have different hostids. I did wonder if that might cause a problem but I don't think so. I'll have to buy Andy a pint and ask him for the server logs. Thanks Richard. ID: 66446 ·

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4345 Credit: 16,532,809 RAC: 5,899	Message 66447 - Posted: 15 Nov 2022, 21:18:31 UTC - in response to Message 66445. I think you'll find OpenIFS tasks more resistant to restarting, but if not, do let me know as I'd like to see what's happening and fix it. That is my memory from some of the early testing but after initial problems were ironed out. The very first ones I tried shutting down and restarting with would appear to progress very rapidly after a restart and finish in a matter of thirty seconds or thereabouts and then produce an error at the end. I don't recall seeing that behaviour on some of the later ones. I don't remember any explanation of what was happening on those either. But I will keep an eye on what happens with them as I am sure I will end up with a reboot or so at some point with them. ID: 66447 ·

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 66448 - Posted: 15 Nov 2022, 21:30:32 UTC - in response to Message 66447. I think you'll find OpenIFS tasks more resistant to restarting, but if not, do let me know as I'd like to see what's happening and fix it. That is my memory from some of the early testing but after initial problems were ironed out. The very first ones I tried shutting down and restarting with would appear to progress very rapidly after a restart and finish in a matter of thirty seconds or thereabouts and then produce an error at the end. I don't recall seeing that behaviour on some of the later ones. I don't remember any explanation of what was happening on those either. But I will keep an eye on what happens with them as I am sure I will end up with a reboot or so at some point with them. Yes, I remember that problem. The 'wrapper' code was too eager to zip up the final model output files and would do it before the files had finished flushing to disk (one problem with separate processes). That was fixed. But, if problems happen again, let us know. ID: 66448 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,987,005 RAC: 23,574	Message 66449 - Posted: 15 Nov 2022, 21:45:18 UTC - in response to Message 66441. The short answer, as far as I've found in quite a few years of running CPDN tasks, is simply "Stop shutting your machines down." CPDN tasks do not reliably handle shutdown and restarts of the tasks, though some of the tasks, some of the time, will continue running properly. I completely agree. Best practice for Hadley models is to let them run to completion without BOINC/system restart interruptions. You seem to get 1 freebie interruption where survival rate is 100%, after that there will almost certainly be crashes of some tasks. It's good to hear that Glenn said that OpenIFS will be more resistant. Other more complicated to set up and resource demanding projects like LHC native Theory and ATLAS don't check point and just restart calculations from scratch if interrupted but they don't error out. ID: 66449 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,987,005 RAC: 23,574	Message 66450 - Posted: 15 Nov 2022, 22:06:11 UTC - in response to Message 66446. Glenn, do you happen to have an app_config with any restrictions? I remember having a puzzling problem of not being able to get tasks for another project some time ago but unfortunately I don't remember the details. I want to say that the problem ended up being with app_config but don't really remember for sure. Perhaps try some of the other Event Log flags and see if any hints come from any of them. It does seem a bit odd that some of your PCs don't show anything in app version info or don't match tasks completed. I wonder if resetting the project may help in any way? ID: 66450 ·

Drago75 Send message Joined: 8 Jan 22 Posts: 9 Credit: 1,498,898 RAC: 4,053	Message 66451 - Posted: 15 Nov 2022, 22:14:22 UTC - in response to Message 66449. Thanks guys for the input about the restart problem. I shut my hosts down because I like to run them on solar power if possible. For now I will try to make sure that each task runs at least 2 minutes after the last check point to make sure it is written to the SSD properly. ID: 66451 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,276,661 RAC: 11,053	Message 66452 - Posted: 15 Nov 2022, 22:26:52 UTC - in response to Message 66450. Perhaps try some of the other Event Log flags ... I always enable <cpu_sched> and <sched_op_debug> as a matter of routine. Compared to some of the others, they're lightweight - adding very little extra to the log - but they clarify what's going on during a work fetch very nicely. 15/11/2022 21:33:46 \| climateprediction.net \| Sending scheduler request: To fetch work. 15/11/2022 21:33:46 \| climateprediction.net \| Requesting new tasks for CPU 15/11/2022 21:33:46 \| climateprediction.net \| [sched_op] CPU work request: 4215.86 seconds; 0.00 devices 15/11/2022 21:33:46 \| climateprediction.net \| [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 15/11/2022 21:33:47 \| climateprediction.net \| Scheduler request completed: got 1 new tasks 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] Server version 715 15/11/2022 21:33:47 \| climateprediction.net \| Project requested delay of 3636 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] estimated total CPU task duration: 87303 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] estimated total NVIDIA GPU task duration: 0 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] Deferring communication for 01:00:36 ID: 66452 ·

AndreyOR Send message Joined: 12 Apr 21 Posts: 247 Credit: 11,987,005 RAC: 23,574	Message 66453 - Posted: 15 Nov 2022, 22:37:43 UTC - in response to Message 66451. I don't think the 2 minute wait will do anything. It may help to read some posts by SolarSyonyk on the subject who also has solar setups. I believe the thing to do is to put the machines to sleep instead of shutting them down. It'll preserve the state of the machine in a very low power mode without interrupting any work just pausing everything (the entire PC, not just BOINC). ID: 66453 ·

SolarSyonyk Send message Joined: 7 Sep 16 Posts: 257 Credit: 31,928,703 RAC: 38,417	Message 66454 - Posted: 16 Nov 2022, 0:28:17 UTC - in response to Message 66451. Last modified: 16 Nov 2022, 0:33:32 UTC Thanks guys for the input about the restart problem. I shut my hosts down because I like to run them on solar power if possible. For now I will try to make sure that each task runs at least 2 minutes after the last check point to make sure it is written to the SSD properly. It's not going to work. You're welcome to waste time trying, and dump a bunch of good compute out on the floor for someone else to have to make up, but that's all you're accomplishing here. Check your task failures, you'll get all sorts of resume nonsense repeated, and then the task will fail. The current tasks do not handle task suspend/resume well, at all. Period. If you're going to shut your computers down fully every night, go put your effort towards another project, because you're literally just wasting your compute cycles. I know the tasks should behave like you wish them to, but years of running them in your exact situation (perhaps even stronger, I'm purely off grid out here with a fairly small battery bank) says that they don't. You get one crash/resume if you're lucky, sometimes more, sometimes less, but "suspending them and then trying to have them resume" seems even less likely to work than just up and terminating them without warning. If you don't want your computers using energy at night, use system suspend. sudo /usr/bin/systemctl suspend That ought to do it on Linux, hit the power button to bring them back up, and they should resume where they left off. Asleep, they should use... a watt? Something like that. The difference between "asleep" and "that power strip to all the compute boxes literally turned off" in my office hides within the noise of measurements. I'm sure there's some difference, but I can't find it at a glance, so it's not enough to matter. It's less than the idle draw of the amp in my bookshelf speakers when they're powered on. If you can't do that, then you can suspend the computation and let the computers idle, though they'll still use some power overnight. "Pause working" suspend then resume, without restarting the actual task (which a computer restart will do) seems to work well enough, though I try to avoid that as well. Usually that's a result of having too many tasks and dropping some back for total system throughput (running 8 tasks on a 12C/24T Ryzen seems to be the sweet spot on my rigs for total system throughput numbers), and I've not seen crashes of stuff left in RAM and suspended, but the "Suspend task, shut system down, expect it to come back properly" thing just doesn't work reliably, and I've tried every combination I can think of. Or just keep wasting your compute cycles trying to make tasks do something they can't and won't do. Either way. ID: 66454 ·

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,538,338 RAC: 2,071	Message 66455 - Posted: 16 Nov 2022, 0:56:54 UTC - in response to Message 66454. "Suspend task, shut system down, expect it to come back properly" thing just doesn't work reliably, and I've tried every combination I can think of. Or just keep wasting your compute cycles trying to make tasks do something they can't and won't do. Either way. Am I just lucky or what? Long ago, there was a bunch of CPDN tasks that crashed when one rebooted a system. But they fixed that. Probably about two years or a little more ago. I have not had any trouble like that since. Is it my distribution of Linux? It uses systemd instead of the old Berkley style of software to start and stop the system. that makes the order of shutdown and startup of processes logically correct. I suspect I could just do a controlled shutdown of the system and have it work OK, but I hate to risk it. On the other hand, I reboot they system only every week or so, when an update comes along that I do not think I should wait. I do not bother when they only change the timezone tables or something like that. ID: 66455 ·

geophi Volunteer moderator Send message Joined: 7 Aug 04 Posts: 2168 Credit: 64,536,855 RAC: 6,595	Message 66456 - Posted: 16 Nov 2022, 1:14:07 UTC - in response to Message 66452. Perhaps try some of the other Event Log flags ... I always enable <cpu_sched> and <sched_op_debug> as a matter of routine. Compared to some of the others, they're lightweight - adding very little extra to the log - but they clarify what's going on during a work fetch very nicely. 15/11/2022 21:33:46 \| climateprediction.net \| Sending scheduler request: To fetch work. 15/11/2022 21:33:46 \| climateprediction.net \| Requesting new tasks for CPU 15/11/2022 21:33:46 \| climateprediction.net \| [sched_op] CPU work request: 4215.86 seconds; 0.00 devices 15/11/2022 21:33:46 \| climateprediction.net \| [sched_op] NVIDIA GPU work request: 0.00 seconds; 0.00 devices 15/11/2022 21:33:47 \| climateprediction.net \| Scheduler request completed: got 1 new tasks 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] Server version 715 15/11/2022 21:33:47 \| climateprediction.net \| Project requested delay of 3636 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] estimated total CPU task duration: 87303 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] estimated total NVIDIA GPU task duration: 0 seconds 15/11/2022 21:33:47 \| climateprediction.net \| [sched_op] Deferring communication for 01:00:36 @Richard Is the "estimated total CPU task duration" dependent on benchmark results? I'm just wondering if that could be a problem since I see quite a few people with PCs where boinc has never run a benchmark. Could that in any way complicate whether tasks are sent to a host? I see Glenn has a couple of his Linux boxes that have the default 1 billion ops/sec for the benchmark. ID: 66456 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 66457 - Posted: 16 Nov 2022, 4:31:17 UTC - in response to Message 66418. Last modified: 16 Nov 2022, 4:51:52 UTC (as I prefer to use free solar elec during the day & not pay for boinc at night - to answer P.Hucker). Ah that old chestnut. Not a lot of solar in Scotland, so it doesn't apply to me. Are you on one of those crazy government subsidies where you can use it for free but still get paid for generating it? If not, surely you can sell it to the grid, so it shouldn't make too much difference when you use the computers. Also, if I was to do what you do, I'd just suspend GPU work, since they use the majority of the power. Or get big batteries! Don't get the expensive Tesla Li Ion stuff, get lead acid deep cycle leisure caravanning/boat batteries. Lovely and cheap - $182 US for 24V 110Ah. Also, they can do about 3000 charge cycles, and Li Ion is only 1000. ID: 66457 ·

Mr. P Hucker Send message Joined: 9 Oct 20 Posts: 690 Credit: 4,391,754 RAC: 6,918	Message 66458 - Posted: 16 Nov 2022, 4:34:17 UTC - in response to Message 66421. BOINC also takes account of the length of time that has elapsed from the last checkpoint when deciding to pause one project's application, and give a turn at the trough for a different one. If a task has never checkpointed, BOINC will try to avoid pausing it unless absolutely necessary. I assume if it had to, it would ask it to checkpoint first? CPDN has a particular problem with checkpoints. The amount of data that has to be recorded to catch the complete internal state of the model so far is much greater than for most other projects. In some cases - slower drives or interfaces, heavily contended devices, or cached 'lazy write' drives - it can take a significant amount of time before the stored data is complete and usable. I think the majority of problems in the past will have been caused by one or more of these delays causing the image on disk to be incomplete and unreadable on restart. Is there a mechanism to delay shutting the machine down until the data is written? ID: 66458 ·

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,276,661 RAC: 11,053	Message 66460 - Posted: 16 Nov 2022, 8:29:33 UTC - in response to Message 66458. I assume if it had to, it would ask it to checkpoint first? Go back and read what Glenn and I have been saying. It's the CPDN app (alone) that decides when its data is in a consistent state for a checkpoint. BOINC cannot 'ask it to checkpoint': the most it can ask for is a delay to the checkpoints, so the apps can get on with the science. Is there a mechanism to delay shutting the machine down until the data is written? No. Ultimately, the operating system is the boss. It will ask BOINC to shut down, and in turn BOINC will ask CPDN to shut down. The OS will wait for a polite interval, and if they don't respond, it will kill them. ID: 66460 ·

New work discussion - 2