climateprediction.net (CPDN) home page
Thread 'Shutting down for re-boot.'

Thread 'Shutting down for re-boot.'

Questions and Answers : Unix/Linux : Shutting down for re-boot.
Message board moderation

To post messages, you must log in.

AuthorMessage
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 55901 - Posted: 15 Mar 2017, 15:30:48 UTC

I was hoping that this had ceased to be an issue and running tasks under WINE that seems to be the case, but running native linux tasks even if I suspend computation, wait a few minutes file exit I seem to lose a task about one in three times.

I will revert to waiting till no tasks are present before updating kernel. Interested to know if others still experiencing this? Is it more of a problem during kernel updates or is it any shut down and restart? Not lost any if I hibernate.
ID: 55901 · Report as offensive     Reply Quote
Desti

Send message
Joined: 6 Aug 04
Posts: 124
Credit: 9,195,838
RAC: 0
Message 56292 - Posted: 24 May 2017, 1:24:20 UTC

How do you suspend computation? Per task, project or at all? Suspending each task seems to work better for me.
Linux Users Everywhere @ BOINC
ID: 56292 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 56294 - Posted: 24 May 2017, 5:39:33 UTC - in response to Message 56292.  
Last modified: 24 May 2017, 5:42:14 UTC

I usually suspend per task and then globally, resuming in reverse order. It seems to be a particular issue when the kernel is updated. Restarting at other times seems to drop the failure rate to about one in ten but data a bit sparse because I don't restart that often except when kernel needs updating.
ID: 56294 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 57337 - Posted: 7 Nov 2017, 14:24:18 UTC

Dave, I've had better luck with the following global compute preference checked:
"Leave non-GPU tasks in memory while suspended" = yes

If that is disabled on your account, try enabling it and see if that improves things.
ID: 57337 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 57338 - Posted: 7 Nov 2017, 16:25:13 UTC - in response to Message 57337.  

Thanks for adding that. I have in fact had "Leave non-GPU tasks in memory while suspended" enabled on my boxes for many years. The measures I have outlined are in addition to that.

Not sure whether something has changed in the tasks or in more recent incarnations of BOINC but of late even when I have had restarts due to power failure, (electrician turning mains off without warning) I haven't lost tasks to it. Something has improved but I don't know what.(Over I would guess last 9 months to a year is an approximate time for the change.)
ID: 57338 · Report as offensive     Reply Quote
Richard Giles

Send message
Joined: 12 Dec 14
Posts: 5
Credit: 14,162,005
RAC: 5,698
Message 63932 - Posted: 3 May 2021, 13:32:13 UTC

Hi.

I've recently started running CPDN tasks on my Linux box again. Am I correct in remembering that HadAM4/N216 crash if I reboot the system? It would be nice to know.

Thanks.
Richard
ID: 63932 · Report as offensive     Reply Quote
ProfileDave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4541
Credit: 19,039,635
RAC: 18,944
Message 63933 - Posted: 3 May 2021, 15:49:16 UTC - in response to Message 63932.  

Hi.

I've recently started running CPDN tasks on my Linux box again. Am I correct in remembering that HadAM4/N216 crash if I reboot the system? It would be nice to know.

Thanks.
Richard


Last reboot, I lost one out of eight but that was an Ubuntu version upgrade which means a kernel upgrade so increases the chances of problems in my experience but I would say there is still a risk of it but probably needs more systematic research to assess the level of risk than my impressions without actually making a note of it every time.
ID: 63933 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 195
Credit: 28,510,803
RAC: 10,061
Message 63934 - Posted: 4 May 2021, 16:21:49 UTC - in response to Message 63932.  
Last modified: 4 May 2021, 16:22:27 UTC

Hi.

I've recently started running CPDN tasks on my Linux box again. Am I correct in remembering that HadAM4/N216 crash if I reboot the system? It would be nice to know.

Thanks.
Richard
Running ubuntu 20.04 under Oracle VM VirtualBox on a Windoze10 machine. Before shutting down the ubuntu VM or Windoze, I always suspend the CPDN/BOINC activity.
When Windoze decided on it's last unexpected update and reboot, I lost one Hadam4h (RIP) of the four CPDN tasks running on the ubuntu VM.
ID: 63934 · Report as offensive     Reply Quote
Eirik Redd

Send message
Joined: 31 Aug 04
Posts: 391
Credit: 219,896,461
RAC: 649
Message 63951 - Posted: 7 May 2021, 6:04:11 UTC - in response to Message 57338.  

Thanks for adding that. I have in fact had "Leave non-GPU tasks in memory while suspended" enabled on my boxes for many years. The measures I have outlined are in addition to that.

Not sure whether something has changed in the tasks or in more recent incarnations of BOINC but of late even when I have had restarts due to power failure, (electrician turning mains off without warning) I haven't lost tasks to it. Something has improved but I don't know what.(Over I would guess last 9 months to a year is an approximate time for the change.)


My experience similar. I do check "leave non-GPU tasks in memory" and I always suspend tasks when a reboot is optional (like for a kernel upgrade or such) before reboot.
Yup, Something has improved but I don't know what.
e
ID: 63951 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 64042 - Posted: 10 Jun 2021, 23:28:49 UTC

I've sure not figured out how to do it.

Even if I suspend tasks and stop BOINC before shutting down and rebooting, I still lose many over time. I've given up on running the Linux CPDN tasks on anything that reboots regularly. A one-off glitch is OK, but since I'm doing most of the work in a solar powered office, I just suspend the machines overnight and resume them in the morning - that doesn't bother anything.

If I do need to reboot for updates or such, I drain the tasks out first and let them all finish before I deliberately reboot.
ID: 64042 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64043 - Posted: 11 Jun 2021, 4:06:38 UTC

Weather is a "chaotic system", over a short period. and climate is an expansion of the short period into months / years / millennia.

Most of the models that this project runs were developed by and for the UK Met Office, where they run on super computers NON STOP.

Because of the "chaotic system" part, they are very sensitive to being interrupted, and attempting to run these desktop versions on anything other than a plain, simple, vanilla system, can lead to trouble.

Anyone constantly having crashes, is using a computer that just isn't stable enough for this work, no matter how wonderful it is at doing everything else.
And doing lots of other things at the same time may be a part of the problem.
ID: 64043 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64044 - Posted: 11 Jun 2021, 7:37:58 UTC - in response to Message 64043.  

Weather is a "chaotic system", over a short period. and climate is an expansion of the short period into months / years / millennia.

Most of the models that this project runs were developed by and for the UK Met Office, where they run on super computers NON STOP.

Because of the "chaotic system" part, they are very sensitive to being interrupted, and attempting to run these desktop versions on anything other than a plain, simple, vanilla system, can lead to trouble.


I am unconvinced. However chaotic the system being simulated by these models may be, the computer they run on is deterministic, so should always be repeatable. Failing this, the hardware the model runs on is either faulty (i.e., non-deterministic) or the software has bugs in it (e.g., uses parts of memory that are unasssigned values).

In either of these two cases, it is not the model that is non-deterministic.

On my main machine it takes about 8 days to run an N216 model and I reboot about once a week. So, since I run 3 or 4 models at a time, I always reboot while some are nominally running. My drill is that I set the stuff to no new tasks, suspend the running tasks, and then reboot, updating the system including sometime the kernel. As far as I remember, I have never lost a task since I got my current system ( I believe last September).
ID: 64044 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 64045 - Posted: 11 Jun 2021, 10:31:08 UTC - in response to Message 64044.  

Jean-David

If you don't lose tasks, then your set-up is stable, so this doesn't apply to you.

It's people who DO keep crashing tasks that have a problem, and only they can sort out why.
And a lot of it is Windows, and the way it takes over, and re-boots when IT wants to.

Admittedly, I haven't run Windows for years, so I'm only going by what I've read.

Still, in the long run, the work does eventually get completed by someone, and that's what matters.
ID: 64045 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64046 - Posted: 11 Jun 2021, 14:12:14 UTC - in response to Message 64045.  

Admittedly, I haven't run Windows for years, so I'm only going by what I've read.


Me either. There was a bad batch some years ago that suffered from this, but IIRC, they fixed that problem in about a week. This machine I got about last September and I have had no problems with it once I got some SELinux problems sorted out. I also got a little Dell PC running Windows 10 so I could do my taxes on it, and download new maps for my Garmin GPS. Since it is sitting there with nothing else to do, I installed BOINC on it and it runs Climateprediction, WCG, Rosetta, and Universe.

Computer 1512658
Domain name DESKTOP-K1UQGC4
Created 19 Dec 2020, 22:21:58 UTC
CPU type GenuineIntel 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz [Family 6 Model 140 Stepping 1]
Number of processors 8
Operating System Microsoft Windows 10
Core x64 Edition, (10.00.19043.00)

It is currently running five WeatherAtHome tasks.

All tasks for computer 1512658

State: All (6) · In progress (5) · Validation pending (0) · Validation inconclusive (0) · Valid (1) · Invalid (0) · Error (0)
Application: All (6) · OpenIFS (0) · UK Met Office Coupled Model Full Resolution Ocean (0) · UK Met Office HadAM4 at N144 resolution (0) · UK Met Office HadAM4 at N216 resolution (0) · UK Met Office HadCM3 short (0) · UK Met Office HadSM4 at N144 resolution (0) · Weather At Home 2 (wah2) (6) · Weather At Home 2 (wah2) (region independent) (0)
ID: 64046 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,522,141
RAC: 1,164
Message 64047 - Posted: 11 Jun 2021, 15:18:25 UTC

I have had very good luck NOT crashing models when re-booting by first shutting down the boinc-client.

I am using UBUNTU, so I open a Terminal and enter "sudo service boinc-client stop", then Restart from the Desktop drop-down menu.

I have NOT been able to find any documentation to figure out what the difference between -

a) sudo service boinc-client stop, then Restart from the Desktop drop-down menu
b) Click on Restart Now from the Software Updater window (after an update)
c) reboot from the Terminal

Do either b) or c) do an orderly termination of the boinc-client?

When a computer crashes (locks up or freezes, or a power interruption, etc.) I expect model crashes and feel fortunate if they don't.
ID: 64047 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1120
Credit: 17,202,915
RAC: 2,154
Message 64048 - Posted: 11 Jun 2021, 18:14:00 UTC - in response to Message 64047.  

I have had very good luck NOT crashing models when re-booting by first shutting down the boinc-client.

I am using UBUNTU, so I open a Terminal and enter "sudo service boinc-client stop", then Restart from the Desktop drop-down menu.

I have NOT been able to find any documentation to figure out what the difference between -

a) sudo service boinc-client stop, then Restart from the Desktop drop-down menu
b) Click on Restart Now from the Software Updater window (after an update)
c) reboot from the Terminal

Do either b) or c) do an orderly termination of the boinc-client?


I am not saying there is anything wrong with what you do. I do not have all the answers. on my RHEL8 system starting and stopping background tasks (actually all tasks) are done, and in the correct order, by the systemd system. I do not even need to know how that works. I know if I boot the system, the boinc-client is automatically started, after anything it requires to be running have already been started. And when I reboot the system, even implicitally, it takes them down also in the correct order.

What I normally do is about a day before, I set no new tasks for all my projects. On the day I am going to do the system updates and reboot, I suspend all tasks that have not been started. I then have lunch or something. Then I suspend all the non Climateprediction tasks, and then the Climatprediction tasks. I then logout as me and login as root. I run the software program that checks for updates (that I already know there will be some. If it finds them, it downloads them, installs them, and reboots.
ID: 64048 · Report as offensive     Reply Quote
Profilegeophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2187
Credit: 64,822,615
RAC: 5,275
Message 64056 - Posted: 14 Jun 2021, 21:50:24 UTC - in response to Message 64042.  

I've sure not figured out how to do it.

Even if I suspend tasks and stop BOINC before shutting down and rebooting, I still lose many over time. I've given up on running the Linux CPDN tasks on anything that reboots regularly. A one-off glitch is OK, but since I'm doing most of the work in a solar powered office, I just suspend the machines overnight and resume them in the morning - that doesn't bother anything.

If I do need to reboot for updates or such, I drain the tasks out first and let them all finish before I deliberately reboot.

From looking at the tasks on your PCs, you have an exceptional record of completing tasks successfully. There was only one PC that had a significant number of failures and those were the hadcm3s models which are more finicky to begin with. There were some errors with the recent hadam4 (not hadam4h) batch but everyone had those as there was a batch configuration problem.
ID: 64056 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 262
Credit: 34,915,412
RAC: 16,463
Message 64092 - Posted: 30 Jun 2021, 16:10:45 UTC - in response to Message 64056.  


From looking at the tasks on your PCs, you have an exceptional record of completing tasks successfully. There was only one PC that had a significant number of failures and those were the hadcm3s models which are more finicky to begin with. There were some errors with the recent hadam4 (not hadam4h) batch but everyone had those as there was a batch configuration problem.


Good to know - though I'm not sure what that would look like if I were actually rebooting systems regularly instead of sleeping them...

Production is down now with the heat, but will be back up this winter. I've been trying to figure out how to get the cheap cloud preemptable instances to hibernate cleanly so they can run CPDN units without having to actually stop/start them, but it's been tricky making it actually work reliably.
ID: 64092 · Report as offensive     Reply Quote

Questions and Answers : Unix/Linux : Shutting down for re-boot.

©2024 cpdn.org