climateprediction.net home page
Batch 996 Weather@Home2 East Asia25

Batch 996 Weather@Home2 East Asia25

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,609,322
RAC: 5,221
Message 69966 - Posted: 20 Oct 2023, 11:59:57 UTC - in response to Message 69964.  

What I am not sure of Glenn is whether even if the science data is the same between both runs, whether tasks that complete under WINE but not using Windows are still invalid or even if there is a way of checking that?
What do you mean by 'invalid'?
Not invalid as in rejected by the software but invalid as in useless for the science.
If the model blows up from a numerical problem that would obviously count as 'invalid' (as opposed to any restart problem). If it completes then it's not possible to tell from a single run. it would need many runs to look at the statistics of the model results and see if there is a bias in the results compared to runs on bare metal.

The WINE issue was more straightforward. We expected the model to crash with a memory error when it was run but under WINE it *always* ran, so there's clearly something special about that environment for memory errors. It doesn't necessarily mean working tasks would be 'wrong'.
ID: 69966 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,609,322
RAC: 5,221
Message 69967 - Posted: 20 Oct 2023, 12:04:59 UTC - in response to Message 69965.  

It's restarting the model from a shutdown that risks the model failing like this.
None my "two minute crashes" have been the result of re-start after a shutdown.
Do you have 'Leave non-GPU tasks in memory when suspended' enabled under 'General' (or 'Memory/Disk') in boincmgr?

If not, if the client suspends the tasks (non-boinc CPU too high for instance), the tasks might be kicked out of memory and then it would have to restart from the disk restarts.

Some of the tasks will fail despite this because they are being deliberately perturbed, some of the forecasts will be physically unrealistic for the model to handle.
ID: 69967 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 69968 - Posted: 20 Oct 2023, 12:48:31 UTC - in response to Message 69966.  
Last modified: 20 Oct 2023, 14:27:15 UTC

The WINE issue was more straightforward. We expected the model to crash with a memory error when it was run but under WINE it *always* ran, so there's clearly something special about that environment for memory errors. It doesn't necessarily mean working tasks would be 'wrong'.


That is what I suspected the answer would be. I don't even know if there are enough hosts using WINE to do that statistical work even if we could distinguish between hosts using WINE and those using Windows. In a bit under an hour I should be able to compare the first zips between tasks running under WINE and Windows in a VM. I will let you know what I find and PM if I get stuck with anything or come up with anything I think significant.

Edit: both _1.zip files are identical. I actually ran diff on the .nc files contained in the zip individually first rather than on the two zip files which would have saved me a couple of minutes faffing around.
ID: 69968 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 78
Credit: 3,036,906
RAC: 4,114
Message 69969 - Posted: 20 Oct 2023, 15:04:12 UTC - in response to Message 69967.  
Last modified: 20 Oct 2023, 15:07:52 UTC

Do you have 'Leave non-GPU tasks in memory when suspended' enabled under 'General' (or 'Memory/Disk') in boincmgr?


I rarely suspend BOINC, but until learning about this current batch not shutting down and the restarting properly I was shutting down BOINC in the minutes before shutting the computer off. I do however have "leave non-GPU tasks in memory when suspending" set ON.

edit to add - My sole computer is running Windows 10, has 32GB memory and 8 "real" cores, plus another 8 with hyper threading turned on ( and virtualisation is turned on, but no virtual machine running just now.
ID: 69969 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,041,675
RAC: 20,255
Message 69971 - Posted: 22 Oct 2023, 5:46:44 UTC - in response to Message 69954.  

I am sure alot of the hard fails are simply due to this and not because of an inherent problem with the model perturbations. Not sure whether CPDN will decide to rerun them or not yet.

Yeah, 12 of my 32 failures so far have been due to BOINC restarts. Some due to an unintentional PC shutdown and others due to BOINC seemingly crashing (came to check on things and found BOINC wasn't running). Sucks as most of those have run for ~12 days and had no more than a day to go to finish. Hopefully the remaining 21 will successfully finish as I only have 5 successfully finish so far.
ID: 69971 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 69972 - Posted: 22 Oct 2023, 6:40:42 UTC - in response to Message 69971.  

and others due to BOINC seemingly crashing
BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.
ID: 69972 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,041,675
RAC: 20,255
Message 69973 - Posted: 22 Oct 2023, 21:18:34 UTC - in response to Message 69972.  

and others due to BOINC seemingly crashing
BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.

This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet.
ID: 69973 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,325,575
RAC: 11,402
Message 69974 - Posted: 23 Oct 2023, 8:07:44 UTC - in response to Message 69972.  

... BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again ...
Both Dave and I participated in a conversation with the developers on this one: #4784. The code is fixed and tested, but the release process has stalled, and it may not be in public use yet.

But for clarity: it was only the Manager which crashed. The client, and hence the science applications, kept running. I've also not had any problems with the client crashing by itself. My main problems have been:

  • Mains power supply outages - even a momentary flicker can cause a host reboot
  • Total system freeze - only a hard reboot regains control

I think it's been suggested that the second can happen if the host memory becomes too full.

ID: 69974 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,609,322
RAC: 5,221
Message 69975 - Posted: 23 Oct 2023, 8:53:53 UTC - in response to Message 69973.  
Last modified: 23 Oct 2023, 8:54:21 UTC

BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.
This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet.
If it happens again it would be good to check in the system logs to see why it failed. I'm assuming this is the Windows client only and not linux (I've never seen it myself).
ID: 69975 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4347
Credit: 16,541,921
RAC: 6,087
Message 69976 - Posted: 23 Oct 2023, 9:42:17 UTC

Both Dave and I participated in a conversation with the developers on this one: #4784. The code is fixed and tested, but the release process has stalled, and it may not be in public use yet.

Just tried switching back and forth between simple and advanced views 10 times on 7.24.1 without inducing a crash but it didn't happen every time before. I am not sure how many times I need to test it to be sure?
ID: 69976 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,325,575
RAC: 11,402
Message 69977 - Posted: 23 Oct 2023, 10:00:01 UTC - in response to Message 69976.  

I am not sure how many times I need to test it to be sure?
When I started that ticket, it was near immediate - one or two cycles at most. I forget how many times I tried it when reporting a successful fix (lower down - after a second problem with the size of text boxes), but at least two - that was probably enough.
ID: 69977 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,041,675
RAC: 20,255
Message 69978 - Posted: 23 Oct 2023, 21:12:38 UTC - in response to Message 69975.  

BOINC crashing when left unattended is a new one for me. The only times I can remember BOINC itself crashing as opposed to tasks is the recent bug that would sometimes make it crash switching between advanced and simple view then back again. I think that has been fixed now but not 100% certain.
This happens to me once in a while, I'll come check on it and find BOINC isn't running. I can't remember when the first time was, this year or last. I think this was the first time with the latest version (7.24.1). It was also the first time with CPDN running which was costly due to loss of tasks and a lot of processing time. I haven't tried to investigate it in any way yet.
If it happens again it would be good to check in the system logs to see why it failed. I'm assuming this is the Windows client only and not linux (I've never seen it myself).

Yes, it's a Windows 10 PC (I use WSL2 for any Linux BOINC work which I haven't ran for months now). I'd look but don't really know where and what to look for. I did look at Reliability History and Event Viewer after posting that first post but couldn't find anything but I'm also not exactly sure what to look for.

It's definitely not related to switching between views as I don't switch and always use the Advanced view. It's also not just the Manager crashing as that'd be easy to tell (BOINC start up, CPU temperature changes). It also isn't due to a system reboot due to some critical system component crash or power failure as that'd also be easy to tell. I have RyzenMaster & MSI Afterburner that start up first to turn on undervolt settings before other things like BOINC start and I have to manually apply the settings & close those programs before anything else proceeds so I can tell when there was a system restart.
ID: 69978 · Report as offensive     Reply Quote
Sardis73

Send message
Joined: 1 Apr 12
Posts: 3
Credit: 13,843,038
RAC: 4,136
Message 69979 - Posted: 24 Oct 2023, 0:35:47 UTC - in response to Message 69978.  

Is this similar to what's being discussed?

Here is what displayed in a pop-up dialog on October 19, 2023:
BIONIC Manager - Connection Error
Invalid client RPC password. Try reinstalling BOINC.

The BOINC Manager was blank - no project, task or other data displayed.
I used the preferred shutdown method for BOINC and restarted my computer.
Climate project and tasks displayed after restart. The two trickles waiting to send displays errors.
BIONIC resume computing the tasks.

About 30 minutes later, it occurred again and all of the tasks for the project crashed.

https://www.cpdn.org/result.php?resultid=22347881 15 (0x0000000F) Unknown error code

https://www.cpdn.org/result.php?resultid=22347188
https://www.cpdn.org/result.php?resultid=22336571

Suspended CPDN Monitor - Suspend request from BOINC...
10:45:57 (12972): BOINC client no longer exists - exiting
10:45:57 (12972): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
ID: 69979 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 247
Credit: 12,041,675
RAC: 20,255
Message 69980 - Posted: 24 Oct 2023, 6:01:54 UTC - in response to Message 69979.  

Sardis73,

Looks different than mine. For me, both Client & Manger crash, in your case it seems to be just the Client. I've seen the Invalid Client RPC password error before but it usually happens right away when you start BOINC. The fact that yours happens some time after everything has started and been running for a while is new and a bit puzzling.
ID: 69980 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,292,252
RAC: 2,561
Message 69982 - Posted: 24 Oct 2023, 10:00:20 UTC - in response to Message 69967.  

Hello again...Most probably it's the windows memory management...

Greetings to all believers in science!

I've a rock-solid computer that is more than 11 years old and still works at it's best.
The same errors appearing over here with this computer. Every new start of BOINC and restarting several CPDN-models will let all the models die.
But the good news...they will go on, because I restarted always from backup. So none of the models is faulty and work on, and also will finish successfully.
So far no other errors, but BOINC manager VERSION 7.24.1 isn't very reliable.

Cheers, and have peaceful day,

Bonsai911
ID: 69982 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,609,322
RAC: 5,221
Message 69983 - Posted: 24 Oct 2023, 11:00:46 UTC - in response to Message 69979.  

About 30 minutes later, it occurred again and all of the tasks for the project crashed.

https://www.cpdn.org/result.php?resultid=22347881 15 (0x0000000F) Unknown error code
This task shows the error 'system cannot find the drive specified'. I think that's why the client & tasks died. I've seen that multiple times on the windows tasks and there was some discussion about it earlier (maybe this thread?). I forget the outcome of the discussion but I wonder whether it indicates a failing drive? Or maybe one that's getting hammered by other process(es).

Perhaps look at the drive's SMART diagnostics to check what's going on with it.

<![CDATA[
<message>
The system cannot find the drive specified.
 (0xf) - exit code 15 (0xf)</message>
<stderr_txt>
10:46:05 (12540): BOINC client no longer exists - exiting
10:46:05 (12540): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...

---
CPDN Visiting Scientist
ID: 69983 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,325,575
RAC: 11,402
Message 69984 - Posted: 24 Oct 2023, 12:23:12 UTC - in response to Message 69983.  

The phrase

BOINC client no longer exists
Is, I think, written by the CPDN wrapper - called in from the BOINC api library.

The only time it's used in BOINC code is https://github.com/BOINC/boinc/blob/master/api/boinc_api.cpp#L508, where its usage is determined by a test on client_pid

I take that to mean that the client is no longer running in memory - it wouldn't tell us anything about the binary file being stored on disk.

So I would take it that "The system cannot find the drive specified" is detected first by the client, and causes that to crash: the subsequent CPDN exit would be a result of that, not caused by it.
ID: 69984 · Report as offensive     Reply Quote
Profile Bonsai911

Send message
Joined: 9 Sep 04
Posts: 228
Credit: 30,292,252
RAC: 2,561
Message 69985 - Posted: 24 Oct 2023, 12:58:57 UTC - in response to Message 69983.  

[quote]

Perhaps look at the drive's SMART diagnostics to check what's going on with it.

<![CDATA[
<message>
The system cannot find the drive specified.
 (0xf) - exit code 15 (0xf)</message>
<stderr_txt>
10:46:05 (12540): BOINC client no longer exists - exiting
10:46:05 (12540): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...



The same^↑↑↑↑↑ occured over here on my computer, but I think SMART diagnostics isn't helpful at all,
because it happened on my newest (in use: less than one year) and error-free-so-far solid-state-drive.
Also I'm monitoring my drives with three real-time programs. Also no error so far on any drive.
ID: 69985 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 809
Credit: 13,609,322
RAC: 5,221
Message 69986 - Posted: 24 Oct 2023, 16:31:59 UTC - in response to Message 69985.  
Last modified: 24 Oct 2023, 16:32:56 UTC

The same^↑↑↑↑↑ occured over here on my computer, but I think SMART diagnostics isn't helpful at all,
because it happened on my newest (in use: less than one year) and error-free-so-far solid-state-drive.
Also I'm monitoring my drives with three real-time programs. Also no error so far on any drive.
The error "The system cannot find the drive specified" is coming from Windows and points to a problem accessing the drive for whatever reason. It's probably intermittent.

Doing a google search on the error message shows plenty of hits with various suggestions for why it's happening and the remedies (device drivers, virus checkers, etc etc). (e.g. https://www.thewindowsclub.com/the-system-cannot-find-the-drive-specified-fixed)
ID: 69986 · Report as offensive     Reply Quote
Ivorget

Send message
Joined: 23 Feb 05
Posts: 7
Credit: 1,423,261
RAC: 213
Message 69987 - Posted: 25 Oct 2023, 5:41:21 UTC - in response to Message 69902.  

Speaking of one year deadlines, I've just been handed the HadSM4 task below from last November that has finally timed out from the original BOINCer. Does anyone know whether it is still of use or would it just be a waste of electricity?

https://www.cpdn.org/workunit.php?wuid=12154819
ID: 69987 · Report as offensive     Reply Quote
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 climateprediction.net