climateprediction.net home page
Batch 996 Weather@Home2 East Asia25

Batch 996 Weather@Home2 East Asia25

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,660,679
RAC: 8,611
Message 69710 - Posted: 9 Oct 2023, 10:49:11 UTC - in response to Message 69708.  

Definitely do not abort the transfers. I raised this with Andy at the meeting this morning, and he will check. The server is definitely up and running.

I've personally had uploads which take 12 retries before eventually getting through. Might be congestion at the Korean site.

Should I just abort the transfer or keep my fingers crossed that it will go at some point?
I would keep them at least till Glen reports back from the meeting tomorrow morning.
ID: 69710 · Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 2 Oct 06
Posts: 52
Credit: 26,209,214
RAC: 3,355
Message 69711 - Posted: 9 Oct 2023, 11:49:55 UTC
Last modified: 9 Oct 2023, 12:27:06 UTC

FWIW, I have 7 zips that cannot upload. "transient HTTP error"
ID: 69711 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 69712 - Posted: 9 Oct 2023, 15:46:34 UTC - in response to Message 69711.  

FWIW, I have 7 zips that cannot upload. "transient HTTP error"


I have three tasks running and have had no trouble uploading zip files. Each has uploaded seven .zip files.

Here is one of them:

Task 22340449
Name 	wah2_eas25_a3fh_200712_24_996_012227993_0
Workunit 	12227993
Created 	5 Oct 2023, 16:02:19 UTC
Sent 	5 Oct 2023, 16:38:36 UTC
Report deadline 	16 Oct 2024, 21:58:36 UTC
Received 	---
Server state 	In progress
Outcome 	---
Client state 	New
Exit status 	0 (0x00000000)
Computer ID 	1512658
Run time 	
CPU time 	
Validate state 	Initial
Credit 	5,819.81
Device peak FLOPS 	4.23 GFLOPS
Application version 	Weather At Home 2 (wah2) v8.24
windows_intelx86
Stderr 	

--

Latest Trickles Received
Time Sent (UTC) 	Host ID Result ID Result Name 	Timestep 	CPU Time (sec) 	 	Average (sec/TS)
09 Oct 2023 06:04:24 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 80,939 306,949 3.7923
08 Oct 2023 17:21:55 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 69,419 261,246 3.7633
08 Oct 2023 05:04:43 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 57,899 217,100 3.7496
07 Oct 2023 16:52:41 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 46,379 173,219 3.7349
07 Oct 2023 04:53:35 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 34,859 130,196 3.7349
06 Oct 2023 16:55:50 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 23,339 87,178  3.7353
06 Oct 2023 05:00:33 	1512658 22340449 wah2_eas25_a3fh_200712_24_996_012227993_0 11,819 44,353  3.7527

ID: 69712 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,660,679
RAC: 8,611
Message 69713 - Posted: 9 Oct 2023, 16:14:41 UTC - in response to Message 69711.  

FWIW, I have 7 zips that cannot upload. "transient HTTP error"
Andy's just informed me that he's restarted the httpd server on the Korean machine. It was running & not out of space, but rather alot of uploads and most likely stale connections. Hope that's got stuck uploads moving again.

If it misbehaves again, pls post it here.
ID: 69713 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4349
Credit: 16,551,032
RAC: 4,328
Message 69714 - Posted: 9 Oct 2023, 16:15:49 UTC
Last modified: 9 Oct 2023, 16:16:16 UTC

No issues uploading zips so far here but this one failed after uploading nine zips with,

<![CDATA[
<message>
Invalid drive.
 (0xf) - exit code 15 (0xf)</message>
<stderr_txt>
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
Suspended CPDN Monitor - Suspend request from BOINC...
Suspended CPDN Monitor - Quit request from BOINC...
No Process Handle
Global Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=920, iMonCtr=1
No Process Handle
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=920, selfPID=936, iMonCtr=1

</stderr_txt>
This running the client under WINE. Tasks under Windows in VM I suspect will fail based on the failures at previous attempts. Exit code 15 I read means the process has been requested to exit gracefully![/code]
ID: 69714 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,660,679
RAC: 8,611
Message 69717 - Posted: 9 Oct 2023, 18:12:06 UTC - in response to Message 69714.  

Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches.
ID: 69717 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 28
Credit: 28,019,522
RAC: 196,126
Message 69719 - Posted: 9 Oct 2023, 20:07:01 UTC

Add me to the list of people having issues uploading zip files. Sometimes retry works, most of the time it doesn't.
ID: 69719 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4349
Credit: 16,551,032
RAC: 4,328
Message 69720 - Posted: 9 Oct 2023, 20:43:21 UTC - in response to Message 69717.  

I've never seen that kind of error with previous batches.


A new one on me which is why I posted.
ID: 69720 · Report as offensive     Reply Quote
rob

Send message
Joined: 5 Jun 09
Posts: 79
Credit: 3,041,876
RAC: 3,671
Message 69721 - Posted: 9 Oct 2023, 21:26:50 UTC

This might be a red herring, but....
All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red?
Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds?
ID: 69721 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,555,907
RAC: 5,858
Message 69722 - Posted: 9 Oct 2023, 22:55:14 UTC - in response to Message 69717.  

Funnily enough, I was looking over the hard fail workunits last couple of days and I've seen multiple tasks failing with that kind of error 'invalid device/drive, device not found'. I am starting to wonder if it's task related rather than just host specifc. But I'd need to trawl through the logs of all the fails after the batch to see how prevalent it is to be sure. I've never seen that kind of error with previous batches.

These have happened occasionally in the past for me with, I believe, the hadam4/h model series. The best I can figure is it happens when lots of disk writes are occurring with multiple models, like when all the models are essentially in sync with each other and saving files, or finishing the model at the same time. I haven't had one for a long time though. When I'm running one or two models at a time, I've never seen it.
ID: 69722 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,555,907
RAC: 5,858
Message 69723 - Posted: 9 Oct 2023, 23:12:04 UTC - in response to Message 69722.  

Looking back on the message board for

drive not specified

with Search limits set to no limit, Iain Inglis had some ideas about that error message. It goes back farther than the hadam4 models, and may have been on Windows tasks instead. My memory isn't performing too well today.
ID: 69723 · Report as offensive     Reply Quote
ChelseaOilman

Send message
Joined: 24 Dec 19
Posts: 28
Credit: 28,019,522
RAC: 196,126
Message 69724 - Posted: 9 Oct 2023, 23:23:06 UTC


ID: 69724 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4349
Credit: 16,551,032
RAC: 4,328
Message 69727 - Posted: 10 Oct 2023, 5:32:59 UTC - in response to Message 69721.  
Last modified: 10 Oct 2023, 5:44:00 UTC

This might be a red herring, but....
All three .exe files associated with the current wah2 tasks are 32-bit
Not sure. I do know that going 64bit has been suggested before and it would get rid of those of us who have taken the pledge having to install 32bit libraries. Having looked at the old thread which suggests BOINC is translating a FORTRAN error into a Windows error description makes sense. Perhaps something to look at on the BOINC fora or to raise as an issue on git-hub?
ID: 69727 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4349
Credit: 16,551,032
RAC: 4,328
Message 69730 - Posted: 10 Oct 2023, 6:37:11 UTC
Last modified: 10 Oct 2023, 7:06:29 UTC

The researcher in Korea reports files seem to be uploading normally. However, after over 70 zips going through without issue, I have got one that is being stubborn at the moment. This could be a bandwidth issue in which case all should clear eventually but before that happens some might run into problems with BOINC limits or disk space. I will have to keep an eye on my VM and pause processing if this becomes an issue.

Edit: after over an hour of refusing to budge, another click on the retry now button and it has cleared.
ID: 69730 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,348,769
RAC: 10,526
Message 69731 - Posted: 10 Oct 2023, 7:39:34 UTC

I'm continuing to monitor the upload duration of the .zip files, all of which are about the same size.

I was slightly surprised to see a restart file being uploaded at the same time as .zip_12 - that one's about 10% bigger. From memory, only one restart file is specified per task (I'll check later), so there won't one one on task completion. But there will be a surge of data for the upload server as users reach the mid-point of the run (assuming to restart files go to the same server - I'll check that too).
ID: 69731 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4349
Credit: 16,551,032
RAC: 4,328
Message 69732 - Posted: 10 Oct 2023, 7:45:39 UTC - in response to Message 69731.  

I was slightly surprised to see a restart file being uploaded at the same time as .zip_12


Pretty sure I first noticed that on testing for the previous batch. Also on previous main site batch.
ID: 69732 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,348,769
RAC: 10,526
Message 69733 - Posted: 10 Oct 2023, 10:41:23 UTC - in response to Message 69732.  

OK, surprise resolved. There is only one restart.zip, but there's also an out.zip, which I assume will be sent at the very end - that makes sense.
ID: 69733 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,660,679
RAC: 8,611
Message 69734 - Posted: 10 Oct 2023, 11:31:09 UTC - in response to Message 69721.  

It's a red herring. All 3 executable are supposed to be 32 bit. The problem with fails after restarts isn't anything to do with 32bit array sizes (and compiling into 64bit is not as easy as the apps rely on 32bit addressing for some shared memory ops).

I'm not going into details but the problem is related to the communication between the global & regional models - we have a pretty good idea what's causing it. It's not an easy fix though as the model doesn't have much control over the computing environment it's running in.

This might be a red herring, but....
All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red?
Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds?

---
CPDN Visiting Scientist
ID: 69734 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 814
Credit: 13,660,679
RAC: 8,611
Message 69735 - Posted: 10 Oct 2023, 11:44:41 UTC - in response to Message 69723.  

I did try a google search but it didn't return anything useful. I've emailed the CPDN folk to see if they recognise it.

Suggests the hardware is starting to fail to me. Might be time to check the drive health? But good to know this is not a new issue, thanks for that.

Looking back on the message board for
drive not specified
with Search limits set to no limit, Iain Inglis had some ideas about that error message. It goes back farther than the hadam4 models, and may have been on Windows tasks instead. My memory isn't performing too well today.

---
CPDN Visiting Scientist
ID: 69735 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2169
Credit: 64,555,907
RAC: 5,858
Message 69736 - Posted: 10 Oct 2023, 12:49:43 UTC - in response to Message 69735.  

@Glenn

I should have said

using the Advanced search link at the top of the forum,

that is how you would get to the search I was talking about.
ID: 69736 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 12 · Next

Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25

©2024 climateprediction.net