climateprediction.net home page
Posts by rob

Posts by rob

21) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69925)
Posted 17 Oct 2023 by rob
Post:
I had one like that a short time ago, after digging through the log file it was fairly obvious that the BOINC client does get "somewhat confused" periodically and counts packets sent (but not acknowledged) as having arrived safely, and thus they are counted to the total transmitted. In my case a subsequent re-try reset the figure to zero, then to a more accurate value.
22) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69915)
Posted 17 Oct 2023 by rob
Post:
Thanks Glenn.
Sadly either this change is going to take some time to actually cause an improvement in the situation or it wasn't the complete solution. I've still got three zip files failing at every retry:

wah2_eas25_a0uz_199012_24_996_012224663_2_r735015961_1.zip - 79.36% after 14:26 transfer time
wah2_eas25_a4ml_20142_24_996_012229545_2_r1812486379_8.zip - 47.67% after 15:20 transfer time
wah2_eas25_a4ml_20142_24_996_012229545_2_r1812486379_3.zip - 1.40% after 35:02 transfer time

Both tasks are still running. The first one, wah2_eas25_a0uz_199012_24_996, due to finish in about 4.5 days, and wah2_eas25_a4ml_20142_24_996 in just under 2 days.
Both my other tasks are uploading their zips in a timely manner, but even these can take a couple of retries (or, should that be a couple of retries?).
{edit to add}
The situation has gone backwards - all new uploads are descending rapidly into the re-try cycle, so the situation is certainly no better than it was before.
23) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69898)
Posted 16 Oct 2023 by rob
Post:
While increasing the number of connections should help by reducing the number of "first time stalls", there appears to be an issue that's leading to tasks with high re-try counts. The symptom is that once a task reaches a certain number of re-tries it becomes increasing more probable that it will fail on it's next attempt, so thinking out loud here, is the time-out time before declaring a failure too short for the "find this zip" time?
24) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69866)
Posted 15 Oct 2023 by rob
Post:
A question (which has probably been asked a few times already)
Does it matter if a low number zip file is stuck in the can't upload cycle, but higher number zips for the same task have escaped the cycle and have been safely uploaded?
25) Message boards : Number crunching : New work discussion - 2 (Message 69858)
Posted 14 Oct 2023 by rob
Post:
A question - Are the "24 hours" you refer to in your post 24 hours as measured by the clock on my wall, or the time the simulation represents?
26) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69820)
Posted 13 Oct 2023 by rob
Post:
This was after an automatic retry rather than a manually initiated one - delay was around 7 minutes, so a reasonable time for the release to take place.

[edit to add] The actual time between previous failure and this report was 14 minutes, with a ~14 minute back-off initiated.
27) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69817)
Posted 13 Oct 2023 by rob
Post:
This is a new one on me:

13/10/2023 14:21:20 | climateprediction.net | Started upload of wah2_eas25_a4ml_201412_24_996_012229545_2_r1812486379_6.zip
13/10/2023 14:21:21 |  | [http_xfer] [ID#178] HTTP: wrote 191 bytes
13/10/2023 14:21:22 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a4ml_201412_24_996_012229545_2_r1812486379_6.zip] locked by file_upload_handler PID=4155647


Not sure what to make of it, but to me it suggests that the upload server is not releasing lock files in a timely manner.
28) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69801)
Posted 13 Oct 2023 by rob
Post:
Well....
I'm running two or three tasks, and for the last day or so have been living with at least one upload stuck in the loop.
29) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69787)
Posted 12 Oct 2023 by rob
Post:
Ah, now a new message:
12/10/2023 17:44:57 | climateprediction.net | Started upload of wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip
12/10/2023 17:45:04 |  | [http_xfer] [ID#125] HTTP: wrote 99 bytes
12/10/2023 17:45:05 |  | [http_xfer] [ID#126] HTTP: wrote 192 bytes
12/10/2023 17:45:05 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip] locked by file_upload_handler PID=4075002
12/10/2023 17:45:05 | climateprediction.net | Temporarily failed upload of wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip: transient upload error
12/10/2023 17:45:05 | climateprediction.net | Backing off 04:07:02 on upload of wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip


Not seen this bit before:
12/10/2023 17:45:05 | climateprediction.net | [error] Error reported by file upload server: [wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip] locked by file_upload_handler PID=4075002
12/10/2023 17:45:05 | climateprediction.net | Temporarily failed upload of wah2_eas25_a21e_199812_24_996_012226190_0_r1771898375_21.zip: transient upload error

30) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69786)
Posted 12 Oct 2023 by rob
Post:
Until about two hours ago all my transfers were going through on their first try. Then I had three files of ~94MB to transfer and all stopped. I was able to move one out of the queue, but the other two are now in an eternal re-try cycle. The event log shows most attempts are for a number of ~16kb blocks:
[size=9]12/10/2023 15:31:43 | climateprediction.net | Temporarily failed upload of wah2_eas25_a2kg_200112_24_996_012226876_1_r917371817_8.zip: transient HTTP error
12/10/2023 15:31:43 | climateprediction.net | Backing off 00:02:31 on upload of wah2_eas25_a2kg_200112_24_996_012226876_1_r917371817_8.zip
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 16384 bytes
12/10/2023 15:31:43 |  | [http_xfer] [ID#0] HTTP: wrote 10694 bytes
12/10/2023 15:31:44 |  | Internet access OK - project servers may be temporarily down.[/size]


Great "fun", but not much help. I would guess that this is a server-side problem since it affects a fair number of users from various parts of the world.
31) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69778)
Posted 12 Oct 2023 by rob
Post:
While I was out last night another signal 11 arrived and departed.
https://www.cpdn.org/result.php?resultid=22346439

At the same time three other tasks continued the long plod towards completion.

Haul of failure since 5th October
SIGNAL 11 = 6 (runtime ~ 2 minutes)
"restart" failure/ signal 11 = 6 (runtime >3 minutes)
(One of these https://www.cpdn.org/result.php?resultid=22337980 was not associated with a shutdown/restart cycle, but failed ~20 minutes after first start.)

Only 3 tasks of the 15 received have any chance of reaching completion, I'll keep the PC (and BOINC) running until they have finished.
32) Questions and Answers : Windows : Computation error when BOINC halts (Message 69753)
Posted 10 Oct 2023 by rob
Post:
A couple of things
- The initial estimates of task duration are often very pessimistic, but in time they get a bit better. Do a quick calculation yourself of the time left to run, once the progress has got beyond about 10% your "guess" will be a lot more accurate than BOINC's.

Second, as these are recent tasks the will belong to the 966 batch, there's a thread running about a couple of issues with these tasks, but no solutions have arrived yet (apart from not shutting down to avoid the "fails on restart" type error, which is a real pain for those that suffer Windows forced reboots, or power cuts, or shut-down at night, or suspend to do something else...).
https://www.cpdn.org/cpdnboinc/forum_thread.php?id=9222
33) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69750)
Posted 10 Oct 2023 by rob
Post:
Thanks - I'll put that idea in the red-herring bin
34) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69721)
Posted 9 Oct 2023 by rob
Post:
This might be a red herring, but....
All three .exe files associated with the current wah2 tasks are 32-bit, so my thought is that the current batch of eas25 tasks (batch 996) cover a large (geographic?) area, and someone suggested that one of the problems may that there is an overflow in an array, and this causes the task to crash in the first few minutes of execution. Could this be solved by compiling the application in 64 bit mode - or is my herring really red?
Likewise the apparently random task crashes mid-run that a few have seen might be another array exceeding its (32-bit) bounds?
35) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69696)
Posted 8 Oct 2023 by rob
Post:
Left the PC on overnight.
The task that was running last night is still running this morning.
However a new task arrived, and promptly crashed (less than 2 minute running)
https://www.cpdn.org/result.php?resultid=22344887
, with a segment violation error. So it looks as if that problem, while reduced, is still around.
36) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69693)
Posted 7 Oct 2023 by rob
Post:
Thanks Glenn & Dave.
All but one of the failures was after shutting down for the night. It's somewhat reassuring that it's not my computer that's got an issue, ut it's a bit disappointing that the stop/restart issue hasn't been fully cured yet.
I'll see what happens if I leave it on over night and report back tomorrow.
37) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69685)
Posted 7 Oct 2023 by rob
Post:
After some rejoicing the other day at getting 8 from this batch I'm now somewhat less happy - all but one have failed :-(
Can someone who understands please have a look at the failed tasks (should b easy as I only have one active cruncher) - I suspect there's something amiss with the way it is set up.
Thanks in advance.
38) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69673)
Posted 6 Oct 2023 by rob
Post:
Good idea - and I see someone has already done it, thanks to who ever that was :-)
39) Message boards : Number crunching : Batch 996 Weather@Home2 East Asia25 (Message 69666)
Posted 5 Oct 2023 by rob
Post:
I just picked up 8 from this batch, and was just about to celebrate all 8 behaving properly.....
However wah2_eas25_a2qn_200212_24_996_012227099 failed with a computation error after 2:39
All the others have passed this time and are plodding on, hopefully for the next few days and to completion.
40) Questions and Answers : Windows : Computation error when BOINC halts (Message 69357)
Posted 18 Jul 2023 by rob
Post:
Suspending and halting (stopping) are not the same - The safer option is to halt the processing, which forces the "resume" file to be written instantly to disk; suspend on the other hand may not even produce a resume file (worst case), or will defer its creation for some time.

As for Windows automatic updates - they are an absolute pain, and should be blocked - others have suggested ways of doing this.


Previous 20 · Next 20

©2024 climateprediction.net