climateprediction.net home page
Posts by Thyme Lawn

Posts by Thyme Lawn

1) Message boards : Number crunching : NZ25 file upload server problems? (Message 65887)
Posted 19 Aug 2022 by Profile Thyme Lawn
Post:
Somehow managed to sneak another two zips onto the server. five queued at the moment.

I had 18 files queued up, with all but one uploaded in a 2-hour window starting at 1424 UTC (average upload time slightly under 4 minutes). The last file was backed off and successfully uploaded on a manual retry 15 minutes after the 17th upload was completed, but it took significantly longer than the others (10 minutes).
2) Message boards : Number crunching : NZ25 file upload server problems? (Message 65885)
Posted 19 Aug 2022 by Profile Thyme Lawn
Post:
2) Error reported by file upload server: EOF on socket read : asked for 262144, got 133564

I've been puzzling over that one. Most of the concern has been over the intermediate .zip data files, which are typically tens of megabytes or even a hundred megabytes in size. So why is the upload server quibbling over a mere hundred kilobytes in a file containing just a quarter of a megabyte?

Nothing to puzzle over Richard. The transfer process transparently splits large uploads into smaller packets. The transfer uses 4 different communication layers, with the relevant ones being 2 to 4 (layer 1 is the physical wire):

    2. The maximum packet size for the Ethernet layer is 1518 bytes but, in practice, the packet size is typically between 1476 and 1500 bytes.
    3. The maximum packet size for the IPv4 TCP layer is 64kb.
    4. The maximum packet size for the BOINC data transfer layer is 256kb, which is where the 262144 comes from.

The socket read error is saying that the upload server received just over half of an expected 256kb file segment before the upload failed (130kb (plus 444 bytes) were in the received packets).

3) Message boards : Number crunching : NZ25 file upload server problems? (Message 65860)
Posted 18 Aug 2022 by Profile Thyme Lawn
Post:
I've noticed the phrase "We are completely uploaded and fine" before, but strangely, it can't be found anywhere in the BOINC codebase. The only place it's found is in event logs quoted in https://github.com/BOINC/boinc/issues/4572, an issue about 'Uploads Stopping for Projects with Large Files' from November last year. He could have been talking about us, but it was another project.

I'm wondering if 'We are completely uploaded and fine' is a message being passed on from curl, BOINC's communications toolbox, which would make it much harder to track down.

I've always assumed that the real meaning was that BOINC had passed everything into a buffer being handled by somebody else (curl?), but didn't necessarily imply that the whole file had been acknowledged by the end user on the other side of the world. In which case, it's a badly-written message.

That message is generated by curl on the successful completion of a request, but it's generated for every individual HTTP message. The file transfer sequence for CPDN has the following sequence:

    1. An initial negotiation to determine how much of the file the server has already received (file_xfer_debug outputs the line "[fxd] starting upload, upload_offset -1").
    2. There's an "Info: We are completely uploaded and fine" http_debug message when that request has been sent.
    3. The server normally responds with the number of bytes it has already received (normally a line "[fxd] starting upload, upload_offset 0"). In the messages below, the first attempt resulted in a gateway timeout.
    4. The upload then starts with a request indicating the number of bytes to be sent (the line "18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919" in the messages below).
    5. This also generates an "Info: We are completely uploaded and fine" http_debug message, even when the update has failed (as was the case in the messages below).


18-Aug-2022 20:14:27 [climateprediction.net] [fxd] starting upload, upload_offset -1
18-Aug-2022 20:14:27 [climateprediction.net] Started upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip (86.28 MB)
18-Aug-2022 20:14:27 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295)
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Sent header to server: Content-Length: 312
18-Aug-2022 20:14:28 [climateprediction.net] [http] [ID#26765] Info:  We are completely uploaded and fine

18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: HTTP/1.1 504 Gateway Timeout
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <html><head>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <title>504 Gateway Timeout</title>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </head><body>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <h1>Gateway Timeout</h1>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <p>The gateway did not receive a timely response
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: from the upstream server or application.</p>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <hr>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: <address>Apache/2.4.7 (Ubuntu) Server at upload4.cpdn.org Port 80</address>
18-Aug-2022 20:24:28 [climateprediction.net] [http] [ID#26765] Received header from server: </body></html>
18-Aug-2022 20:24:28 [---] [http_xfer] [ID#26765] HTTP: wrote 328 bytes
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] http op done; retval -184 (transient HTTP error)
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] file transfer status -184 (transient HTTP error)
18-Aug-2022 20:24:29 [climateprediction.net] Temporarily failed upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip: transient HTTP error
18-Aug-2022 20:24:29 [climateprediction.net] Backing off 00:07:43 on upload of wah2_nz25_a1t1_200105_25_936_012152103_0_r1523319327_20.zip

18-Aug-2022 20:24:29 [climateprediction.net] [fxd] starting upload, upload_offset -1
18-Aug-2022 20:24:29 [climateprediction.net] Started upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip (86.26 MB)
18-Aug-2022 20:24:29 [climateprediction.net] [file_xfer] URL: http://upload4.cpdn.org/cgi-bin/file_upload_handler
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2295)
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312
18-Aug-2022 20:24:30 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Connection died, retrying a fresh connect
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  the ioctl callback returned 0
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Closing connection 2295
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Issue another request to this URL: 'http://upload4.cpdn.org/cgi-bin/file_upload_handler'
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:    Trying 131.217.169.79...
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  Connected to upload4.cpdn.org (131.217.169.79) port 80 (#2302)
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 312
18-Aug-2022 20:24:32 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine

18-Aug-2022 20:26:27 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK
18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success)
18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply>
    <status>0</status>
    <file_size>0</file_size>
</data_server_reply>

18-Aug-2022 20:26:27 [climateprediction.net] [file_xfer] parsing status: 0
18-Aug-2022 20:26:27 [climateprediction.net] [fxd] starting upload, upload_offset 0
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: POST /cgi-bin/file_upload_handler HTTP/1.1
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Content-Length: 90454919
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Sent header to server: Expect: 100-continue
18-Aug-2022 20:26:28 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 100 Continue

18-Aug-2022 20:35:58 [climateprediction.net] [http] [ID#26786] Info:  We are completely uploaded and fine

18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: HTTP/1.1 200 OK
18-Aug-2022 20:38:26 [climateprediction.net] [http] [ID#26786] Received header from server: Content-Length: 123
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] http op done; retval 0 (Success)
18-Aug-2022 20:38:27 [climateprediction.net] Error reported by file upload server: EOF on socket read : asked for 262144, got 133564
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing upload response: <data_server_reply>
    <status>1</status>
    <message>EOF on socket read : asked for 262144, got 133564
</message>
</data_server_reply>
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] parsing status: -127
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] file transfer status -127 (transient upload error)
18-Aug-2022 20:38:27 [climateprediction.net] Temporarily failed upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip: transient upload error
18-Aug-2022 20:38:27 [climateprediction.net] [file_xfer] project-wide xfer delay for 645.092929 sec
18-Aug-2022 20:38:27 [climateprediction.net] Backing off 00:02:09 on upload of wah2_nz25_a07q_198705_25_936_012150040_1_r1004064792_24.zip
4) Message boards : Number crunching : NZ25 file upload server problems? (Message 65814)
Posted 12 Aug 2022 by Profile Thyme Lawn
Post:
My task's third upload has had 7 failed attempts, the first starting at 0409 UTC, the last 20 minutes ago.
5) Message boards : Number crunching : News and Announcements 3 (Message 65194)
Posted 23 Feb 2022 by Profile Thyme Lawn
Post:
Notification from the project team:

Hi All,

Just to let you know that the CPDN project is currently offline. Overnight there has been a power outage on the site where two key servers of the project are based. Engineering IT Support are aware of the issue and will be working to rectify it.

Best wishes,

Andy
6) Message boards : climateprediction.net Science : Myles Allen on The Life Scientific (Message 62133)
Posted 18 Feb 2020 by Profile Thyme Lawn
Post:
CPDN founder Myles Allan was on Jim Al-Khalili's always illuminating BBC Radio 4 programme "The Life Scientific" this morning. Definitely worth a listen.
7) Message boards : Number crunching : Credits (Message 61376)
Posted 24 Oct 2019 by Profile Thyme Lawn
Post:
If CPDN is missing from your list of projects on the credit aggregation sites you'll need to tick the Do you consent to exporting your data to BOINC statistics aggregation Web sites? checkbox on the project preferences page if you want to reappear.
8) Message boards : Number crunching : transient HTTP error (Message 59197)
Posted 17 Dec 2018 by Profile Thyme Lawn
Post:
I have been stuck uploading some zips for batch 691 for several days. Transient HTTP error.

As George mentioned, I also have stuck uploads for a task in this batch. The task has completed and everything has been successfully sent to upload6 except for:

  • wah2_cam25_a0fm_200405_18_691_011370111_2_r66429025_13.zip is stuck at 54.19%, first failed at 13:07:05 on 15th, has been retried 24 times and is getting up to 54.41% on each attempt.

  • wah2_cam25_a0fm_200405_18_691_011370111_2_r66429025_16.zip is stuck at 8.56%, first failed at 05:58:34 on 16th, has been retried 15 times and is getting up to 8.68% on each attempt.

In both cases the BOINC event log shows that the file was "locked by file_upload_handler" for 2 hours after the first attempt and gives no indication why subsequent attempts have been failing.

Wireshark shows that retries are successfully negotiating the restart offset and data is being sent before timing out. The same restart point is negotiated on every retry, indicating that none of the retransmitted data is being received.

9) Message boards : Number crunching : Batch 774 (safr50) (Message 59116)
Posted 27 Nov 2018 by Profile Thyme Lawn
Post:
If you installed BOINC to run as a service the batch 774 tasks seem to get stuck in the initialisation of the regional model instead of crashing out with a Fortran runtime error popup dialog box.

If you are running BOINC as a service BOINC Manager will show the elapsed time and progress increasing as expected, but if you open the task properties dialog box the CPU time isn't changing from "---". If checkpoint or task debug is enabled BOINC's event log shows that no checkpoints are being made and the elapsed time and progress will revert to 0 if you restart BOINC.

If this applies to your system you should abort all of its batch 774 tasks.
10) Message boards : Number crunching : New work Discussion (Message 59115)
Posted 27 Nov 2018 by Profile Thyme Lawn
Post:
BOINC Manager shows the elapsed time and progress increasing as expected, but if you open the task properties dialog box the CPU time isn't changing from "---". If checkpoint or task debug is enabled BOINC's event log shows that no checkpoints are being made. When BOINC is restarted the elapsed time and progress revert to 0.


Do they still use a whole core of cpu time when doing this?

No. SysInternals Process Explorer was showing 2 idle cores of the 8 on my i7 with less than 0.1 seconds of CPU time for both of the batch 774 models' controller + global + regional processes.
11) Message boards : Number crunching : New work Discussion (Message 59108)
Posted 27 Nov 2018 by Profile Thyme Lawn
Post:
@WB8ILI I run BOINC as a service. Yes, it does mean I can't run GPU applications on other projects, but it also means that tasks which fail with a Windows runtime error don't generate the pop-up dialog box you can get in a non-service install.
12) Message boards : Number crunching : New work Discussion (Message 59106)
Posted 27 Nov 2018 by Profile Thyme Lawn
Post:
Batch 774 SAFR50 region. These are using on the restart files from batch 741.

If you are running any tasks from this batch please check them because I have 2 batch 774 tasks stuck in the initialisation of the regional model (project team notified).

BOINC Manager shows the elapsed time and progress increasing as expected, but if you open the task properties dialog box the CPU time isn't changing from "---". If checkpoint or task debug is enabled BOINC's event log shows that no checkpoints are being made. When BOINC is restarted the elapsed time and progress revert to 0.
13) Message boards : Number crunching : Stuck upload issue (Message 57759)
Posted 1 Feb 2018 by Profile Thyme Lawn
Post:
The project team are aware of the upload problem for WAH2 SAS50 batch 706, 707 and 708 files. Upload server upload2.cpdn.org has reached its disk quota and they are making more space available.
14) Message boards : News : Welcome to the News message board! (Message 57217)
Posted 27 Oct 2017 by Profile Thyme Lawn
Post:
The credit problem users have been posting about here has been raised with the project team and I have been assured that all of the information required to calculate credit is intact.

The problem is related to the new daily credit script they're trying to phase in. The old weekly script is being re-run and credits should return to the correct level when it's complete.
15) Message boards : Number crunching : Credit Status (Message 57216)
Posted 27 Oct 2017 by Profile Thyme Lawn
Post:
The credit problem has been raised with the project team and I have been assured that all of the information required to calculate credit is intact.

The current problem is related to the new daily credit script they're trying to phase in. The old weekly script is being re-run and credits should return to the correct level when it's complete.

BTW, the total credit for the project dropped from 32,814,617,008 on Wednesday to 10,938,204,204 yesterday ...
16) Message boards : Number crunching : Meet the CPDN team! (Message 56754)
Posted 30 Aug 2017 by Profile Thyme Lawn
Post:
Curiosity Explorer Tickets can be obtained at https://www.eventbrite.co.uk/e/curiosity-carnival-tickets-33578014746
17) Message boards : Number crunching : No trickles on webpage (Message 55676)
Posted 12 Feb 2017 by Profile Thyme Lawn
Post:
It looks like the project team have forgotten to enable trickle processing for batch 514 on the server.
18) Message boards : Number crunching : Project communication failed: attempting access to reference site (Message 54819)
Posted 21 Sep 2016 by Profile Thyme Lawn
Post:
Thanks a lot Vitalii, your image makes everything much clearer.

Your stuck file is 105.26MB, the server has already received 100.99MB and BOINC can't send the last 4,540,690 bytes of the file. This is very conclusively pointing towards the cause being a file size limit on the server. The project team have been notified and it should hopefully be sorted out tomorrow.
19) Message boards : Number crunching : Project communication failed: attempting access to reference site (Message 54816)
Posted 21 Sep 2016 by Profile Thyme Lawn
Post:
Andy has increased the HTTP timeout on upload6. Could anyone who has a stuck mex25 upload please check if that allows it to complete.
20) Message boards : Number crunching : Project communication failed: attempting access to reference site (Message 54808)
Posted 21 Sep 2016 by Profile Thyme Lawn
Post:
The message logs show that 4,540,690 bytes are being transferred for Vitalii's wah2_mex25_c0fh_199012_13_410_010609033_1_5.zip file and 69,646,991 bytes for Lockley's wah2_mex25_c0il_199112_13_410_010609145_0_12.zip file. What are their sizes showing as on BOINC Manager's Transfers tab? If they're larger it'll indicate that the server has partially accepted them.


Next 20

©2024 climateprediction.net