climateprediction.net home page
Posts by xii5ku

Posts by xii5ku

21) Message boards : Number crunching : The uploads are stuck (Message 67937)
Posted 21 Jan 2023 by xii5ku
Post:
Here we go again:
21 Jan 2023 17:43 UTC Error reported by file upload server: can't write file oifs_43r3_ps_[…].zip: No space left on server
22) Message boards : Number crunching : w/u failed at the 89th zip file (Message 67932)
Posted 21 Jan 2023 by xii5ku
Post:
"double free or corruption (out)" means:
– Either the program attempted to free (i.e., to deallocate) a memory segment more than once.
– Or something illegally overwrote certain data right before the memory segment which was to be freed.

Could be a programming error. Or could be a secondary symptom of some earlier program failure.
(Edit: Can also be caused by hardware defects, e.g. overclocked RAM or CPU, but IIRC this failure mode has also been seen on Xeon hosts which seem unlikely to be operated in an unstable manner or with defects.)

nairb wrote:
Anybody had one of these?
There are several more reports of this in this message board, and occurrences in stderr.txt of several failed tasks of users who mentioned that they had failures.
23) Message boards : Number crunching : If you have used VirtualBox for BOINC and have had issues, please can you share these? (Message 67931)
Posted 21 Jan 2023 by xii5ku
Post:
SG Felix wrote:
If I want to do VBox Boinc WUs, i have to start the Client with Admin rights. Also, if I want to start a regular Virtual Machine, I have to do it with Admin rights. If i don't, i get the following VBox error:

Critical error

COM-Object for VirtualBox couldn't be created

Errorcode: 
E_ACCESSDENIED (0x80070005)
Komponente: 
VirtualBoxClientWrap
Check with "id" for your own user ID and with "id boinc" for the boinc user ID whether or not they are member of the vboxusers group.
If they are not, add them to the group:
sudo usermod -a -G vboxusers $USER
sudo usermod -a -G vboxusers boinc
To test if this solved your problem for your own user, 1. either simply open a new terminal with a login shell, or log out and back in entirely, 2. then try starting a VM without elevated privileges from the new login.

To test if this solved your problem for boinc, 1. shutdown and restart the client, 2. try starting a vbox based task with the client running normally with "boinc" user ID.
24) Message boards : Number crunching : The uploads are stuck (Message 67840)
Posted 18 Jan 2023 by xii5ku
Post:
ncoded.com wrote:
Please think about having one section on your website that lists all the major issues that one could have, with clear solutions.

These are 3 issues I have hit in the last week or so, all of which stopped us crunching

1) 100GB UI Limit
2) 2x core_count, cant download too many uploads Limit
3) Invalidated tasks by switching out HDD
1) That'd be good to have in a FAQ, although this issue is shared with other projects with workunits with large rsc_disk_bound.

2) A generic problem, mostly encountered at corner cases like server outages. However, in case of oifs_43r3_ps with its extremely large result data size per task, why were people so keen on downloading more new work while it was clear that the upload file server was down for more than a week/ that recovery of the upload server would take more than a week and its success was entirely uncertain?

3) A corner case which works trouble-free at CPDN as long as either the filesystem is enlarged while the client is down, or a second client instance is created according to the guidelines for multiple client instances per physical host.

[I am of course speaking just for myself; I am not suggesting what CPDN should or shouldn't do WRT user communications.]
25) Message boards : Number crunching : The uploads are stuck (Message 67838)
Posted 18 Jan 2023 by xii5ku
Post:
ncoded.com wrote:
[...] I posted that I had just bought a new HD and was about to swap it out due to so many uploads. Not one mod or team member mentioned the problems that would happen if I did this.
Your problems were unexpected.

ncoded.com wrote:
[...] In terms of this problem, from what I gather this "option" has been turned on by CPDN and GPUGrid, but has been left off by most other projects.
No. As mentioned by others before, multiple boinc client instances on a single physical host *are* in fact treated as separate boinc host instances. (Unless these client instances are created such that they make themselves look the same to the project server.) Just follow the widely available guides for the setup and operation of multiple boinc client instances, and you are fine at CPDN.

GPUGrid and one or another project collapses such host entries into a single one. (Or, attempts to collapse. There are still workarounds to prevent this.) But CPDN as well as the majority of other BOINC projects do not do this (again, *if* the client instances don't make themselves look identical to the server).

ncoded.com wrote:
[...] Either way it is a bitter pill to swallow which is why there is no point us running this project now.
The trouble you encountered is not specific to CPDN, as others mentioned before.
26) Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup (Message 67822)
Posted 17 Jan 2023 by xii5ku
Post:
Glenn,
you mentioned in another thread that the current OpenIFS application leaves some superfluous files in the slot directory if the task is restarted several times. (And you mentioned that this will be addressed in an upcoming application update.) It was said that it is possible to remove older files manually; just the newest one in the respective set of files must be left intact.

Since this info is buried in the various subtopics of the more general threads, I opened this new one.

I now compared the contents of the slot directory of a task which was started only once in its entire lifetime, with that of a task which was started two times.

The latter task had following files in pairs:
BLS20110501000000_000002230000.1_1 (41 MBytes)
BLS20110501000000_000003230000.1_1 (41 MBytes)
LAW20110501000000_000002230000.1_1 (1.6 MBytes)
LAW20110501000000_000003230000.1_1 (1.6 MBytes)
srf00030000.0001 (768 MBytes)
srf00040000.0001 (768 MBytes)
while the former task had just one BLS*, LAW*, and srf* file, respectively.

The newest of the BLS*, LAW*, and srf* files would be replaced with a differently named file while the task is running.

So are these the files of which all but the newest could be deleted if disk space is tight?


________

Oh, and one more thing:

The task which was started two times happened to have the following files in it slot directory:
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:31 boinc_ufs_upload_file_0.zip
-rw-r--r-- 1 boinc boinc         0 Jan 17 20:39 boinc_ufs_upload_file_1.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:49 boinc_ufs_upload_file_2.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 20:55 boinc_ufs_upload_file_3.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:02 boinc_ufs_upload_file_4.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:12 boinc_ufs_upload_file_5.zip
-rw-r--r-- 1 boinc boinc        19 Jan 17 21:14 boinc_ufs_upload_file_6.zip
Note the 0 size of *_1.zip.
And its stderr.txt contains a few occasions of
handle_upload_file_status: can't parse boinc_ufs_upload_file_1.zip
Does this mean that the *_1.zip file wasn't properly created, and that the task will fail with computation error after it completed the rest of the computation?
27) Message boards : Number crunching : The uploads are stuck (Message 67815)
Posted 17 Jan 2023 by xii5ku
Post:
AndreyOR wrote:
It seems to me that BOINC upload is for the most part a background process that does its job relatively well. Pretty much the only times uploading generates user complaints are when upload servers aren't working right. The length of this upload outage is rather unique but even so the progress has been very good so far.
It's not only the length of the server outage which is a one-off edge case here.
The extreme ratio of result data size to CPU time is also unique.
AFAIK, it's very unlike any of the current active projects. (And it's atypical for Distributed Computing which requires client-server communications to be minimal to be effective. Client-server bandwidth and latency in Distributed Computing are, obviously, worlds apart from an HPC cluster.)


(On a positive note, both the result data size and the CPU time of oifs_43r3_ps tasks are very predictable, making it easy for users to control their output accordingly, if they care.)
28) Message boards : Number crunching : The uploads are stuck (Message 67778)
Posted 16 Jan 2023 by xii5ku
Post:
Since this morning (UTC+1 time zone) my two hosts are uploading continuously, with only a very small portion of transfers still getting stuck. But these remaining hickups don't decrease my effective upload rate anymore, which is now limited by my own internet uplink bandwidth again.

_________________________________

Overall progress:



purple: oifs_43r3_ps ready-to-send
yellow: oifs_43r3_ps in-progress

Cumulation of both data:



yellow: oifs_43r3_ps to-be-done (ready-to-send + in-progress)
purple: oifs_43r3_ps ready-to-send

Source: grafana.kiska.pw/d/boinc/boinc (made by Kiska)
29) Message boards : Number crunching : The uploads are stuck (Message 67744)
Posted 15 Jan 2023 by xii5ku
Post:
Richard Haselgrove wrote:
Richard Haselgrove wrote:
I've found a technique which seems to help. Go through this sequence:

  • Suspend network activity (BOINC Manager, Advanced view, Activity menu)
  • Retry all transfers (Tools Menu)
  • Allow network activity

That just cleared the last six tasks due by 25 January, on one machine.

And it's just worked on my second machine as well - tied off the loose ends from 14 tasks in a single hour.

14/01/2023 17:12:27 | climateprediction.net | Reporting 14 completed tasks
I usually wait until both connections have stalled, and the queue has gone into 'project backoff'. Not sure if that's a significant part of the procedure, but it can't spoil it.
I just tried it once on two computers, and it did not set anything in motion.


Jean-David Beyer wrote:
I am getting pretty good response from the upload server, though not as good as it was about 10 (?) days ago. I have high speed (75 megabit/second) fiber optic Internet connection, but I am in USA and the server is in England. So right now, traceroute does not make it all the way to the server. It did recently.

But notice the big delay from New York to London. Step 8 to step 9. This is usual and unchanged. IIRC, the server is at about step 22.
[...]
16  ral-r26.ja.net (146.97.41.34)  80.929 ms  83.553 ms  79.803 ms
wujj123456 wrote:
Thanks for the traceroute output. The server doesn't respond to ICMP packets, probably blocked for security. ral-r26.ja.net seems to be the last hop everyone sees from traceroute. Your latency is pretty low, compared to my 140-150ms to reach ral-r26.ja.net.
My roundtrip times are even lower:
14  ral-r26.ja.net (146.97.41.34)  34.969 ms  34.954 ms  34.178 ms
Yet I don't get anything uploaded anyway.
(To be precise, one computer of mine uploaded 86 files and the other computer 41 files, since Wednesday night.)
30) Message boards : Number crunching : Upload server is out of disk space (Message 67743)
Posted 15 Jan 2023 by xii5ku
Post:
MiB1734 wrote:
I have 1400 tasks to upload. This means 2.5 TB. if there is no wonder the backlog is forever.
MiB1734 wrote:
I have about 2.5 TB result files and can upload about 10 GB. This means to resolve the backlog takes 250 days
Is the 10 GB/day limit the one which is imposed by your internet uplink? Or is it your actual upload during the current period of deliberately downgraded server connectivity (see posts 67636 and 67649)?

If it is the limit of your Internet link, the best course of action _in December_ would have been to
– configure the computers to complete only 5 tasks per day (total of all computers on this internet link),
– configure only small download buffers on these computers accordingly,
– stop computation soon after it became evident that there will be a multi-day server outage.

If it is your current actual average upload rate, then
– stop or throttle computation if you haven't done so yet and
– keep hoping that upload server performance can be recovered later next week.
(Personally, I am hoping this as well but am expecting that upload server performance remains degraded, periodically or the whole time until the current set of OpenIFS work batches is done. My expectation is based on what has been achieved so far by the operators of the server.)


Dave Jackson wrote:
I am now down to 16 tasks uploading. I think I will be clear by the end of play tomorrow. Keeping to just one task running till backlog is cleared.
The part which I bolded is what everybody who runs OpenIFS should be doing currently.
(Alternatively: Halt computation entirely, retry backed-off transfers once or twice a day via boincmgr, re-enable computation after the backlog is cleared.)


leloft wrote:
I think I've got a workaround to the 'too many uploads' issue. Thanks to all who contributed bits towards this. It appeared that actively crunching clients had more success at securing upload slots, so I changed <ncpus> from 24 to 40 in cc_config and reread it. The client downloaded 8 units and started to process them. The host has been uploading solidly since 21.00 last night and has no trouble regaining an upload slot within seconds of dropping it. I have no real idea why this should have worked, except to guess that the ability to secure an upload slot is somehow enhanced by having an actively crunching client.
Best,
fraser
You are lucky. — I have been logging the number of pending file transfers on my two active computers since Wednesday night. As far as I can tell from this log, there was only one short window so far during which my computers uploaded anything. The window lasted less than 2 hours, 123 files were uploaded, out of 6,600 pending files.
31) Message boards : Number crunching : The uploads are stuck (Message 67658)
Posted 13 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
A staggered start for OpeniFS would not help. OpenIFS hits its peak memory every timestep, the tasks will not stay the same 'distance' apart and would still crash if memory is over-provisioned. This is a current feature of boinc client that needs fixing. It's not a criticism of the code, PC hardware & projects have moved on from early boinc apps and the code needs to adapt.

The only sane way to do this is for the client to sum up what the task has told it are its memory requirements and not launch any more than the machine has space for. OpenIFS needs the client to be smarter in its use of volunteer machine's memory.
Memory usage outside of BOINC (hence, memory available to BOINC) may fluctuate rapidly too, if the host is not a dedicated BOINC machine.

Glenn Carver wrote:
And I don't agree this is for the user to manage. I don't want to have to manage the ruddy thing myself, it should be the client looking after it for me.

I think all we can do at present is provide a 'Project preferences' set of options on the CPDN volunteer page and set suitable defaults for no. of workunits per host, default them to low. With clear warnings about running too many openifs tasks at once.
Some projects have a "max number of tasks in progress" option enabled in their user-facing project web preferences. But that's of course something entirely different. So far, only the host-side app_config::project_max_concurrent and app_config::app::max_concurrent come close to what's needed.
32) Message boards : Number crunching : The uploads are stuck (Message 67635)
Posted 13 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
xii5ku wrote:
It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage.
Maybe the first is occupying available bandwidth/slot. I see similar with my faster broadband.
One of the things which I tried on Wednesday was to bandwidth-limit the 'good' computer to less than half of my uplink's width and leave the 'bad' computer unlimited. This did not improve the 'bad' computer's situation though.

Getting stuck is perhaps to be expected giving your remote location??
I recall two other Germans commenting here on the Wednesday situation: One being stuck exactly as myself, the other having transferred everything at very high speed. I know of one user at the US East coast, on a big commercial datacenter pipe, who had an increasing rate of transfers getting stuck on Wednesday too, hours before the server ran out of disk space.

As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too.
Depends what you mean by 'large' & small in relation to transfer size.
– Asteroids@home: one file per result, 60…300 kB
– PrimeGrid: one 128 MB file and one tiny file per result; very long task duration but I had multiple computers running, therefore there were occasions during which I saturated the upstream of my cable link for a few hours
– TN-Grid: one file with <10KB per result
– Universe@home: six files per result, ranging from 20 B to <60 kB; current task duration 45 minutes, resulting in a quite high connection rate

Again, no problem between my clients and these project servers.

[Back in May 2022, there was a competition at Universe@home (BOINC Pentathlon's marathon), during which the foreseeable happened: Its server had to drop most of the connection attempts of participants in this competition, as it could not handle the high combined request rate.]

Also, over the handful of years during which I have been using BOINC now, there were multiple occasions during which I saturated my upload bandwidth with result uploads for longer periods (hours, days) at various projects without the problems which I encountered with upload11.cpdn.org.

We opted for more 'smaller' files to upload rather than less 'larger' files to upload, precisely for people further afield. Each upload file is ~15Mb for this project. I still think that's the right approach even though I recognise people got a bit alarmed by the number of upload files that was building up. Despite the problem with 'too many uploads' blocking new tasks, I am reluctant to up the maximum upload size limit, esp if you are having problems now.
It doesn't matter much for the failure mode which I described (and was apparently experienced by some others too). If a transfer got stuck after having transferred some amount of data, the next successful retry (once there was one) would pick up the transfer where it was left off. I.e. the previously transferred portion did not have to be retransmitted.

Furthermore, regarding the client's built-in blocking of new work requests: That's a function of the number of tasks (a.k.a. results) which are queued for upload, not of the number of files which are queued for upload. (Also it's a function of number of logical CPUs usable by BOINC, as you recall.)

Personally, I don't have a preference how you split the result data into files. For my amount of contribution – during periods during which the upload server works at least somewhat –, only the total data size per result matters. As for periods during which the rate of stuck transfers is very high but not 100%, I have no idea whether fewer or more files per result would help. And for completeness: Users who are network bandwidth constrained (such as myself), or who are disk size constrained, cannot contribute at all for the same duration as the upload server is unavailable; then the file split doesn't matter anymore at all, obviously.

So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage.
I think that's part of your answer there. When there isn't a backlog of transfer clogging the upload server, it can tick along fine.
For most of the time of the current OpenIFS campaign during which the upload server wasn't down, the upload server was "ticking along fine" seemingly to one part of the contributors, and was all along showing signs of "just barely limping along" to another part of the contributors.

The server's behaviour on Wednesday _might_ be an indication that some of the problems which lead to the Great Holiday Outage have not actually been resolved. (Vulgo, somebody somewhere _seems_ to flog a dead horse.)

If it really troubles people to see so many uploads building up in number we can modify it for these longer model runs (3 month forecasts).
The precise file split isn't too critical. What matters is: Which amount of result data do the scientists need in which time frame? Based on that, which rate of data transfers does the server infrastructure need to support? Take into account that many client hosts can only operate for as long as they can upload. — That's all pretty obvious to everybody reading this message board, but to me it's clear that somebody somewhere must have cut at least one corner too many last December.
________
PS: I'm not complaining, just observing. The scientist needs the data; I don't.
33) Message boards : Number crunching : Upload server is out of disk space (Message 67617)
Posted 12 Jan 2023 by xii5ku
Post:
wujj123456 wrote:
It's kinda funny I was not able to upload anything due to transient HTTP error, but can see these messages like everyone else. ¯\_(ツ)_/¯
The web server, scheduler, feeder, validator, transitioner, download file handler… are on www.cpdn.org (status), but the upload file handler for the current OIFS work is on upload11.cpdn.org. They are physically different.
34) Message boards : Number crunching : The uploads are stuck (Message 67616)
Posted 12 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
xii5ku wrote:
It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred.

Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not.
Do have these issues with other projects? And, when CPDN isn't playing catch up with their uploads, do you have the problem then? Just trying to understand if this is a general problem you have or whether it's just related to what's going on with CPDN at the moment.
It's specific to CPDN. IIRC I had basically flawless transfers early after the start of the current OpenIFS campaign, but soon this changed to a state in which a certain amount of transfers got stuck at random transfer percentage. I worked around this by increasing max_file_xfers_per_project a little. This period ended at the server's X-Mas outage. — When I came home from work yesterday (evening in the UTC+0100 zone), I found one of the two active computers transferring flawlessly, the other one having frequent but not too many transfers getting stuck at random percentage. This worsened within a matter of hours on both computers, to the point that all transfer attempts got stuck at 0%. (Another few hours later, the server ran out of disk space, which changed the client log messages accordingly.)

I just came home from work again and am seeing "connect() failed" messages now. Gotta read up in the message board for the current server status.

As for other projects: I don't have projects with large uploads active at the moment. TN-Grid and Asteroids with infrequent small transfers and Universe with frequent smallish transfers are working fine right now on the same two computers, in the same boinc client instances as CPDN. Earlier in December, I took part in the PrimeGrid challenge which had one large result file for each task, and they transferred flawlessly too.

So to summarize: It's a problem specifically with CPDN's upload server, and this problem existed with lower severity before the Holidays outage.
________

Edit: So… is the new "connect() failed" failure mode expected due to the current need to move data off of the upload server?

Since last night, I have been logging the number of files to transfer on both of my active computers in 30min intervals, and these numbers have been monotonically increasing. (Each computer has got one OIFS task running.) I haven't done the math if the increase of the backlog is exactly the same as the file creation rate of the running tasks, but I guess it is. The client logs of the past 2 or 3 hours don't show a single success.
35) Message boards : Number crunching : Hardware for new models. (Message 67598)
Posted 12 Jan 2023 by xii5ku
Post:
I guess you are used to get by with very shallow work buffers, and/or rarely deal with periods of heavily congested project servers.¹

Edit: The solution to this is obvious (partition one wide host into several narrow hosts) and its implementation is straightforward — but arguably not something which one should call trivial.

Edit 2: ¹) Those are the common problems. In addition there are a few special cases, e.g. attachment of a new host to CPDN: Guess how long it takes until there are e.g. the first 64 tasks assigned to a host, given the 3636s request_delay and small number of tasks assigned per work request.

Edit 3: This is my first and last off-topic post in this thread, which already suffers from a low SNR.

Edit 4: "Easy to solve - set a bigger buffer" – Wrong. Goes to show that the poster knows little about quota of tasks in progress, of the significance of scheduler request rates, and more.
36) Message boards : Number crunching : The uploads are stuck (Message 67597)
Posted 12 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
xii5ku wrote:
FWIW, when I came home from work 3 hours ago it worked somewhat. It quickly worsened, and now _all_ transfer attempts fail. I've got the exact same situation now as the one @Stony666 described.

The bad: The fact that _nothing_ is moving anymore for myself and evidently for some others doesn't make me optimistic that the backlog (however modest or enormous it might be) would clear anytime soon. And as long as nothing is moving, folks as myself can't resume computation.
There may be some boinc-ness things going on. I vaguely remember Richard saying something about uploads will stop processing if it tries & fails 3 times? Or something like that? Uploads are still ok for me, I've got another 1000 to do.
It's not for a lack of the client's trying. First, I have still 1 running task per each of my two active computers, which causes a new untried file to be produced every ~7.5 minutes. Second, even if I intervene when the client applies an increasing "project backoff", i.e. make it retry regardless, nothing changes. Earlier yesterday, while I still was able to transfer some (with decreasing success), the unsuccessful transfers gut stuck at random transfer percentages. Later on, still long before the new 'server is out of disk space' issue came up, eventually all of the transfer attempts got stuck at 0 bytes transferred.

Unlike some other contributors here, I have a somewhat limited upload bandwidth of about 8 Mbit/s. Maybe this plays a role why I have been less successful than others, maybe not.
37) Message boards : Number crunching : Hardware for new models. (Message 67595)
Posted 12 Jan 2023 by xii5ku
Post:
xii5ku wrote:
Hardware requirements for current "OpenIFS 43r3 Perturbed Surface" work:

The following items need to be taken into account, in descending order of concern:
1. Upload bandwidth of your internet link.
2. Disk space.
3. RAM capacity.
99. CPU. This one doesn't really matter, except of course that CPU core count has got an influence on how many tasks you may want to run in parallel, and that core count × core speed influences how many tasks you can complete per day at most. One or both of these factors (concurrent tasks, average rate of task completions) influence the sizing of items 1…3 in this list.
Note, I bolded one word in the first sentence after the fact.

This priority order has been criticised here. I admit that my perspective is somewhat biased, as I am owning several computers with relatively high core count and high computational throughput and am used to be able to fully utilize them. (Although that's not always trivial to accomplish, because many BOINC projects are focused on low core count/ low throughput hosts.)

However, given how the current "OpenIFS 43r3 Perturbed Surface" campaign is going so far, my priority list is – empirically; refer to thread 9167, thread 9178 – indeed quite generally applicable.
38) Message boards : Number crunching : The uploads are stuck (Message 67568)
Posted 11 Jan 2023 by xii5ku
Post:
FWIW, when I came home from work 3 hours ago it worked somewhat. It quickly worsened, and now _all_ transfer attempts fail. I've got the exact same situation now as the one @Stony666 described.

Glenn Carver wrote:
The upload server is still up and functioning ok. There is an enormous backlog, so please be patient for a couple of days. I'm sure your uploads will happen soon.
There is some good news and some bad news:

The good: Relative to the upload server outage (now almost three weeks), the backlog can't actually be very big. That's because many of us had to stop computing rather soon after the upload server outage started. (Everybody whose production is either constrained by Internet connection throughput, or by disk space.)

The bad: The fact that _nothing_ is moving anymore for myself and evidently for some others doesn't make me optimistic that the backlog (however modest or enormous it might be) would clear anytime soon. And as long as nothing is moving, folks as myself can't resume computation.
39) Message boards : Number crunching : OpenIFS Discussion (Message 67517)
Posted 10 Jan 2023 by xii5ku
Post:
Glenn mentioned that the upload server has got 25 TB storage attached. My understanding is that data are moved off to other storage in time.

My guess is that the critical requirements on the upload server's storage subsystem are high IOPS and perhaps limited latency. (Accompanied by, of course, data integrity = error detection and correction, as a common requirement on file servers.)
40) Message boards : Number crunching : The uploads are stuck (Message 67478)
Posted 9 Jan 2023 by xii5ku
Post:
wujj123456 wrote:
There are completed WUs approaching deadline soon. I have WUs due in 11 days and 11 days no longer look that long given the upload server has been down for two weeks. If somehow the new storage array again has problems, or some new issues showing up which is not that uncommon for new systems, we probably need server to extend deadlines to not waste work.
After the reporting deadline, your work isn't obsolete right away. The server would create a replica task from the same workunit, would have to wait for a work-requesting host to assign this new task to, and then wait for that host to return a valid result. Until that happens — which can potentially be a long time after your reporting deadline —, the server will still opportunistically accept a result from your original task. (And give credit to it if valid… normally. Not sure about CPDN, where credit is assigned separately.)

PS, if you return a valid result for the original task, after the server already assigned a replica task to another host, three things can follow: 1) The other host returns a result too. AFAIK it will get credit if valid. 2) The other host issues an unrelated scheduler request to the server. In the response, the server informs the other host that the replica is no longer needed. 2.a) If the host hasn't started the replica task yet, the task is aborted, and thereby no CPU cycles are wasted on it. 2.b) If the host already started the replica task, same as 1: it will finish and report it and AFAIK get credit if valid.


Previous 20 · Next 20

©2024 climateprediction.net