climateprediction.net home page
The uploads are stuck

The uploads are stuck

Message boards : Number crunching : The uploads are stuck
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 25 · Next

AuthorMessage
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67656 - Posted: 13 Jan 2023, 15:15:09 UTC - in response to Message 67642.  
Last modified: 13 Jan 2023, 15:15:46 UTC

If the team don't want to meddle with the database in the middle of this (and I wouldn't either), at the very least they should set a large number for

<report_grace_period>x</report_grace_period>
<grace_period_hours>x</grace_period_hours>
A "grace period" (in seconds or hours respectively) for task reporting. A task is considered time-out (and a new replica generated) if it is not reported by client_deadline + x.
(from https://boinc.berkeley.edu/trac/wiki/ProjectOptions)
Richard, I checked with Andy. CPDN have not specified this in the scheduler settings. Neither can I find a default value for this in the boinc code. So I assume then the default is 0.

We don't hit the first batch deadlines for 950 & 951 until 19th January. I think that'll be fine, if a little tight. I'd prefer to keep it like that so we know what tasks are never coming back sooner rather than later as this project must finished end Feb.
ID: 67656 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 67658 - Posted: 13 Jan 2023, 17:57:47 UTC - in response to Message 67655.  
Last modified: 13 Jan 2023, 17:59:10 UTC

Glenn Carver wrote:
A staggered start for OpeniFS would not help. OpenIFS hits its peak memory every timestep, the tasks will not stay the same 'distance' apart and would still crash if memory is over-provisioned. This is a current feature of boinc client that needs fixing. It's not a criticism of the code, PC hardware & projects have moved on from early boinc apps and the code needs to adapt.

The only sane way to do this is for the client to sum up what the task has told it are its memory requirements and not launch any more than the machine has space for. OpenIFS needs the client to be smarter in its use of volunteer machine's memory.
Memory usage outside of BOINC (hence, memory available to BOINC) may fluctuate rapidly too, if the host is not a dedicated BOINC machine.

Glenn Carver wrote:
And I don't agree this is for the user to manage. I don't want to have to manage the ruddy thing myself, it should be the client looking after it for me.

I think all we can do at present is provide a 'Project preferences' set of options on the CPDN volunteer page and set suitable defaults for no. of workunits per host, default them to low. With clear warnings about running too many openifs tasks at once.
Some projects have a "max number of tasks in progress" option enabled in their user-facing project web preferences. But that's of course something entirely different. So far, only the host-side app_config::project_max_concurrent and app_config::app::max_concurrent come close to what's needed.
ID: 67658 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 67660 - Posted: 13 Jan 2023, 18:08:49 UTC - in response to Message 67656.  

We don't hit the first batch deadlines for 950 & 951 until 19th January. I think that'll be fine, if a little tight. I'd prefer to keep it like that so we know what tasks are never coming back sooner rather than later as this project must finished end Feb.

Any estimate how much backlog we might be able to drain by that day, or is there a way for server to prioritize uploads for WUs closer to deadline? I am guessing no for the latter question, so to avoid wasted work, most of the backlog, or at least most machines need to have a chance to upload.

I am not really sure it helps when upload servers are struggling but we let results expire and send out more duplicates to generate more uploads. We can always change this grace period back to 0 once the upload backlog drains, at which point we would know the ones haven't shown up mostly haven't finished in time. Hopefully the catch up wouldn't take so long to put end of Feb deadline at risk.
ID: 67660 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 67661 - Posted: 13 Jan 2023, 18:43:10 UTC - in response to Message 67655.  

A staggered start for OpeniFS would not help. OpenIFS hits its peak memory every timestep, the tasks will not stay the same 'distance' apart and would still crash if memory is over-provisioned. This is a current feature of boinc client that needs fixing. It's not a criticism of the code, PC hardware & projects have moved on from early boinc apps and the code needs to adapt.

The only sane way to do this is for the client to sum up what the task has told it are its memory requirements and not launch any more than the machine has space for. OpenIFS needs the client to be smarter in its use of volunteer machine's memory.

No, that's not the sane way for most workloads. Leaving a lot of memory on the table for worst case scenario is a huge waste. BOINC is doing the optimization for workloads with relatively stable memory usage, but might vary across different WUs, or change very infrequently. It certainly made the mistake of not accounting for a potential slow start and I would also argue its 30s polling interval for working set size is way too sparse, but the general idea is not insane. A lot of WUs do not use their worst case memory need most of the time. Over-committing is unfortunately required to use memory efficiently.

The problem here is the burst memory usage of OpenIFS. A greatly varying working set size is just a harder problem for memory management. Perhaps the multi-generational LRU can soon help if the BOINC working set size is retrieved from the kernel, but the feature is not there yet and that certainly won't help on Windows. If VBox can effectively reserve the full memory all the time, the problem won't exist in Windows either.

If OpenIFS can pre-allocate instead of keeping growing and shrinking memory usage, that would help a lot. That's not always possible though, so we do need to change BOINC client a bit. See below.

And I don't agree this is for the user to manage. I don't want to have to manage the ruddy thing myself, it should be the client looking after it for me.

Totally agree. IMO, all it takes might just be two flags controlled by projects. The key is to communicate to client how this specific app should be managed.

1) A per-WU or per-app flag that instructs client to reserve the full rsc_memory_bound for this WU/app and completely ignore the measured working set size.
2) Another flag to communicate the minimal memory requirement. Always use at least this number for working set size even if measured one is lower.

Client can treat all apps without the new flags as is. This is hopefully also a less intrusive change that's would be easier to implement. OpenIFS might need both, but a lot of other apps could potentially benefit from 2) to avoid starting tasks when there isn't enough memory left.
ID: 67661 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 67662 - Posted: 13 Jan 2023, 19:31:09 UTC - in response to Message 67649.  


The current limit is 50 concurrent connections.

Well, on an optimistic note, I finally snared a connection and at a rattling 30+kBits/s have managed to upload a whole w/u. And still the connection is holding. I might make it up to 65 kbits/s later in the evening.

With luck all the rest of the zips will go overnight.
ID: 67662 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,541,921
RAC: 6,087
Message 67663 - Posted: 13 Jan 2023, 19:42:10 UTC - in response to Message 67662.  

With luck all the rest of the zips will go overnight.
Mine won't even if I keep going at the maximum my connection allows. (Another 19 tasks to go. Though I haven't checked how many zips each has left.)
ID: 67663 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,958,609
RAC: 36,807
Message 67665 - Posted: 13 Jan 2023, 19:51:22 UTC - in response to Message 67656.  

We don't hit the first batch deadlines for 950 & 951 until 19th January. I think that'll be fine, if a little tight. I'd prefer to keep it like that so we know what tasks are never coming back sooner rather than later as this project must finished end Feb.


I don't see any way to have all my WUs uploaded by the 19th at the current rates, short of physically moving a bunch of machines to places with a faster internet connection and hoping they can connect. Some of those are 19th WUs, expiring in 6d per the client.

Can I ship you guys a burned BluRay disk or two of my boinc client directories and just reset them?
ID: 67665 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67668 - Posted: 13 Jan 2023, 20:45:47 UTC - in response to Message 67661.  
Last modified: 13 Jan 2023, 20:53:05 UTC

wujj123456 wrote:
The only sane way to do this is for the client to sum up what the task has told it are its memory requirements and not launch any more than the machine has space for. OpenIFS needs the client to be smarter in its use of volunteer machine's memory.
No, that's not the sane way for most workloads. Leaving a lot of memory on the table for worst case scenario is a huge waste.
That's incorrectly implying that OpenIFS tasks are 'wasting' alot of memory. I can assure you they are not. Maybe other projects don't calculate accurate app memory usage but we do. CPDN (me) calculate the rsc_memory_bound based on observing the task processes and add a small overhead to allow for any syslib differences & leaks. As I've repeatedly said, the model hits peak memory every timestep which is very frequently. If you or boinc don't respect these values, the task risks a crash. And I see that happening in ~25-30% of tasks sent out, so it's a serious problem that's costing alot of resends.

I don't understand what you mean by 'worse case' for memory. Each task is predictable in its use of memory - there are no 'worse case' tasks, each task uses the same amount of memory because each task takes an identical code path. Only the starting conditions are different.

If OpenIFS can pre-allocate instead of keeping growing and shrinking memory usage, that would help a lot.
OpenIFS makes use of dynamic memory, largely through heap memory and certain parts of the code use more heap than others. No-one codes numerical models to 'pre-allocate' memory for the entire code path - that makes no sense and is impractical. The only time I've seen that done is on early parallel machines which didn't support dynamically allocated memory, but that wasn't by choice.

1) A per-WU or per-app flag that instructs client to reserve the full rsc_memory_bound for this WU/app and completely ignore the measured working set size.
I'm talking about the decision to start the task. Since it's not run yet, the client doesn't know the working set size so it should respect the rsc_memory_bound before starting the task.
2) Another flag to communicate the minimal memory requirement. Always use at least this number for working set size even if measured one is lower.
The minimum requirement IS the rsc_memory_bound that comes with the task download!

Client can treat all apps without the new flags as is. This is hopefully also a less intrusive change that's would be easier to implement. OpenIFS might need both, but a lot of other apps could potentially benefit from 2) to avoid starting tasks when there isn't enough memory left.
No! OIFS doesn't need both. It's already given the client all the information it needs. It's the decisions the client is taking that's the problem - not the information to the client.

At the moment, it's capable of starting e.g. 3 OIFS tasks concurrently which must have 5Gb each, on a 4 core 8Gb memory machine. That's clearly going to crash the tasks and possibly the machine and boinc should really be capable of catching that. Unfortunately, it may also prevent CPDN from rolling out the higher resolution configurations which need more memory and what the scientists are looking for. So I hope we can solve this soon and lessen impact on the volunteers.
ID: 67668 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 67669 - Posted: 13 Jan 2023, 20:49:55 UTC - in response to Message 67665.  
Last modified: 13 Jan 2023, 20:50:53 UTC

SolarSyonyk wrote:
We don't hit the first batch deadlines for 950 & 951 until 19th January.
I don't see any way to have all my WUs uploaded by the 19th at the current rates, short of physically moving a bunch of machines to places with a faster internet connection and hoping they can connect. Some of those are 19th WUs, expiring in 6d per the client.
Ok. do you have a rough idea of how many extra days you need? I've already raised this with CPDN following Richard's message about the grace period. I'm in a meeting with them on Monday so I can talk about it then if you give me a rough estimate?
ID: 67669 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67670 - Posted: 13 Jan 2023, 21:03:56 UTC
Last modified: 13 Jan 2023, 21:33:51 UTC

Can you confirm what you are saying here, it sounds like you are saying that the WUs issued on Dec 20th 2022 (#950) and 21st 2022 (#951), will hit their deadline to be uploaded by 19th January 2023? I thought CPDN WUs had a 12 month deadline..

We have over 200+ WU's to upload, across 2 different drives; I am not sure which if any are from the 2 batches as stated.

Just to confirm, I have not seen *any uploads (on the main server with the majority of completed tasks) all I see is 'back off ..'.
ID: 67670 · Report as offensive     Reply Quote
computezrmle

Send message
Joined: 9 Mar 22
Posts: 30
Credit: 963,113
RAC: 46,932
Message 67671 - Posted: 13 Jan 2023, 21:09:17 UTC - in response to Message 67669.  

... do you have a rough idea of how many extra days you need?

I suspect that's the wrong question.
As far as I understand the various comments there are volunteers getting a connection but on very low bandwidth.
Others (including myself) are not able to upload anything, at least since throttling has been activated.

You may need to stop sending out new tasks or resends until the backlog has been processed.
ID: 67671 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 67672 - Posted: 13 Jan 2023, 21:23:04 UTC - in response to Message 67668.  
Last modified: 13 Jan 2023, 21:55:18 UTC

Let me put this first since a lot of your points seem to be based on this misunderstanding.

The minimum requirement IS the rsc_memory_bound that comes with the task download!

No. rsc_memory_bound is the maximum. There is no parameter for minimum from what I see.

rsc_memory_bound: an estimate of the app's largest working set size
From: https://boinc.berkeley.edu/trac/wiki/MemoryManagement

max_mem_usage = rp->wup->rsc_memory_bound;
From https://github.com/BOINC/boinc/blob/master/client/app.cpp#L293

The lack of minimum but instead using measured working set size is IMO the root cause of why OpenIFS is being scheduled when it shouldn't. We may disagree on the technical details, but I do agree with you the client shouldn't have started a lot of OpenIFS task when it currently does. That's a problem can only be solved by change to the client.

On the other hand, I am not sure this bound is even used to kill task: https://github.com/BOINC/boinc/blob/master/client/app_control.cpp#L910. The commented out code was implementing the abort as described in the wiki link above. This might explain why I've seen LHC Theory tasks stuck with way more memory than rsc_memory_bound but never aborted. If a lot of developers misunderstood rsc_memory_bound, well guess that could also be why this code is commented out... 😂

Perhaps my two suggestions might make more sense for you now, but if my information is incorrect, feel free to point me to the code/doc.

Edit: Well, suggestion 1) in my original reply was indeed unnecessary. So long as we get the memory lower bound parameter, a project/task can simply set the lower bound same as maximum bound to achieve the same effect of 1). Less parameters and code is better.

That's incorrectly implying that OpenIFS tasks are 'wasting' alot of memory.

First, that's not my assumption. My point is not specific to OpenIFS. It's only specific to the part when you mentioned "sane way for most workloads". You are making the assumption a whole lot of workload, at least BOINC workload behaves like OpenIFS. That's definitely wrong based on what I've observed in other applications. A lot of them, even the ones require large amount of memory like LHC, their active working set is pretty much constant, but they do not hit rsc_memory_bound. Scheduling basedd on rsc_memory_bound will waste a lot of opportunity.

"wasting memory" has a lot of meanings. An aggressive mean of interpreting that is that any non-active working set being resident is waste of memory, even if the application allocated it and used it in the past. There will be a lot of cold pages that can be swapped out that's not in active working set at any moment. The whole topic around memory management efficiency is about how we correctly identify the cold memory so that actual physical memory are used by active working set. What you mentioned about OpenIFS is why I said That's not always possible though. That's fair. It just makes the job of memory management harder, or require specific knowledge communicated from this application to the memory management code, which is boinc client here.
ID: 67672 · Report as offensive     Reply Quote
SolarSyonyk

Send message
Joined: 7 Sep 16
Posts: 257
Credit: 31,958,609
RAC: 36,807
Message 67673 - Posted: 13 Jan 2023, 22:18:09 UTC - in response to Message 67668.  
Last modified: 13 Jan 2023, 22:19:39 UTC

OpenIFS makes use of dynamic memory, largely through heap memory and certain parts of the code use more heap than others. No-one codes numerical models to 'pre-allocate' memory for the entire code path - that makes no sense and is impractical. The only time I've seen that done is on early parallel machines which didn't support dynamically allocated memory, but that wasn't by choice.


It's a style and operational difference, but "Allocate all your memory up front, and simply reuse that throughout execution" is a valid design choice, as is "malloc and free as you need and free memory." Personally, I prefer the first, because a stable memory footprint through execution is a lot nicer to work with for long running tasks - though I've also worked a lot of weird places where things like this matters. You can often get better performance from that as well, at least on Linux, by making use of the "HugeTLB" options (2MB or 1GB mappings, depending on your memory use). Being able to avoid TLB misses with large/huge pages can be worth some very real performance gains - and, even if you don't avoid all the TLB misses, the page table walks for 1GB or 2MB pages are a good bit quicker as there are less tables to walk in the process.

If nobody's looked at the OpenIFS codebase with an eye towards making use of large pages and more or less static allocations, it might be worth a pass. It shouldn't be too hard to achieve that in a sane codebase, and I really do think the performance gains may be worth it.

Ok. do you have a rough idea of how many extra days you need? I've already raised this with CPDN following Richard's message about the grace period. I'm in a meeting with them on Monday so I can talk about it then if you give me a rough estimate?


It depends on how many of my machines can get upload slots. Running constantly, 100GB or so on 6Mbit is 2 days, but so far I spent half of my time not being able to upload as there are no slots available. I'm hoping this clears out, but I'm also not seeing full utilization of the links - so... short answer is I have no idea. I would expect an extra week would get everything in, but it just depends on how the networks are working.

Can you confirm what you are saying here, it sounds like you are saying that the WUs issued on Dec 20th 2022 (#950) and 21st 2022 (#951), will hit their deadline to be uploaded by 19th January 2023? I thought CPDN WUs had a 12 month deadline..


New OpenIFS ones are 1 month.
ID: 67673 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,981,759
RAC: 14,695
Message 67674 - Posted: 13 Jan 2023, 22:41:32 UTC - in response to Message 67673.  

You can often get better performance from that as well, at least on Linux, by making use of the "HugeTLB" options (2MB or 1GB mappings, depending on your memory use). Being able to avoid TLB misses with large/huge pages can be worth some very real performance gains - and, even if you don't avoid all the TLB misses, the page table walks for 1GB or 2MB pages are a good bit quicker as there are less tables to walk in the process.

Good point on performance. Actually most distros should have transparent huge pages enabled now, so if application keeps using same range of pages, kernel will combine them into huge pages and the performance comes without any need to change application or config. One can verify whether transparent huge pages are enabled by checking AnonHugePages in /proc/meminfo. It's enabled for my Ubuntu 22.04 and Arch.
ID: 67674 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67676 - Posted: 13 Jan 2023, 22:56:05 UTC
Last modified: 13 Jan 2023, 23:08:54 UTC

Thank you SolarSyonyk, I honestly completely missed that and just assumed it would be the usual 12 months.
ID: 67676 · Report as offensive     Reply Quote
Vato

Send message
Joined: 4 Oct 19
Posts: 13
Credit: 7,300,561
RAC: 14,819
Message 67677 - Posted: 13 Jan 2023, 23:03:16 UTC

one of my machines only has 8GB RAM and it has been working flawlessly (but subject to file upload issues) because i have max_concurrent set to 1 in my app_config.xml (project_max_concurrent would also work). but there is no way for this to be set by the server afaik - it has to be done by the user.
ID: 67677 · Report as offensive     Reply Quote
wateroakley

Send message
Joined: 6 Aug 04
Posts: 185
Credit: 27,123,458
RAC: 3,218
Message 67678 - Posted: 13 Jan 2023, 23:30:04 UTC - in response to Message 67669.  

23:15 GMT 13 Jan 2023. Here, the backlog of uploads has just cleared a few minutes ago. There is about six days of openIFS work cached with deadlines from Sat 21 Jan 2023 to Mon 23 Jan 2023. They are followed by the next openIFS task deadllines from 10 to 12 Feb 2023. Provided everything is left running 24/7, it all looks doable here, within the deadlines.
Ok. do you have a rough idea of how many extra days you need? I've already raised this with CPDN following Richard's message about the grace period. I'm in a meeting with them on Monday so I can talk about it then if you give me a rough estimate?

ID: 67678 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,542,479
RAC: 2,180
Message 67679 - Posted: 14 Jan 2023, 0:23:41 UTC - in response to Message 67647.  

On the positive side, I seem to be one of the presumably few(er) who are able to latch on to trickle in right now. Hopefully the latch is strong enough to finally drain it all in this one shot.


I forget how many tasks I had to upload on this current go-around. 10 to 15, I would guess. I managed to upload them all.
I am currently running 12 Boinc tasks at one time, of which 5 are oifs_43r3_ps ones, four are WCG, one Einstein, and two MilkyWay.

The trickles do not go up instantly, but they on a longer term average, are keeping up. At the moment, there are none waiting to upload.

My machine is:

Computer 1511241

Total credit 	6,612,421
Average credit 	101.08
CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	62.4 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	475.33 GB
Measured floating point speed 	6.05 billion ops/sec
Measured integer speed 	24.32 billion ops/sec
Average upload rate 	2106.58 KB/sec
Average download rate 	8283.46 KB/sec

ID: 67679 · Report as offensive     Reply Quote
Stony666

Send message
Joined: 9 Feb 21
Posts: 9
Credit: 10,334,808
RAC: 880,522
Message 67693 - Posted: 14 Jan 2023, 10:31:07 UTC - in response to Message 67679.  

Hi,

I posted a few days ago that nothing work for me.

After connect timeout now the old http trancient error again.

I have more then 400 WUs waiting for upload. Not a single file has found its way to the server.
In 6 days most WUs run out.

I will loose 470 days of computing when this happens.

Regards Jörg
ID: 67693 · Report as offensive     Reply Quote
ncoded.com

Send message
Joined: 16 Aug 16
Posts: 73
Credit: 52,932,477
RAC: 8,823
Message 67695 - Posted: 14 Jan 2023, 10:46:29 UTC
Last modified: 14 Jan 2023, 10:54:44 UTC

Stony666 we were in a similar position, but just in the last hour 3 out of 5 hosts started to upload, so hopefully your get a connection shortly.

We are seeing a single connection uploading at around 500-1500 KBps (as reported by BOINC).
ID: 67695 · Report as offensive     Reply Quote
Previous · 1 . . . 10 · 11 · 12 · 13 · 14 · 15 · 16 . . . 25 · Next

Message boards : Number crunching : The uploads are stuck

©2024 climateprediction.net