climateprediction.net home page
Posts by xii5ku

Posts by xii5ku

1) Message boards : Number crunching : OpenIFS Discussion (Message 68163)
Posted 31 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
It's not possible to 'filter-out' the triple-errors (if I understand what you mean).
I mean filtering after the failures occurred, not before. Such as grep'ing through the stderr.txt of the three results of a failed workunit. It might be possible to pick up on a few keywords which either indicate reproducible model errors, versus non-reproducible 'operational' failures (such as suspend-resume related issues if they are still relevant after the upcoming application updates, OOM, out of disk space, empty stderr.txt, etc.).
2) Message boards : Number crunching : The uploads are stuck (Message 68162)
Posted 31 Jan 2023 by xii5ku
Post:
Upload11 is back up.
It had a little more than 2 hours availability today. Time to celebrate...?

Yes I see 62 people have managed to report tasks and the number listed as in progress has dropped significantly.
From 4192 to 3952 (-240) in the last 24 hours.

________

On the idea of turnaround time based credit: The trick to wean people off month deep work buffers is to set reporting deadlines such that a good compromise between required completion times (includes time needed to get result data offloaded) on one hand, and project goals on the other hand.

________

So there is talk of new oifs_43r3_bl work batches coming up some time soon. Did CPDN engage a new provider for their upload file handler server yet? It can't realistically stay at the current one, can it. (December 24: Server becomes inaccessible to CPDN's admin. January 30: Server becomes inaccessible to CPDN's admin. Not to mention everything else which happened in between, or even the warning signs that showed already before December 24.) It seems almost as if deeper rooted issues of the server implementation keep getting ignored.
3) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68133)
Posted 30 Jan 2023 by xii5ku
Post:
Jean-David Beyer wrote:
Is 208 consecutive tasks out of 208 total tasks a few or a lot?
It's all relative. Right now I am seeing this on your single Linux host:
oifs_43r3_ps results: 244 total, 243 valid, 1 in progress

This is mine across 2 (earlier partly 3) hosts:
oifs_43r3_ps results: 1493 total, 954 valid, 484 in progress (would have been done by now if not for the permanent upload server absence), 55 error

My last errors were from November mostly, when I shutdown and resumed one of my hosts. Then a few errors from Dec 1, one from Dec 3, and no error since. But I have successfully avoided to suspend tasks to disk ever since November, with the exception of 1 deliberate test which AFAICT didn't fail. (Might still fail, if it is among the pending uploads.)

My upload link width allows me to return 48 results per day. (This is rather little relative to the CPUs, RAM, and disk space which I could spare.) This means my 954 valid tasks translate to merely 20 days production. The rest was server downtime. My best CPUs are 32 core CPUs, of which one alone could produce slightly more than 48 results per day. Some folks have even larger CPUs, or similarly large ones but with higher power budget than mine.

If we go by credit of the last week or last month, my 48 results/day during the brief times when the upload server is functioning put me above the average. But a few big producers are missing from these 3rd party stats because they didn't enable statistics export. E.g., based on last week's credit of my team, there was 1.3 M credit given to one or more users on my team without stats export, compared to my 400 k or your 80 k of last week.
4) Message boards : Number crunching : The uploads are stuck (Message 68129)
Posted 30 Jan 2023 by xii5ku
Post:
(on the avoidance of overly deep work buffers at client computers)
wujj123456 wrote:
This reminds me of some other project's trick. They grant additional credits if tasks are returned within X days to disincentive excessive hoarding. No matter how people fake it, whether it's bogus core count or multiple clients per machine, they can't fake the actual compute throughput of the machine. Given CPDN has credit granting script run once a week instead of continuously, it might even be possible to adjust for upload server downtime if necessary.
GPUGrid used to apply different credit if the turnaroud time of a result was <24 h, 24…48 h, or >48 h. I am not sure if they are still doing it. AFAICS the corresponding FAQ vanished from their message board. Folding@Home practically prevents work buffering in their client, and apply an extremely nonlinear credit based on turnaround time (to a degree which is ridiculous; little credit is given to work done, much credit is given to the speed at which the work was done). — GPUGrid did this/ F@H does this because newer workunit batches are built based on results from previous workunit batches, presumably.
5) Message boards : Number crunching : OpenIFS Discussion (Message 68127)
Posted 30 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
xii5ku wrote:
About next OpenIFS batches:
One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.)
It's staying at 3. Some of the model perturbations lead to the model aborting, negative theta, levels crossing, too short timestep etc. We don't want to send out too many repeats of tasks which will always fail. 3 is usually enough to get past any wobbles and if necessary another small batch to rerun can be sent out.
Thanks, sounds good! If it's feasible to filter out such triple-errors which were not repeats of one and the same reproducible model failure, and turn these back into extra workunits, then that's obviously a lot better than a higher "max # of error tasks" setting.
6) Message boards : Number crunching : w/u failed at the 89th zip file (Message 68126)
Posted 30 Jan 2023 by xii5ku
Post:
Jean-David Beyer wrote:
It is really puzzling to me that I have such good luck with these, and others have bad.
While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter.

(I for one am one of the users who complete comparably few tasks, because of my upload bandwidth limit, which means I don't run a lot of tasks while the upload server is up, and am down to running only 1 "pilot" task per computer while the worthless upload server is down. Which it is most of the time. Plus, by now I practically never suspend a task to disk, only to RAM. Consequently, I had very few errors until now.)
7) Message boards : Number crunching : OpenIFS Discussion (Message 68090)
Posted 27 Jan 2023 by xii5ku
Post:
About next OpenIFS batches:
One or another frequenter of this board mentioned it already: Consider to increase the "max # of error tasks" workunit parameter (and total tasks of course). 3 as in the current WUs isn't a lot. (Too high "max # of error tasks" would of course be bad if crashes were highly repeatable on independent hosts, such as with bad input parameters, but that's evidently not a problem at least currently.)
8) Message boards : Number crunching : How to optimize credit production with OpenIFS tasks (Message 68088)
Posted 27 Jan 2023 by xii5ku
Post:
if you're concentrating on one BOINC project at a time, it's probably not worth using hyperthreading at all.
This is not universally true. Quite often there are gains,¹ at least with Haswell and later and even more so with Zen and later.

________
¹) WRT host throughput. Task energy is another question.
9) Message boards : Number crunching : OpenIFS Discussion (Message 68086)
Posted 27 Jan 2023 by xii5ku
Post:
@Richard Haselgrove,
note, CPDN's overall oifs_43r3_ps progress _right now_ is likely not subject to one of the three modes which I described, because a few things were apparently changed after the tape storage disaster.

________

Glenn Carver wrote:
If you altered the <ncpus> tag in cc_config.xml from -1 to a large number, as a way of bypassing the 'no more tasks too many uploads in progress' problem when upload11 was down, could I please remind everyone to change that tag back to <ncpus>-1</ncpus>.

There are some more OpenIFS batches coming soon and we don't want 100+ tasks landing on volunteer machines that really don't have 100 cores: e.g. https://www.cpdn.org/show_host_detail.php?hostid=1524863.
This host (it is not one of mine) has got ncpus set to 100 when I looked at this link just now.
This *may* have been done due to a desire to download new work while lots of uploads were pending. (Fetching new work in such a situation is risky though, given the history of upload11's operations.) However, there is also another possible explanation why the user did this:

Boinc-client and its control interfaces (web control, boincmgr, global_config_override.xml, you name it) offer to control the number of CPUs usable by boinc only as a percentage, not as the absolute number of CPUs. Hence some users apply this simple trick: Set <ncpus> to 100, et voilà, <max_ncpus_pct> suddenly becomes equal to the actual absolute CPU count which boinc shall use.

So if you see such hosts and wonder if their operator is doing something silly or undesirable: It's very well possible that this host is in fact configured well and proper. (I guess project admins could check the scheduler logs; <max_ncpus_pct> is sent by the host in each scheduler request.)
10) Message boards : Number crunching : OpenIFS Discussion (Message 68052)
Posted 25 Jan 2023 by xii5ku
Post:
I did not make this statement after looking at one or two hosts.
I looked at recorded server_status.php history. (Sum of 'tasks ready to send' and 'tasks in progress', plotted over time, oifs_43r3_ps only. grafana.kiska.pw has got the record.)
In other words, be "we" I don't refer to myself, but to all combined who are, or have been, computing oifs_43r3_ps.

We had three modes of progress in January:
– upload11 was down. Progress rate was 0. (We had four periods of this in January so far.)
– upload11 was online and ran without notable limit of connections. Progress rate was ~3,300 results during 14 hours, followed by upload11 going down again. (This mode was played out twice in January.)
– upload11 was online and ran with throttled connection limit. Progress rate was quite constant ~1,500...~2,000 results/day. (There was a single period in this mode. It lasted 8d3h until the tape storage issue.)

The latter constant progress rate cannot be the rate at which we are actually producing. If it was, then there would have been a noticeably steeper progress at the start of that stage when everybody had previously stuck files to upload.
11) Message boards : Number crunching : OpenIFS Discussion (Message 68048)
Posted 25 Jan 2023 by xii5ku
Post:
Richard Haselgrove wrote:
I think the continued issue of resends is a very minor concern in the grand scheme of things.
Except that if two results of the same workunit are attempted to be uploaded, this aggravates upload11.cpdn.org's troubles.

This month so far, whenever upload11.cpdn.org was up at all, it _never_ was able to take our result data as fast as we were able to compute them. If we now start to compute redundant tasks, this only gets worse.
12) Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup (Message 68020)
Posted 24 Jan 2023 by xii5ku
Post:
I put a trivial cleanup script together after all.
#!/bin/bash

echo "=== before ==="
df -h /var/lib/
echo

for d in /var/lib/boinc/slots/*/
do
	for p in BLS LAW srf
	do
		ls ${d}${p}* >/dev/null 2>&1 || continue
		f=($(ls -t ${d}${p}*))
		for ((i=1; i<${#f[*]}; i++))
		do
			rm -f ${f[i]}
		done
	done
done

echo
echo "=== after ==="
df -h /var/lib/
13) Message boards : Number crunching : oifs_43r3_ps v1.05 slot dir cleanup (Message 68010)
Posted 23 Jan 2023 by xii5ku
Post:
Thank you for looking at this.
Unfortunately I did not make the connection between slot number and task identity at the time. I'll see if I can find another one of these.

I thought of setting up a little periodic cleanup but haven't put it together yet, because it's better to avoid suspension in the first place in order to lower the risk of task failures, more so than because of the disk space requirement.
14) Message boards : Number crunching : How to Prevent OpenIFS Download (Message 68009)
Posted 23 Jan 2023 by xii5ku
Post:
Glenn Carver wrote:
[...] there's an issue with the boinc client that it will start up as many tasks as free cores available to boinc. It does not respect the memory limit of the task, leaving it to volunteers like yourself to fix it. The problem with the client was unexpected and we're looking into workarounds we can put in place on the server to deal with this. [...] We've since found that LHC also hit this problem so we'll probably follow their approach to limiting tasks downloaded to machines.
I haven't been at lhc@home for a while, so don't know what their approach looks like. But a limit on tasks in progress is not a good replacement for the desired limit on tasks which are executing.

Stages of a "task in progress":
(ready to send)
– assigned to host
– downloading
– ready to run
– executing
– uploading
– ready to report
(reported)

Each of the stages can take unpredictably long for a variety of reasons. Hence it's clear that # in progress cannot control # executing very well, to put it mildly.

Also, oifs_43r3_ps concurrency is only part of the equation. The other part is what else is going on on the host. It is a big difference if the host is running a desktop environment or is a dedicated cruncher.
15) Message boards : Number crunching : The uploads are stuck (Message 67977)
Posted 22 Jan 2023 by xii5ku
Post:
Saenger wrote:
I got this with the last .zip for one WU: [...]
While at the same time 3 new WUs got downloaded without any problem.

Jean-David Beyer wrote:
I am confused.

I used to go to climateprediction.net to get here and yesterday evening that failed. I could not even get anywhere. I had to change it to cpdn.org to get here today. Could that be why I cannot upload anything?

Checking if climateprediction.net is down or it is just you...
It's not just you! climateprediction.net is down.
There are (at least) four physically different servers:

    www.climateprediction.net — Just a web site, basically unrelated to the CPDN BOINC operations. It's currently down for unknown reasons.
    (Actually, it is related to the CPDN BOINC functions in the way that the BOINC project URL is named www.climateprediction.net too. I suppose it is impossible to attach new clients to CPDN for as long as this web server is down.)


    www.cpdn.org — The main CPDN BOINC site. Hosts scheduler, BOINC's own web pages and message board, download server, validator, assimilator… This one is up and running well.


    upload11.cpdn.org — Currently hosts the upload file handler for Linux OpenIFS work. It's currently up but configured to accept only very few simultaneous HTTP connections. So few that most of our connection attempts are rejected. The reason for this is that this server ran out of disk space yesterday and first needs to offload a lot of data to another external storage server. It can accomplish this only if there isn't too much data incoming from client computers at the same time. Eventually this situation will be over and the admin will increase the allowed connection count again, somewhat.
    Expect this sort of unavailability to happen again and again until the current OpenIFS work is done. (Unless CPDN can afford a storage subsystem which has magnitudes more temporary space, or can set up a magnitudes faster outbound data link from the upload file handler to backing store.


    upload???.cpdn.org — Currently hosts the upload file handler for Windows Weather@Home work. I take it from user reports here that this server is down right now too. (I'm just guessing that because I don't have any W@H uploads myself.)


________

I hope this gives a picture why some things work and others don't.

16) Message boards : Number crunching : The uploads are stuck (Message 67965)
Posted 22 Jan 2023 by xii5ku
Post:
Richard Haselgrove wrote:
Saw Andy's message, timed at 21:55 last night. It doesn't seem to have made much difference - I'm still in multi-hour project backoffs. I'll check exactly how many uploads are getting through when I've woken up a bit more. [...] Edit - one machine got a 5-minute burst of uploads around 22:45 last night, and another around 04:45 this morning, but nothing since then. (A total of 100 files across the two bursts)
The situation turned from a certain portion of transfers failing (and going into retry loops), to a large portion of connection requests being rejected.

Ever since the upload server was revived, it is evidently working near or at its throughput limit and only the details of how it is coping are slightly varying over time. From the project's infrastructure point of view there is one good aspect of this: The upload infrastructure is well utilized (as long as it doesn't go down like on Christmas eve and during the first recovery attempt, or attempts, in early January). For us client operators it's of course bad because we have to constantly watch and possibly re-adjust the compute clients in order to prevent too large transfer backlogs or even outright task failures in case of lack of disk space. The client can deal with a situation like this somewhat on its own, but not particularly well.


SolarSyonyk wrote:
I've got a dual core EPYC VM running with 10GB RAM at $11.63/mo. It's running about 20h per task, with two going at any given time: https://www.cpdn.org/results.php?hostid=1538282
So either you will be lucky and the upload server availability recovers soon enough. Or you will need to go through hoops to add storage to the VM while it is up and running. Or you will have to suspend the unstarted tasks and wait for the running tasks to complete and then shut the VM down. Or you could shut down the VM right away and risk the tasks erroring out after resumption. Or you could suspend the VM at extra charge for the provider's storing your VM state.
17) Message boards : Number crunching : w/u failed at the 89th zip file (Message 67963)
Posted 22 Jan 2023 by xii5ku
Post:
AndreyOR wrote:
xii5ku wrote:
"double free or corruption (out)" is not caused by lack of free RAM.
It's something else.
If not directly then perhaps indirectly? Could it be that pushing the RAM limits is more likely to bring out these types of memory related problems?
It's hard to tell but I believe that this is unlikely. Keep in mind: When there is a lack of free RAM but some processes on the system request new RAM allocations, what then follows is not that the allocations fail.¹ Instead, the OS first tries to make more RAM available by swapping less recently accessed pages out to swap space (at the price of the entire system becoming less and less responsive, possibly to a degree that users mistake the system for being frozen entirely). When all swap space is used up (or if there isn't any swap space attached in the first place), then the kernel proceeds to act on the out-of-memory situation by picking processes with large memory footprint and terminating them. (This is known as Linux' "OOM killer". That's as if SIGKILL was sent to the affected process, which the process cannot catch. Therefore the process doesn't have a chance anymore to exercise any –possibly buggy– code paths. The process simply goes away immediately.)²

That said, the period during which swap space is started to be used and system responsiveness is being degraded, could uncover bugs or misbehaviour in programs with realtime functionality. An example of such functionality could be I/O watchdogs which consider an I/O operation to be failed if it doesn't succeed within a certain time frame. (In turn, the handling of such assumed failure could easily contain bugs, such as memory corruptions, because error handling code paths like this may be rarely exercised in testing.) I don't know if this class of "realtime functionality" is relevant to the OpenIFS application or the CPDN wrapper.

________
¹) A failed memory allocation would lead to various random program misbehaviour: The program might catch the failure but might not have a good strategy to back out of such a situation. Or the allocation error handler might contain a programming bug. Or the program may not check for failure of the allocation attempt and use the returned error pointer as if it was pointing to successfully allocated memory. In the latter case, the program would most likely crash with a segfault. But see next footnote.

²) Consequently, if a program attempts to allocate memory when the system doesn't have any available anymore, two things can happen: Either the allocation succeeds but the required system call takes a rather long time to complete. Or the process which performs the allocation attempt is terminated by the OOM killer. That is, on Linux, memory allocation requests by userspace processes never fail with a returned error pointer.

AFAIK.
18) Message boards : Number crunching : If you have used VirtualBox for BOINC and have had issues, please can you share these? (Message 67954)
Posted 21 Jan 2023 by xii5ku
Post:
SG Felix wrote:
windows 10 is my main System, on which VBox runs :)
So no sudo usermod :)
Hm, not sure then. (Last time I used VBox on Windows myself was a while ago on Win 7 Pro.) According to a superficial web search, uninstalling + reinstalling VBox and running the installer as admin while doing so might help. Or overwriting the contents of C:\Users\%USERNAME%\.VirtualBox\VirtualBox.xml by that of VirtualBox.xml-prev perhaps. Or a reset of the access permissions of the .VirtualBox folder and everything in it.
19) Message boards : Number crunching : Hardware for new models. (Message 67950)
Posted 21 Jan 2023 by xii5ku
Post:
SolarSyonyk wrote:
I guess avoid AMD, that double free thing is being a royal pain on them for some reason, though it's happening on enough different boxes (including cloud systems I know very well are ECC) that I don't believe it's RAM errors, just something in whatever code path AMD chips end up on.
Here is an Intel Broadwell-EP (host 1534812) which has got a bunch of "double free or corruption (out)" too. (In this host's list of oifs_43r3_ps results with error status, look at those which finished on January 12 and took less than 70,000 seconds.)
20) Message boards : Number crunching : w/u failed at the 89th zip file (Message 67939)
Posted 21 Jan 2023 by xii5ku
Post:
"double free or corruption (out)" is not caused by lack of free RAM.
It's something else.


Next 20

©2024 climateprediction.net