climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 31 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 925
Credit: 34,100,818
RAC: 11,270
Message 67807 - Posted: 17 Jan 2023, 16:16:06 UTC

As threatened, here's a look at the first minute or so of six tasks starting at once in a 64 GB machine. I don't think I've yet seen a working set size above 4.2 GB per task on this measure, but the log is still running and I'll scan through it later.

@ Glenn, how long would you expect it to take to reach the first peak memory use?

Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 1192.73MB, smoothed 596.37MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.900, kernel CPU 0.660
Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 1190.67MB, smoothed 595.33MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.840, kernel CPU 0.720
Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 1203.55MB, smoothed 601.78MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.930, kernel CPU 0.680
Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 1184.74MB, smoothed 592.37MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.870, kernel CPU 0.630
Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 1188.60MB, smoothed 594.30MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.870, kernel CPU 0.660
Tue 17 Jan 2023 15:59:55 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 1196.59MB, smoothed 598.30MB, swap 1663.22MB, 0.00 page faults/sec, user CPU 4.900, kernel CPU 0.660
Tue 17 Jan 2023 15:59:55 GMT |  | [mem_usage] BOINC totals: WS 7156.88MB, smoothed 3578.44MB, swap 9979.33MB, 0.00 page faults/sec
Tue 17 Jan 2023 15:59:55 GMT |  | [mem_usage] All others: WS 2701.28MB, swap 258864.39MB, user 64.660s, kernel 33.670s
Tue 17 Jan 2023 15:59:55 GMT |  | [mem_usage] non-BOINC CPU usage: 0.78%
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 2656.43MB, smoothed 1626.40MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.200, kernel CPU 1.300
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 2656.42MB, smoothed 1625.88MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.130, kernel CPU 1.320
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2656.42MB, smoothed 1629.10MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.180, kernel CPU 1.370
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 2656.42MB, smoothed 1624.40MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.040, kernel CPU 1.240
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2656.43MB, smoothed 1625.36MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.050, kernel CPU 1.360
Tue 17 Jan 2023 16:00:05 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2656.43MB, smoothed 1627.36MB, swap 3228.59MB, 0.00 page faults/sec, user CPU 14.180, kernel CPU 1.310
Tue 17 Jan 2023 16:00:05 GMT |  | [mem_usage] BOINC totals: WS 15938.54MB, smoothed 9758.49MB, swap 19371.55MB, 0.00 page faults/sec
Tue 17 Jan 2023 16:00:05 GMT |  | [mem_usage] All others: WS 2701.54MB, swap 258864.39MB, user 64.900s, kernel 33.810s
Tue 17 Jan 2023 16:00:05 GMT |  | [mem_usage] non-BOINC CPU usage: 0.63%
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 2750.01MB, smoothed 2188.20MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.690, kernel CPU 1.780
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 2753.11MB, smoothed 2189.49MB, swap 3236.08MB, 0.00 page faults/sec, user CPU 23.730, kernel CPU 1.730
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2755.95MB, smoothed 2192.52MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.700, kernel CPU 1.780
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 2746.14MB, smoothed 2185.27MB, swap 3236.08MB, 0.00 page faults/sec, user CPU 23.590, kernel CPU 1.640
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2755.17MB, smoothed 2190.27MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.550, kernel CPU 1.780
Tue 17 Jan 2023 16:00:15 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2761.35MB, smoothed 2194.36MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 23.640, kernel CPU 1.740
Tue 17 Jan 2023 16:00:15 GMT |  | [mem_usage] BOINC totals: WS 16521.73MB, smoothed 13140.11MB, swap 19416.44MB, 0.00 page faults/sec
Tue 17 Jan 2023 16:00:15 GMT |  | [mem_usage] All others: WS 2700.32MB, swap 258862.36MB, user 65.080s, kernel 33.910s
Tue 17 Jan 2023 16:00:15 GMT |  | [mem_usage] non-BOINC CPU usage: 0.47%
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 3414.84MB, smoothed 2801.52MB, swap 3988.23MB, 0.00 page faults/sec, user CPU 32.870, kernel CPU 2.500
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 3361.16MB, smoothed 2775.33MB, swap 3988.24MB, 0.00 page faults/sec, user CPU 33.040, kernel CPU 2.370
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 2682.67MB, smoothed 2437.59MB, swap 3104.65MB, 0.00 page faults/sec, user CPU 32.870, kernel CPU 2.500
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 3409.42MB, smoothed 2797.34MB, swap 3988.24MB, 0.00 page faults/sec, user CPU 32.810, kernel CPU 2.350
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 2683.42MB, smoothed 2436.84MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 32.720, kernel CPU 2.560
Tue 17 Jan 2023 16:00:25 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 2689.60MB, smoothed 2441.98MB, swap 3236.07MB, 0.00 page faults/sec, user CPU 32.760, kernel CPU 2.500
Tue 17 Jan 2023 16:00:25 GMT |  | [mem_usage] BOINC totals: WS 18241.10MB, smoothed 15690.61MB, swap 21541.51MB, 0.00 page faults/sec
Tue 17 Jan 2023 16:00:25 GMT |  | [mem_usage] All others: WS 2700.34MB, swap 258862.32MB, user 65.250s, kernel 34.270s
Tue 17 Jan 2023 16:00:25 GMT |  | [mem_usage] non-BOINC CPU usage: 0.88%
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 1636.18MB, smoothed 2218.85MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.490, kernel CPU 2.820
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 1640.56MB, smoothed 2207.94MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.730, kernel CPU 2.690
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 1635.78MB, smoothed 2036.69MB, swap 1865.15MB, 0.00 page faults/sec, user CPU 42.630, kernel CPU 2.780
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 1635.91MB, smoothed 2216.63MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.530, kernel CPU 2.640
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 1643.66MB, smoothed 2040.25MB, swap 1871.29MB, 0.00 page faults/sec, user CPU 42.400, kernel CPU 2.840
Tue 17 Jan 2023 16:00:35 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 1741.96MB, smoothed 2091.97MB, swap 1953.28MB, 0.00 page faults/sec, user CPU 42.480, kernel CPU 2.790
Tue 17 Jan 2023 16:00:35 GMT |  | [mem_usage] BOINC totals: WS 9934.05MB, smoothed 12812.33MB, swap 11303.59MB, 0.00 page faults/sec
Tue 17 Jan 2023 16:00:35 GMT |  | [mem_usage] All others: WS 2700.48MB, swap 258862.32MB, user 65.410s, kernel 34.650s
Tue 17 Jan 2023 16:00:35 GMT |  | [mem_usage] non-BOINC CPU usage: 0.89%
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0668_2008050100_123_977_12194312_0: WS 4268.91MB, smoothed 3243.88MB, swap 4859.94MB, 0.00 page faults/sec, user CPU 51.590, kernel CPU 3.680
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0085_1987050100_123_956_12172729_0: WS 4272.00MB, smoothed 3239.97MB, swap 4859.95MB, 0.00 page faults/sec, user CPU 51.910, kernel CPU 3.460
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0454_1999050100_123_968_12185098_0: WS 4274.55MB, smoothed 3155.62MB, swap 4860.07MB, 0.00 page faults/sec, user CPU 51.870, kernel CPU 3.550
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0748_1983050100_123_952_12169392_0: WS 4273.29MB, smoothed 3244.96MB, swap 4859.95MB, 0.00 page faults/sec, user CPU 51.720, kernel CPU 3.460
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0781_2015050100_123_984_12201425_0: WS 4290.31MB, smoothed 3165.28MB, swap 4859.94MB, 0.00 page faults/sec, user CPU 51.430, kernel CPU 3.740
Tue 17 Jan 2023 16:00:45 GMT | climateprediction.net | [mem_usage] oifs_43r3_ps_0855_2006050100_123_975_12192499_0: WS 3569.75MB, smoothed 2830.86MB, swap 4107.27MB, 0.00 page faults/sec, user CPU 51.700, kernel CPU 3.570
Tue 17 Jan 2023 16:00:45 GMT |  | [mem_usage] BOINC totals: WS 24948.80MB, smoothed 18880.56MB, swap 28407.11MB, 0.00 page faults/sec
Tue 17 Jan 2023 16:00:45 GMT |  | [mem_usage] All others: WS 2700.48MB, swap 258862.32MB, user 65.550s, kernel 34.740s
Tue 17 Jan 2023 16:00:45 GMT |  | [mem_usage] non-BOINC CPU usage: 0.38%
Tue 17 Jan 2023 16:00:49 GMT |  | [mem_usage] enforce: available RAM 57819.62MB swap 1536.00MB
ID: 67807 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 67817 - Posted: 17 Jan 2023, 20:21:39 UTC - in response to Message 67807.  

I don't think I've yet seen a working set size above 4.2 GB per task on this measure, but the log is still running and I'll scan through it later.


If I just look at the top command results (that for me looks every 10 seconds), I see 4.6 GB on the largest task quite frequently. This is the RES column.
ID: 67817 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 925
Credit: 34,100,818
RAC: 11,270
Message 67827 - Posted: 18 Jan 2023, 8:11:27 UTC

Another possible memory-related issue. One of the six tasks I started in the simultaneous stress test yesterday (22270385) failed in the very last second.

Four of the tasks finished within a five second interval. All the others ended normally, but this one got a "process exited with code 9 (0x9, -247)". The full stderr is present, but the last two uploads were caught in the apparent upload failure this morning.
ID: 67827 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,677,418
RAC: 29,639
Message 67859 - Posted: 18 Jan 2023, 18:41:19 UTC - in response to Message 67807.  

Working set size reported by boinc client is smoothed and on my system a measurement is taken every 10 seconds. This would systematically under-estimate the peak memory usage, which is what matters if we want to make sure the hosts don't ever run out of memory. Even worse, boinc client uses that smoothed working set size for scheduling, which is causing all kinds issues for OpenIFS and forcing us to use app_config, instead of relying on client to handle memory properly.

For folks interested in debugging memory usage, I would recommend installing atop or below, both will give you historical snapshot whenever you want to check back. atop is widely available, though you might need to tune the default window to be shorter to be useful. Also note that atop captures per-thread information which could be a lot and can wear out SSDs really fast. I personally use below which doesn't have these problems, but not many distros have them so you might have to install rustc, build and install unit files yourself. There is probably an Ubuntu PPA though.
ID: 67859 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1045
Credit: 16,506,818
RAC: 5,001
Message 67862 - Posted: 18 Jan 2023, 20:26:10 UTC - in response to Message 67859.  
Last modified: 18 Jan 2023, 20:26:44 UTC

For folks interested in debugging memory usage, I would recommend installing atop or below, both will give you historical snapshot whenever you want to check back. atop is widely available, though you might need to tune the default window to be shorter to be useful. Also note that atop captures per-thread information which could be a lot and can wear out SSDs really fast. I personally use below which doesn't have these problems, but not many distros have them so you might have to install rustc, build and install unit files yourself. There is probably an Ubuntu PPA though.


I tried atop and below. below does not work and I do not feel like finding out why.

Here is an excerpt from atop. Is the line starting out
MEM | tot 62.4G the one you have in mind? I assume free plus cache to be (about) the amount of RAM available, but over what interval are the values measured? From most recent boot-up?
In this case, 5d6h42m18s
Or over the interval between repeats of atop?
(It appears it is since system boot-up the first time, and over the current interval therafter.)

What is shrss?
ATOP - localhost                       2023/01/18  14:54:47                       -----------------                       5d6h42m18s elapsed
PRC | sys    5h38m | user  92h04m | #proc    460 | #trun     13 | #tslpi   647 | #tslpu   180 | #zombie    0 | clones 423e3 | no  procacct |
CPU | sys      97% | user   1094% | irq       4% | idle    403% | wait      2% | steal     0% | guest     0% |              | curf 4.23GHz |
CPL | avg1   12.36 | avg5   12.43 | avg15  12.44 |              | csw 120772e4 |              | intr 62723e5 |              | numcpu    16 |
MEM | tot    62.4G | free    4.5G | cache  36.3G | dirty  46.3M | buff  142.3M | slab    1.4G | shmem  87.2M | shrss   2.0M | numnode    1 |
SWP | tot    15.6G | free   14.0G | swcac  18.7M |              |              |              |              | vmcom  27.0G | vmlim  46.8G 
|

ID: 67862 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,677,418
RAC: 29,639
Message 67865 - Posted: 18 Jan 2023, 21:53:47 UTC - in response to Message 67862.  

below requires cgroupv2 though it should be default now in most distros. Both are useful to look at per process stats too, like finding out peak RSS for each OpenIFS task. For that we probably need short intervals to be recorded. To look at history, you want to start with `atop -r <timestamp>` though, otherwise the top rows aren't any more useful than top or other tools if you are monitoring live.

atop man page explains the meaning, except it doesn't cover shrss either. Guess it's small enough we can just ignore.
ID: 67865 · Report as offensive     Reply Quote
alanb1951

Send message
Joined: 31 Aug 04
Posts: 32
Credit: 9,526,696
RAC: 109,831
Message 67878 - Posted: 19 Jan 2023, 0:08:00 UTC

What is shrss?
According to the man page for atop on XUbuntu 22.04 it is "the resident size of shared memory (`shrss`)" (same as SHR in top?)

Cheers - Al.
ID: 67878 · Report as offensive     Reply Quote
wujj123456

Send message
Joined: 14 Sep 08
Posts: 87
Credit: 32,677,418
RAC: 29,639
Message 67879 - Posted: 19 Jan 2023, 0:25:04 UTC - in response to Message 67865.  

Turns out if all we care about is just getting RSS of the OpenIFS app for some period, it's much faster to just write a script instead of trying to observe it live or from history. I meant to do this for a while just to understand how much the memory usage swings, and guess the discussion finally pushed me to do that.

Shitty script here: https://pastebin.com/GtAiv5XB. One RSS sample per second and total count is in the parentheses. --help has some flags you can tune. Probably lots of rough edges for corner cases and it's Linux only. That's what I got for the current public app after running it for 5 minutes.
$ ./boinc_task_memory.py --slot 15
2023-01-18 16:17:51,760 [INFO] pid of slot 15: 495869
2568212 - 2714144: ***************** (51)
2714145 - 2860076:  (2)
2860077 - 3006008:  (0)
3006009 - 3151940:  (0)
3151941 - 3297872:  (2)
3297873 - 3443804: ****** (19)
3443805 - 3589736: * (3)
3589737 - 3735668: * (4)
3735669 - 3881600: ** (8)
3881601 - 4027532: ************************** (78)
4027533 - 4173464: *********************************** (107)
4173465 - 4319396: * (5)
4319397 - 4465328:  (2)
4465329 - 4611260: ** (6)
4611261 - 4757192: * (5)
4757193 - 4903125: ** (8)
ID: 67879 · Report as offensive     Reply Quote
Profile Alan K

Send message
Joined: 22 Feb 06
Posts: 484
Credit: 29,579,234
RAC: 4,572
Message 67922 - Posted: 20 Jan 2023, 17:27:50 UTC

Not sure if this is the right place for this but I have had a task fail with a compute error after the last zip file (122) was written. Stderr message is:

<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>
irectory: /var/lib/boinc-client/slots/1/ICMSHhq0f+002316

WU 12189428 task 22274970.
ID: 67922 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 67995 - Posted: 23 Jan 2023, 16:22:28 UTC - in response to Message 67922.  
Last modified: 23 Jan 2023, 16:26:35 UTC

Yep, I'm aware of this one. I've seen it happen on one of my machines and from our error stats it's responsible for less than 5% of the total failed tasks. That error message occurs because there are two models executing in the same slot directory. One of the model processes shouldn't be running, so the client kills it when the task itself is complete (because the real model process finished normally).

In detail: there are 2 processes involved in a task, one is the model, the other is the controlling process that monitors the model and zips & transfers results to the CPDN server. This controller process also executes the suspend/resume instructions from the client. Sometimes, for some reason, the controller loses track of the model process id. Not sure why. It might be related to the 'memory faults' that also occur because on my machine I had the 'process still running' error right after I saw a task fail with 'double corruption'. So my working theory is that one of the tasks clobbered a bit of memory of another task and the controlling task then couldn't control the model any longer. The boinc client shows the task as suspended, but, it's actually only the controller process that's suspended as the client know nothing about the model. Only the controller does but as it's 'lost' the model, the model runs free.

I spotted this because I suspended the project and then wondered why my PC fans were still running. I checked processes with 'ps' and noticed all but one 'oifs_43r3_model.exe' process was still running, all other processes were suspended correctly.

If you see this happen, you can safely kill the running model as the client will just start the model up again when the task resumes, which is why a second model process starts up. Just make sure to kill the right process, only the 'oifs_43r3_model.exe' process and not the 'oifs_43r3_ps_1.05_x86_64-pc-linux-gnu' one, as that's the controller. If you kill that by mistake it will abort the task. And make sure the project is suspended, because if not and the model process is killed, that will be detected by the controller and also abort the task.

I have fixed a number of issues in the controller code lately and a new version is about to be tested. One was in the process control, though I am not 100% certain it will deal with this issue.

Not sure if this is the right place for this but I have had a task fail with a compute error after the last zip file (122) was written. Stderr message is:

<![CDATA[
<message>
Process still present 5 min after writing finish file; aborting</message>
<stderr_txt>
irectory: /var/lib/boinc-client/slots/1/ICMSHhq0f+002316

WU 12189428 task 22274970.
ID: 67995 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 68003 - Posted: 23 Jan 2023, 17:30:53 UTC

As a p.s. to previous message about fixing issues. The forum posts about I/O amount from the model were noted and I have changed the model configuration for future batches so it will produce less logging output. I can't alter the results file sizes nor the checkpoints but the log information contributed a notable %age of read & write I/O.
ID: 68003 · Report as offensive     Reply Quote
BellyNitpicker

Send message
Joined: 13 Jun 20
Posts: 6
Credit: 5,301,352
RAC: 176,529
Message 68022 - Posted: 24 Jan 2023, 22:06:46 UTC

Re previous observations on OIFS uploads failing, I now have well over a week's worth of pending uploads across two Ubuntu virtual machines - between 500 and 1,000 in all. No storage problems at my end yet, but they are only virtual, and do only have a fraction of a real SSD each. Do we have any news on when uploads might resume?

Nick
ID: 68022 · Report as offensive     Reply Quote
cetus

Send message
Joined: 7 Aug 04
Posts: 9
Credit: 139,753,972
RAC: 19,927
Message 68023 - Posted: 24 Jan 2023, 23:16:07 UTC - in response to Message 67995.  

It might be related to the 'memory faults' that also occur because on my machine I had the 'process still running' error right after I saw a task fail with 'double corruption'.

I have also seen and killed several detached model.exe processes that seem to occur after the model fails with a "double free or corruption (out)" error. I've started looking with "ps -efl | grep boinc" whenever I see a task with a computation error. The bad process is pretty easy to find because the parent PID is set to "1", instead of the PID of a controlling process. It also has the same slot number as another process. I suspect that there is a detached process every time the corruption error happens, but I haven't looked consistently enough to be certain.
Do you have any insight into how the intermediate data is used? It's easy to imagine looking at final results of 40,000 runs, and it's easy to imagine looking at the intermediate results of a few runs, but I have a hard time imagining sorting through the massive amount of data that we are generating here.
ID: 68023 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 243
Credit: 11,418,997
RAC: 26,689
Message 68028 - Posted: 25 Jan 2023, 7:11:36 UTC - in response to Message 68022.  

Do we have any news on when uploads might resume?

The last update posted didn't specify a number just said several days. My guess would be not until next week.
ID: 68028 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 68029 - Posted: 25 Jan 2023, 7:42:18 UTC - in response to Message 68028.  

Do we have any news on when uploads might resume?

The last update posted didn't specify a number just said several days. My guess would be not until next week.
That is likely. I or one of the other moderators or Glen will post when we hear anything. First place to look will be the, "Uploads are stuck" thread.
ID: 68029 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 243
Credit: 11,418,997
RAC: 26,689
Message 68034 - Posted: 25 Jan 2023, 9:33:41 UTC

It seems like even though sending out of work has been paused, that only applies to new work. Reruns are still being sent out. Unfortunately many if not all of them seem to be from users who probably already completed them but haven't been able to upload and report before the deadline. It seems like I was wrong in my confidence that this won't happen. So now others will have to redo the work. At least this time the 30 day grace period applies.
ID: 68034 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 925
Credit: 34,100,818
RAC: 11,270
Message 68036 - Posted: 25 Jan 2023, 9:39:37 UTC - in response to Message 68034.  

Most of the active crunchers will be well into "too many uploads" by now, and active readers of these boards will know the risks of trying to circumvent that limit. I think the continued issue of resends is a very minor concern in the grand scheme of things.
ID: 68036 · Report as offensive     Reply Quote
AndreyOR

Send message
Joined: 12 Apr 21
Posts: 243
Credit: 11,418,997
RAC: 26,689
Message 68037 - Posted: 25 Jan 2023, 9:57:00 UTC - in response to Message 68036.  

Most of the active crunchers will be well into "too many uploads" by now, and active readers of these boards will know the risks of trying to circumvent that limit. I think the continued issue of resends is a very minor concern in the grand scheme of things.

Sure, but it'll be whoever gets lucky to upload and report first who gets the credit. Good chance many of the users who did the work first may lose out.
ID: 68037 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 774
Credit: 13,433,329
RAC: 7,110
Message 68039 - Posted: 25 Jan 2023, 11:03:19 UTC

Update:
Data backup has been reduced sufficiently that the batch & upload servers will be restarted today, if not tomorrow (depending on some last checks).
ID: 68039 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68048 - Posted: 25 Jan 2023, 19:45:30 UTC - in response to Message 68036.  
Last modified: 25 Jan 2023, 19:48:58 UTC

Richard Haselgrove wrote:
I think the continued issue of resends is a very minor concern in the grand scheme of things.
Except that if two results of the same workunit are attempted to be uploaded, this aggravates upload11.cpdn.org's troubles.

This month so far, whenever upload11.cpdn.org was up at all, it _never_ was able to take our result data as fast as we were able to compute them. If we now start to compute redundant tasks, this only gets worse.
ID: 68048 · Report as offensive     Reply Quote
Previous · 1 . . . 16 · 17 · 18 · 19 · 20 · 21 · 22 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net