climateprediction.net home page
OpenIFS Discussion

OpenIFS Discussion

Message boards : Number crunching : OpenIFS Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 31 · Next

AuthorMessage
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66713 - Posted: 2 Dec 2022, 6:30:58 UTC - in response to Message 66710.  
Last modified: 2 Dec 2022, 6:35:49 UTC

I suppose that the largest group of CPDN contributors run CPDN alongside several other projects, which, for all practical purposes, puts them into the "bottlenecked by CPU" category.

Jean-David Beyer wrote:
cache-misses # 49.552 % of all cache refs
(That's on a Xeon W2245 with 8c/16t, 16.5 MB last-level cache, mixed workload with 3x OpenIFS. I suspect that some Rosetta work can be quite cache hungry too. Not sure of any of the others. You could check by )

Sample from a dual-Epyc 7452 = 2x {32c/64t, 8x 16 MB last-level cache}, running merely 5 OpenIFS at the moment because I want to clear a backlog of uploads, plus 59 PrimeGrid llrSGS which have a known cache footprint of 1 MBytes. That is, only 64 of 128 logical CPUs are used at a time. Also, it's a headless system; display-manager service is shut down.

    system-wide: ~10% cache misses
    looking at one of the master.exe processes: ~11% cache misses
    looking at one of the sllr processes: ~10% cache misses

It's remarkable that master.exe and sllr processes have the same cache miss rate. But it's of course only a very small and quick sample which I took for now.

Here is an idea: It seems as if watching the stderr.txt could be a useful tool to estimate progress rate of an OpenIFS task. I could use that to explore relative performance of different task distributions. E.g. 1.) pin all OpenIFS tasks to logical CPUs which belong to different last-level caches; 2.) pin 4 OpenIFS tasks to one and the same 4 last-level cache sibling CPUs. Check if both progress rate and cache miss rate change noticeably. Maybe I'll try this at the weekend.

ID: 66713 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 816
Credit: 13,668,962
RAC: 8,030
Message 66714 - Posted: 2 Dec 2022, 10:34:17 UTC - in response to Message 66698.  
Last modified: 2 Dec 2022, 11:08:36 UTC

Steven,

That's a nice summary of the different problems we're seeing. I'm going to print that out as it's better than my notes!

As Dave says, the memory corruption is the more serious one. The kills the controlling process (not the boinc client), which looks after running the model. The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process.

There is also a problem at the very end of the task where the last upload file seems to go missing. I have seen this happen once now on my test machine. it's some kind of race condition between the multiple processes managing the task, one process doesn't get information when it should, but I haven't pinned down exactly what's going on.

My impression is that the tasks run better if there are 1,2 running at a time. Mine are running fine, I've only had 1 failure so far this way. I was looking at the batch statistics and the success/failure ratio has improved for the latest batch compared to the first batch. Perhaps that's in part because everyone has got better at managing the tasks after the first batch?

Apologies for the failures and your time, but the summary was very useful and maybe the moderators can refer others to that post.

I have fixed the memory and disk bounds for these tasks and I've started looking at these other issues with the CPDN folk.

Regards, Glenn

Edit: and thanks for the posts about broadband & upload speeds. CPDN have a tool for computing workunit output sizes to advise scientists on what's acceptable. I'll pass that info on to them.
ID: 66714 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66715 - Posted: 2 Dec 2022, 11:00:09 UTC - in response to Message 66710.  

Memory 	62.28 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	477.76 GB
Measured floating point speed 	6.13 billion ops/sec
Measured integer speed 	26.09 billion ops/sec
Average upload rate 	4480.76 KB/sec
Average download rate 	45235.53 KB/sec
I think the data rates reported by Boinc-CPDN are really Kilobits per second, not KiloBytes per second.)
Where are you copying those figures from? I see these for your two machines on the CPDN website:

Average upload rate 4308.41 KB/sec (Xeon)
Average upload rate 136.81 KB/sec (Windows 10)

I get this on a 15.9 Mbps uplink fibre ADSL line:

Average upload rate 603.55 KB/sec

I think that's consistent with the BOINC measurement being based on bytes. Some figures may be skewed if there's a proxy server in the loop.
ID: 66715 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66716 - Posted: 2 Dec 2022, 11:07:00 UTC - in response to Message 66714.  

The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process.
I think that's unlikely. The BOINC client is pretty robust about checking that a proposed slot is genuinely empty before starting a new task in it: if there's any doubt, it creates a new slot directory and starts the task there instead.

We did have a problem a few years ago where files over 4 GB (!) were invisible to the checking routine, but that's been long fixed.
ID: 66716 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 816
Credit: 13,668,962
RAC: 8,030
Message 66717 - Posted: 2 Dec 2022, 11:15:36 UTC - in response to Message 66716.  

The consequence of this is the model runs 'alone' in the slot, while another task might start running in the same slot corrupting the files. This causes some of the other problems you are seeing. The client does kill off the rogue model process.
I think that's unlikely. The BOINC client is pretty robust about checking that a proposed slot is genuinely empty before starting a new task in it: if there's any doubt, it creates a new slot directory and starts the task there instead.
Quite prepared to accept that's the case. But looking at the task logs there's some output in some of them that suggests the model is still running.

Possible scenario is: wrapper dies, leaving the master.exe process still running. boinc client detects the wrapper (i.e task) has died and clears out the slot directory. However, master.exe will write to the same slot every, let's say, 1 min, writing to a couple of text output log files. So the slot dir appears empty until the process does the write. In the meantime, the client has started another task, and then 30secs later the first master.exe then writes to the files (which now exist because the new model task has started). Possible?
ID: 66717 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 6 Jul 06
Posts: 141
Credit: 3,511,752
RAC: 144,072
Message 66718 - Posted: 2 Dec 2022, 11:27:13 UTC
Last modified: 2 Dec 2022, 11:28:00 UTC

Just downloaded a resend of a Work Unit that failed due to an error.

This Task 22245903

It failed due to running longer than 5 minutes after the work unit had finished.

The WU was run by mikey and other than the longer run time after finishing seemed to have run successfully after over 2 days run time.

The run time seems overly long on a Ryzen but did complete.

It is now running as Task 22249047 on my Ryzen computer.

Will see how it runs for me.

Conan
ID: 66718 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66719 - Posted: 2 Dec 2022, 11:30:32 UTC - in response to Message 66717.  

Possible scenario is: wrapper dies, leaving the master.exe process still running. boinc client detects the wrapper (i.e task) has died and clears out the slot directory. However, master.exe will write to the same slot every, let's say, 1 min, writing to a couple of text output log files. So the slot dir appears empty until the process does the write. In the meantime, the client has started another task, and then 30secs later the first master.exe then writes to the files (which now exist because the new model task has started). Possible?
Yes, I suppose so. If master.exe is still running in memory, but the slot directory has had all files deleted, then the BOINC client could nip in and start launching a new task in the split second before the next write (and the final 'empty folder' check is a single test just before the next task is launched - there's no ongoing verification). I think the slots mostly contain symlinks to files stored in the project directory? Does anything check that those links are still valid?
ID: 66719 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 816
Credit: 13,668,962
RAC: 8,030
Message 66720 - Posted: 2 Dec 2022, 11:38:14 UTC - in response to Message 66713.  
Last modified: 2 Dec 2022, 11:48:58 UTC

looking at one of the master.exe processes: ~11% cache misses
Should look at TLB misses, they are expensive. I'm not sure if that's what you mean by a 'cache miss'.

Here is an idea: It seems as if watching the stderr.txt could be a useful tool to estimate progress rate of an OpenIFS task. I could use that to explore relative performance of different task distributions.
Find the slot directory for the task and find the 'ifs.stat' file, this is the file that the controlling wrapper process is watching.

% tail -f ifs.stat
 11:24:28 0AAA00AAA STEPO      512    16.886   16.886   26.918    177:55   1298:49 0.11147937005926E-04       2GB       0MB
 11:24:58 0AAA00AAA STEPO      513    17.573   17.573   29.619    178:12   1299:18 0.11141352198812E-04       2GB       0MB

The 4th column is the current model step. The 5th column is what you want, this is the CPU time of the last step. I optimize my setup by watching this number, aim to get it as low as possible. Note that when the model is doing output, there are multiple lines per step.

For info the rest of the columns are (not all may work well outside ECMWF):
1 : wall-clock time
2 : model configuration (short code for exactly what the model is doing)
3 : name of calling routine
4 : timestep count
5 : CPU time of last step
6 : vector CPU time of last step (throwback to the old days when the model ran on vector hardware)
7 : wall-clock time of last step
8 : accumulated cpu time
9 : accumulated wall-clock time
10 : L2 norm of global divergence field (used to check bit-reproducibility)
11 & 12 : heap and stack memory, these don't work well outside of ECMWF.
ID: 66720 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66721 - Posted: 2 Dec 2022, 13:54:55 UTC - in response to Message 66713.  

cache-misses # 49.552 % of all cache refs

(That's on a Xeon W2245 with 8c/16t, 16.5 MB last-level cache, mixed workload with 3x OpenIFS. I suspect that some Rosetta work can be quite cache hungry too. Not sure of any of the others. You could check by )


Right. Is this more helpful? (Taken at a different time from the previous one I showed, but similar work load.)

# perf stat -aB -d -d -e cache-references,cache-misses
 Performance counter stats for 'system wide':

    21,080,333,138      cache-references                                              (36.36%)
    10,457,921,236      cache-misses              #   49.610 % of all cache refs      (36.37%)
 1,372,155,371,235      L1-dcache-loads                                               (36.37%)
    78,486,820,592      L1-dcache-load-misses     #    5.72% of all L1-dcache accesses  (36.37%)
     4,986,357,186      LLC-loads                                                     (36.37%)
     3,181,661,273      LLC-load-misses           #   63.81% of all LL-cache accesses  (36.36%)
   <not supported>      L1-icache-loads                                             
     5,060,458,674      L1-icache-load-misses                                         (36.36%)
 1,373,019,705,796      dTLB-loads                                                    (36.36%)
       117,604,750      dTLB-load-misses          #    0.01% of all dTLB cache accesses  (36.36%)
       158,451,773      iTLB-loads                                                    (36.36%)
        29,511,730      iTLB-load-misses          #   18.63% of all iTLB cache accesses  (36.36%)

      62.707952810 seconds time elapsed


I then suspended all Rosetta tasks and got this:

# perf stat -aB -d -d -e cache-references,cache-misses
 Performance counter stats for 'system wide':

    20,554,374,124      cache-references                                              (36.36%)
    10,415,226,289      cache-misses              #   50.672 % of all cache refs      (36.36%)
 1,205,539,957,850      L1-dcache-loads                                               (36.36%)
    70,253,511,063      L1-dcache-load-misses     #    5.83% of all L1-dcache accesses  (36.37%)
     4,768,042,362      LLC-loads                                                     (36.37%)
     3,149,369,211      LLC-load-misses           #   66.05% of all LL-cache accesses  (36.36%)
   <not supported>      L1-icache-loads                                             
     4,194,578,638      L1-icache-load-misses                                         (36.36%)
 1,206,853,267,867      dTLB-loads                                                    (36.36%)
        50,345,945      dTLB-load-misses          #    0.00% of all dTLB cache accesses  (36.36%)
        58,005,275      iTLB-loads                                                    (36.36%)
        20,350,600      iTLB-load-misses          #   35.08% of all iTLB cache accesses  (36.36%)

      62.973239730 seconds time elapsed

ID: 66721 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66722 - Posted: 2 Dec 2022, 14:18:50 UTC - in response to Message 66715.  

Where are you copying those figures from? I see these for your two machines on the CPDN website:

Average upload rate 4308.41 KB/sec (Xeon)
Average upload rate 136.81 KB/sec (Windows 10)

I get this on a 15.9 Mbps uplink fibre ADSL line:

Average upload rate 603.55 KB/sec

I think that's consistent with the BOINC measurement being based on bytes. Some figures may be skewed if there's a proxy server in the loop.


I do not think it is bytes. Consider the reported download rate of 25457 K download rate. That would be 203656K bits per second -- 203 Megabits per second. The most I could possibly get from my Internet connection (fibre-optic) is 75 Megabits per second.

Those numbers I got were from the "(Xeon)" machine, not the "(Windows 10)" machine. The Windows 10 machine is a pipsqueak and will not run the Oifs models anyway. Right now those figures for my 1511241 machine are
Average upload rate 	4796.28 KB/sec
Average download rate 	25457.84 KB/sec


These speeds seem to have gone up since I started getting Oifs work units. I do not know if this means the increased traffic to the CPDN servers caused this, or just that the Internet has speeded up. In either case, I do not understand it.
ID: 66722 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66723 - Posted: 2 Dec 2022, 15:35:53 UTC - in response to Message 66722.  

Does your Xeon have any IFS tasks left? If it does (or next time you have any, if not), could you:

Open BOINC Manager
Switch to 'Advanced' view (if not using it already)
Watch the 'Transfers' tab as the task progresses

You have to be quick - the recent batch had fairly consistent file sizes around 14 MB, and left my machine in around 10 or 11 seconds. The exact time can be checked in the Event log later. That would seem to imply a speed of around 1.2 MB/sec, or 1,200 KB / sec. The figures will flash up on the transfers tab while the transfer is active.
ID: 66723 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66724 - Posted: 2 Dec 2022, 17:31:16 UTC - in response to Message 66723.  

You have to be quick - the recent batch had fairly consistent file sizes around 14 MB, and left my machine in around 10 or 11 seconds. The exact time can be checked in the Event log later. That would seem to imply a speed of around 1.2 MB/sec, or 1,200 KB / sec. The figures will flash up on the transfers tab while the transfer is active.


My transfers seem to be about 5 seconds. Sometimes 4 seconds; sometimes 6 seconds. And any one task seems to send a trickle about every 8 minutes.
But even if I leave the 'Transfers' tab displaying, they go by too fast for me to act.

HA! I tricked it! Watching the transfer tab, whenever it displayed anything, I hit the Print Screen button.

One of them was 14.32 MB, 5826 KBps, ...
Now if only we knew the definition of B. 5.826 MBps. If it is bits, my 75 MegaBits per sec fibre-optic Internet can handle it with ease. If it is bytes, then still handle it (46.6 Megabits per second).

I conclude that I cannot conclude anything from these data. 8-(

Fri 02 Dec 2022 11:47:15 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_51.zip
Fri 02 Dec 2022 11:47:20 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_51.zip
Fri 02 Dec 2022 11:49:58 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_118.zip
Fri 02 Dec 2022 11:50:04 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_118.zip
Fri 02 Dec 2022 11:52:32 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_118.zip
Fri 02 Dec 2022 11:52:37 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_118.zip
Fri 02 Dec 2022 11:54:46 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_52.zip
Fri 02 Dec 2022 11:54:51 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_52.zip
Fri 02 Dec 2022 11:57:29 AM EST | climateprediction.net | Started upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_119.zip
Fri 02 Dec 2022 11:57:34 AM EST | climateprediction.net | Finished upload of oifs_43r3_ps_1574_2021050100_123_946_12164663_0_r972931216_119.zip
Fri 02 Dec 2022 12:00:14 PM EST | climateprediction.net | Started upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_119.zip
Fri 02 Dec 2022 12:00:19 PM EST | climateprediction.net | Finished upload of oifs_43r3_ps_2830_2021050100_123_947_12165919_0_r111030085_119.zip

ID: 66724 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66725 - Posted: 2 Dec 2022, 17:50:08 UTC - in response to Message 66724.  

The easiest way is to temporarily suspend network activity (BOINC Manager again, Activity menu) - this keeps the files on your disk so you can check them with the usual file system tools. The usual location is:

/var/lib/boinc-client/projects/climateprediction.net
but YMMV. You have the generic form of the file names from your log.

Remember to turn network activity back on when you've satisfied your curiosity!
ID: 66725 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66726 - Posted: 2 Dec 2022, 18:17:44 UTC - in response to Message 66725.  

Do you mean like this?

-rw-r--r--. 1 boinc boinc 14962469 Dec  2 12:54 oifs_43r3_ps_1304_2021050100_123_946_12164393_1_r579744359_60.zip
-rw-r--r--. 1 boinc boinc 14868279 Dec  2 12:55 oifs_43r3_ps_2039_2021050100_123_947_12165128_1_r1193171530_3.zip
-rw-r--r--. 1 boinc boinc 14849068 Dec  2 12:57 oifs_43r3_ps_1734_2021050100_123_946_12164823_2_r333244089_3.zip


So those files are almost 15 MegaBytes each or 120 Megabits.
Since they take an average of 5 seconds to send, 24 Megabits/second. Will easily squeeze through my 75 Megabit/second Fibre-optic Internet link.
ID: 66726 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,357,052
RAC: 9,908
Message 66727 - Posted: 2 Dec 2022, 18:35:48 UTC - in response to Message 66726.  

Yup, I think we've reached the definitive answer. But it still doesn't explain your average download speed ...
ID: 66727 · Report as offensive     Reply Quote
4TX75586Qp61ADs93WEnnQM2vLs4

Send message
Joined: 29 Jan 06
Posts: 1
Credit: 607,579
RAC: 46,960
Message 66728 - Posted: 2 Dec 2022, 19:06:08 UTC
Last modified: 2 Dec 2022, 19:06:51 UTC

I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz)
and it finish as valid after 2 days
but the credit is 0 !?
https://www.cpdn.org/result.php?resultid=22247911
ID: 66728 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4352
Credit: 16,567,599
RAC: 5,265
Message 66729 - Posted: 2 Dec 2022, 19:17:40 UTC - in response to Message 66728.  

I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz)
and it finish as valid after 2 days
but the credit is 0 !?
https://www.cpdn.org/result.php?resultid=22247911


The credit script for CPDN is only run once a week, usually on Sundays so wait a couple more days and credit should appear.
ID: 66729 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4352
Credit: 16,567,599
RAC: 5,265
Message 66730 - Posted: 2 Dec 2022, 19:19:31 UTC

@ Glen.

It looks on my Ryzen at least that when only two tasks are running, they don't seem to be many failures. Three or more running at a time, often one out of the three or more will crash. I will tail the file you mentioned a couple of posts back when i get the chance and look at that.
ID: 66730 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1067
Credit: 16,546,621
RAC: 2,321
Message 66731 - Posted: 2 Dec 2022, 19:20:51 UTC - in response to Message 66728.  

I crunch a WU of "OpenIFS 43r3 Perturbed Surface v1.01" on my Ubuntu server (Xeon X5650 @ 2.67GHz)
and it finish as valid after 2 days but the credit is 0 !?


You are not alone. Don't worry about it. I have completed 17 of these tasks successfully, no errors, on my main machine, 1511241, and also have no credits assigned yet. I think credits are updated only once a week on weekends.
ID: 66731 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 66732 - Posted: 2 Dec 2022, 19:22:02 UTC
Last modified: 2 Dec 2022, 19:35:56 UTC

xii5ku wrote:
(That's on a Xeon W2245 with 8c/16t, 16.5 MB last-level cache, mixed workload with 3x OpenIFS. I suspect that some Rosetta work can be quite cache hungry too. Not sure of any of the others. You could check by )
I forgot to finish that sentence. It should have been something like "You could check by looking up a PID of a rosetta worker process and add -p PID to the perf command line". Although the processes which fight over cache affect each others hit/miss rates, of course. A more conclusive way would be to investigate homogeneous workloads.

Glenn Carver wrote:
Should look at TLB misses, they are expensive. I'm not sure if that's what you mean by a 'cache miss'.
perf's "-e cache-misses" appears to count across all cache levels; i.e. those accesses which couldn't be satisfied by any one or another of the cache levels. (There are other performance counters for distinct cache levels.) Re: TLB: Good idea to watch these in general. In my particuar current case of # cpu-time consuming processes = # physical cores, and because Linux doesn't move processes from core to core too frequently, it's not an isse, as TLB is a per-core resource, not shared between cores like e.g. L3$ or memory controllers. That is, each process has got an entire TLB for itself for most of its runtime, and I can't actually improve on that. TLB misses would be good to watch though if I ran more CPU time consuming processes than there are physical cores. Or if I was an OpenIFS developer, rather than just a user of the binary.

Glenn Carver wrote:
Find the slot directory for the task and find the 'ifs.stat' file, this is the file that the controlling wrapper process is watching. [...] For info the rest of the columns are [...]
Thank you for this detailed info.

--------

About transfer speed display:
IME the speeds shown at the show_host_detail webpage as well those in boincmgr's transfers tab are not very accurate. However, as far as I can tell, the "B" is for Bytes, at both places.

--------

About OpenIFS failure modes:
My current OpenIFS task count is 195 in progress (of those: 52 uploading), 93 valid, and 54 error, alas. All of the error results come from only one out of three hosts. All three hosts have the same hardware, OS, boinc client configs, and same split workload of OpenIFS and PrimeGrid llrSGS. The one host with errors was the only one on which I suspended all tasks to disk, rebooted the host, and resumed the tasks. I strongly believe that all of these 54 tasks went through this suspend–resume cycle. But I don't have a record of this to verify it. The stderr.txts of these tasks are of two types: One type contains just "--". The other shows that the last one to five zip files were missing.

The host with errors has reported only successful tasks for a while now, which is another hint that the error episode was just the aftermath of the suspend-resume cycle.

All three of the hosts which I have active at OpenIFS have plenty of RAM, and are set to "leave non-GPU tasks in memory while suspended".¹ That's possibly a factor why they run error-free.

¹) On a side note, my boinc clients would never suspend OpenIFS tasks on their own. I run OpenIFS and llrSGS in two separate boinc client instances, so that I have full control over work buffers and number of running tasks. (If I used the same client instance for both projects, the client could decide to suspend some OpenIFS in favor of more llrSGS.) But I triggered suspend-to-RAM of OpenIFSs once or twice now when I reduced the number of running OpenIFSs in order to cut down the upload backlog.
ID: 66732 · Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 31 · Next

Message boards : Number crunching : OpenIFS Discussion

©2024 climateprediction.net