w/u failed at the 89th zip file

Author	Message
PDW Send message Joined: 29 Nov 17 Posts: 55 Credit: 6,505,083 RAC: 1,294	Message 68104 - Posted: 29 Jan 2023, 13:37:44 UTC - in response to Message 68103. Well BOINC code comments say: // If we already found a finish file, abort the app; // it must be hung somewhere in boinc_finish(); I do find this comment thought-provoking: // process is still there 5 min after it wrote finish file. // abort the job // Note: actually we should treat it as successful. // But this would be tricky. ID: 68104 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,279,167 RAC: 11,006	Message 68105 - Posted: 29 Jan 2023, 14:25:27 UTC - in response to Message 68104. That line was added, and the timeout increased, around 4 years ago: https://github.com/BOINC/boinc/commit/db4c3d0c22d772f77d6d65e6adf9f23280530a7f client: increase finish-file timeout When an app finishes, it writes a "finish file", which ensures the client that the app really finished. If the app process is still there N seconds after the finish file appears, the client assumes that something went wrong, and it aborts the job. Previously N was 10. This was too small during periods of heavy paging. I increased it to 300. It has been pointed out that if the app creates the finish file, and its output files are present, it should be treated as successful regardless of whether it exits. This is probably true, but right now we don't have a mechanism for killing a job and marking it as success. The longer timeout makes this moot. 'ensures' --> 'assures'? 'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too. ID: 68105 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,538,338 RAC: 2,071	Message 68106 - Posted: 29 Jan 2023, 15:04:26 UTC - in response to Message 68105. 'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too. My machine has a 512 GByte solid state drive, but other than the Boinc client software, and the swap space, all the Boinc stuff is on a 5400 rpm SATA spinning drive. Yet I do next to no paging. I have no CPDN tasks running at the moment because there are none. I started getting swap space usage when I had about 20 completed tasks that were unable to upload their "trickles" for about a week. I have no idea what was on that swap space. There was certainly no thrashing of running Boinc tasks. This is my current RAM usage: $ free -hw total used free shared buffers cache available Mem: 62Gi 4.0Gi 1.0Gi 85Mi 162Mi 57Gi 57Gi Swap: 15Gi 3.0Mi 15Gi ID: 68106 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 68114 - Posted: 30 Jan 2023, 8:12:02 UTC With 1 cpdn task running by its self it failed after zip no 83 with 13:30:43 STEP 2039 H=2039:00 +CPU= 18.156 double free or corruption (out) I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads. ID: 68114 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,538,338 RAC: 2,071	Message 68115 - Posted: 30 Jan 2023, 8:38:35 UTC - in response to Message 68114. With 1 cpdn task running by its self it failed after zip no 83 with 13:30:43 STEP 2039 H=2039:00 +CPU= 18.156 double free or corruption (out) I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads. It is really puzzling to me that I have such good luck with these, and others have bad. I really wonder what the difference is. CPU type GenuineIntel Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7] Number of processors 16 Operating System Linux Red Hat Enterprise Linux Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64\|libc 2.28] BOINC version 7.20.2 Memory 62.4 GB Cache 16896 KB Swap space 15.62 GB Total disk space 488.04 GB Free Disk Space 479.52 GB OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu Number of tasks completed 208 Max tasks per day 212 Number of tasks today 0 Consecutive valid tasks 208 Average processing rate 28.33 GFLOPS Average turnaround time 3.74 days ID: 68115 · Reply Quote

Dave Jackson Volunteer moderator Send message Joined: 15 May 09 Posts: 4346 Credit: 16,533,637 RAC: 5,933	Message 68116 - Posted: 30 Jan 2023, 10:07:40 UTC Current error free run is now up to 18. I think all of the errors have been while I was running four or more tasks at once. I am now sticking to a maximum of two because even that is slightly more than my ADSL can cope with. ID: 68116 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68126 - Posted: 30 Jan 2023, 16:40:22 UTC - in response to Message 68115. Jean-David Beyer wrote: It is really puzzling to me that I have such good luck with these, and others have bad. While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter. (I for one am one of the users who complete comparably few tasks, because of my upload bandwidth limit, which means I don't run a lot of tasks while the upload server is up, and am down to running only 1 "pilot" task per computer while the worthless upload server is down. Which it is most of the time. Plus, by now I practically never suspend a task to disk, only to RAM. Consequently, I had very few errors until now.) ID: 68126 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,538,338 RAC: 2,071	Message 68128 - Posted: 30 Jan 2023, 17:15:09 UTC - in response to Message 68126. It is really puzzling to me that I have such good luck with these, and others have bad. While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter. I do not know about statistics -- and I have a BA in mathematics and took both a course in statistics, and another in probability. Is 208 consecutive tasks out of 208 total tasks a few or a lot? in any case, 100% success rate seems pretty good. How many need I run before I get a failure? I seem to run them faster than the server delivers them, even when periods of time occur when the upload server will not accept the "trickles." And in many cases, two prior users have attempted the same work unit and failed. ID: 68128 · Reply Quote

xii5ku Send message Joined: 27 Mar 21 Posts: 79 Credit: 78,302,757 RAC: 1,077	Message 68133 - Posted: 30 Jan 2023, 18:05:04 UTC - in response to Message 68128. Last modified: 30 Jan 2023, 18:15:59 UTC Jean-David Beyer wrote: Is 208 consecutive tasks out of 208 total tasks a few or a lot? It's all relative. Right now I am seeing this on your single Linux host: oifs_43r3_ps results: 244 total, 243 valid, 1 in progress This is mine across 2 (earlier partly 3) hosts: oifs_43r3_ps results: 1493 total, 954 valid, 484 in progress (would have been done by now if not for the permanent upload server absence), 55 error My last errors were from November mostly, when I shutdown and resumed one of my hosts. Then a few errors from Dec 1, one from Dec 3, and no error since. But I have successfully avoided to suspend tasks to disk ever since November, with the exception of 1 deliberate test which AFAICT didn't fail. (Might still fail, if it is among the pending uploads.) My upload link width allows me to return 48 results per day. (This is rather little relative to the CPUs, RAM, and disk space which I could spare.) This means my 954 valid tasks translate to merely 20 days production. The rest was server downtime. My best CPUs are 32 core CPUs, of which one alone could produce slightly more than 48 results per day. Some folks have even larger CPUs, or similarly large ones but with higher power budget than mine. If we go by credit of the last week or last month, my 48 results/day during the brief times when the upload server is functioning put me above the average. But a few big producers are missing from these 3rd party stats because they didn't enable statistics export. E.g., based on last week's credit of my team, there was 1.3 M credit given to one or more users on my team without stats export, compared to my 400 k or your 80 k of last week. ID: 68133 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 68453 - Posted: 25 Feb 2023, 1:09:37 UTC Old friend is back..... "double free or corruption (out)" I have been missing these. ID: 68453 · Reply Quote

nairb Send message Joined: 3 Sep 04 Posts: 105 Credit: 5,646,090 RAC: 102,785	Message 68514 - Posted: 28 Feb 2023, 20:59:41 UTC At least this w/u ran to the end before aborting with "Process still present 5 min after writing finish file; aborting</message>£ No other error messages. Only 50% success so far with the latest bunch. ID: 68514 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 68515 - Posted: 28 Feb 2023, 21:49:11 UTC - in response to Message 68514. At least this w/u ran to the end before aborting with "Process still present 5 min after writing finish file; aborting</message>£ No other error messages. Only 50% success so far with the latest bunch. That error message points to an issue in the boinc client. The task has finished and told the client but then it gets stuck, somewhere in the client code. CPDN isn't the only project to see this behaviour but I've not seen any good explanation for why on the forums. --- CPDN Visiting Scientist ID: 68515 · Reply Quote

Jean-David Beyer Send message Joined: 5 Aug 04 Posts: 1060 Credit: 16,538,338 RAC: 2,071	Message 68516 - Posted: 28 Feb 2023, 22:22:19 UTC - in response to Message 68514. Only 50% success so far with the latest bunch. Same here, but the failures all came first. They all had very short execution times. 22317174 12214156 27 Feb 2023, 2:24:01 UTC 27 Feb 2023, 17:23:19 UTC Completed 51,028.30 50,403.60 0.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22316084 12214703 24 Feb 2023, 22:24:03 UTC 25 Feb 2023, 13:43:22 UTC Completed 53,538.60 52,734.92 2,353.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314976 12213630 24 Feb 2023, 12:25:29 UTC 25 Feb 2023, 3:03:19 UTC Completed 52,615.79 51,784.63 2,353.00 OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314676 12213385 22 Feb 2023, 6:23:59 UTC 22 Feb 2023, 7:24:41 UTC Error while computing 66.16 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314647 12213316 22 Feb 2023, 3:24:44 UTC 22 Feb 2023, 3:49:31 UTC Error while computing 66.61 1.28 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu 22314608 12213345 22 Feb 2023, 0:25:23 UTC 22 Feb 2023, 1:23:20 UTC Error while computing 66.38 1.15 --- OpenIFS 43r3 v1.21 x86_64-pc-linux-gnu ID: 68516 · Reply Quote

Glenn Carver Send message Joined: 29 Oct 17 Posts: 806 Credit: 13,593,584 RAC: 7,495	Message 68521 - Posted: 1 Mar 2023, 12:57:43 UTC - in response to Message 68516. Last modified: 1 Mar 2023, 12:58:52 UTC Only 50% success so far with the latest bunch. Same here, but the failures all came first. They all had very short execution times. I would expect that. The perturbations are most likely to generate instability soon after the model gets started. Once the model has balanced its mass & wind fields it will run on ok. The baroclinic life-cycle experiment was different. In that setup, it started from a very simple atmospheric state which was then perturbed to generate large atmospheric 'storms'. Some of which became too strong for the model to resolve with the timestep length it was using. So for those batches, the model would tend to fail nearer the end of the run. --- CPDN Visiting Scientist ID: 68521 · Reply Quote

Richard Haselgrove Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,279,167 RAC: 11,006	Message 68524 - Posted: 1 Mar 2023, 14:07:39 UTC - in response to Message 68521. In this case, it's much easier than that. The older tasks, like WU 12213345 (issued on 22 Feb) are resends from the failed batch 992, which was withdrawn because of a missing data file in the package. The newer tasks, like WU 12213630 (issued on 24 Feb) are from the corrected replacement batch issued on that day. ID: 68524 · Reply Quote