climateprediction.net home page
w/u failed at the 89th zip file

w/u failed at the 89th zip file

Message boards : Number crunching : w/u failed at the 89th zip file
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile PDW

Send message
Joined: 29 Nov 17
Posts: 55
Credit: 6,505,083
RAC: 1,294
Message 68104 - Posted: 29 Jan 2023, 13:37:44 UTC - in response to Message 68103.  

Well BOINC code comments say:

// If we already found a finish file, abort the app;
// it must be hung somewhere in boinc_finish();

I do find this comment thought-provoking:

// process is still there 5 min after it wrote finish file.
// abort the job
// Note: actually we should treat it as successful.
// But this would be tricky.
ID: 68104 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,279,167
RAC: 11,006
Message 68105 - Posted: 29 Jan 2023, 14:25:27 UTC - in response to Message 68104.  

That line was added, and the timeout increased, around 4 years ago:

https://github.com/BOINC/boinc/commit/db4c3d0c22d772f77d6d65e6adf9f23280530a7f

client: increase finish-file timeout
When an app finishes, it writes a "finish file",
which ensures the client that the app really finished.

If the app process is still there N seconds after the finish file appears,
the client assumes that something went wrong, and it aborts the job.

Previously N was 10.
This was too small during periods of heavy paging.
I increased it to 300.

It has been pointed out that if the app creates the finish file,
and its output files are present,
it should be treated as successful regardless of whether it exits.
This is probably true, but right now we don't have a mechanism
for killing a job and marking it as success.
The longer timeout makes this moot.
'ensures' --> 'assures'?
'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too.
ID: 68105 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68106 - Posted: 29 Jan 2023, 15:04:26 UTC - in response to Message 68105.  

'The longer timeout makes this moot'. Or not, as the case may be. With 64 GB of RAM, and all solid-state drives, I doubt paging is the trigger. We should consider other possibles causes too.


My machine has a 512 GByte solid state drive, but other than the Boinc client software, and the swap space, all the Boinc stuff is on a 5400 rpm SATA spinning drive.
Yet I do next to no paging. I have no CPDN tasks running at the moment because there are none. I started getting swap space usage when I had about 20 completed tasks that were unable to upload their "trickles" for about a week. I have no idea what was on that swap space. There was certainly no thrashing of running Boinc tasks.

This is my current RAM usage:
$ free -hw
              total        used        free      shared     buffers       cache   available
Mem:           62Gi       4.0Gi       1.0Gi        85Mi       162Mi        57Gi        57Gi
Swap:          15Gi       3.0Mi        15Gi
 

ID: 68106 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 68114 - Posted: 30 Jan 2023, 8:12:02 UTC

With 1 cpdn task running by its self it failed after zip no 83 with
13:30:43 STEP 2039 H=2039:00 +CPU= 18.156
double free or corruption (out)

I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads.
ID: 68114 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68115 - Posted: 30 Jan 2023, 8:38:35 UTC - in response to Message 68114.  

With 1 cpdn task running by its self it failed after zip no 83 with
13:30:43 STEP 2039 H=2039:00 +CPU= 18.156
double free or corruption (out)

I will be glad to see this problem solved........ the machine will have been running with endless free memory and several idle threads.


It is really puzzling to me that I have such good luck with these, and others have bad. I really wonder what the difference is.

CPU type 	GenuineIntel
Intel(R) Xeon(R) W-2245 CPU @ 3.90GHz [Family 6 Model 85 Stepping 7]
Number of processors 	16
Operating System 	Linux Red Hat Enterprise Linux
Red Hat Enterprise Linux 8.7 (Ootpa) [4.18.0-425.10.1.el8_7.x86_64|libc 2.28]
BOINC version 	7.20.2
Memory 	62.4 GB
Cache 	16896 KB
Swap space 	15.62 GB
Total disk space 	488.04 GB
Free Disk Space 	479.52 GB

OpenIFS 43r3 Perturbed Surface 1.05 x86_64-pc-linux-gnu
Number of tasks completed 	208
Max tasks per day 	212
Number of tasks today 	0
Consecutive valid tasks 	208
Average processing rate 	28.33 GFLOPS
Average turnaround time 	3.74 days

ID: 68115 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4346
Credit: 16,533,637
RAC: 5,933
Message 68116 - Posted: 30 Jan 2023, 10:07:40 UTC

Current error free run is now up to 18. I think all of the errors have been while I was running four or more tasks at once. I am now sticking to a maximum of two because even that is slightly more than my ADSL can cope with.
ID: 68116 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68126 - Posted: 30 Jan 2023, 16:40:22 UTC - in response to Message 68115.  

Jean-David Beyer wrote:
It is really puzzling to me that I have such good luck with these, and others have bad.
While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter.

(I for one am one of the users who complete comparably few tasks, because of my upload bandwidth limit, which means I don't run a lot of tasks while the upload server is up, and am down to running only 1 "pilot" task per computer while the worthless upload server is down. Which it is most of the time. Plus, by now I practically never suspend a task to disk, only to RAM. Consequently, I had very few errors until now.)
ID: 68126 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68128 - Posted: 30 Jan 2023, 17:15:09 UTC - in response to Message 68126.  

It is really puzzling to me that I have such good luck with these, and others have bad.

While it certainly is partly a matter of bad luck vs. good luck, it partly is also a simple matter of statistics. The more tasks a user runs, the more error tasks this user is likely to encounter.


I do not know about statistics -- and I have a BA in mathematics and took both a course in statistics, and another in probability.

Is 208 consecutive tasks out of 208 total tasks a few or a lot? in any case, 100% success rate seems pretty good. How many need I run before I get a failure? I seem to run them faster than the server delivers them, even when periods of time occur when the upload server will not accept the "trickles." And in many cases, two prior users have attempted the same work unit and failed.
ID: 68128 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 27 Mar 21
Posts: 79
Credit: 78,302,757
RAC: 1,077
Message 68133 - Posted: 30 Jan 2023, 18:05:04 UTC - in response to Message 68128.  
Last modified: 30 Jan 2023, 18:15:59 UTC

Jean-David Beyer wrote:
Is 208 consecutive tasks out of 208 total tasks a few or a lot?
It's all relative. Right now I am seeing this on your single Linux host:
oifs_43r3_ps results: 244 total, 243 valid, 1 in progress

This is mine across 2 (earlier partly 3) hosts:
oifs_43r3_ps results: 1493 total, 954 valid, 484 in progress (would have been done by now if not for the permanent upload server absence), 55 error

My last errors were from November mostly, when I shutdown and resumed one of my hosts. Then a few errors from Dec 1, one from Dec 3, and no error since. But I have successfully avoided to suspend tasks to disk ever since November, with the exception of 1 deliberate test which AFAICT didn't fail. (Might still fail, if it is among the pending uploads.)

My upload link width allows me to return 48 results per day. (This is rather little relative to the CPUs, RAM, and disk space which I could spare.) This means my 954 valid tasks translate to merely 20 days production. The rest was server downtime. My best CPUs are 32 core CPUs, of which one alone could produce slightly more than 48 results per day. Some folks have even larger CPUs, or similarly large ones but with higher power budget than mine.

If we go by credit of the last week or last month, my 48 results/day during the brief times when the upload server is functioning put me above the average. But a few big producers are missing from these 3rd party stats because they didn't enable statistics export. E.g., based on last week's credit of my team, there was 1.3 M credit given to one or more users on my team without stats export, compared to my 400 k or your 80 k of last week.
ID: 68133 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 68453 - Posted: 25 Feb 2023, 1:09:37 UTC

Old friend is back.....
"double free or corruption (out)"

I have been missing these.
ID: 68453 · Report as offensive     Reply Quote
nairb

Send message
Joined: 3 Sep 04
Posts: 105
Credit: 5,646,090
RAC: 102,785
Message 68514 - Posted: 28 Feb 2023, 20:59:41 UTC

At least this w/u ran to the end before aborting with

"Process still present 5 min after writing finish file; aborting</message>£

No other error messages.

Only 50% success so far with the latest bunch.
ID: 68514 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68515 - Posted: 28 Feb 2023, 21:49:11 UTC - in response to Message 68514.  

At least this w/u ran to the end before aborting with
"Process still present 5 min after writing finish file; aborting</message>£
No other error messages.
Only 50% success so far with the latest bunch.
That error message points to an issue in the boinc client. The task has finished and told the client but then it gets stuck, somewhere in the client code. CPDN isn't the only project to see this behaviour but I've not seen any good explanation for why on the forums.
---
CPDN Visiting Scientist
ID: 68515 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 1060
Credit: 16,538,338
RAC: 2,071
Message 68516 - Posted: 28 Feb 2023, 22:22:19 UTC - in response to Message 68514.  

Only 50% success so far with the latest bunch.

Same here, but the failures all came first. They all had very short execution times.

22317174 	12214156 	27 Feb 2023, 2:24:01 UTC 	27 Feb 2023, 17:23:19 UTC 	Completed 	51,028.30 	50,403.60 	0.00 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22316084 	12214703 	24 Feb 2023, 22:24:03 UTC 	25 Feb 2023, 13:43:22 UTC 	Completed 	53,538.60 	52,734.92 	2,353.00 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314976 	12213630 	24 Feb 2023, 12:25:29 UTC 	25 Feb 2023, 3:03:19 UTC 	Completed 	52,615.79 	51,784.63 	2,353.00 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314676 	12213385 	22 Feb 2023, 6:23:59 UTC 	22 Feb 2023, 7:24:41 UTC 	Error while computing 	66.16 	1.15 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314647 	12213316 	22 Feb 2023, 3:24:44 UTC 	22 Feb 2023, 3:49:31 UTC 	Error while computing 	66.61 	1.28 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu
22314608 	12213345 	22 Feb 2023, 0:25:23 UTC 	22 Feb 2023, 1:23:20 UTC 	Error while computing 	66.38 	1.15 	--- 	OpenIFS 43r3 v1.21
x86_64-pc-linux-gnu

ID: 68516 · Report as offensive     Reply Quote
Glenn Carver

Send message
Joined: 29 Oct 17
Posts: 806
Credit: 13,593,584
RAC: 7,495
Message 68521 - Posted: 1 Mar 2023, 12:57:43 UTC - in response to Message 68516.  
Last modified: 1 Mar 2023, 12:58:52 UTC

Only 50% success so far with the latest bunch.
Same here, but the failures all came first. They all had very short execution times.
I would expect that. The perturbations are most likely to generate instability soon after the model gets started. Once the model has balanced its mass & wind fields it will run on ok.

The baroclinic life-cycle experiment was different. In that setup, it started from a very simple atmospheric state which was then perturbed to generate large atmospheric 'storms'. Some of which became too strong for the model to resolve with the timestep length it was using. So for those batches, the model would tend to fail nearer the end of the run.
---
CPDN Visiting Scientist
ID: 68521 · Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 1 Jan 07
Posts: 943
Credit: 34,279,167
RAC: 11,006
Message 68524 - Posted: 1 Mar 2023, 14:07:39 UTC - in response to Message 68521.  

In this case, it's much easier than that.

The older tasks, like WU 12213345 (issued on 22 Feb) are resends from the failed batch 992, which was withdrawn because of a missing data file in the package.

The newer tasks, like WU 12213630 (issued on 24 Feb) are from the corrected replacement batch issued on that day.
ID: 68524 · Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : w/u failed at the 89th zip file

©2024 climateprediction.net