climateprediction.net home page
exit code -5 (0xfffffffb)

exit code -5 (0xfffffffb)

Questions and Answers : Windows : exit code -5 (0xfffffffb)
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile adrianxw
Avatar

Send message
Joined: 31 Aug 04
Posts: 145
Credit: 2,021,020
RAC: 816
Message 9077 - Posted: 10 Feb 2005, 11:19:31 UTC
Last modified: 10 Feb 2005, 19:57:17 UTC

I don't think that is necessarily true. I run 4 BOINC projects, and only have unexplained trouble with CPDN.

The thing with an open architecture to accomodate many projects must be to provide a standard container. Modifying the container to fit individual projects is the wrong way round.

The BOINC API is published, and it should be up to the client applications to conform to it.

That said, I agree, there are still issues with BOINC, but I don't think this is one.

I would also agree, it would be nice if the pre-emption was such that when pre-emption was due, the client was informed and could checkpoint before being deleted, (not as big an issue if the "leave in memory" option is enabled of course). That is a wish for all projects, not just CPDN.

I have a long wish list of BOINC improvements, but I think this thread is a CPDN issue. A start would be to differentiate between the various conditions that produce the catchall -5 error message, then at least people know what went wrong.
Wave upon wave of demented avengers march cheerfully out of obscurity into the dream.
ID: 9077 · Report as offensive     Reply Quote
Profile Friedrich S.

Send message
Joined: 22 Jan 05
Posts: 38
Credit: 3,914,110
RAC: 4,770
Message 9078 - Posted: 10 Feb 2005, 11:30:34 UTC - in response to Message 9063.  

Hello Les,

> Also, -5 is a "catch all" error message, so it isn't necessarily a file
> write.
> Sometimes it's caused by a negative pressure in one of the cells.

But is that not a big difference? A model that failed because of a hardware glitch or a file error could have been important & valid whereas a negative pressure certainly shows the model is flawed.
Do they check that and run models again if they failed because of hardware glitches etc.?

> There are several threads about success rates on the phpBB, (which is down),
> and one of the admins said the ratio
> is about 1 in 7 successful, so don't get too discouraged.

Well, I hope the scietists and their programmers now what they are doing. If they consider that acceptable, I can live with it.
I am just surprised that it happens relatively often.
And I am wondering if there are not strategies to compensate for the hardware-related failures.

Friedrich

I love CPDN!
--
ID: 9078 · Report as offensive     Reply Quote
Profile Andrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 9082 - Posted: 10 Feb 2005, 14:07:03 UTC
Last modified: 10 Feb 2005, 14:07:30 UTC

The 1:7 ratio is a little misleading. Looking at the WUs that have been resent, which is possible when we are allocated one to do ourselves, it seems that most fail either immediately or within quite a short time. It is unlikely that these are the result of parameterisation, and the fact that they can generally be run by somebody else reinforces that.

Clients have a download limit of 5 WUs a day. Somebody who is having trouble can easily reach this limit on successive days, perhaps without realising that there is a problem. In contrast, someone who runs CPDN successfully may go through one WU a month. Statistically, it would be possible to have a failure rate of 7 out of 8 measured on WU volume, and a success rate of 7 out of 8 measured by user. I'm not suggesting that the actual figures are that, as I haven't seen the data, but it is also true that those of us who run CPDN successfully tend to do so consistently, and where a WU does fail we know why.

That is no consolation to those who are having trouble, but we should not run away with thre idea that failure is the norm. If it were, few of us would bother.


ID: 9082 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 9096 - Posted: 10 Feb 2005, 18:37:25 UTC

Sorry Andrew,
I was attempting to quote Carl from the phpBB forum. I may have the figures wrong.

I still think that adrianxw's problem lies with the fact that he is running all the projects.
But I don't know why it's a problem.
I only run CP with BOINC V4.05, and don't have a problem.
I'll stay out of this, I think.

Les

ID: 9096 · Report as offensive     Reply Quote
Profile Andrew Hingston
Volunteer moderator

Send message
Joined: 17 Aug 04
Posts: 753
Credit: 9,804,700
RAC: 0
Message 9100 - Posted: 10 Feb 2005, 19:44:31 UTC - in response to Message 9096.  


> I was attempting to quote Carl from the phpBB forum. I may have the figures
> wrong.

No, I don't question that. But Carl was, I think, quoting figures for units, not users - my point is that his figure could be correct and this still be a minority experience for users.

Maybe Carl wants to answer for himself ;)
ID: 9100 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 9108 - Posted: 10 Feb 2005, 22:48:40 UTC - in response to Message 9100.  

>
> > I was attempting to quote Carl from the phpBB forum. I may have the
> figures
> > wrong.
>
> No, I don't question that. But Carl was, I think, quoting figures for units,
> not users - my point is that his figure could be correct and this still be a
> minority experience for users.
>
> Maybe Carl wants to answer for himself ;)
>

Reducing the 7:1 ratio is easy just change the maximum number of downloads per day down from 4 to 3 or 2, but this doesn't really make things any better.

A much better measure is the percentage of model years that are in complete run compared to the model years in all finished runs (including incomplete runs). This figure is something like 76% in the completed runs. Better than 72.5% that classic manages. Unfortunately it is not easy to estimate or see which way if any that it is altering. 76% seems pretty good to me.


Visit BOINC WIKI for help

And join BOINC Synergy for all the news in one place.
ID: 9108 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9417 - Posted: 16 Feb 2005, 18:39:41 UTC
Last modified: 16 Feb 2005, 18:42:22 UTC

I just caught a -5 error before it could happen.

After closing (exiting) BOINC I checked the task manager and found one hadsm3um_4.04_windows_intelx86.exe still running.

This is the best guarantee (I would give 100% on it!) for crashing a model with error -5 - the next BOINC restart would have destroyed the model as the still running program locked some files.


Dual Athlon MP 2600+, Win2k SP4, one CPDN model and one Einstein WU running, BOINC 4.19 GUI


I already lost several models from this nasty bug, it really needs to be fixed.
ID: 9417 · Report as offensive     Reply Quote
crandles
Volunteer moderator

Send message
Joined: 16 Oct 04
Posts: 692
Credit: 277,679
RAC: 0
Message 9424 - Posted: 16 Feb 2005, 19:04:11 UTC

>I already lost several models from this nasty bug, it really needs to be fixed.

The good news Tolu has said he has 'resolved this for good'. The bad this was on the Sulphur cycle alpha test.
Visit BOINC WIKI for help

And join BOINC Synergy for all the news in one place.
ID: 9424 · Report as offensive     Reply Quote
Profile Ananas
Volunteer moderator

Send message
Joined: 31 Oct 04
Posts: 336
Credit: 3,316,482
RAC: 0
Message 9426 - Posted: 16 Feb 2005, 19:12:40 UTC - in response to Message 9424.  
Last modified: 16 Feb 2005, 19:14:56 UTC

> >I already lost several models from this nasty bug, it really needs to be
> fixed.
>
> The good news Tolu has said he has 'resolved this for good'. The bad this was
> on the Sulphur cycle alpha test.


This is very good news, I already have been quite frustrated as nobody seemed to care about all my nagging about this problem. I can reproduce it anytime - the problem is more to not reproduce it ;-)

As I know now that it will be solved, I will be patient and not nag anymore :-)


My idea for compound projects would have been to give control over the process IDs of secondary started programs back to BOINC, like a PID list that instructs BOINC, which additional tasks to kill on exit. Will it be something like this?
ID: 9426 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 9431 - Posted: 16 Feb 2005, 21:23:38 UTC
Last modified: 16 Feb 2005, 21:25:33 UTC

Chris,
Did Tolu say what the problem was? As a former programer, I'm curious. Or is that nosy? :-)

Les

ID: 9431 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 9443 - Posted: 17 Feb 2005, 7:53:44 UTC - in response to Message 9431.  

> Did Tolu say what the problem was? As a former programer, I'm curious. Or is that nosy? :-)

He didn't, but as another programmer I can imagine what it's likely to have been.

The hadsm3_* process controls the complete model, spawning hadsm3um_* to do the work for each of the model phases. The controller can detect when the worker stops running relatively easily, but you need a custom mechanism for the worker to be able to detect an abnormal termination of the controller.

The processes communicate using shared memory, and my guess is that Tolu fixed a problem with the 'are you there' handshake between the processes.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 9443 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 9466 - Posted: 17 Feb 2005, 14:50:13 UTC

Thanks, Thyme Lawn.

Les
ID: 9466 · Report as offensive     Reply Quote
Previous · 1 · 2

Questions and Answers : Windows : exit code -5 (0xfffffffb)

©2024 climateprediction.net