Joined: 31 Oct 04
If stderr had two extra lines for each data trickle (the big uploads), it would be easier to figure out which heartbeat or restart events are critical. One line when it starts to collect the data for the upload, one when the upload files are ready for upload.
For example I have received 5 ANZ models within quite a short time on one box that is very likely to produce heartbeat problems, when a new CPDN model starts.
The first model that started most likely will have 4 heartbeat messages, the second one 3 ... and the last one no heartbeat message at all.
After returning 3 of the models, the box downloaded one more ANZ but - quite unexpected - the remaining two "old" ANZ models were not hit by heartbeat problems when the new one started, just a few other projects were affected this time. Lucky me, perfect timing.
But it could as well have happened that the initialisation of the new model caused problems for the two older ones and those interruptions might even have caused a crash.
Unfortunately BOINC stderr messages have only useless timestamps but still trickle messages in stderr would help to see in which order events occured.
My guess would be that a heartbeat error while it prepares the upload data is absolutely destructive, whereas there is a good chance to survive it in the calculation phase.
The reason why this is especially interesting is that the project client might be able to set a "doing critical work" condition, that is not interrupted by heartbeat checks. I think I have seen something like that in the project API sources (long ago).