climateprediction.net home page
System clock adjustments - problematic?

System clock adjustments - problematic?

Message boards : Number crunching : System clock adjustments - problematic?
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14550 - Posted: 20 Jul 2005, 20:26:38 UTC
Last modified: 20 Jul 2005, 20:49:39 UTC

Hi. Two models I have on a P4 HT PC both exited with 0 status but no finished file. I got the usual warning that if it happems repeatedly etc etc. Interestingly...I think.....It was in the same minute that the system clock was adjusted. I know this becuase it had slowed by over 60 seconds and I forced an update with my local NTP server. I can imagine I think what went on here but my guess is that systems adjust clocks on a pretty regular basis and if the app/cc is using that clock, as I think it may well be, then it could be the source of some strange behaviour.

EDIT: Just checked the logs and when it did this with cpdn, boinc then picked up 2 seti units to run and did the same with them....and restarted those seti units happily. I have kept the log entires if anyone is interested.

@Paul B
If the clock can affect this way it may be a good one for the Wiki?

Any one else seen this at all? Known already? Or is this just a conincidence do you think?
<img></img><br><img></img><img>
ID: 14550 · Report as offensive     Reply Quote
old_user1102
Avatar

Send message
Joined: 25 Aug 04
Posts: 6
Credit: 473,435
RAC: 0
Message 14566 - Posted: 21 Jul 2005, 12:22:27 UTC - in response to Message 14550.  
Last modified: 21 Jul 2005, 12:23:59 UTC

&gt; Hi. Two models I have on a P4 HT PC both exited with 0 status but no finished
&gt; file. I got the usual warning that if it happems repeatedly etc etc.
&gt; Interestingly...I think.....It was in the same minute that the system clock
&gt; was adjusted. I know this becuase it had slowed by over 60 seconds and I
&gt; forced an update with my local NTP server. I can imagine I think what went on
&gt; here but my guess is that systems adjust clocks on a pretty regular basis and
&gt; if the app/cc is using that clock, as I think it may well be, then it could be
&gt; the source of some strange behaviour.
&gt;
&gt; EDIT: Just checked the logs and when it did this with cpdn, boinc then picked
&gt; up 2 seti units to run and did the same with them....and restarted those seti
&gt; units happily. I have kept the log entires if anyone is interested.
&gt;
&gt; @Paul B
&gt; If the clock can affect this way it may be a good one for the Wiki?
&gt;
&gt; Any one else seen this at all? Known already? Or is this just a conincidence
&gt; do you think?

Well spotted Ian, I've been puzzled about this myself for some time, but, never thought to look for links to what the system time was up to. Anyway, I've just risked a model, for experiments sake, and changed the system time a bit on one of my crunchers, as it happened it was running LHC at the time, but, it DID error out on cue, with "exited with zero status but no 'finished' file".
It looks like you're next in line for a 'finders fee' (beer?) from Paul!

TTFN, Ken

ID: 14566 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 14567 - Posted: 21 Jul 2005, 12:24:14 UTC
Last modified: 21 Jul 2005, 12:26:52 UTC

it may just be coincidence. Hmmm... maybe not... need to think on this one ... I don't recall if BOINC looks at the clock to determine switching times or not...

And, Yes I am interested in the logs... heck, I take the logs even if you are not sure there is something interesting in them. If there is nothing interesting ... I "grep" them and delete them ...

Before I started to rebuild my tool I was looking at 10 logs roughly before I saw something interesting, and had a good example. In the e-mail when you send them over, mention the time and day so I can scan that part ... I may have to try the expiriment here ...
{edit}
Looks like a "smoking Gun" and this may explain why that problem has been such a BEAR to locate ... this buglet has driven the developers nuts as they could not explain its "random" behavior.
<p>
<a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
<img src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8">
ID: 14567 · Report as offensive     Reply Quote
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14575 - Posted: 21 Jul 2005, 18:56:18 UTC
Last modified: 21 Jul 2005, 19:01:32 UTC

Great beer.....ahem....Paul doesn't drink beer.....damn!
Oh well I'll pretend.

@Paul sent you the logs.

Let's hope it is something tbh as its always nice to find a reason for a problem even if you cannot fix it necessarily. Should I have a scan of the boinc code to see if I can find out where it is happening? Worth while?

Regards
Ian

EDIT: Just checked Win 2003 server....will do a clock update using normal MS client every 7 days.So these incidents may be occuring every 7 days on some systems and more frequently if say an ntp client is in use where even hourly updates are sometimes requested. This would certainly add to the randomness "look" as it does not always follow there is a time change. Hmmm.

<img src="http://www.boincsynergy.com/images/stats/comb-1091.jpg"></img><br><img src="http://www.iantighe.com/setisig.jpg"></img><img border="0" src="http://boinc.mundayweb.com/one/teamStats.php?userID=1602&amp;prj=1&amp;trans=off">
ID: 14575 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 14577 - Posted: 21 Jul 2005, 20:33:46 UTC

I've just done some testing on Win 2K and didn't get this problem moving the time forward or back. That host does an NTP sync every 24 hours and I can't remember it ever exiting with no finished file.

The only strange thing I noticed was that client_state.xml wasn't updated on CPDN checkpoints after moving the clock back an hour (the model XML file was updated).
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 14577 · Report as offensive     Reply Quote
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14578 - Posted: 21 Jul 2005, 20:46:07 UTC - in response to Message 14577.  
Last modified: 21 Jul 2005, 20:55:03 UTC

&gt; I've just done some testing on Win 2K and didn't get this problem moving the
&gt; time forward or back. That host does an NTP sync every 24 hours and I can't
&gt; remember it ever exiting with no finished file.
&gt;
&gt; The only strange thing I noticed was that client_state.xml wasn't updated on
&gt; CPDN checkpoints after moving the clock back an hour (the model XML file was
&gt; updated).
&gt; <br><a href="http://www.teampicard.net/"><img> src="http://www.teampicard.net/images/picardmini.gif"&gt;</a><a> href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3"&gt;Join
&gt; us here</a>
&gt;

Hmmmm interesting again I guess. Tolerant to big changes but not to small changes. I moved mine by 65 or so seconds as I recall. My systems update every hour with my ntp server and I dont get this behaviour normally or at least I have nor detected it. But still....it sounds a decent candidate for further investigation as it has happened for two of us over three projects. Try a smaller change....60 secs?

EDIT:
Did you move backwards then forwards? May be sensitive to which way you go first.

Regards
Ian
<img src="http://www.boincsynergy.com/images/stats/comb-1091.jpg"></img><br><img src="http://www.iantighe.com/setisig.jpg"></img><img border="0" src="http://boinc.mundayweb.com/one/teamStats.php?userID=1602&amp;prj=1&amp;trans=off">
ID: 14578 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 14579 - Posted: 21 Jul 2005, 21:41:50 UTC

Well, I just posted to the developers mailing list ... let them chech it out ...

If you all find more I will (or you can) add a post there ...
<p>
<a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
<img src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8">
ID: 14579 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 14586 - Posted: 22 Jul 2005, 6:59:47 UTC - in response to Message 14578.  

&gt; Did you move backwards then forwards? May be sensitive to which way you go first.

Rewound one hour, then 2 steps of an hour forward and 6 * 10 minute steps backwards, allowing at least one timestep to be completed each time. I also had BOINC client debugging enabled to try and capture any unusual activity. If anyone who can reproduce this wants to try it with client debug and needs help to enable it <a href="http://www.climateprediction.net/board/privmsg.php?mode=post&amp;u=1070">send me a PM</a> on the phpBB forum (you'll need to register if you haven't already).

I'll try again with smaller time increments later today.
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 14586 · Report as offensive     Reply Quote
old_user1102
Avatar

Send message
Joined: 25 Aug 04
Posts: 6
Credit: 473,435
RAC: 0
Message 14591 - Posted: 22 Jul 2005, 11:57:51 UTC - in response to Message 14586.  
Last modified: 22 Jul 2005, 12:00:10 UTC

&gt; &gt; Did you move backwards then forwards? May be sensitive to which way you
&gt; go first.
&gt;
&gt; Rewound one hour, then 2 steps of an hour forward and 6 * 10 minute steps
&gt; backwards, allowing at least one timestep to be completed each time. I also
&gt; had BOINC client debugging enabled to try and capture any unusual activity.
&gt; If anyone who can reproduce this wants to try it with client debug and needs
&gt; help to enable it <a> href="http://www.climateprediction.net/board/privmsg.php?mode=post&amp;u=1070"&gt;send
&gt; me a PM</a> on the phpBB forum (you'll need to register if you haven't
&gt; already).
&gt;
&gt; I'll try again with smaller time increments later today.

Thyme,

For me (win2k adv server, service pack 4), this error is reproducable by simply turning the system clock back 2 to 3 minutes, and then waiting; the currently running project then does it's 'exited with zero status but no 'finished' file' bit shortly (up to a few minutes) afterwards. I've now been able to cause this effect 3 times on different projects, which is a bit more than pure coincidence, but, unfortunately have not managed to have a CPDN unit actively running at the times in question.
I can't seem to make it happen by setting the clock forwards.
For those of us that knowingly or otherwise automatically synchronise system time with that of other NTP servers, this is signifcant.

Ken Phillips
ID: 14591 · Report as offensive     Reply Quote
old_user70741

Send message
Joined: 16 Apr 05
Posts: 7
Credit: 13,526
RAC: 0
Message 14592 - Posted: 22 Jul 2005, 13:14:01 UTC

Just reproduced "exited with zero status but no 'finished' file" while running only CPDN and Crash.

Running 4.45 on XP with SP2.

I had set the system clock back 5 minutes, left it alone for about 10 minutes then updated the clock via the web.

The strange thing is, that I had earlier tried the same while setting the system clock forward 5 minutes, but could not reproduce this then.

Last night I also reproduced this with LHC. But comp has been rebooted since - so logs lost.

Over to those much better qualified than me.

It may not be the only reason that "exited with zero status but no 'finished' file" happens but it certainly is starting to look like a good proportion.


Well spotted Tigher.


22/07/2005 13:28:29|LHC@home|Throughput 21130 bytes/sec
22/07/2005 13:28:30|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
22/07/2005 13:28:30|LHC@home|Requesting 0 seconds of work, returning 1 results
22/07/2005 13:28:31|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
22/07/2005 13:30:34||request_reschedule_cpus: result op
22/07/2005 13:30:34|climateprediction.net|Resuming result 2v1x_200154905_0 using hadsm3 version 4.12
22/07/2005 13:49:38|climateprediction.net|Result 2v1x_200154905_0 exited with zero status but no 'finished' file
22/07/2005 13:49:38|climateprediction.net|If this happens repeatedly you may need to reset the project.
22/07/2005 13:49:38|crashcollection|Result crash_collection_20050622183003_62 exited with zero status but no 'finished' file
22/07/2005 13:49:38|crashcollection|If this happens repeatedly you may need to reset the project.
22/07/2005 13:49:38||request_reschedule_cpus: process exited
22/07/2005 13:49:38|climateprediction.net|Restarting result 2v1x_200154905_0 using hadsm3 version 4.12
22/07/2005 13:49:38|crashcollection|Restarting result crash_collection_20050622183003_62 using minidumps version 4.65
22/07/2005 13:49:42|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
22/07/2005 13:49:42|climateprediction.net|Requesting 0 seconds of work, returning 0 results
22/07/2005 13:49:43|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded
22/07/2005 13:49:46|climateprediction.net|Sending scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi
22/07/2005 13:49:46|climateprediction.net|Requesting 0 seconds of work, returning 0 results
22/07/2005 13:49:47|climateprediction.net|Scheduler request to http://climateapps2.oucs.ox.ac.uk/cpdnboinc_cgi/cgi succeeded



<img src="http://www.boincstats.com/stats/banner.php?cpid=dff6f0f73ac52826ebac01d8716dc4a8">
It's not the speed, but the quality - Until I get a faster computer
ID: 14592 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 14593 - Posted: 22 Jul 2005, 13:29:06 UTC

Cool! Ok, folks, I posted your expiriments and write ups on the developers mailing list ... also direct to Dr. Anderson. Obviously, we will not see an immediate "fix" but the check is in the mail.

Janus also pointed to another possible situation that might arise related to this and my memory is so bad I can't rememer what it was !!!!

:)
<p>
<a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
<img src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8">
ID: 14593 · Report as offensive     Reply Quote
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14594 - Posted: 22 Jul 2005, 13:35:33 UTC
Last modified: 22 Jul 2005, 13:35:51 UTC

Paul
A little help

"As a sidenote to this i hear some people talking about WUs failing after
resuming suspended laptops (either hibernation or suspend modes) due to
the BOINC application library triggering a heartbeat timeout. This could
easily be related.

-- Janus"

Cheers
<img src="http://www.boincsynergy.com/images/stats/comb-1091.jpg"></img><br><img src="http://www.iantighe.com/setisig.jpg"></img><img border="0" src="http://boinc.mundayweb.com/one/teamStats.php?userID=1602&amp;prj=1&amp;trans=off">
ID: 14594 · Report as offensive     Reply Quote
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14595 - Posted: 22 Jul 2005, 13:49:25 UTC - in response to Message 14593.  

&gt; Cool! Ok, folks, I posted your expiriments and write ups on the developers
&gt; mailing list ... also direct to Dr. Anderson. Obviously, we will not see an
&gt; immediate "fix" but the check is in the mail.
&gt;
&gt; Janus also pointed to another possible situation that might arise related to
&gt; this and my memory is so bad I can't rememer what it was !!!!
&gt;
&gt; :)
&gt; <p>
&gt; <a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
&gt; <img> src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8"&gt;
&gt;

Ok sounds good.

My guess having worked with both real-time and simulation systems is that system time changes need to be trapped/detected by keeping an artifical time within the model and using that to avoid harware clock inadeqaucies. Hourly checks by the cc should be adequate as even the most fiendish watcher of time (hehe ME!) will only adjust hourly I would guess. I know from my old job where time to a thousandth of a second was important, for judical reasons, an hour is as far as you need to go unless you have some pretty cheap hardware.
<img src="http://www.boincsynergy.com/images/stats/comb-1091.jpg"></img><br><img src="http://www.iantighe.com/setisig.jpg"></img><img border="0" src="http://boinc.mundayweb.com/one/teamStats.php?userID=1602&amp;prj=1&amp;trans=off">
ID: 14595 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 14597 - Posted: 22 Jul 2005, 15:03:28 UTC

Look here My reply to Dr. Anderson:
------------------

YEA!

Way cool ... I will post your message on the CPDN board ... and maybe two bugs for the price of one investigation ...

On Jul 22, 2005, at 7:30 AM, David Anderson wrote:

I think I know how to fix this.
The part of the API that checks for loss of heartbeat
from the core client should use its own "tick count"
(number of timer events) rather than system time.
That should fix both the clock-changing
and the wakeup-after-sleep problem.
Thanks to everyone for investigating this.

-- David


<p>
<a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
<img src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8">
ID: 14597 · Report as offensive     Reply Quote
Profile old_user5994

Send message
Joined: 31 Aug 04
Posts: 239
Credit: 2,933,299
RAC: 0
Message 14598 - Posted: 22 Jul 2005, 15:08:15 UTC - in response to Message 14595.  

&gt; My guess having worked with both real-time and simulation systems is that
&gt; system time changes need to be trapped/detected by keeping an artifical time
&gt; within the model and using that to avoid harware clock inadeqaucies. Hourly
&gt; checks by the cc should be adequate as even the most fiendish watcher of time
&gt; (hehe ME!) will only adjust hourly I would guess. I know from my old job where
&gt; time to a thousandth of a second was important, for judical reasons, an hour
&gt; is as far as you need to go unless you have some pretty cheap hardware.

When I worked on the OTH-B radar we scheduled behaviors a about a minute and a half in the future ... and since there were 3 different sites being coordinated the time requirement was pretty important. The transmit and receive sites were on a 90 NMi baseline with the ops center roughly in the middle... I forget what the clocks were that were used, but one was to get us into the right basic area (of time) and then another was used to slice up that second to some finer granularity.

I forget the exact parameters (it was a decade ago that I worked on the darn thing ) but it was a beautiful radar ... but I digress again ...
<p>
<a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
<img src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8">
ID: 14598 · Report as offensive     Reply Quote
Profile old_user59948

Send message
Joined: 3 Mar 05
Posts: 76
Credit: 127,896
RAC: 0
Message 14599 - Posted: 22 Jul 2005, 15:32:59 UTC - in response to Message 14597.  

&gt; Look here My reply to Dr. Anderson:
&gt; ------------------
&gt;
&gt; YEA!
&gt;
&gt; Way cool ... I will post your message on the CPDN board ... and maybe two bugs
&gt; for the price of one investigation ...
&gt;
&gt; On Jul 22, 2005, at 7:30 AM, David Anderson wrote:
&gt;
&gt; I think I know how to fix this.
&gt; The part of the API that checks for loss of heartbeat
&gt; from the core client should use its own "tick count"
&gt; (number of timer events) rather than system time.
&gt; That should fix both the clock-changing
&gt; and the wakeup-after-sleep problem.
&gt; Thanks to everyone for investigating this.
&gt;
&gt; -- David
&gt;
&gt;
&gt; <p>
&gt; <a href="http://boinc-doc.net/boinc-wiki/index.php"><b>BOINC-Wiki</b></a>
&gt; <img> src="http://www.boincstats.com/stats/banner.php?cpid=a6477942e70ed39f669d1ff2ede05be8"&gt;
&gt;

How do we invoice....LOL!
Ooooops we're volunteers!
Regards
Ian
<img src="http://www.boincsynergy.com/images/stats/comb-1091.jpg"></img><br><img src="http://www.iantighe.com/setisig.jpg"></img><img border="0" src="http://boinc.mundayweb.com/one/teamStats.php?userID=1602&amp;prj=1&amp;trans=off">
ID: 14599 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 14600 - Posted: 22 Jul 2005, 17:38:11 UTC

Nice find Ian :)

And yes, I did manage to replicate the problem by winding back 3 minutes. Apps stop a good while before the 'exited with 0 status but no finished file' message is generated, and the only thing that client debug revealed was that the message is output precisely when the time passes what it had been just before the change.
<br><a href="http://www.teampicard.net/"><img src="http://www.teampicard.net/images/picardmini.gif"></a><a href="http://climateapps2.oucs.ox.ac.uk/cpdnboinc/team_display.php?teamid=3">Join us here</a>
ID: 14600 · Report as offensive     Reply Quote

Message boards : Number crunching : System clock adjustments - problematic?

©2024 climateprediction.net