climateprediction.net home page
HadAM3P-HadRM3P restart loop on Windows 7

HadAM3P-HadRM3P restart loop on Windows 7

Questions and Answers : Windows : HadAM3P-HadRM3P restart loop on Windows 7
Message board moderation

To post messages, you must log in.

AuthorMessage
MaynardVizzutti

Send message
Joined: 29 Mar 15
Posts: 3
Credit: 859,479
RAC: 0
Message 51723 - Posted: 30 Mar 2015, 0:06:31 UTC

The program will start, run for around 10 seconds and fail, restarting immediately and failing again in a never-ending loop. I didn't find any instructions for the preferred information gathering, but will be happy to collect information if it's desired.

Windows 7 SP1, 8-core Intel i7, 8GB, BOINC 7.4.42 x64. HadAM3P 7.22. It failed on the first try and has never worked on this machine.

The Coupled Model program seems to be proceeding normally, so I'll run that in the meantime. Thanks.
ID: 51723 · Report as offensive     Reply Quote
Profile astroWX
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1496
Credit: 95,522,203
RAC: 0
Message 51724 - Posted: 30 Mar 2015, 4:22:56 UTC - in response to Message 51723.  

Hi, Maynard,

Welcome to the project and to the boards.

Checked your machine, found one task running and four aborted by user. Guaranteed: User aborts will kill tasks every time.

How many times did each task crash/restart on its own? What does your 'Messages' tab show?
"We have met the enemy and he is us." -- Pogo
Greetings from coastal Washington state, the scenic US Pacific Northwest.
ID: 51724 · Report as offensive     Reply Quote
MaynardVizzutti

Send message
Joined: 29 Mar 15
Posts: 3
Credit: 859,479
RAC: 0
Message 51727 - Posted: 30 Mar 2015, 21:51:31 UTC - in response to Message 51724.  

Thanks for your quick reply.

At first, I was assigned one task, which I allowed to restart for approximately 5-7 minutes before aborting. I estimate something like 30-50 restarts for that task. The system's memory-in-use display oscillated up and down with the same period, which is what made me notice in the first place.

I hoped it was an isolated incident, but on the next batch, I received three more such jobs, which I saw behaving the same way and terminated much sooner, probably within one minute.

The fourth job I received was hadcm3n_um_6.07_windows_intelx86 *32. It is running normally, but on a side note, the deadline calls for 400 hours of CPU over 92 days, which I'm not sure I can deliver. The shorter tasks had deadlines a year away and would easily make it.

Nothing appears in the BOINC event log (with only the default logging enabled), nor did I find any log files in the project/task directories. If you have instructions for enabling better logging, I'll be happy to do it.
ID: 51727 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51728 - Posted: 30 Mar 2015, 22:02:25 UTC - in response to Message 51727.  
Last modified: 31 Mar 2015, 2:19:49 UTC

1) As posted VERY regularly, there is NO "deadline" for returning the data. It's just an unbypassable BOINC requirement that there be one.

As for the error messages, they appear under Stderr on each model's page. Click the plus sign to expand the list.
ID: 51728 · Report as offensive     Reply Quote
MaynardVizzutti

Send message
Joined: 29 Mar 15
Posts: 3
Credit: 859,479
RAC: 0
Message 51730 - Posted: 31 Mar 2015, 0:53:45 UTC - in response to Message 51728.  

Thanks for pointing me to the error messages. Here is a sample:

18:41:05 (5248): BOINC client no longer exists - exiting
18:41:05 (5248): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
18:41:49 (10872): BOINC client no longer exists - exiting
18:41:49 (10872): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=9760, selfPID=9760, iMonCtr=2
18:42:00 (8224): BOINC client no longer exists - exiting
18:42:00 (8224): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
18:42:11 (6156): BOINC client no longer exists - exiting
18:42:11 (6156): timer handler: client dead, exiting
CPDN Monitor - No 'heartbeat' from BOINC...
Regional Worker:: CPDN process is not running, exiting, bRetVal = 1, checkPID=8332, selfPID=8332, iMonCtr=2

And so on.

I saw a similar sequence in another thread, but the program identified in that case as the culprit is not installed on my machine. I'll assume the virus/firewall protection is a good place to start looking and will try some things when my current task nears completion. Thanks to both of you for your help.
ID: 51730 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 51734 - Posted: 1 Apr 2015, 1:51:38 UTC

Your problem has been identified, and, co-incidentally, also posted about by another cruncher.
I have answered him here.



ID: 51734 · Report as offensive     Reply Quote

Questions and Answers : Windows : HadAM3P-HadRM3P restart loop on Windows 7

©2024 climateprediction.net