climateprediction.net home page
Reporting - Errors while computing -

Reporting - Errors while computing -

Message boards : Number crunching : Reporting - Errors while computing -
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46712 - Posted: 27 Jul 2013, 2:58:48 UTC - in response to Message 46711.  

The server, which is very elderly, is having difficulties.
I saw that red error a few hours ago on a couple of pages.

I'll email the project, but it's <sigh> the weekend.
There may be downtime associated with fixing this.

ID: 46712 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,496,276
RAC: 6,460
Message 46713 - Posted: 27 Jul 2013, 6:18:01 UTC - in response to Message 46712.  

Presumably the server status page showing blank is part of the same server problem? As you say Les, "It is a weekend." At least it isn't joined on to a public holiday!
ID: 46713 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46714 - Posted: 27 Jul 2013, 6:44:04 UTC - in response to Message 46713.  

climateapps2 contains EVERYTHING boinc.
Including this forum, so we're lucky at present.


ID: 46714 · Report as offensive     Reply Quote
WB8ILI

Send message
Joined: 1 Sep 04
Posts: 161
Credit: 81,421,805
RAC: 1,225
Message 46715 - Posted: 27 Jul 2013, 11:00:20 UTC - in response to Message 46711.  

I have the "Server can't open log file (../log_climateapps2/scheduler.log)" error too. I haven't seen it before either.
ID: 46715 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,496,276
RAC: 6,460
Message 46718 - Posted: 27 Jul 2013, 14:04:13 UTC - in response to Message 46715.  

As you say Les, we are lucky at the moment. I noticed two trickles have not gone with the message internet access OK etc.
ID: 46718 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,508,497
RAC: 1,340
Message 47262 - Posted: 9 Oct 2013, 6:23:21 UTC - in response to Message 46364.  

Hi,
It seems my workunit got stuck at the 75% mark and as far as I got I should abort it. However I want to try your suggestions with vm.swapness etc but I use the BOING GUI manager and I'm not sure how to make the changes. Can you give an advice?

Thanks
ID: 47262 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,496,276
RAC: 6,460
Message 47263 - Posted: 9 Oct 2013, 6:52:04 UTC - in response to Message 47262.  

It seems my workunit got stuck at the 75% mark


This is almost certainly one of the ways these models can crash at the decade points. Go into the graphics and see if it is stuck in a loop, If so the only thing to do is to abort it. You may then have to go into the BOINC data folder and delete the folder for that particular model once it has reported as being aborted if you want to avoid it taking up space.
ID: 47263 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,508,497
RAC: 1,340
Message 47264 - Posted: 9 Oct 2013, 7:51:05 UTC - in response to Message 47263.  

Show graphics command (button) never worked. But I can say that the work is stuck on 75.761% and restarts the time at one particular time - this is a loop. So most probably it is the "25%" mark issue. I will follow your suggestion. Can you show me where and how to tune "the sysctls vm.swappiness, vm.dirty_background_ratio, and vm.dirty_ratio."?
ID: 47264 · Report as offensive     Reply Quote
Steve in Pimlico

Send message
Joined: 17 Sep 04
Posts: 9
Credit: 19,604,231
RAC: 296
Message 47294 - Posted: 12 Oct 2013, 17:34:40 UTC


Dear Sir
every model ran on my home computer reports an error?
can anyone suggest what I should, been running since 2004!

steve

Show: All | In progress | Completed | Valid | Invalid | Error

Task ID
click for details
Show names Work unit ID
click for details Sent Time reported
or deadline
explain Status Run time
(sec) CPU time
(sec) Claimed credit Granted credit Application
16060662 8607938 7 Oct 2013 21:10:43 UTC 7 Jan 2014 4:37:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07
16060078 8498903 7 Oct 2013 10:37:43 UTC 8 Oct 2013 22:21:05 UTC Error while computing 73.38 0.44 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
16060077 8553663 7 Oct 2013 10:37:43 UTC 6 Jan 2014 18:04:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07
16060069 8608608 7 Oct 2013 10:37:43 UTC 6 Jan 2014 18:04:54 UTC In progress --- --- --- --- UK Met Office Coupled Model Full Resolution Ocean v6.07
16055638 8455826 3 Oct 2013 22:36:38 UTC 8 Oct 2013 22:27:56 UTC Error while computing 211,148.33 193,029.90 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07
16046545 8626670 27 Sep 2013 11:32:06 UTC 27 Dec 2013 18:59:17 UTC In progress --- --- 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07
16046297 8626424 27 Sep 2013 13:15:05 UTC 8 Oct 2013 22:27:56 UTC Error while computing 242,333.37 221,644.30 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07
16044336 8624471 27 Sep 2013 10:31:22 UTC 7 Oct 2013 10:37:43 UTC Error while computing 247,812.48 225,412.40 2,177.28 2,177.28 UK Met Office Coupled Model Full Resolution Ocean v6.07
16042451 8622601 30 Sep 2013 21:24:27 UTC 7 Oct 2013 10:37:43 UTC Error while computing 215,726.37 198,771.50 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07
16040051 8620218 3 Oct 2013 21:35:51 UTC 7 Oct 2013 10:37:43 UTC Error while computing 218,947.11 203,252.70 1,866.24 1,866.24 UK Met Office Coupled Model Full Resolution Ocean v6.07
16037500 8617683 27 Sep 2013 9:30:28 UTC 3 Oct 2013 21:35:51 UTC Error while computing 28,158.34 23,817.70 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
16000291 8613251 2 Sep 2013 10:57:22 UTC 2 Dec 2013 18:24:33 UTC In progress --- --- 2,488.32 2,488.32 UK Met Office Coupled Model Full Resolution Ocean v6.07
15998475 8613934 31 Aug 2013 20:42:38 UTC 2 Sep 2013 10:57:22 UTC Error while computing 412.77 296.89 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15998473 8469688 31 Aug 2013 20:42:38 UTC 27 Sep 2013 8:00:17 UTC Error while computing 7,265.02 5,740.59 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15998470 8613873 31 Aug 2013 20:42:38 UTC 3 Oct 2013 22:36:38 UTC Error while computing 32,718.18 27,907.19 311.04 311.04 UK Met Office Coupled Model Full Resolution Ocean v6.07
15998347 8613875 31 Aug 2013 18:23:35 UTC 27 Sep 2013 8:00:17 UTC Error while computing 2,440.82 2,327.43 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15997975 8528069 31 Aug 2013 10:25:14 UTC 31 Aug 2013 20:42:38 UTC Error while computing 49.79 0.06 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15997974 8613961 31 Aug 2013 10:25:14 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15996983 8613807 31 Aug 2013 10:25:14 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
15996847 8613676 31 Aug 2013 14:59:20 UTC 31 Aug 2013 18:23:35 UTC Error 0.00 0.00 0.00 --- UK Met Office Coupled Model Full Resolution Ocean v6.07
ID: 47294 · Report as offensive     Reply Quote
DadX

Send message
Joined: 30 Aug 06
Posts: 27
Credit: 1,513,025
RAC: 1,466
Message 47295 - Posted: 12 Oct 2013, 18:10:13 UTC

A while back I changed the various programs on my Windows machines that scan folders (Virus Scans, Indexing) to exclude the WCG directories from the real time scans. I also set BOINC to suspend work when the full disk scans are scheduled. I have had better luck completing models since then.
ID: 47295 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4341
Credit: 16,496,276
RAC: 6,460
Message 47308 - Posted: 13 Oct 2013, 7:16:18 UTC

Steve in Pimlico

Many of your tasks have made it quite a long way through before giving up. If you click on the + under STDERR you can see the errors. Those I looked at all showed the same pattern. I suspect most likely something that puts a lock on one of the files when BOINC is trying to write to it. (many antivirus programs do this.) Second possibility is something very intensive being run on the computer.

It is worth excluding the BOINC data directory from any virus scans, suspending computation and exiting BOINC before shutting down. also under Tools>disk and memory usage ensure leave applications in memory while suspended is ticked.

I am sure I have missed something out. If so someone with a better memory than me will add it soon.
ID: 47308 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 47310 - Posted: 13 Oct 2013, 9:20:17 UTC
Last modified: 13 Oct 2013, 9:28:15 UTC

I agree with the above two. Just to expand on Dave's point, after you have finished looking at your antivirus's options, if you are looking on the website, the boinc settings are found in Account / computing options.

* Suspend work while computer is in use? no

* Suspend work if CPU usage is above 0 %

* Leave tasks in memory while suspended? yes
Suspended tasks will consume swap space if 'yes'



Having these three settings mean that the task will stay in memory rather than being pushed out & reloaded repeatedly. You have plenty of memory, so it should be fine to keep them in memory.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 47310 · Report as offensive     Reply Quote
Steve in Pimlico

Send message
Joined: 17 Sep 04
Posts: 9
Credit: 19,604,231
RAC: 296
Message 47442 - Posted: 30 Oct 2013, 1:23:11 UTC - in response to Message 47308.  

Thank will try this no luck so far
ID: 47442 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 47568 - Posted: 14 Nov 2013, 3:56:20 UTC

I've returned to the project after a bit of a gap.
A couple of models have errored out, one annoyingly close to the end - so I'll post a clip of the message log to gauge any opinions, I don't know what all the error codes mean.
I do most of the hygiene things anyway, though perhaps I should look at exempting Boinc data from MSE a/virus.
One model completed successfully and was stuck in the recent upload blockage and one of the others failed just after everything cleared. Just coincidence?
My tasks should be set to visible to see the stderr exit files <which I don't understand!>.

message log, starting with the successful model completing and reporting:

12/11/2013 07:30:19 climateprediction.net Started upload of hadcm3n_o1km_1980_40_008401621_1_4.zip
12/11/2013 07:30:21 climateprediction.net [error] Error reported by file upload server: Server is out of disk space
12/11/2013 07:30:21 climateprediction.net Temporarily failed upload of hadcm3n_o1km_1980_40_008401621_1_4.zip: transient upload error
12/11/2013 07:30:21 climateprediction.net Backing off 3 hr 30 min 42 sec on upload of hadcm3n_o1km_1980_40_008401621_1_4.zip
12/11/2013 11:01:04 climateprediction.net Started upload of hadcm3n_o1km_1980_40_008401621_1_4.zip
12/11/2013 11:04:50 climateprediction.net Finished upload of hadcm3n_o1km_1980_40_008401621_1_4.zip
12/11/2013 12:59:28 climateprediction.net task hadcm3n_7x75_1980_40_008454308_3 resumed by user
12/11/2013 13:02:59 climateprediction.net Restarting task hadcm3n_7x75_1980_40_008454308_3 using hadcm3n version 607
12/11/2013 21:39:31 climateprediction.net Sending scheduler request: To send trickle-up message.
12/11/2013 21:39:31 climateprediction.net Reporting 1 completed tasks, not requesting new tasks
12/11/2013 21:39:34 climateprediction.net Scheduler request completed
13/11/2013 01:15:22 climateprediction.net Task hadcm3n_o525_1940_40_008380310_2 exited with zero status but no 'finished' file
13/11/2013 01:15:22 climateprediction.net If this happens repeatedly you may need to reset the project.
13/11/2013 01:15:22 climateprediction.net Task hadcm3n_7x75_1980_40_008454308_3 exited with zero status but no 'finished' file
13/11/2013 01:15:22 climateprediction.net If this happens repeatedly you may need to reset the project.
13/11/2013 01:15:23 climateprediction.net Restarting task hadcm3n_o525_1940_40_008380310_2 using hadcm3n version 607
13/11/2013 01:15:24 climateprediction.net Restarting task hadcm3n_7x75_1980_40_008454308_3 using hadcm3n version 607
13/11/2013 01:16:28 climateprediction.net Task hadcm3n_ofqn_1900_40_008475522_1 exited with zero status but no 'finished' file
13/11/2013 01:16:28 climateprediction.net If this happens repeatedly you may need to reset the project.
13/11/2013 01:16:28 climateprediction.net Restarting task hadcm3n_ofqn_1900_40_008475522_1 using hadcm3n version 607
13/11/2013 01:20:33 climateprediction.net Sending scheduler request: To send trickle-up message.
13/11/2013 01:20:33 climateprediction.net Not reporting or requesting tasks
13/11/2013 01:20:37 climateprediction.net Scheduler request completed
13/11/2013 01:20:51 climateprediction.net Computation for task hadcm3n_ofqn_1900_40_008475522_1 finished
13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_3.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent
13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_4.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent
13/11/2013 09:25:47 climateprediction.net Sending scheduler request: To send trickle-up message.
13/11/2013 09:25:47 climateprediction.net Reporting 1 completed tasks, not requesting new tasks
13/11/2013 09:25:50 climateprediction.net Scheduler request completed


ID: 47568 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 47570 - Posted: 14 Nov 2013, 4:33:24 UTC - in response to Message 47568.  

13/11/2013 01:20:51 climateprediction.net Output file hadcm3n_ofqn_1900_40_008475522_1_3.zip for task hadcm3n_ofqn_1900_40_008475522_1 absent

That model, ofqn, has crashed between zips 2 & 3.

Judging by the trickle list, it was while getting the data ready to zip up (for zip3), to return to the project.

In other words, "the 25% problem". (In this case the 75% point.)
The error list shows BOINC stopping a lot, indicative of the option: Suspend work if CPU usage is above being still set to the default of 25%.
Which is fine for other projects, but not here.
These programs DON'T like being interrupted at certain critical points.

So it's possible that you started to use the computer at that moment, the cpu load went above 25%, and BOINC, (and the model), stopped. In the case of the model, permanently.

ID: 47570 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 47576 - Posted: 14 Nov 2013, 13:40:34 UTC - in response to Message 47570.  
Last modified: 14 Nov 2013, 13:42:41 UTC

The error list shows BOINC stopping a lot, indicative of the option: Suspend work if CPU usage is above being still set to the default of 25%.
Which is fine for other projects, but not here.
These programs DON'T like being interrupted at certain critical points.

So it's possible that you started to use the computer at that moment, the cpu load went above 25%, and BOINC, (and the model), stopped. In the case of the model, permanently.


Les, thanks for looking at things.

Current setting:

"Computing allowed"
1] while computer is in use
2] while processor usage is less than 0 percent

I'll change "Only after computer has been idle for" to 0 minutes, it was on 3.00mins.
{not entirely sure what this latter setting actually means or really remember why it was on 3.00}

also, applications are left in memory on suspend.

It's true, there does tend to be a whole lot of stuff running on the pc most times. I do need to get myself a new desktop pc which will share the workload of everything going on! :)

Pete
ID: 47576 · Report as offensive     Reply Quote
Professor Desty Nova
Avatar

Send message
Joined: 19 Sep 04
Posts: 92
Credit: 1,934,516
RAC: 390
Message 47583 - Posted: 15 Nov 2013, 9:50:29 UTC - in response to Message 47576.  


I'll change "Only after computer has been idle for" to 0 minutes, it was on 3.00mins.
{not entirely sure what this latter setting actually means or really remember why it was on 3.00}


If you have selected "while computer is in use", the setting doesn't matter.



Professor Desty Nova
Researching Karma the Hard Way
ID: 47583 · Report as offensive     Reply Quote
glaesum

Send message
Joined: 24 Feb 06
Posts: 47
Credit: 782,082
RAC: 0
Message 47587 - Posted: 16 Nov 2013, 1:28:02 UTC - in response to Message 47583.  


I'll change "Only after computer has been idle for" to 0 minutes, it was on 3.00mins.
{not entirely sure what this latter setting actually means or really remember why it was on 3.00}


If you have selected "while computer is in use", the setting doesn't matter.


Thanks prof. Desty, I sensed some double speak in there...
ID: 47587 · Report as offensive     Reply Quote
Ingleside

Send message
Joined: 5 Aug 04
Posts: 108
Credit: 18,983,150
RAC: 37,089
Message 47632 - Posted: 22 Nov 2013, 7:25:36 UTC - in response to Message 47576.  

"Computing allowed"
1] while computer is in use
2] while processor usage is less than 0 percent

I'll change "Only after computer has been idle for" to 0 minutes, it was on 3.00mins.
{not entirely sure what this latter setting actually means or really remember why it was on 3.00}

You're not allowed to set "has been idle for" to zero minutes, even as has already been mentioned this setting isn't used if you don't suspend computing for any reason. While it's possible to manually edit the preference-file (either override or general) and set it to zero, if you do this the client-default is used instead, and this probably is 3 minutes.

Some other settings on the other hand does accept zero minutes, and also a little inconsistently zero percent as far as processor-usage means 100%.
ID: 47632 · Report as offensive     Reply Quote
bernard_ivo

Send message
Joined: 18 Jul 13
Posts: 438
Credit: 24,508,497
RAC: 1,340
Message 47723 - Posted: 4 Dec 2013, 17:19:49 UTC

Hi guys,
It seems my fourth project attempt got stucked this time at 75% - it is running but no progress on the %. I have a core duo system with 3 GB RAM and have allocated one of the CPUs to be used by BOINC at 90% (not to overheat). I set boinc to run while computer is in use and suspend when CPU usage is above 50%. I do use my computer permanently for work and sometimes I need to shut it down 2-3 times per day. Usually I open it in the morning and shut it down at night. I have checked leave application in memory while suspended. But while working at some point the whole computer becomes slow.

So far none of my 4 attempts to complete a task was successfull. I wonder is there any use of my computations at all if not a single task has been completed.
If these models are that sensitive isn't possible that the client requests something that can be completed?

I tried leaving the computer running longer at 25% and 50% points to pass this tresholds, but at 75 I could not. It computes around 2% per day so it is hard to avoid stopping at tresholds unless I leave the computer runnig for more than 24 h which is rather rare possibility. Any suggestions to overcome this!

Cheers

ID: 47723 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : Number crunching : Reporting - Errors while computing -

©2024 climateprediction.net