climateprediction.net home page
Posts by mmonnin

Posts by mmonnin

21) Message boards : Number crunching : Upload failures (Message 60628)
Posted 10 Jul 2019 by mmonnin
Post:
Thanks for the history on the cam25's. I found the one in question, and it has returned 18 zips.
https://www.cpdn.org/cpdnboinc/result.php?resultid=21709022

However, the ones that are stuck are #12 and #13. So it looks like they got lost in the shuffle.
If they have not uploaded by the time my other work has finished tomorrow, I will just can them
(as in trash can; I just realized that may not be clear to non-native English speakers).

What I've done with previous CAM25s that have persistently stuck uploads is just abort the transfers that are obviously stuck. After aborting the transfers, it will report the task, possibly as a success, and the scientists can determine whether the output is useful without the missing zips. I've only done this with the CAM25 tasks however since those are the ones that seem to occasionally have the rogue stuck uploads.


This worked for me too. Trickle _15 was stuck at 51%. After reading this I checked and 18 trickles were uploaded according to the task stats. Aborted and it updated to successful.
22) Message boards : Number crunching : Upload failures (Message 60585)
Posted 4 Jul 2019 by mmonnin
Post:
CPDN has a track record of computation errors when suspending tasks so I sure ignored that part. I don't blame any one else for it either.

I left the 2 that had not started yet suspended but the ones that had started I let them complete even if the uploads will take awhile.


Never seen a problem from suspending tasks if BOINC isn't stopped and restarted. Also a long time since even doing that I have lost a Windows task.

Just to reiterate so the information stays near the top of the thread, Clearing the data and the backlog of people still uploading data some of whom have several hundred gigabytes means that it could easily be a week or more before the problems stop completely. Also no need to suspend any tasks other than sam50's as they go to different servers.


How soon you forget. You started this thread.
https://www.cpdn.org/cpdnboinc/forum_thread.php?id=8701#59554

CPDN tasks are some of the most fragile tasks of all the BOINC projects. Most have no issues suspending or at least going back to the last checkpoint. Even if they did go back the last checkpoint, no one wants to lose several days of work. There's a higher chance of losing work from suspending than from a task trickle upload being lost.
23) Message boards : Number crunching : Upload failures (Message 60576)
Posted 3 Jul 2019 by mmonnin
Post:
CPDN has a track record of computation errors when suspending tasks so I sure ignored that part. I don't blame any one else for it either.

I left the 2 that had not started yet suspended but the ones that had started I let them complete even if the uploads will take awhile.
24) Message boards : Number crunching : Upload failures (Message 60511)
Posted 30 Jun 2019 by mmonnin
Post:
The client version has nothing to do with a full disk on a project server.
25) Questions and Answers : Getting started : "Communications deferred" ... continuously (Message 60451)
Posted 26 Jun 2019 by mmonnin
Post:
See this thread. You're not alone.
https://www.cpdn.org/cpdnboinc/forum_thread.php?id=8744
26) Message boards : Number crunching : Upload failures (Message 60435)
Posted 24 Jun 2019 by mmonnin
Post:
All of my Linux zips on three machines have gone, so that is progress. But I have over 50 WAH2 zips on my windows machine still stuck. It seems to be discrimination against North America.


Same here. SAM50 and SAFR50 are pending. Those files are several times as big. ~17MB compared to 76/92MB each.
27) Message boards : Number crunching : Free-DC reports negative credits today for CPDN (Message 60434)
Posted 24 Jun 2019 by mmonnin
Post:
A week or two someone also reported that Free-DC cannot get new stats from CPDN and the issue is on CPDN site. So issues ;)


The difference here is that credit reversed on Free-DC while other weeks there was just no update at Free-DC. This time the CPDN site was down. I'm not sure how often Free-DC queries CPDN. It might be a week until it goes back to the data from a couple of weeks ago.

Free-DC also has 2 sets of data to update and display the site. At times they can get out of sync.
28) Message boards : Number crunching : Free-DC reports negative credits today for CPDN (Message 60420)
Posted 24 Jun 2019 by mmonnin
Post:
It just loaded an old copy of data since the site was down when it tried to pull stats. It's happened plenty of times at Free-DC. Mine went back to a familiar # for me.
29) Message boards : Number crunching : New Model Type HadAM4 (Message 59584)
Posted 11 Feb 2019 by mmonnin
Post:
Its hard for them to not be interrupted when there isn't enough CPDN work to fill the queue, they have a runtime of 18 days and they have a year deadline. Something will end up interrupting them.
30) Message boards : Number crunching : New Model Type HadAM4 (Message 59580)
Posted 11 Feb 2019 by mmonnin
Post:
I realized due to these CPDN tasks that tasks from other projects were taking many times as long so I paused all but one. Today I received from BURP tasks which paused that one task since they have a short deadline and are mt. Upon resuming it had an error. Along with every other one I had. Pretty much a waste of time.

Model crashed: READDUMP: BAD BUFFIN OF DATA
Sorry, too many model crashes! :-
31) Message boards : Number crunching : New Model Type HadAM4 (Message 59566)
Posted 8 Feb 2019 by mmonnin
Post:
My Linux PC grabbed 6 of this this morning. About 18.5 day ETA. Still running after 12 hours. ~630 MB of memory usage.
32) Message boards : Number crunching : The big crash of 2018, and credits lost (Message 58869)
Posted 19 Oct 2018 by mmonnin
Post:
Should seem feasible to search for completed task status and credit = 0.
33) Message boards : Number crunching : Download server (Message 58812)
Posted 26 Sep 2018 by mmonnin
Post:
I'll third that.
34) Message boards : Number crunching : New work Discussion (Message 58789)
Posted 20 Sep 2018 by mmonnin
Post:
There is now a batch of 28 month pnw25 tasks out there. Batch 757

Some of these are already on their third attempt with a combination of download errors and being aborted. I suspect this is the download server being out of action referred to elsewhere.


I've had some of these for 3 days waiting on the download server.
35) Message boards : Number crunching : Why no credit in Statistics? (Message 58779)
Posted 19 Sep 2018 by mmonnin
Post:
Yes, the march / April stuff may be a problem, as there was no "real" server to return data to around about that time.

Which is why I kept posting that people shouldn't try and get new work, or return old results.

Still, I don't think that recovery is over yet.


Those are the results I mentioned being lost but it was discarded saying results get transferred old elsewhere. But with the server up and down they prob couldn't be saved elsewhere.

I had returned some work during that time. Besides using a hosts file to inhibit access to just this site there wouldn't be a way to stop uploads from happening while still running another BOINC project. And once the site came back up the restored db probably had no record of users having said tasks anyway. They would have uploaded and been thrown away.

Similar thing just happened to Cosmology and any work in queue was dismissed when the server came back. Here it at least shows some record.
36) Message boards : Number crunching : Download server (Message 58777)
Posted 19 Sep 2018 by mmonnin
Post:
Yes I have 5 stuck in download for several days. Some files partially downloaded. Upload trickles are still working.

The whole project site went down the other day but these are still stick. Abort them?
37) Message boards : Number crunching : Why no credit in Statistics? (Message 58713)
Posted 6 Sep 2018 by mmonnin
Post:
Now awaiting the clamour from those who think they should have more.


There should be more. I posted a link of a task returned for zero credit. Most of us pay out of pocket to support projects. At least provide some arbitrary credit that says volunteers helped.
38) Message boards : Number crunching : Why no credit in Statistics? (Message 58701)
Posted 6 Sep 2018 by mmonnin
Post:
It looks like there was a credit update for the work done since the project came back up? Just the new work it seems. Does that fit with others estimates?

Some tasks returned while the server was up and down in March seem to have been lost.

https://www.cpdn.org/cpdnboinc/workunit.php?wuid=11531038
39) Message boards : Number crunching : Lots of pending WUs (Message 58684)
Posted 3 Sep 2018 by mmonnin
Post:
or if we got credits it wouldn't really be that big a deal...


Once things are fully sorted, credit will be granted again. There was an attempt to run the credit script last week but it didn't complete. Andy has started to go through the logs looking for the problem. I think he said it got to about the 7 hour point before it failed.


Maybe it should be run more often so it doesn't take so long. *GASP*
40) Message boards : Number crunching : Why no credit in Statistics? (Message 58618)
Posted 16 Aug 2018 by mmonnin
Post:
I don’t know how many credits you think thqt you lost get over it. I lost about 1.3 million.


I KNOW how much, Get over yourself. It's a shame you think its ok for actual science to be lost.


Previous 20 · Next 20

©2024 climateprediction.net