climateprediction.net home page
Upload Failure

Upload Failure

Message boards : Number crunching : Upload Failure
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

AuthorMessage
old_user633787

Send message
Joined: 14 Sep 10
Posts: 11
Credit: 1,812,972
RAC: 0
Message 44208 - Posted: 21 May 2012, 17:43:57 UTC - in response to Message 44207.  

The upload handler string does not appear to be mis-spelled on the problem systems. It reads "http://cpdn-restarts.oerc.ox.ac.uk/cgi-bin/file_upload_handler". The problem is that the upload shouldn't go to "cpdn-restarts.oerc..." at all. It should go to "uploader.oerc...". I read this at http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=5447&nowrap=true#44076 , fixed the client-state.xml files, and everything's uploading fine now.

Please update the server-status page to indicate which files go where, so that users can fix this problem more easily.
ID: 44208 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44209 - Posted: 21 May 2012, 19:53:01 UTC - in response to Message 44208.  

Please update the server-status page to indicate which files go where, so that users can fix this problem more easily.

What you say makes perfect sense to me. However I am sure that there are more than a few who would make a total hash of this if the information was that easily available. Making people ask questions here probably weeds out some of those who would produce the proverbial dogs dinner.

I also wonder why it was going to the wrong server. Is this to do with the fixing of the server problems recently, in particular the interim fix before it was properly fixed? If so, problems due to that source have in the past been fixed at the CPDN end and the uploads have eventually gone through.

Dave
ID: 44209 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44210 - Posted: 21 May 2012, 19:57:24 UTC - in response to Message 44209.  

And despite what I just said, it would be nice to see what was meant to go where, then if a server was down we would be able to see whether or not that was the cause of our problems as opposed to just assuming.

Dave
ID: 44210 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44211 - Posted: 21 May 2012, 22:23:40 UTC

The mods do have a list, but it's a bit dynamic.

The real problem, is that during the recent server failure, re-directs were put in place for all servers, so that in the event of future problems, the data could be quickly sent to another server somewhere else in Oxford.
It appears that this needs to be checked again, against the current working server list.

The upload server being used for every upload is listed in client_state.xml, where it's used by BOINC to address it's mail.

The labels on the servers is (mostly) the physical location of the server, which makes it easier to know where to go where the server needs a physical re-boot, as against a ssh wakeup.

I've just informed the project, so it'll get looked at in half a day. But there may be more data sets that are affected, in which case, it shouldn't be a problem for too long.


Backups: Here
ID: 44211 · Report as offensive     Reply Quote
old_user633787

Send message
Joined: 14 Sep 10
Posts: 11
Credit: 1,812,972
RAC: 0
Message 44212 - Posted: 21 May 2012, 23:01:22 UTC - in response to Message 44211.  

I've just informed the project, so it'll get looked at in half a day. But there may be more data sets that are affected, in which case, it shouldn't be a problem for too long.

Thanks. It does seem very likely that others are also experiencing this problem, perhaps without noticing it.
ID: 44212 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 19 Dec 09
Posts: 4
Credit: 256,301
RAC: 0
Message 44237 - Posted: 28 May 2012, 8:49:47 UTC

5/28/2012 4:47:56 AM | climateprediction.net | [error] Error reported by file upload server: can't open file /storage/cpdn-restarts/incoming/uploader/hadam3p_pnw_azhc_1967_1_007885059_1_13.zip: No such file or directory
ID: 44237 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44238 - Posted: 28 May 2012, 8:52:23 UTC - in response to Message 44237.  
Last modified: 28 May 2012, 8:58:27 UTC

It's a power failure in one of the buildings.

edit
Which now seems to be fixed. 2 of my zip 13s are currently uploading.
Backups: Here
ID: 44238 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 19 Dec 09
Posts: 4
Credit: 256,301
RAC: 0
Message 44246 - Posted: 28 May 2012, 13:30:20 UTC - in response to Message 44238.  

sure they upload ... the client says so but once it gets to 100% I still get the same error.
ID: 44246 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44247 - Posted: 28 May 2012, 13:38:43 UTC - in response to Message 44246.  

Servers keep going up and down. My guess is they are rebuilding the raid array if it went off during a write but that is no more than a guess.

Dave
ID: 44247 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44248 - Posted: 28 May 2012, 13:49:09 UTC - in response to Message 44247.  

Zip1 just gone through without an error message afterwards, waiting to see about the 13s
ID: 44248 · Report as offensive     Reply Quote
Profile ex_brit
Avatar

Send message
Joined: 26 Aug 04
Posts: 84
Credit: 351,331
RAC: 0
Message 44249 - Posted: 28 May 2012, 18:07:37 UTC
Last modified: 28 May 2012, 18:07:58 UTC

Well that cpdnupload2.oerc server keeps going up and down like a yo-yo so my WU's never complete the upload.
ID: 44249 · Report as offensive     Reply Quote
ChinookFoehn

Send message
Joined: 7 Aug 04
Posts: 83
Credit: 410,895
RAC: 0
Message 44250 - Posted: 28 May 2012, 19:00:35 UTC

All upload servers are shown to be operating yet I still can not complete the 13.zip upload of the PNW unit. It keeps reaching 100% and then fails.

-ChinookF�hn
ID: 44250 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44251 - Posted: 28 May 2012, 19:09:53 UTC - in response to Message 44250.  

As Les Helpfully reminded me in this thread http://climateapps2.oerc.ox.ac.uk/cpdnboinc/forum_thread.php?id=7378

zip13s go to the restart server to generate the next data set. Following the power outage in the server room, the server has been going on and off a lot today and I would guess that is something to do with the work they are doing to restore it to full functionality after the power outage. In the last 10 minutes it has gone from down to up again on the server status page but I think I will keep network activity suspended till tomorrow and try again then.

Dave
ID: 44251 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 19 Dec 09
Posts: 4
Credit: 256,301
RAC: 0
Message 44254 - Posted: 29 May 2012, 8:54:49 UTC

Every time I check the server status page it shows all servers "running" and I do not see a server named "restart" and my final upload is not being accepted.

Would be nice if they turned the uploads off entirely so I don't have to continue reuploading the file only to have it fail in the end.

Does the project staff typically post when these types of problems occur / get correted?
ID: 44254 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44256 - Posted: 29 May 2012, 9:27:27 UTC - in response to Message 44254.  

This problem was being discussed in the adjacent thread, had3pam_eu models not uploading, before you started to post.
Look there for news.

Sometimes one of the two project people post, but that's in the News and Announcements thread.
Most of the time they leave it to the moderators to pass on news. In both directions.

You'll only see the cpdn_restarts server name mentioned in the client_state.xml file, because all of the servers have aliases to make it simpler and faster to redirect data to an alternative server when one fails for any reason.


Backups: Here
ID: 44256 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44257 - Posted: 29 May 2012, 9:37:18 UTC - in response to Message 44254.  
Last modified: 29 May 2012, 9:39:37 UTC

My guess is that turning the uploads off completely might interfere with something else or it may just be that under resourced as the project is they have more important things to do such as working on resolving the problem. If you look in your clientstate.xml file you will see entries such as these which give hints as to which server is being used.

<file>
<name>hadam3p_pnw_c3rx_1988_1_007937852_0_13.zip</name>
<nbytes>0.000000</nbytes>
<max_nbytes>150000000.000000</max_nbytes>
<status>0</status>
<upload_url>http://cpdn-restarts.oerc.ox.ac.uk/cgi-bin/file_upload_handler</upload_url>
</file>
Which I think refres uploader.oerc but it could be upload2.oerc.

It was suggested in another thread that it may be a drive mounting problem or a daemon problem causing the uploads to fail @ 100%. Best thing is to turn off network access for BOINC till things are resolved unless you are running multiple projects in which case there is a way to do it but as I only run CPDN I have never learned it.

Beat me to it Les, feel free to contradict anything I may have got wrong on the technical side.
ID: 44257 · Report as offensive     Reply Quote
Snow Crash

Send message
Joined: 19 Dec 09
Posts: 4
Credit: 256,301
RAC: 0
Message 44262 - Posted: 29 May 2012, 17:07:21 UTC - in response to Message 44256.  

This problem was being discussed in the adjacent thread, had3pam_eu models not uploading, before you started to post.
Look there for news.

Sometimes one of the two project people post, but that's in the News and Announcements thread.
Most of the time they leave it to the moderators to pass on news. In both directions.

You'll only see the cpdn_restarts server name mentioned in the client_state.xml file, because all of the servers have aliases to make it simpler and faster to redirect data to an alternative server when one fails for any reason.


I get that you are busy and covering lots of ground but your *tone kind of sounds like I'm at fault for not reading a thread about EU models when I'm taling about finals for PNW. Perhaps redirecting me to that thread first would have worked better for all. Guess it really doesn't matter much as there are still reports saying it's still not working.

While I understand short budgets and time constraints as an IT professional myself, this has turned into just one more typical alientating BOINC forum experience ... love the science, frustrated with the communication.
ID: 44262 · Report as offensive     Reply Quote
Profile Iain Inglis
Volunteer moderator

Send message
Joined: 16 Jan 10
Posts: 1079
Credit: 6,906,534
RAC: 6,466
Message 44263 - Posted: 29 May 2012, 17:49:12 UTC - in response to Message 44262.  

... While I understand short budgets and time constraints as an IT professional myself ...

The moderators are volunteers and have no budgets; their time is as constrained as their real life and motivation allow.

On the other (PHP) board there was recently an exchange in which a user posted a question and got what seemed to me to be a very complete answer from another user, which provided not only the direct information required but also the context in which to interpret the answer. The first user then posted back to complain about "long-winded waffle". Provide a direct answer, as Les did here, and that will offend someone else for being abrupt. Essentially, you can't win.

If your complaint is about the lack of communication from project staff, then I'm with you ++ - but please give other volunteers a break.
ID: 44263 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 44267 - Posted: 29 May 2012, 22:20:30 UTC - in response to Message 44262.  

Both time and health constraints. It was getting cold, and past time to go to bed where it's warmer.

People discuss matters in lots of threads whose title doesn't always reflect a current problem. It's always a good idea to look at the most recent thread or two when their time/date stamp shows that they're recent.

As you don't seem familiar with this board, perhaps I should say:

Warning

Essential maintenance on our storage infrastructure will require our main uploads servers to go out of service at the following time:

Wed 30 May 2012; 8:30-10:30 GMT

The following servers will be affected

cpdn-uploads2.oerc.ox.ac.uk
uploader.oerc.ox.ac.uk
cpdn-restarts.oerc.ox.ac.uk
climateapps1.oerc.ox.ac.uk

I anticipate that the service will be resumed within the allocated time period, and that downloads deferred during this period will catch up over the next few hours.

If you experience any problems after this scheduled downtime, please let us know by posting on the boards.

Jonathan Miller


This message was posted to the following places:
News and Announcement] (at the top of the Number crunching section)
News and Announcements (near the top of our alternative PHP board)
News (linked from the front page of this project.)

The 1st of these can be subscribed to, and you'll get an email every time something is posted there.
BUT you MUST have email messages activated.

The 2nd location, on the PHP bard, requires a separate login, as it's the original board from back before BOINC started, and has nothing to do with this board. Any of the threads there can also be subscribed to.

The 3rd location has an RSS feed.


Backups: Here
ID: 44267 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 4314
Credit: 16,380,160
RAC: 3,563
Message 44268 - Posted: 30 May 2012, 5:59:06 UTC - in response to Message 44267.  

Good to see that my zip13s have now gone through.
ID: 44268 · Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 10 · Next

Message boards : Number crunching : Upload Failure

©2024 climateprediction.net