climateprediction.net home page
Posts by Greg van Paassen

Posts by Greg van Paassen

41) Message boards : Number crunching : Cannot locate specific track - and similar disk-like failures on client side (Message 46163)
Posted 6 May 2013 by Profile Greg van Paassen
Post:
In my (anecdotal) experience, "exit 25" errors tend to be correlated with the presence of many "Suspended CPDN Monitor - Suspend request from BOINC..." messages.

For example. 1151 suspend messages in 62619 CPU seconds, an average of one suspension every 54 CPU seconds.

There may be a race condition. If Boinc suspends a task just after it requests a disk read, at the point where the operating system thinks it has successfully delivered the disk data to the requesting task, the disk data might vanish from in-memory buffers before the task is re-animated. (Especially if "leave suspended applications in memory" is not selected, so that the task itself is written to disk -- which uses disk buffers.)

The remedy for "exit 25s" may be the same as for "exit 193s": set Boinc preferences to (1) allow high levels of CPU utilisation and (2) leave applications in memory when suspended. That is, reduce the number of task suspensions and reduce the amount of code executed when they are re-animated.

Exit 22s seem to be different. From what I have seen, there is little correlation with anything. They may be due to power failures, or to conflicts with other software. I lost four tasks running in a virtual machine with exit 22, due to repeated power failures. (Those big switches on the power mains are so tempting to toddlers...) Interestingly, tasks running on the host machine at the same time were unscathed.
42) Message boards : Number crunching : WORTH THE TROUBLE???? (Message 46011)
Posted 21 Apr 2013 by Profile Greg van Paassen
Post:
2. Is there a tool to help estimate how configuration changes will effect performance - if one can assume some typical usage pattern for a specific machine?

Boinc's built-in benchmarking tool has given me some very strange results. It's best treated as a rough indication. The best method requires a calculator:-

Go to the web page for a task that has trickled twice since the tweak.
Calculate (CPU time of last trickle - CPU time of previous trickle) divided by (Timestep of last trickle - timestep of previous trickle).

Compare the result to the same calculation for another pair of consecutive trickles, both of which were before the tweak.

Do this for several tasks on the same computer. CPDN tasks can speed up and slow down at different stages in their "lives" -- but not always at the same stage.

Of course this method is not instant. My i7-2600 running at stock takes about 12 hours between trickles for a HadCM3N, so doing this measurement typically requires a wait of 36 hours. (The event log will tell you when a trickle for one or more tasks has been sent; unfortunately it won't tell you which tasks. You have to check on their web pages.)

The best way to maximise your credits is to focus on stability first and foremost. Run 24x7. Run at stock. Don't run flaky software on the same box, and uninstall everything not essential. Don't allow Boinc to suspend tasks automatically -- do it manually when required. Use a good power supply. (And train visiting toddlers not to touch the oh-so-tempting big red switch on the mains supply box outside your door... :) )

As for backups: in the days when computers had one core, backups were a good idea. Nowadays it would be very difficult to restart a crashed task without affecting other tasks on the same box. I say "affecting", I mean "completely trashing". I recommend not backing up the Boinc data folder.
43) Message boards : Number crunching : Don't receive a packet (Message 45872)
Posted 10 Apr 2013 by Profile Greg van Paassen
Post:
Mamph,

KWSN - Sir Frank of the Wood pointed out how to see what's going on with the climateprediction.net servers.

To see what's going on on your own computer, in Boinc Manager change to "Advanced view" and then press Ctrl+Shift+E to get the 'Event Log' window. (Or, select "Event log..." from the "Advanced" menu.)

Read the messages there. Among them you should see one that tells you "The project has no work available" (or similar words). If you see that message, then your computer is configured correctly. All you can do is wait, along with the rest of us. :-) Messages like "Unable to connect to server" indicate that there is a problem that needs fixing.

It's a good idea to check the Event Log regularly. Sometimes software updates break things that used to work.
44) Message boards : Number crunching : Sound playback choppy with all cores crunching. (Message 45689)
Posted 22 Mar 2013 by Profile Greg van Paassen
Post:
Dave - you could try increasing the "latency" for the sound & video cards (lspci -v / setpci ) It won't work for video if you're using the i3's built-in graphics, though.

http://www.mythtv.org/wiki/PCI_Latency

Also you could try changing to a 'lowlatency' kernel, but that will reduce throughput of the models by 5% or so. There's a trade-off between responsiveness and throughput, unfortunately.
45) Message boards : Number crunching : Notice: Problems with PNW 'd' series Weather at Home models issued on Feb 22 (Message 45572)
Posted 22 Feb 2013 by Profile Greg van Paassen
Post:
These models appear to have multiple issues: missing download files, and missing files within the zip files that are present and do get downloaded.

See this thread in the phpBB forums.
46) Message boards : Number crunching : Reporting - Errors while computing - (Message 45551)
Posted 12 Feb 2013 by Profile Greg van Paassen
Post:
Intel's turbo boost won't be a problem. According to Intel's literature it operates for up to two or three seconds when one process is using a lot of one core and the other cores are idle. In this situation the chip won't get too hot and unstable.

But if you are running more than one CPDN model at a time, turbo boost won't operate. And even with only one model, it will exceed the "a few seconds" time limit, so the CPU will cycle: a few seconds on turbo, then 10 or 20 seconds at normal speed, turbo for 2 or 3 seconds, back to normal... Interesting to watch, if you like that kind of thing.

Manually overclocking to the turbo boost frequency is not recommended. Together with underclocking the RAM, you may get just the results you are seeing.

Now I've given you the overclocking lecture. :-)

Several other people have reported the C++ DLL crash over the last few years (that I have seen).

Solving the problem was always difficult. Sometimes the problem was blamed on video drivers, but I can't remember whether ATI/AMD or Nvidia is the bigger suspect. Sometimes the screen saver was suspected, or other software such as Microsoft SQL Server, which will try to grab all the memory for itself. Sometimes a corrupt download of the BOINC software was suspected.

If you are confident in your video card, its drivers, and in the power supply, then the way forward is probably to disconnect from CPDN and all other projects, uninstall boinc, delete its data folder and program folder via windows explorer, download a fresh copy of boinc, and re-connect to CPDN.

But that may not work either. Some combinations of CPU, RAM, and motherboard just seem less reliable. I had a core i3 (Clarkdale) on a Gigabyte H55 board with Hynix memory that was like that. Worked perfectly for everything except CPDN.
47) Message boards : Number crunching : Reporting - Errors while computing - (Message 45549)
Posted 12 Feb 2013 by Profile Greg van Paassen
Post:
Looking at the tasks page for your computer, GuruFin, there has been a great variety of reasons for models crashing on your computer. Possibly there is more than one issue.

For best results with climate models, which stress the CPU and memory more heavily than almost anything else, and which are fussy about disk access, the following are recommended, in this order:

1. Do not overclock.

2. Ensure that your virus scanner excludes the Boinc data folder and all sub-folders. (That is the folder with two sub-folders "projects" and "slots".)

3. In Boinc preferences - disk and memory usage, ensure that "leave applications in memory when suspended" is selected, and allow Boinc to use up to 75% of memory. (At least 1 GB per running task is best; mostly 500MB works too. Mostly.) Also ensure that Boinc has enough disk space, 2 GB per CPU at least.

4. Shut down Boinc (suspend all work) when you play games that have demanding video requirements.

5. For a multi-processor system such as yours, in processor usage preferences, set Boinc to "Use at most 100% of CPU time", and control the amount of work with for example "use at most 75 % of processors" (change the 75 to whatever you like).

If you have done these, and are still getting errors, your RAM may be running out of specification. Run a memory test program such as memtest86+ for at least 48 hours to check. Alternatively, the power supply for your computer may be unable to supply enough power, or the motherboard is using the not-recommended "voltage boost" feature that some have.

Edit: I should point out that there will still be some apparent failures even after doing all of this. Some climate models fail because they generate physically impossible atmospheric pressures or potential temperatures. A few other have been sent out with the wrong data files - these normally crash straight away, though.
48) Questions and Answers : Windows : Upload error (Message 44907)
Posted 26 Sep 2012 by Profile Greg van Paassen
Post:
Server status says all is okay...
Yes, it's a bit misleading. Green means that the server program is running. Unfortunately, the status-checking page is not smart enough to look at disk space.
49) Message boards : Number crunching : Uploads not working (Message 44866)
Posted 21 Sep 2012 by Profile Greg van Paassen
Post:
What you say is true for data that must remain readily accessible, Eirik. But it seems to me (from the outside) that CPDN's main requirement is for somewhere to put data that no-one has wanted during the last few months, and that is unlikely to be wanted for the next few months or years -- but it might be wanted sometime. Most likely, when a scientist does want it, they'll be able to give plenty of notice.

Back in the day, IBM used to sell the concept of tiered storage: on-line, near-line and off-line. The idea was that 'hot' data would stay on the on-line storage, and when people stopped accessing it it would migrate to progressively less responsive (but cheaper) storage.

Of course IBM sold fancy systems to 'migrate' unneeded data automatically. But I don't think CPDN needs that. It does need some kind of systematic archiving process, though.

I'd caution that archiving is an ongoing process, not a one-time event, and resources should be allocated and processes set up accordingly.

For non-critical data such as CPDN run results, two copies on consumer-grade storage, kept in separate file store-rooms in separate buildings in separate campuses and tested annually, should provide enough of a guarantee of future accessibility.

100 TB of non-critical offline storage is then some checksum files, a hard-back book, a label maker, 100+ 2TB disks and a USB3 dock, and two cupboards -- plus a high-school student volunteer for a few weeks each year (to stock-take and checksum the archives, replace any failed disks and archive new data). And the instructions for the student.
50) Message boards : Number crunching : Upload Failure (Message 44388)
Posted 13 Jun 2012 by Profile Greg van Paassen
Post:
Hi glaesum,

I'm in NZ, so it often takes me a while to catch up with the day's posts.

How are things now? I gather from other posts that most people's files are getting through now. Certainly mine have been, since about 6 hours ago when I turned networking back on.

The two tests I gave just confirm that there is a path through the internet from your machine to the the cpdnupload2 server. But there are other servers, and maybe the problem is with one of them. If you are still having problems, try re-starting the boinc service - rebooting your machine is the easiest way of doing that. It worked for me, once! :) If that doesn't work, the next step is to set up "http transfer debugging" in cc_config.xml, so we can get more information about what boinc is trying to do, and where it's trying to do it.
51) Message boards : Number crunching : Upload Failure (Message 44347)
Posted 10 Jun 2012 by Profile Greg van Paassen
Post:
Reed, welcome to climateprediction.net, a.k.a. CPDN.

You've already discovered that this is not a 'set and forget' project, nor even 'plug and play'. :-/

If you have a data cap on your ISP account, I recommend doing one of two things pro tem., to prevent your data cap being used up with fruitless upload attempts by BOINC. For either of these you'll need to change to the BOINC Manager's "advanced view".

Either: in the activity menu, select "Network Activity Suspended". This will isolate your BOINC work from the internet entirely.

Or, in the "Tools" menu, select "Computing Preferences". On the dialogue box that pops up, select the "network usage" tab. Just below half-way, there is a section "Network Usage Allowed". Set this to only allow network usage for one or two hours per day. That will limit the damage but still upload trickles and allow BOINC to download work ... once the project has some more.

52) Message boards : Number crunching : Upload Failure (Message 44346)
Posted 10 Jun 2012 by Profile Greg van Paassen
Post:
Of course that affects all projects. And of course there is no option in the 'Projects' tab to hold uploads for an individual project, and never will be...

I submitted a BOINC client patch to implement that almost 3 years ago ...
Precisely, T. L. That was ... three, nearly four BOINC versions ago?

I guess we should be grateful that BOINC still mostly does the basics OK. (Except on Macs, where the BOINC project randomly changes the required ownership and permissions on the client's data folder from version to version.)

:-/
53) Message boards : Number crunching : Upload Failure (Message 44337)
Posted 9 Jun 2012 by Profile Greg van Paassen
Post:
To save bandwidth/data caps being wasted one could set 'Network Activity Suspended in the Boinc Manager's Activity menu.

Of course that affects all projects. And of course there is no option in the 'Projects' tab to hold uploads for an individual project, and never will be...
54) Message boards : Number crunching : Upload Failure (Message 44318)
Posted 6 Jun 2012 by Profile Greg van Paassen
Post:
All flowing smoothly here, too. :)
55) Message boards : Number crunching : Upload Failure (Message 44309)
Posted 6 Jun 2012 by Profile Greg van Paassen
Post:
Harri - try this:

Open another tab in your browser and go to this address: http://cpdn-upload2.oerc.ox.ac.uk.

The response should be a page that says
Climate Prediction.net Upload Server

This server is part of the Climate Prediction.net project. Please visit climateprediction.net to participate.
If you don't get that, there is a connection problem - perhaps a blacklisting. To check, open a command prompt (Win+R cmd.exe) and run
tracert cpdn-upload2.oerc.ox.ac.uk

The trace should complete in less than a minute, listing fewer than 30 numbered lines and with not too many '*'s in the output.

If that is OK, you'll need to set the file_xfer_debug and http_xfer_debug options in cc_config.xml and get boinc to re-read it. Let me know if you want help with that.
56) Message boards : Number crunching : Upload Failure (Message 44305)
Posted 6 Jun 2012 by Profile Greg van Paassen
Post:
Thanks again, Les.

I suspended all work and re-started the boinc client (i.e., the background process, not the boinc manager). That seems to have done the trick --- seven failures, but the eighth file uploaded OK and now the ninth file is uploading.

EDIT: The successful uploads are going to cpdn-upload2.oerc.ox.ac.uk. The problems are with uploader1.atm.ox.ac.uk.
57) Message boards : Number crunching : Upload Failure (Message 44298)
Posted 5 Jun 2012 by Profile Greg van Paassen
Post:
Thanks, Les, I appreciate the help.

I added the time-out option to cc_config.xml, and set Boinc to try three transfers at a time as well. It's not helping, though: when I turned network activity back on, the first three files changed status to 'Uploading', and then within two seconds changed back to 'Retry in 09:30:00'. (With the time being different for each, of course.) Since the upload fails straight away it can't be a time-out problem.

It's not the "hnndler" problem, either. The problem is with uploader1. Its http server appears to be running OK, but attempting to access the file upload handler returns HTTP 500, internal server error.

I've set the http_transfer_debug option and I will see what new insight that brings when the first time-out expires.
58) Message boards : Number crunching : Upload Failure (Message 44294)
Posted 4 Jun 2012 by Profile Greg van Paassen
Post:
Not exactly "transitory" in the usual meaning of the word... I have had the same two EU zip files failing to upload since early Saturday my time - that would be Friday afternoon in UTC (Friday evening in British Daylight Time). Currently there are another 104 zips in line behind these two. Oh, well.
59) Message boards : Number crunching : had3pam_eu models not uploading (Message 44252)
Posted 28 May 2012 by Profile Greg van Paassen
Post:
hadam3p_pnw not uploading ..., reached 100%, before the upload failed.
I have this problem too, and so do others. It also happens with zip no. 13 on HadAM3P EU.

I'm sure the project team is looking into it. Meanwhile, I've set my network preferences to only communicate between 3:00 AM and 5:00 AM, to cut down on bandwidth usage.
60) Questions and Answers : Getting started : user-id and e-mail address (Message 43825)
Posted 20 Feb 2012 by Profile Greg van Paassen
Post:
If you click on "Your account" in the blue bar on the left, you will be able to see your email address.


Previous 20 · Next 20

©2024 climateprediction.net