climateprediction.net home page
Posts by Jonathan Brier

Posts by Jonathan Brier

1) Questions and Answers : Unix/Linux : Computational Error exit status 193 (0xc1) various Linux computers (Message 54474)
Posted 11 Jul 2016 by Profile Jonathan Brier
Post:
Worth running memtest for several hours to exclude a problem with one of your memory modules.

Both computers passed memtest many times over. One is a new workstation stress tested for hours and only issues on climateprediction.net which I'm marking up to the app's handling and robustness.

The problem with this, is that the researchers are no longer at Oxford. They're climate physicists from all over the planet.
The Oxford people are the IT people who look after the servers and connections.

None of these has a reason for compiling a list of errors, or how many there are of each error.
If a model runs to completion, good. If not, "why not" can be checked, and another one added to the next batch.


The physical location of researchers should not be an issue as the Internet allowing distributed work on software.

In the interest of getting the science done efficiently and using volunteer resources ethically someone should be monitoring and caring the errors for needed fixes to be implemented. Knowing which errors are occurring at what rate helps direct the time investment for the largest return in additional computing. You don't know the impact without measuring them.

Additionally people should care as these are volunteer resources and software is not a static thing especially when dealing with heterogenous environments such as BOINC was designed. Disregard for the electricity and efficient use of hardware of volunteers will breed ill will for the project no matter the science.


Most of the problems can be divided into 2 groups:
1) People who look at the results regularly, and
2) Those who just join and then forget about it.

Neither group should have to worry about results failing that is on those running the project to make sure their software is operating correctly and robust to using the donated resources efficiently with the least amount of waste or the researchers should not be recruiting the general public. BOINC projects are designed to be installed and able to be left alone. Anything less and the project is not mature for public participation.


The 2 main problems are:
1) Those who run 64 bit Linux and don't know that they also need 32 bit libraries.
And a sub-group of these who only need one more lib, but don't check to see this.

There should be zero expectation of participants to install extra libraries. Instead the lack of these should be detected by the project to not send work units to these computers or provide necessary the library locally. There should not be a constant stream of errors and wasted resources from volunteers due to the project providing work units to computers that do no have sufficient environments. If providing them locally is not possible then the project should be running in a virtual machine to have full environment control.

Regarding the 32 bit libraries needed on a 64 bit Linux machine there is insufficient documentation for this and that needs to be moved to a more visible location than the sticky in the linux section of the forums the join instructions would be one additional place to note this for reference. BOINC notifications would be appropriate starting point to notify computers missing the libraries to help bring them into compliance before an automated mean could be implemented. Participants shouldn't have to sift through the entire thread to find libraries that may need to be installed.


2) Windows users who let MS update their computer whenever MS wants to, meaning a re-boot while models are running.

In all of the above, there's also failing hardware, incorrect permissions, running out of disk space due to lots of failed models taking up HD space, and not enough ram.


The whole point of checkpointing is to resume where unexpectedly interrupted actions occur. The app should recognise an invalid exit and attempt resuming from the last checkpoint.

I'm well aware with the issue that could arise given I've watched BOINC evolve from day one. I expect climateprediction.net to have a more robust approach to their project maintance given how long they have been using BOINC not just a pretty website and highly valuable scienctific project that is perfectly paired for engaging the public.[/quote]
2) Questions and Answers : Unix/Linux : Computational Error exit status 193 (0xc1) various Linux computers (Message 54467)
Posted 9 Jul 2016 by Profile Jonathan Brier
Post:
Looking over the past workunits I'm seeing a majority of my devices are exiting with a computation error with exit status 193 (0xc1). It appears to be memory related [url]http://boincfaq.mundayweb.com/index.php?view=238[\url]

http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1256213
http://climateapps2.oerc.ox.ac.uk/cpdnboinc/results.php?hostid=1401337

The virtualLHC project recently published their breakdown of computational error at http://lhcathome2.cern.ch/vLHCathome/forum_thread.php?id=1846 which was quite informative on what problems they were encountering, tackling, and somewhat explianed what they were. Could we get such a breakdown for climateprediction.net to see how pervasive the various computational errors are on the tasks or computer types?
3) Message boards : Number crunching : Any warnings about upgrade from BOINC 6.10.56 to 6.12.34 ? (Message 43146)
Posted 6 Oct 2011 by Profile Jonathan Brier
Post:
I have not experienced any issues when upgrading. Did you shutdown the old BOINC instance before installing the new version? What type of system pc, mac, etc?




©2024 climateprediction.net