climateprediction.net home page
Posts by rjs5

Posts by rjs5

1) Message boards : Number crunching : The uploads are stuck (Message 67410)
Posted 7 Jan 2023 by rjs5
Post:
Latest update from JASMIN Support:

'We are anticipating that it will take until Monday to get the machine back now. Sorry for the inconvenience.'


Latest communication to Andy. Sorry folks.


The UPLOADS went fine for a day or so and the behavior is the same " transient HTTP error" as before.

Sat 07 Jan 2023 12:53:46 AM PST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0547_1993050100_123_962_12179191_1_r1371948884_42.zip: transient HTTP error
2) Message boards : Number crunching : Tasks failing on Ubuntu 22 (Message 67381)
Posted 5 Jan 2023 by rjs5
Post:
Thanks for your idea!

Just to make it clear again:
This machine hasn't seen memory usage exceeding 30% of the 64GB while swap is at 0%.
RAM isn't an issue here.
Leave non-GPU tasks in memory has been ticked between the first and second task.

Also, I had a look at the log file, and the only thing I find is:
02-Jan-2023 17:10:34 [climateprediction.net] Aborting task oifs_43r3_ps_0094_2007050100_123_976_12192738_0: exceeded disk limit: 7590.70MB > 7168.00MB

with nothing unusual preceding it.

ATM I'm waiting for the third task to finish in hope it will be ok.


Have you looked at the BOINC Manager DISK pie chart screen? It shows Disk Usage pie charts and you can visually see how much space you have left on the disk partition.
You can increase the DISK available with the OPTIONS : COMPUTING PREFERENCES popup. Look at the DISK AND MEMORY options.

I think I remember some problems with leaving any of the 3 disk options blank. Try setting those 3 values and check the DISK USAGE pie charts.
3) Message boards : Number crunching : The uploads are stuck (Message 67103)
Posted 28 Dec 2022 by rjs5
Post:
Dear web master and project admin,
The uploads are stuck

Project communication failed: attempting access to reference site
Temporarily failed upload of oifs_.......zip: transient HTTP error
Backing off 00:02:15 on upload of oifs_......zip
Internet access OK - project servers may be temporarily down.

Paolo

I am seeing the TRANSIENT HTTP errors too. I have removed and added the project without success. The behavior continues on my Fedora37 Linux system. It looks to me like there is some problem with the UPLOAD servers.



Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0833_2015050100_123_984_12201477_0_r1783308090_16.zip: transient HTTP error
Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Backing off 03:47:52 on upload of oifs_43r3_ps_0833_2015050100_123_984_12201477_0_r1783308090_16.zip
Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Temporarily failed upload of oifs_43r3_ps_0140_2013050100_123_982_12198784_0_r166110094_49.zip: transient HTTP error
Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Backing off 04:48:10 on upload of oifs_43r3_ps_0140_2013050100_123_982_12198784_0_r166110094_49.zip
Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Started upload of oifs_43r3_ps_0695_2012050100_123_981_12198339_0_r1801321405_52.zip
Wed 28 Dec 2022 01:18:49 PM PST | climateprediction.net | Started upload of oifs_43r3_ps_0833_2015050100_123_984_12201477_0_r1783308090_17.zip
4) Questions and Answers : Unix/Linux : Thinking about Processor Cache hit misses... (Message 64014)
Posted 29 May 2021 by rjs5
Post:
Intel has made its VTUNE Analyzer available for free with Forum support. You can get some interesting information about what the program is doing.

https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.2mf7da
5) Message boards : Number crunching : Discussion thread for new global atmospheric models (Message 59840)
Posted 18 Mar 2019 by rjs5
Post:
They can spend 10 minutes and add some test code that executes first and prints messages in the event log. The messages would tell what libraries are missing. It would be a trivial couple lines of code to add.


While I am sure the average Linux user knows more about their system than the average Windows user and so a great many of us will be able to find the 32 bit libs, with Ubuntu withdrawing support for 32 bit computing, adding those libraries will not be straightforward for some even if the event log would tell us what the missing libs were.

Also unless I am mistaken the, "trivial couple of lines of code" would need to be in the BOINC client rather than the executables sent by CPDN. Even if a request is put in to the volunteers maintaining BOINC now, I don't see it happening in a hurry!


I was thinking of weakly linked libraries and CPDN issuing BOINC EVENT LOG messages when they detect the libraries are absent. Seems like that adding the WEAK option to the link and then testing each of the libraries before plowing into computation is a one-time effort. If the user knew which libraries were missing, it is far easier to add them.

I think the average BOINC user is able to read the BOINC log for error messages.

Whatever CPDN does is fine with me. I agree with you, I don't see it happening in the near future.
6) Message boards : Number crunching : Discussion thread for new global atmospheric models (Message 59838)
Posted 18 Mar 2019 by rjs5
Post:
the new models are going to be 32 bits. (Because this works.)


But think how boring it would be with no computers without the 32bit libs to report.


They could do something they should have done years ago.

They can spend 10 minutes and add some test code that executes first and prints messages in the event log. The messages would tell what libraries are missing. It would be a trivial couple lines of code to add.
7) Message boards : Number crunching : Credits (Message 59796)
Posted 12 Mar 2019 by rjs5
Post:
Andy had been working on a script for daily credit updates that didn't involve the massive amount of processing the weekly system does. It was being tried out on the testing site which is currently down. I will ask what the current status of this is.


It would speed up the processing if Andy deleted all the past DEADLINE tasks "IN PROGRESS" on computers that no longer exist. I have tasks that had DEADLINES for 2013. I suspect I am not the only one the project is tracking dead tasks for.
8) Message boards : Number crunching : New work Discussion (Message 58622)
Posted 16 Aug 2018 by rjs5
Post:
And server status page showing just one left now, though if as I suspect these are Linux or Linux and Mac only, some will come back in because of people without the 32bit libs installed. Mine are running at just over 3.5hours/1% Currently a bit over .6%completed.The danger point when some batches have failed is just before completion of first zip.


I got one of the Linux WU on my 64-bit Linux machine and it seems to be running fine.

There is a lot of "HOW TO run 32-bit dynamic apps on 64-bit Linux" information about making sure that a 64-bit installation has the right 32-bit libraries. Seems like a pretty easy to check to make sure the right 32-bit libraries are installed is by writing a small 32-bit test app that needs the same libraries.

Seems like the KEY would be the build. 32-bit COMPILED to "a.out" with the command line that forces the correct libraries to be present:

g++ h.cpp -m32 -lpthread -ldl -lstdc++ -lm -lgcc_s -lc -lz -lnsl



Example of any c++ program (Hello World): h.cpp
cat h.cpp

#include <iostream>
using namespace std;
int main (int argc, char** argv)
{
cout << "Hello world!" << endl;
return 0;
}



32-bit libraries I needed for my 32-bit creation to say "Hello World":

ldd a.out
linux-gate.so.1 (0xf7fcd000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7f7d000)
libdl.so.2 => /lib/libdl.so.2 (0xf7f78000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7df3000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7dd6000)
libz.so.1 => /lib/libz.so.1 (0xf7dbd000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7da2000)
libm.so.6 => /lib/libm.so.6 (0xf7ca8000)
libc.so.6 => /lib/libc.so.6 (0xf7b10000)
/lib/ld-linux.so.2 (0xf7fcf000)

Notice the same 32-bit libraries I needed for CPDN application:

ldd *gnu *gnu.so
hadcm3s_8.34_i686-pc-linux-gnu:
linux-gate.so.1 (0xf7fd1000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7f81000)
libdl.so.2 => /lib/libdl.so.2 (0xf7f7c000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7df7000)
libm.so.6 => /lib/libm.so.6 (0xf7cfd000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7ce0000)
libc.so.6 => /lib/libc.so.6 (0xf7b48000)
/lib/ld-linux.so.2 (0xf7fd3000)
hadcm3s_um_8.34_i686-pc-linux-gnu:
linux-gate.so.1 (0xf7f69000)
libdl.so.2 => /lib/libdl.so.2 (0xf7f33000)
libm.so.6 => /lib/libm.so.6 (0xf7e39000)
libpthread.so.0 => /lib/libpthread.so.0 (0xf7e1a000)
libc.so.6 => /lib/libc.so.6 (0xf7c82000)
/lib/ld-linux.so.2 (0xf7f6b000)
hadcm3s_se_8.34_i686-pc-linux-gnu.so:
linux-gate.so.1 (0xf7f2d000)
libz.so.1 => /lib/libz.so.1 (0xf7e59000)
libnsl.so.1 => /lib/libnsl.so.1 (0xf7e3e000)
libstdc++.so.6 => /lib/libstdc++.so.6 (0xf7cb9000)
libm.so.6 => /lib/libm.so.6 (0xf7bbf000)
libgcc_s.so.1 => /lib/libgcc_s.so.1 (0xf7ba2000)
libc.so.6 => /lib/libc.so.6 (0xf7a0a000)
/lib/ld-linux.so.2 (0xf7f2f000)
9) Message boards : Number crunching : Useful work being done? (Message 58548)
Posted 5 Aug 2018 by rjs5
Post:
I have been working on Climate Prediction since 5 Aug 2004 and run up 2,964,384 credits. None lately as there have been no Linux work units in a long time. I used to get enough work units to keep three of my 4 processors busy 24/7, but not in many many months.

I wonder if they will ever supply Linux work units again?



There is a 32-bit Linux app. If you present a 32-bit Linux environment to CPDN, you should get some work. Any method to present a 32-bit "machine" to CPDN should work ... Wine, Virtualbox, ... I think you just have to be able to run the 32-bit version of BOINC.

The CPDN code is dynamically linked and it takes some work for crunchers to get the right 32-bit packages installed on a 64-bit machine to make it work. It would be trivial for CPDN to add some assert code and display a helpful message about what 32-bit library support is missing on the 64-bit system, but CPDN is really struggling to make the simple stuff work.
10) Message boards : Number crunching : New work Discussion (Message 58521)
Posted 2 Aug 2018 by rjs5
Post:
Only the researchers can do that when they run various programs against the data received.

With climate modeling, answers aren't just right or wrong. There's a wide range of possible answers. Which is what makes this project such a tricky little beast.

****************

On reflection

Perhaps I used words that have a different "obvious meaning" to others. So lets try again.

Most of the failures are at about 3 minutes. (I've seen a few that were further along.)

For the "average" computer speed, with the user doing something as well, and with BOINC set to start and stop the program frequently, perhaps the program is at a critical point at about 3 minutes, (saving data, swapping data across cell boundaries, etc), and just then BOINC says STOP. And then when the program is allowed to restart, data "or something else" is missing/corrupted/whatever, and the program goes to the next step in the current if/then/else decision statement and aborts.

All of mine are running OK, so I don't need to worry/think about all of those that are failing. That's the researcher's job.


Thanks for the reply. It just seemed like a program bug where the results COULD possibly be from crunching some garbage.

If it is a rogue pointer that points into random program code/data (does not SEGVIO), rather than the "SEGVIO" abort ... then it seems like the computed results will not be what the researcher wants.
11) Message boards : Number crunching : New work Discussion (Message 58515)
Posted 2 Aug 2018 by rjs5
Post:
Sardis73

All of yours that I looked at had: Suspended CPDN Monitor - Suspend request from BOINC..., which usually indicates that you're using the default setting for the option Suspend when non-BOINC CPU usage is above.

It may be that these models are more sensitive than most to being interrupted at a crucial moment that's about 3 minutes into the running.

Try setting it to 100% to turn it off.


I had 2 of the sam25 generate compute errors (Signal 11 received: Segment violation).

It is easy to ignore a "sensitive" model when it generates an error and aborts. How do you detect when the "sensitive" model just gets the wrong answer and does NOT abort?

Seems like something important to isolate.
12) Message boards : Number crunching : Which URL do we use? (Message 58248)
Posted 25 May 2018 by rjs5
Post:
http://ithaqua.oerc.ox.ac.uk/cpdnboinc/

Use this to view the boards and some information about your account. (This information is still patchy/wrong in places on mine at least and probably on most.) Leave BOINC attached to the old url for now unless the server status page starts showing lots of tasks available for download. I think this is unlikely to happen as I don't expect new work to be released until the project moves back to the main server. As discussed elsewhere, the main server has serious stability problems and I have for some reason been dropped from the email list that might have given me any hints about what is happening. I have been too busy to chase this up yet.


I used the other URL from BOINC ADD PROJECT and it worked too.
Using the ADVANCED VIEW BOINC, I select CPDNBOINC project, hit the PREFERENCES BUTTON under the LEFT COMMANDS section and it shows the URL you say to use.

IMO, my 7.8.3 BOINC added the project correctly. There seems to be some aliasing being done after connecting.
13) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 50777)
Posted 10 Nov 2014 by rjs5
Post:
The "eu" path seems to be working fine for trickle up.

11/10/2014 7:32:32 AM | climateprediction.net | Started upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip
11/10/2014 7:36:25 AM | climateprediction.net | Finished upload of hadam3p_eu_h6e3_2013_1_008862108_0_3.zip
14) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 50775)
Posted 10 Nov 2014 by rjs5
Post:
Les,
Sorry, but I could not figure out how to start a new thread.
Answers at end.



I think this may be a lot worse than the Australian server being down.
Or, in the case of the 2 models listed, the re-start server at Oxford.

That computer has a LOT of "still running" models showing on it's list from way back.
1263799

So, some questions:

Is that computer also running work for other projects?
Are there 12 climate models showing in it's Tasks tab?
What message(s) is/are showing in the Event Log when there's an upload attempt?

And I think that you should start a new thread for this, as it may take several posts to sort out.



1. Yes, the computer is also running work for other projects. PrimeGrid, World Community Grid, MilkyWay (GPU only), Rosetta, and Einstein (GPU only).
2. Yes, there are 12 climate models showing in the TASKs tab. Two have completed after 200 hours each (a week or so ago) and 10 more are in the QUEUE. It looks like about 300 compute hours left.
3. When I push the RETRY NOW button to upload the completed tasks, I get the sequence of messages:

11/10/2014 5:57:32 AM | climateprediction.net | Started upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/10/2014 5:57:36 AM | climateprediction.net | Started upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/10/2014 6:02:38 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip: transient HTTP error
11/10/2014 6:02:38 AM | climateprediction.net | Backing off 05:03:32 on upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/10/2014 6:02:39 AM | | Project communication failed: attempting access to reference site
11/10/2014 6:02:40 AM | | Internet access OK - project servers may be temporarily down.
11/10/2014 6:02:43 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip: transient HTTP error
11/10/2014 6:02:43 AM | climateprediction.net | Backing off 03:47:13 on upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/10/2014 6:02:44 AM | | Project communication failed: attempting access to reference site
11/10/2014 6:02:45 AM | | Internet access OK - project servers may be temporarily down.
15) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 50739)
Posted 5 Nov 2014 by rjs5
Post:
My log only goes back through 11/4 but I have some ANZ completed workloads that continue to be hung. Is there something that has to be done on my end to free them?


11/5/2014 3:55:32 AM | climateprediction.net | Backing off 03:39:11 on upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/5/2014 3:55:33 AM | | Project communication failed: attempting access to reference site
11/5/2014 3:55:34 AM | | Internet access OK - project servers may be temporarily down.
11/5/2014 6:17:13 AM | climateprediction.net | Started upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/5/2014 6:17:13 AM | climateprediction.net | Started upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
11/5/2014 6:22:20 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip: transient HTTP error
11/5/2014 6:22:20 AM | climateprediction.net | Backing off 05:57:48 on upload of hadam3p_anz_r719_2012_1_008738731_0_13.zip
11/5/2014 6:22:20 AM | climateprediction.net | Temporarily failed upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip: transient HTTP error
11/5/2014 6:22:20 AM | climateprediction.net | Backing off 03:42:10 on upload of hadam3p_anz_r0ra_2012_1_008730596_0_13.zip
16) Message boards : Number crunching : ANOTHER UPLOAD PROBLEM (Message 50030)
Posted 5 Sep 2014 by rjs5
Post:
All the failures I have and have seen are related to trying to create a file on their network drives. I have seen instances of this exact directory complaint back in Nov 2013. I am just going to TURN OFF EU processing until the existing files drain.

9/4/2014 6:19:23 PM | climateprediction.net | [error] Error reported by file upload server: can't open file /storage/cpdn-restarts/incoming/uploader/hadam3p_eu_p4z6_2013_.....zip: No such file or directory

It looks like a simple missing directory or access rights problem somewhere in the path that prevents the file being created on that volume.
/storage/cpdn-restarts/incoming/uploader/




©2024 cpdn.org