climateprediction.net home page
Posts by Yeti

Posts by Yeti

21) Message boards : Number crunching : OpenIFS Discussion (Message 68331)
Posted 15 Feb 2023 by Yeti
Post:
Yes, that's one of them. There's a line that says
forrtl: error (72): floating overflow
'forrtl' is 'Fortran runtime library'.
This WU will go the same way: https://www.cpdn.org/workunit.php?wuid=12207576
22) Message boards : Number crunching : no credit awarded? (Message 68247)
Posted 10 Feb 2023 by Yeti
Post:
Jumping back to a checkpoint is usually very obvious (representing many hours of work lost), and I didn't notice that happening without obvious cause (such as a power failure or manual suspension). I wonder if it's something to do with hibernation causing a small jump back every evening, but not one that's big enough to notice on the progress? The longer a task runs, the more that would multiply up.

I have just taken a quick scan through the Universe@Home results for the host, and I can't see any sign that the virtual machine started running faster when the last CPDN task finished (unfortunately I can't go back very far, and other projects have an even shorter results record).
Or you started every day with the same checkpoint because you didn't reach the next one
23) Message boards : Number crunching : OpenIFS Discussion (Message 68234)
Posted 9 Feb 2023 by Yeti
Post:
Further OpenIFS batches will be released this week. A short batch for the missing forecasts from the recent batches, plus new batches for the OpenIFS BL app.


Can you please fill in missing Details:

OpenIFS_PS: 4,5 GB RAM 7,5 GB Harddisc
OpenIFS_BL: ??? GB RAM ??? GB Harddisc
24) Message boards : Number crunching : What does "Didn't need" mean on work-unit status webpage? (Message 68211)
Posted 5 Feb 2023 by Yeti
Post:
Meanwhile I have got lots of "Didn't need" WUs, they do fine
25) Message boards : Number crunching : The uploads are stuck (Message 68190)
Posted 2 Feb 2023 by Yeti
Post:
I got several resends, Unfortunately one of my older CPUs got three and the faster machine only one
26) Message boards : Number crunching : How to Prevent OpenIFS Download (Message 68121)
Posted 30 Jan 2023 by Yeti
Post:
Also set use of the Swap to 1%.
Not shure if this is really a good idea.

As long, as I have enough free Main-Memory, swap isn't needed and all is okay.

If you reduce swap-usage to 1% and your memory gets low, there is not much, what BOINC can do, so it will set a task to sleep or error something out.

If you allow swapping there is a chance that the running WU may survive and this will be a better chance as with blocked swapping.

Shure, swapping is no good for a lot of projects, but if BOINC has to think about swapping you/we have already made a big mistake
27) Message boards : Number crunching : How to Prevent OpenIFS Download (Message 68094)
Posted 27 Jan 2023 by Yeti
Post:
I think, one very important setting regarding memory is:

Memory
When computer is in use, use at most 100 %
When computer is not in use, use at most 100 %

It hasn't to be 100%, but both should use the same size.

If these differ from each other and you start using your machine, BOINC has to free up the memory to the lower limit and this seems to be very risky with CPDN-Tasks / OpenIFS
28) Message boards : Number crunching : OpenIFS Discussion (Message 68093)
Posted 27 Jan 2023 by Yeti
Post:
I appreciate that, I also find %age cpus a pain (why wasn't it just a plain number). But there are other cases where that's not the case.
Nope, having only a plain number, but boxes with different Core-Counts, it is a real pain.

I have all my boxes set to "Use only 75% of the real existing Cores, and I really want exactly this behaviour as a maximum for BOINC
29) Message boards : Number crunching : The uploads are stuck (Message 68065)
Posted 26 Jan 2023 by Yeti
Post:
There are some more OpenIFS batches coming soon
Hopefully you will publish needs like RAM / WU, HD-Space / WU before starting the batch(es) together with an idientifier, how we can recognize the different batches. Perhaps even as a sticky post or news ?
30) Message boards : Number crunching : The uploads are stuck (Message 67978)
Posted 22 Jan 2023 by Yeti
Post:
HM,guys, I think in the moment CPDN seems to have more crunching power than the infrastructur can handle. So, I think, it is better for the project if I pause CPDN-Crunching for quit a while, until the infrastructure can handle the load.

For now, I let all my clients finish already downloaded tasks, but not download any new.
31) Message boards : Number crunching : Why does this task fail ? (Message 67953)
Posted 21 Jan 2023 by Yeti
Post:
I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100%
This was really the right solution. Since I fixed this on my side, I have already crunched 7 or more WUs with 100% success
32) Message boards : Number crunching : The uploads are stuck (Message 67951)
Posted 21 Jan 2023 by Yeti
Post:
Here we go again:
21 Jan 2023 17:43 UTC Error reported by file upload server: can't write file oifs.....zip: No space left on server
Same here, we seem to upload faster than the internal processess move files to other places.

occasional uploads go through
33) Message boards : Number crunching : Why does this task fail ? (Message 67898)
Posted 19 Jan 2023 by Yeti
Post:
I wish you good results. For me the time on both host and guest OSs always matched up and was correct but yet the BOINC time mismatch kept showing up.
I'm very optimistic because on HOST01 / ATLAS1_L1 and HOST05 / ATLAS5_L1 this never happened and at the moment they have a success-rate of 100%
34) Message boards : Number crunching : Why does this task fail ? (Message 67889)
Posted 19 Jan 2023 by Yeti
Post:
The HOST from ATLAS7_L1 had a wrong time (1 hour difference) and it seems as if this has made it into the VM although this feature was deactivated.

We have corrected this and now a new try begins

Thanks to all for your help
35) Message boards : Number crunching : Why does this task fail ? (Message 67875)
Posted 18 Jan 2023 by Yeti
Post:
18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeouts
At the moment I have no idea why this happens, but the difference seems to be 60,2 Minutes, could it have something to do with UTC versus GMT or similar ?
36) Message boards : Number crunching : The uploads are stuck (Message 67874)
Posted 18 Jan 2023 by Yeti
Post:
But in otherways many users cannot tell if something is BOINC or CPDN. Which is perhaps an argument for not splitting them up.
But if you don't split it up, you force the project-Admin of each project to built up this knowledge. I don't think, that this a good way.

Better would be to have a central Generell-BOINC-Side, where you find tutorials and descriptions for common / generell situations.

So, the project-admin could focus on the project relevant infos and I'm shure, they will do with joy / happiness

As has been told somewhere in this discussions, best knowledge about BOINC-generell things have cruncher like me, that take part in races and work with 100ds of instances. We all have had a lot of this problems and have spent much much time to find out, how it works. So, even races have good sides
37) Message boards : Number crunching : Why does this task fail ? (Message 67872)
Posted 18 Jan 2023 by Yeti
Post:
Yeti,
Are you still getting that system time mismatch issue in the Event log, like you posted last time?
18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 01:22:30 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip
18-Jan-2023 01:22:30 [---] New system time (1674001351) < old system time (1674004963); clearing timeouts
18-Jan-2023 01:26:24 [climateprediction.net] Temporarily failed upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip: transient HTTP error
18-Jan-2023 01:26:24 [climateprediction.net] Backing off 00:56:55 on upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 01:26:51 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_36.zip

now CPDN seems to sleep for nearly an hour. Look at the time in relation to the new system time:

18-Jan-2023 02:22:45 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip
18-Jan-2023 02:23:21 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip
18-Jan-2023 02:26:13 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_37.zip
18-Jan-2023 02:26:57 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_31.zip

and here comes the next trickle

18-Jan-2023 02:27:22 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip
18-Jan-2023 02:30:25 [climateprediction.net] Finished upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_38.zip
18-Jan-2023 02:34:14 [climateprediction.net] Started upload of oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_39.zip
18-Jan-2023 02:34:14 [climateprediction.net] Sending scheduler request: To send trickle-up message.
18-Jan-2023 02:34:14 [climateprediction.net] Not requesting tasks: don't need ()
18-Jan-2023 02:34:15 [climateprediction.net] Scheduler request completed
18-Jan-2023 02:34:15 [climateprediction.net] Project requested delay of 3636 seconds
38) Message boards : Number crunching : The uploads are stuck (Message 67870)
Posted 18 Jan 2023 by Yeti
Post:
If there were a sticky thread called "Instant Invalidated Tasks" then that would give a definite answer.

Nope, I would never search for such a post, I would search for cancelled Tasks and that is the start of the Dilemma.

I have (re-) started crunching CPDN after having paused for several years round about 10 days ago and was searching for all the important details to know to do it right. I found a FAQ, but this was so basic about BOINC, that didn't help.

I think, the information has to be devided into two or even more sections:

A) generelly running BOINC. That is all knowledge, that is not Project-relevant.
B) Running this special project

Your problem with the disc running full is a general task like A)

How to change a disc is general task like A)

How much memory need OpenISF or special settings like "Keep tasks in memory is needed" is project-specific, so B)

And things, that work on windows do not work on linux, and things that work with Ubuntu 20.x don't work with Ubuntu 22.x, this all makes it more and more complicated.

For example: Your disc-swap under windows would have been very ease: Stop BOINC, copy whole BOINC-DIR to new location, (re-) install BOINC if needed and tell the "new" location being the Data-Section and that's it. You could continue at your last point and nothing will change. Under Linux this won't work
39) Message boards : Number crunching : Why does this task fail ? (Message 67861)
Posted 18 Jan 2023 by Yeti
Post:
And once again this task has failed: https://www.cpdn.org/result.php?resultid=22262774

Here is the log-snipped from finishing:

18-Jan-2023 10:55:53 [climateprediction.net] Computation for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 finished
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_115.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_116.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_117.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_118.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_119.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_120.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_121.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent
18-Jan-2023 10:55:54 [climateprediction.net] Output file oifs_43r3_ps_0294_1992050100_123_961_12177938_0_r149474446_122.zip for task oifs_43r3_ps_0294_1992050100_123_961_12177938_0 absent

Before starting this task, the box was completly rebooted. And I never did anything like pausing or suspending the WU or BOINC. I have the complete BOINC stdoutdae.txt from the run, so if anyone is interested to take a look through I can provide this.

I have checked syslog but couldn't find anything related. I kept a copy of the syslog if someone is interested I can provide it.

Some more thaugts:

At the moment, I'm running OpenIFS on three virtual boxes. They all are a clone from one single Master Ubuntu 22.04 LTS and sit on different (hardware-) hosts.

All machines should have more memory than needed for their tasks and although enough free space on HD, leave Application in memory is selected / activated. BOINC may use 100%/100% of available RAM

ATLAS1_L1 works fine, 8 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 16 GB RAM
ATLAS5_L1 works fine, 7 OpenIFS succesfull, 0 failed, runs 1 OpenIFS and 1x4-Core Atlas-Native, 32 GB RAM
ATLAS7_L1 struggles, 3 OpenIFS succesfull, 7 failed, runs 1 OpenIFS and 2x4-Core Atlas-Native, 32GB RAM

What I still could test is running only 1x4-Core Atlas-Native on Atlas7_L1, another thing would be to go with OpenIFS in a second instance and check, if this brings any progress

Any Thaughts or Ideas ?

@Dave: Thanks for your links, I already checked them, but could so far find nothing that helped
40) Message boards : Number crunching : Why does this task fail ? (Message 67819)
Posted 17 Jan 2023 by Yeti
Post:
Okay, here is the next failed WU and I ask again, why has this WU failed: https://www.cpdn.org/result.php?resultid=22268795

The machine has 32 GB RAM and was running 2 AtlasNative-tasks together with only 1 OpenIFS.
....
Same client, same configuration, next WU finished successfull: https://www.cpdn.org/result.php?resultid=22269176

No Idea,what could be the reason

One thing I saw in BOINC-Log after a restart:

142 17-01-2023 20:56 New system time (1673985414) < old system time (1673989026); clearing timeouts

Could this have something to do with the failing OpenIFS ? Atlas-Native seem to be not effected in any way


Previous 20 · Next 20

©2024 climateprediction.net