climateprediction.net home page
New work Discussion

New work Discussion

Message boards : Number crunching : New work Discussion
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 31 · 32 · 33 · 34 · 35 · 36 · 37 . . . 83 · Next

AuthorMessage
Thomas Wiegand

Send message
Joined: 4 Jul 19
Posts: 31
Credit: 252,192
RAC: 0
Message 61102 - Posted: 30 Sep 2019, 7:53:34 UTC
Last modified: 30 Sep 2019, 7:53:52 UTC

and another freeze on my main machine, no fun
so I set: no new tasks - and hope last 4 on this computer might survive

2 other computer seem have less problem, but also a lot mistakes, and gone

1 computer gone, have to open and search failure: ... HP desktop used ... 5 beep at start ... and all can run is crazy fan
ID: 61102 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 11,351,817
RAC: 2,861
Message 61108 - Posted: 30 Sep 2019, 19:27:22 UTC

How are these New Linux tasks on ram and disk space?
I've got an i7-2600 with Ubuntu 1804 with all the propper fixes for 32 bit work installed (I think).
From what I remember the WAH 2 tasks use around 1 gb per core on Windows, unless I'm mistaken, I forget how much disk space though. Should I expect similar runtimes under Linux e.g. 3-5 days?
thanks and look forward to giving this a shot.
ID: 61108 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61110 - Posted: 30 Sep 2019, 20:02:00 UTC - in response to Message 61102.  

Thomas

One of your computers had 2 models fail with Model crashed: ATM_DYN : INVALID THETA DETECTED.

This is a perfectly normal science result - ATM_DYN is Atmospheric Dynamics., and it means that the stating values used for that model eventually lead to an impossible physical condition of some sort.
This is one of the reasons for running these models; to find out what happens at some time in it's future.
ID: 61110 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61112 - Posted: 30 Sep 2019, 20:11:43 UTC - in response to Message 61108.  

wolfman

The time taken will depend on the particular research being done in a given batch.
If the researcher gets ambitious and wants a long run, then that's what it will end up as.

Disk space - give them plenty of room to start with, because you never know.
One test batch of OpenIFS models used a bit over 9 Gigs per model.

Some more work being assembled at present, but I don't any details.
ID: 61112 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 612
Credit: 9,610,669
RAC: 8,481
Message 61115 - Posted: 30 Sep 2019, 22:15:46 UTC - in response to Message 61108.  

I am running Red Hat Enterprise Linux 6.10 on a machine with a 4-core 64-bit Xeon processor. There are 4 CPDN tasks running, and this is the disk space they are using:
988M    ./hadam4_a08h_209410_12_838_011900611
982M    ./hadam4_a0hb_209810_12_838_011900929
778M    ./hadam4_a0hb_209810_12_838_011900929/datain
778M    ./hadam4_a08h_209410_12_838_011900611/datain
640K    ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata
640K    ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata
624K    ./hadcm3s_hd57_190012_240_835_011892279/jobs
624K    ./hadcm3s_hd55_190012_240_835_011892277/jobs
624K    ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata
624K    ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata
603M    ./hadam4_a0hb_209810_12_838_011900929/datain/ancil
603M    ./hadam4_a08h_209410_12_838_011900611/datain/ancil
552K    ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata/STASHmaster
552K    ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata/STASHmaster
536K    ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata/STASHmaster
536K    ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata/STASHmaster
470M    ./hadcm3s_hd55_190012_240_835_011892277
351M    ./hadcm3s_hd57_190012_240_835_011892279
276K    ./hadam4_a0hb_209810_12_838_011900929/jobs
276K    ./hadam4_a08h_209410_12_838_011900611/jobs
245M    ./hadcm3s_hd57_190012_240_835_011892279/datain
245M    ./hadcm3s_hd55_190012_240_835_011892277/datain
210M    ./hadam4_a08h_209410_12_838_011900611/dataout
208M    ./hadcm3s_hd55_190012_240_835_011892277/dataout
205M    ./hadam4_a0hb_209810_12_838_011900929/dataout
180K    ./hadcm3s_hd57_190012_240_835_011892279/tmp
180K    ./hadcm3s_hd55_190012_240_835_011892277/tmp
175M    ./hadam4_a0hb_209810_12_838_011900929/datain/dumps
175M    ./hadam4_a08h_209410_12_838_011900611/datain/dumps
143M    ./hadcm3s_hd57_190012_240_835_011892279/datain/masks
143M    ./hadcm3s_hd55_190012_240_835_011892277/datain/masks
88M     ./hadcm3s_hd57_190012_240_835_011892279/dataout
84K     ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil/ctldata/stasets
84K     ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil/ctldata/stasets
84K     ./hadam4_a0hb_209810_12_838_011900929/datain/ancil/ctldata/stasets
84K     ./hadam4_a08h_209410_12_838_011900611/datain/ancil/ctldata/stasets
70M     ./hadcm3s_hd57_190012_240_835_011892279/datain/dumps
70M     ./hadcm3s_hd55_190012_240_835_011892277/datain/dumps
33M     ./hadcm3s_hd57_190012_240_835_011892279/datain/ancil
33M     ./hadcm3s_hd55_190012_240_835_011892277/datain/ancil
28K     ./hadam4_a0hb_209810_12_838_011900929/tmp
28K     ./hadam4_a08h_209410_12_838_011900611/tmp
3.2G    total


The two big ones have been running about a day, and the two little ones for about 5 days.
ID: 61115 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 11,351,817
RAC: 2,861
Message 61133 - Posted: 2 Oct 2019, 2:04:23 UTC - in response to Message 60296.  
Last modified: 2 Oct 2019, 2:16:52 UTC

Thank you both. Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available, so one can hope. I theoretically have the 32 bit libraries installed for Ubuntu 1804. I didn't get the 'no tasks available for this operating system' message when adding.
Is there a way to tell, similar to WAH 2, that I got a shorter or longer task - or how exactly are these workunits different from the ones ran on Windows? For instance - number of months ran, number of KM that it covers (similar to sam25 being 25 km) and batch number?

The machine I'm most worried about as far as ram usage is a Ryzen 7 1800x with 16 gb. It's running Windows, I'd just not like it to start thrashing a ton, so maybe I will limit CPDN to 12 or so tasks just to be on the safe side. I have considered having Linux run off a USB drive when I'm not using it for alternative crunching and projects that don't run on Windows.
Are these similar to for instance Rosetta, when ram usage will slowly creep up until a checkpoint, or in this case a zip file is created?
thanks and sorry for the barrage of questions.
ID: 61133 · Report as offensive     Reply Quote
Profile geophi
Volunteer moderator

Send message
Joined: 7 Aug 04
Posts: 2100
Credit: 58,039,568
RAC: 1,756
Message 61135 - Posted: 2 Oct 2019, 3:23:23 UTC - in response to Message 61133.  
Last modified: 2 Oct 2019, 3:23:59 UTC

Thank you both. Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available, so one can hope. I theoretically have the 32 bit libraries installed for Ubuntu 1804. I didn't get the 'no tasks available for this operating system' message when adding.
Is there a way to tell, similar to WAH 2, that I got a shorter or longer task - or how exactly are these workunits different from the ones ran on Windows? For instance - number of months ran, number of KM that it covers (similar to sam25 being 25 km) and batch number?

The hadcm3s models now on Linux have been run on Windows (and Linux and Mac) before. These are "simpler" models and run faster for a given model day/month/year. The ones we are running right now are 240 months, so 20 years, but I've seen other numbers of months in other batches in the past. The 240 month models might take 5 days or less on a very fast PC as long as it's not running hyperthreading/SMT. They take less than 200 MB of RAM per task It's a global model with no regional component, and has a rather large spacing between grid points.

The N144 models are currently being run for 12 model months and are also global models with no regional component. The have a grid spacing of N144 which is on the order of 100 km or so at the equator. These take about 650 MB of RAM per task and loading up a multi-core PC with a whole bunch of them will slow down the progress considerably. We're talking 5 to 10 days on fast PCs that aren't loaded up with too many of them.

The N216 models have a grid spacing about 2/3 of the N144 ones and can take up to 1.5 GB of RAM per task. On the development site, we've been running them for 4 months. The most I've run at a time is 2 as it is desired to get back results quickly on some of these beta tests. I imagine they will really slow down progress if you load up a multi-core PC with a whole bunch of them, let alone trying to run them on HT/SMT logical cores. Just running a couple is 5+ days for a fast PC.

The machine I'm most worried about as far as ram usage is a Ryzen 7 1800x with 16 gb. It's running Windows, I'd just not like it to start thrashing a ton, so maybe I will limit CPDN to 12 or so tasks just to be on the safe side. I have considered having Linux run off a USB drive when I'm not using it for alternative crunching and projects that don't run on Windows.
Are these similar to for instance Rosetta, when ram usage will slowly creep up until a checkpoint, or in this case a zip file is created?
thanks and sorry for the barrage of questions.


For the hadcm3s and N144 and N216 models, there isn't much fluctuation in the RAM used per task over the course of the run. For the OpenIFS models that are 64 bit and may be coming later, they do fluctuate in RAM as the model runs along, and we have run some that *only* take up 3.5 GB per task and others up to 5.2 GB per task.
ID: 61135 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61136 - Posted: 2 Oct 2019, 4:52:30 UTC

Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available

VERY important: DON'T click the Update button.

Doing this will reset the 1 hour backoff to 1 hour and a couple of minutes, and you may only be a few seconds away from getting work.
Although there should be a message in that case, along the lines of "too soon since last request".

And the model names contain a lot of what you asked about. You just need to decode it.
ID: 61136 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 11,351,817
RAC: 2,861
Message 61137 - Posted: 2 Oct 2019, 6:54:52 UTC - in response to Message 61136.  

Looks like my Linux machine hasn't gotten anything just yet, but there are a few hundred tasks available

VERY important: DON'T click the Update button.

Doing this will reset the 1 hour backoff to 1 hour and a couple of minutes, and you may only be a few seconds away from getting work.
Although there should be a message in that case, along the lines of "too soon since last request".

And the model names contain a lot of what you asked about. You just need to decode it.

Excellent. Of course about 5 minutes after posting my previous message I did exactly that. Though I have since left it alone and it doesn't seem to want to attempt to schedule more work, even hours later. I've got Asteroids cued up at the moment but there is no more work available over there so I'm hoping this straightens itself out before the tasks here are all gone. Both projects are set equally as far as resource management. I do have Rosetta set up, but it's at 0%, so if work was available here I should hope I would get it beforehand.
I see 271 workunits are still available, so I'm hoping that I'll have at least 1 or 2 when I wake up tomorrow.
This is an i7-2600 with 16 GB of ram, so certainly not the fastest, but I doubt I'll get enough work to hit all the threads at this rate. HT is enabled because I run other projects that benefit from it.
Perhaps tomorrow I'll do the app config and do the max_concurrent assuming I get anything. Work seems to be coming out fairly regularly though.
thank you for the help.
ID: 61137 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61138 - Posted: 2 Oct 2019, 7:45:08 UTC

One Update now and then doesn't hurt, it depends on what you're doing.

e.g. If I'm setting up the computer after reloading the OS, I'll waste some of the backoff by doing an Update to get changes to the Venue, or anything else that I want set differently to default.
THEN I leave the update alone.

But if you haven't had a response in that long, then it's time to get suspicious and try things.
Starting with an Update, then, if that doesn't work, try shuting down BOINC and rebooting the computer.
ID: 61138 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 612
Credit: 9,610,669
RAC: 8,481
Message 61143 - Posted: 2 Oct 2019, 14:17:56 UTC - in response to Message 61135.  

The N144 models are currently being run for 12 model months and are also global models with no regional component. The have a grid spacing of N144 which is on the order of 100 km or so at the equator. These take about 650 MB of RAM per task and loading up a multi-core PC with a whole bunch of them will slow down the progress considerably.


Why does loading up a multi-core PC with a bunch of N144 models slow the progress down? (I am not referring to hyperthreaded processors.) It seems to me that if all my cores are running N144 models, that the on-chip cache (Memory 15.5 GB, Cache 10240 KB) would get a greater hit rate (for the instructions, not the data of course) than if, say, only one core were running N144, and the other three were running WCG or something. And I do not recall that the hadam4 models ran the hard drives very hard. Each take 4% of my RAM, which is not a big deal. I have not seen any N216 models yet.
ID: 61143 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 11,351,817
RAC: 2,861
Message 61144 - Posted: 2 Oct 2019, 16:27:45 UTC

Okay this is a little odd. This is what I woke up to. Perhaps its normal and I'm just not sure what I'm seeing.
2019-10-02 9:28:12 AM | climateprediction.net | [sched_op] Starting scheduler request
2019-10-02 9:28:12 AM | climateprediction.net | Sending scheduler request: To fetch work.
2019-10-02 9:28:12 AM | climateprediction.net | Requesting new tasks for CPU
2019-10-02 9:28:12 AM | climateprediction.net | [sched_op] CPU work request: 9387264.63 seconds; 8.00 devices
2019-10-02 9:28:13 AM | climateprediction.net | Scheduler request completed: got 0 new tasks
2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Server version 713
2019-10-02 9:28:13 AM | climateprediction.net | No tasks sent
2019-10-02 9:28:13 AM | climateprediction.net | Project requested delay of 3636 seconds
2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Deferring communication for 01:00:36
2019-10-02 9:28:13 AM | climateprediction.net | [sched_op] Reason: requested by project
Does this mean I have too much work cued up in the meantime? Since it isn't saying project has no work or no tasks available, I'm trying to figure out the reasoning behind not receiving any work, since Asteroids has been running with 0 new tasks in its cue for several days now and I only have enough Rosetta tasks to fill the empty cores.
Any help appreciated.
ID: 61144 · Report as offensive     Reply Quote
Jim1348

Send message
Joined: 15 Jan 06
Posts: 608
Credit: 26,475,326
RAC: 5,766
Message 61145 - Posted: 2 Oct 2019, 16:49:25 UTC - in response to Message 61144.  

I have noticed that it starts out with a one-hour delay in any case. You may get something after that.
ID: 61145 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3372
Credit: 10,104,924
RAC: 8,765
Message 61146 - Posted: 2 Oct 2019, 17:48:56 UTC

Why does loading up a multi-core PC with a bunch of N144 models slow the progress down?


I only noticed a small hit on performance running 4 N144 tasks. With the IFS having only 8GB or ram for 4 cores, there was a massive hit though total throughput of tasks did still increase with each additional cpu. However, I think the hit on my SSD would be excessive if doing so more than occasionally.
ID: 61146 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61148 - Posted: 2 Oct 2019, 20:11:16 UTC - in response to Message 61144.  

wolfman

It IS quite possible that BOINC has enough work to be getting on with.
cpdn doesn't play well with other projects, with their tasks that only last minutes to a few hours.

Having other projects active at the same time only really works when cpdn has a large, constant stream of work.
Then BOINC can slot in a climate model when other project work is scheduled to run out "soon".

And now we're out of work again.
ID: 61148 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 612
Credit: 9,610,669
RAC: 8,481
Message 61149 - Posted: 2 Oct 2019, 22:04:51 UTC - in response to Message 61146.  

I only noticed a small hit on performance running 4 N144 tasks. With the IFS having only 8GB or ram for 4 cores, there was a massive hit though total throughput of tasks did still increase with each additional cpu. However, I think the hit on my SSD would be excessive if doing so more than occasionally.


OK; I have 4 cores and I ran 4 N144 tasks a while ago with 16 GBytes RAM, and the machine doing little else that some web browsing and e-mail. And how fast can I read or type? I am currently running 2 hadam4 N144 tasks and two hadcm3s tasks. When the hadam4 tasks complete, I will see if they run any faster than when I was running four. But I still wonder what the mechanism of running four N144 tasks at a time. Cache poisoning seems unlikely. Their disk requirements seem low, so that is not likely to be the problem. Shortage of memory right now is not a problem. The hadam4's take about 4% of my RAM each, and the hadcm3s take 1.1% of my RAM each, so it is not memory thrashing to disk. Could it be that your processor(s) overheat and get automatically throttled down due to the work load?

So far, I have received no IFS work units. I run Linux on a 64-bit Xeon processor, so I should be able to run them when they become more available.
ID: 61149 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7585
Credit: 24,102,463
RAC: 2,603
Message 61151 - Posted: 3 Oct 2019, 0:48:36 UTC - in response to Message 61149.  

If you're not having problems, then don't worry.

And the easiest way to tell if you have a problem may well be the sec/TS number on your Task page.

I found it was obvious when running the big introductory batch of OpenIFS earlier in the year.
ID: 61151 · Report as offensive     Reply Quote
Jean-David Beyer

Send message
Joined: 5 Aug 04
Posts: 612
Credit: 9,610,669
RAC: 8,481
Message 61152 - Posted: 3 Oct 2019, 2:02:44 UTC - in response to Message 61151.  

If you're not having problems, then don't worry.


Thank you.
ID: 61152 · Report as offensive     Reply Quote
Profile Dave Jackson
Volunteer moderator

Send message
Joined: 15 May 09
Posts: 3372
Credit: 10,104,924
RAC: 8,765
Message 61154 - Posted: 3 Oct 2019, 6:07:06 UTC

Could it be that your processor(s) overheat and get automatically throttled down due to the work load?


Pretty sure in my case it was the disk writes slowing things down. Though this was around the time when the temperature in Cambridge set a new UK record. The fan in my laptop is quieter now with 4 cores running than it was then with two! With the IFS I know it was swapping data out to the swap partition from main memory.
ID: 61154 · Report as offensive     Reply Quote
wolfman1360

Send message
Joined: 18 Feb 17
Posts: 81
Credit: 11,351,817
RAC: 2,861
Message 61155 - Posted: 3 Oct 2019, 6:21:40 UTC - in response to Message 61148.  

wolfman

It IS quite possible that BOINC has enough work to be getting on with.
cpdn doesn't play well with other projects, with their tasks that only last minutes to a few hours.

Having other projects active at the same time only really works when cpdn has a large, constant stream of work.
Then BOINC can slot in a climate model when other project work is scheduled to run out "soon".

And now we're out of work again.

Would setting another project to say 10% and CPDN to say 190 work at all? Along those same lines, would having a third backup project, just in case the former two are out of work, which in this case can happen with fair bit of regularity, set at 0% resources harm fetching of new work? It was a toss up between this and LHC.
I've got quite a few Windows 10 machines. I know Boinc can harness the Windows 10 Linux subsystem I'm just not sure exactly how seamless it actually is. I'll throw the libraries in the "Linux" install in the terminal tomorrow and see what happens from there.
Hopefully with more than one machine looking for work I'll get somewhere when work is available.
ID: 61155 · Report as offensive     Reply Quote
Previous · 1 . . . 31 · 32 · 33 · 34 · 35 · 36 · 37 . . . 83 · Next

Message boards : Number crunching : New work Discussion

©2022 climateprediction.net