climateprediction.net home page
Posts by leloft

Posts by leloft

1) Message boards : Number crunching : The uploads are stuck (Message 67887)
Posted 19 Jan 2023 by leloft
Post:
The 100 GB limit is -

If you DO NOT check the "Use no more than XXXX GB" box, the default value is 100 GB.

In other words, checking the box with a value of 100 GB is the same as not checking it at all.


When I learnt of this limit, I just set it to 1000G in the preferences and controlled disk usage through 'the use no more than xG' or 'eave at least yG free'. The preferences clearly state that the lowest of the 3 limits will be used.
2) Message boards : Number crunching : Why does this task fail ? (Message 67860)
Posted 18 Jan 2023 by leloft
Post:
Hello. Could I please ask for clarification? I am generating several 'Error while Computing' results per day per host. Here is a typical one:
22286062 12199858 1534812 12 Jan 2023, 5:38:05 UTC 18 Jan 2023, 15:26:09 UTC Error while computing 73,196.55 73,196.55 --- OpenIFS 43r3 Perturbed Surface v1.05
x86_64-pc-linux-gnu

The last few lines of the stderr output for this task are these

[...}
Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0214_2014050100_123_983_12199858_0_r1018304785_122.zip
Uploading the final file: upload_file_122.zip
Uploading trickle at timestep: 10623600
14:47:17 (32691): called boinc_finish(0)

</stderr_txt>
]]>

Could someone please explain why the model finishes with a 'final file' ***_122.zip. Are these errors only detectable after the model has run to completion and been uploaded? I'm not sure where to start trying to reduce this high rate of errors.

I have restricted all hosts with app_configs that allow 5.5G memory per task leaving ~10G free for the system. Where should I start unpicking these errors?

Many thanks
fraser
3) Message boards : Number crunching : Upload server is out of disk space (Message 67739)
Posted 15 Jan 2023 by leloft
Post:
I think I've got a workaround to the 'too many uploads' issue. Thanks to all who contributed bits towards this. It appeared that actively crunching clients had more success at securing upload slots, so I changed <ncpus> from 24 to 40 in cc_config and reread it. The client downloaded 8 units and started to process them. The host has been uploading solidly since 21.00 last night and has no trouble regaining an upload slot within seconds of dropping it. I have no real idea why this should have worked, except to guess that the ability to secure an upload slot is somehow enhanced by having an actively crunching client.
Best,
fraser
4) Message boards : Number crunching : Upload server is out of disk space (Message 67722)
Posted 14 Jan 2023 by leloft
Post:
Always check with the User manual.

That's a shortened output of boinccmd --help.
The command 'boinccmd --set_network_mode always' doesn't do anything, but that's because it's set to 'always' in boinctui. I was after a boinccmd option that would do the same as the
the 'retry' tools in BOINC Manager
but there doesn't seem to be one, which seems strange. The nearest seemed to be the '--network_available retry deferred network communication'. I'll just wait it out.
5) Message boards : Number crunching : Upload server is out of disk space (Message 67707)
Posted 14 Jan 2023 by leloft
Post:
You can try and clear things, by using the 'retry' tools in BOINC Manager


What would that be in boinccmd? --network_available seems to do nothing, I assumed it was a toggle;
--file_transfer requires a filename and doesn't work with wildcards. I was hoping to set up a cronjob to try and improve the chances of getting a slot.

It seems to be a case of giving to those who already have. Is there someway the backing off time period could be reduced to a few minutes for those machines that have failed to upload and a few tens of minutes for those that succeeded? If the question is simply a correlation between number of attempts and successful uploads, then to allow unsuccessful attempts shorter times between tries would stand a better chance of clearing some of these 'too many uploads' errors, at least enough to allow the stalled hosts to resume active duty. Just a thought.

fraser
6) Message boards : Number crunching : Upload server is out of disk space (Message 67688)
Posted 14 Jan 2023 by leloft
Post:
I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too.


Here is an observation: I have five hosts with WU in uploading status. Of these five, three of them are successfully uploading files and as they are disgorging their backlog, they are able to download new WU, process and upload them. The two other hosts that are failing to secure an upload slot are blocked from downloading as they are up to capacity and therefore idle. Can anyone confirm that actively crunching machines are more successful at elbowing their way in to an upload slot? If so, it seems that it would be a shame that these machines are uploading 20 hours into a 28 day deadline, while backlog-enforced idling hosts are unable to fight their way onto the server. Just an observation, but it feels that it is more than just a sampling error.

fraser
7) Message boards : Number crunching : The uploads are stuck (Message 67571)
Posted 11 Jan 2023 by leloft
Post:
Fix for - Need more disk space. You currently have 0.00 MB available.

In the BOINC Manager, Options -> Computing Preferences -> Disk and memory -

Check the box "Use no more than" and put a number in the number box equal to about 3/4 of your disk size (or some other number you are comfortable with).

If you leave it this box UNCHECKED, it is the same as having it checked with 100 (GB) in the number box.

At least that is how it works for me.


Good advice, thank you. I set 'use no more than' to 1000G, 'leave at least' to 1G free, 'use at most' to 99% of disk, updated project and now the two hosts in question processing new units.
2/4 hosts uploading as well.
Onwards and forwards.
fraser
8) Message boards : Number crunching : The uploads are stuck (Message 67555)
Posted 11 Jan 2023 by leloft
Post:

I think you were just unlucky you got a resend from the first batch. I suspect if you try again, you might get a couple of 'corrected' tasks from the other batches. Try it?

Doubly unlucky: I've just had the same refusal from both the first machine and now a second one, both refer to the same value 7168.00 MB. The good news is that one of the hosts has managed to upload 8 tasks.
I'll keep trying, but I'm limited by the 3636 seconds rule.

Best
fraser
9) Message boards : Number crunching : The uploads are stuck (Message 67544)
Posted 11 Jan 2023 by leloft
Post:
Thanks for your reply.

HI Fraser,
I suggest removing any boinc limits on disk space (temporarily if need be). In the boincmgr app (or equiv for boinccmd), untick to remove any disk limits for: 'Use no more than', 'Leave at least', & 'Use no more than'. If those are all disabled, the messages about insufficient disk should disappear.

There are no limits on disk space: /var/lib/boinc-client has its own 46G partition. These restrictions have been 'unticked' in the account preferences for all 'locations' for a while (days/weeks) since the upload issues went long term.

I'm puzzled boinc gave you the tasks if there wasn't enough memory. Did you by any chance change your disk limits lately?

It's not a memory issue, the refusal was based on disk space. I've checked to see if it was a swap issue but swap is at 0.45% (of 12G). Host has 16G RAM, of which 10.5% (1.7G) in use.

If that doesn't work, let us know.

It's not working, but I haven't changed anything, so no surprises there. The host is this one ID: 1523000. If you want any logs, let me know and I'll send you the last 12 hours worth. I'll report any changes if it clears itself.

Best
fraser
10) Message boards : Number crunching : The uploads are stuck (Message 67540)
Posted 11 Jan 2023 by leloft
Post:
Hi. I'm seeing an error message that there is insufficient space on one of my hosts from the project update process, but df, boinccmd and boinctui all report that there is over 17GB available. No movement on all four hosts, three of which are in the 'too many uploads' loop.

update requested by user
11-Jan-2023 12:27:39 [climateprediction.net] Sending scheduler request: Requested by user.
11-Jan-2023 12:27:39 [climateprediction.net] Requesting new tasks for CPU
11-Jan-2023 12:27:41 [climateprediction.net] Scheduler request completed: got 0 new tasks
11-Jan-2023 12:27:41 [climateprediction.net] No tasks sent
11-Jan-2023 12:27:41 [climateprediction.net] OpenIFS 43r3 Perturbed Surface needs 38146.97MB more disk space. You currently have 0.00 MB available and it needs 38146.97 MB.
11-Jan-2023 12:27:41 [climateprediction.net] OpenIFS 43r3 Perturbed Surface needs 7168.00MB more disk space. You currently have 0.00 MB available and it needs 7168.00 MB.
11-Jan-2023 12:27:41 [climateprediction.net] Project requested delay of 3636 seconds


boinccmd --get_disk_usage
======== Disk usage ========
total: 47000.71MB
free: 18054.40MB
1) -----------
master URL: https://climateprediction.net/
disk usage: 26511.11MB

Any ideas?

fraser
11) Message boards : Number crunching : The uploads are stuck (Message 67488)
Posted 10 Jan 2023 by leloft
Post:
Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system.


Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results.

My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts?
12) Questions and Answers : Unix/Linux : Help requested - using new hard disk under Linux Mint 21 [SOLVED] (Message 67477)
Posted 9 Jan 2023 by leloft
Post:

Can you confirm whether those changes are persistent - i.e. will the new disk become the active BOINC data directory for subsequent restarts? Or, if not, can it be scripted? I'd prefer not to have to go through it after every restart.

Provided that the fstab entry for the disk and mountpoint are correctly entered and saved, the changes should persist between reboots. At least, they do on my sysvinit hosts; it might be worth checking the situation with someone who understands systemd.mount, although there doesn't seem to be any conflict.

https://unix.stackexchange.com/questions/90723/is-there-any-reason-to-move-away-from-fstab-on-a-systemd-system

Don't forget to confirm that you've backed up the old fstab before unleashing the blkid >>

best
fraser
13) Questions and Answers : Unix/Linux : Help requested - using new hard disk under Linux Mint 21 [SOLVED] (Message 67456)
Posted 9 Jan 2023 by leloft
Post:
As before, all suggestions are welcome.


Hi. I have had to do this quite often with boinc. This is what I do. /dev/sdX is your new drive You'll need to adapt this for systemd. I used parted to make the /dev/sdX partition and # mkfs -t ext4 /dev/sdX before this procedure.

# service boinc-client stop
# mkdir /tmp/bc
# mount /dev/sdX /tmp/bc
# rsync -a /var/lib/boinc-client/ /tmp/bc (note trailing /)
# mv /var/lib/boinc-client /path/to/backup/directory/for/safe/keeping
# blkid /dev/sdX >> /etc/fstab (NOTE double arrows!!!)
# vi /etc/fstab (edit last entry to mount /var/lib/boinc-client on /dev/sdX)
# umount /tmp/bc
# mkdir /var/lib/boinc-client
# chown boinc:boinc /var/lib/boinc-client
# mount -a
# service boinc-client start

*Only* when you are confident that the data transfer has worked properly, remove the backup. Before doing this, please wait for a couple of confirmations that these instructions are good to go. I've copied them from a bash history, so I can confirm that they worked on that machine. YMMV

Good luck...
fraser
14) Message boards : Number crunching : The uploads are stuck (Message 67451)
Posted 9 Jan 2023 by leloft
Post:
[quote]2) The admins shouldn't assume once the upload server is up, everyone is good. It might take a while to fully drain pending uploads before things go back to normal.


Hello. A related issue: On one of my hosts, the boinc-client partition is full and so the state file cannot be written and boinc exits. There are over 120 WU waiting to upload (174G), but if the state file cannot be written, I am concerned that these uploads will not get initiated when the upload server comes back online. Can anyone suggest a workaround for this? My first instinct is to move a chunk of the data to an adjacent partition, but I do not know how to ensure that the data structures will remain intact. In other words: How do I move exactly 100% of half of the completed WU from the data directory to a holding directory? I have done this several times (including twice this week) to move an entire data partition to a new, bigger one using rsync -a which works reliably, but as I do not know how to move these completed WU in their entirety, I'd appreciate some feedback.

Many thanks
15) Message boards : Number crunching : The uploads are stuck (Message 67255)
Posted 3 Jan 2023 by leloft
Post:
I'm Uploading.
16) Questions and Answers : Unix/Linux : Running 32-bit MacOS Tasks on Linux with KVM (Message 65303)
Posted 16 Mar 2022 by leloft
Post:
I'm not sure it's fair to call it a distraction,

Haha, my bad. The distraction is mine: I am not used to having to be in front of any one machine to manage boinc. The machines are spread over two locations 15km apart, two labs, three classrooms and two offices, all of which have their own timetables. I manage everything over ssh via boinctui from wherever I happen to be working. Having boinc running as a service means that workunits are likely to survive a reboot on a remote host, even if with a small loss of work since the last checkpoint. I guess I'll just wait for the next batch of WU, they've got to be coming soon!
17) Questions and Answers : Unix/Linux : Running 32-bit MacOS Tasks on Linux with KVM (Message 65297)
Posted 16 Mar 2022 by leloft
Post:
boinccmd should work the same as in the Linux version

Edit:There may or may not be permissions issues. I have never been close enough to a Mac for long enough but if there are any, anyone capable of getting the VM set up can probably navigate them.


Thanks for the edit. The 'mac' terminal doesn't seem to understand the command 'boinccmd' and I'm just using Boinc manager, but it's complaining that it needs to be reinstalled. I'm not satisfied that the VM route is anything other than a distraction (although an interesting one):, it's very volatile: the machine needed a kernel update and after the reboot, the VM dropped a couple of WU that had been downloaded but not started; they are no longer visible in Boinc manager. I'd like to finish the 8 WU under computation and then close the VM. So here's the questions: how in Boinc manager do I set the equivalent of 'No New Tasks' in boinctui ? and can I abort those two dropped WU from the cpdn site so that they get re-assigned promptly? Or, can anyone direct me to where they might be stored on the 'mac' and how to re-attach them in BM?

Many thanks

leloft
18) Questions and Answers : Unix/Linux : Running 32-bit MacOS Tasks on Linux with KVM (Message 65260)
Posted 10 Mar 2022 by leloft
Post:
Hello. Strange events: I've just set up a qemu/kvm instance of mojave, installed boinc manager, attached to the project and apparabntly successfully downloaded files, (there are four marked as in progress (computer ID: 1528682)). However, Boinc manager is empty and holds the message 'No work available to process'. I am used to managing boinc through boinctui and I wondered it there was a terminal interface available in the mac version so I can see what's going on.

Many thanks

leloft
19) Message boards : Number crunching : Download errors on UK Met Office HadAM4 at N216 resolution v8.52 tasks (Message 64680)
Posted 21 Oct 2021 by leloft
Post:
Same stuff repeating on my end on a Linux VM (Host 1519938). Tasks are stuck in the "Download: retry in xx:xx" loop. According to the event log some files started to download okay, but got suddenly stuck with the message "Temporarily failed download of …" & "Backing off xx:xx on download of …".

Me too. Same issue on two hosts (ID: 1522999; ID: 1523002), although other hosts have received work units after the first one reported problems. Do I need to do anything, or does this get resolved server-side?
Many thanks
20) Questions and Answers : Getting started : New to Climate Prediction - No tasks for a week (Message 64493)
Posted 23 Sep 2021 by leloft
Post:
If this is just because of workunit availability, I'll just reattach and do other projects while I wait.

The other machines have been assigned additional workunits since last night , but this machine is still being denied any. I cannot see any reason other than the cpdn server doesn't 'like' this machine's configuration. Is there anyway I can check? There is nothing except an app_config.xml in the climateprediction.net folder.

Many thanks
leloft


Next 20

©2024 climateprediction.net