1)
Message boards :
Number crunching :
The uploads are stuck
(Message 67887)
Posted 19 Jan 2023 by leloft Post: The 100 GB limit is - When I learnt of this limit, I just set it to 1000G in the preferences and controlled disk usage through 'the use no more than xG' or 'eave at least yG free'. The preferences clearly state that the lowest of the 3 limits will be used. |
2)
Message boards :
Number crunching :
Why does this task fail ?
(Message 67860)
Posted 18 Jan 2023 by leloft Post: Hello. Could I please ask for clarification? I am generating several 'Error while Computing' results per day per host. Here is a typical one: 22286062 12199858 1534812 12 Jan 2023, 5:38:05 UTC 18 Jan 2023, 15:26:09 UTC Error while computing 73,196.55 73,196.55 --- OpenIFS 43r3 Perturbed Surface v1.05 x86_64-pc-linux-gnu The last few lines of the stderr output for this task are these [...} Zipping up the final file: /var/lib/boinc-client/projects/climateprediction.net/oifs_43r3_ps_0214_2014050100_123_983_12199858_0_r1018304785_122.zip Uploading the final file: upload_file_122.zip Uploading trickle at timestep: 10623600 14:47:17 (32691): called boinc_finish(0) </stderr_txt> ]]> Could someone please explain why the model finishes with a 'final file' ***_122.zip. Are these errors only detectable after the model has run to completion and been uploaded? I'm not sure where to start trying to reduce this high rate of errors. I have restricted all hosts with app_configs that allow 5.5G memory per task leaving ~10G free for the system. Where should I start unpicking these errors? Many thanks fraser |
3)
Message boards :
Number crunching :
Upload server is out of disk space
(Message 67739)
Posted 15 Jan 2023 by leloft Post: I think I've got a workaround to the 'too many uploads' issue. Thanks to all who contributed bits towards this. It appeared that actively crunching clients had more success at securing upload slots, so I changed <ncpus> from 24 to 40 in cc_config and reread it. The client downloaded 8 units and started to process them. The host has been uploading solidly since 21.00 last night and has no trouble regaining an upload slot within seconds of dropping it. I have no real idea why this should have worked, except to guess that the ability to secure an upload slot is somehow enhanced by having an actively crunching client. Best, fraser |
4)
Message boards :
Number crunching :
Upload server is out of disk space
(Message 67722)
Posted 14 Jan 2023 by leloft Post: Always check with the User manual. That's a shortened output of boinccmd --help. The command 'boinccmd --set_network_mode always' doesn't do anything, but that's because it's set to 'always' in boinctui. I was after a boinccmd option that would do the same as the the 'retry' tools in BOINC Managerbut there doesn't seem to be one, which seems strange. The nearest seemed to be the '--network_available retry deferred network communication'. I'll just wait it out. |
5)
Message boards :
Number crunching :
Upload server is out of disk space
(Message 67707)
Posted 14 Jan 2023 by leloft Post: You can try and clear things, by using the 'retry' tools in BOINC Manager What would that be in boinccmd? --network_available seems to do nothing, I assumed it was a toggle; --file_transfer requires a filename and doesn't work with wildcards. I was hoping to set up a cronjob to try and improve the chances of getting a slot. It seems to be a case of giving to those who already have. Is there someway the backing off time period could be reduced to a few minutes for those machines that have failed to upload and a few tens of minutes for those that succeeded? If the question is simply a correlation between number of attempts and successful uploads, then to allow unsuccessful attempts shorter times between tries would stand a better chance of clearing some of these 'too many uploads' errors, at least enough to allow the stalled hosts to resume active duty. Just a thought. fraser |
6)
Message boards :
Number crunching :
Upload server is out of disk space
(Message 67688)
Posted 14 Jan 2023 by leloft Post: I'll post updates if I get them to the 'Uploads are stuck' thread, am busy with other things. I'm sure Dave will update when he hears anything too. Here is an observation: I have five hosts with WU in uploading status. Of these five, three of them are successfully uploading files and as they are disgorging their backlog, they are able to download new WU, process and upload them. The two other hosts that are failing to secure an upload slot are blocked from downloading as they are up to capacity and therefore idle. Can anyone confirm that actively crunching machines are more successful at elbowing their way in to an upload slot? If so, it seems that it would be a shame that these machines are uploading 20 hours into a 28 day deadline, while backlog-enforced idling hosts are unable to fight their way onto the server. Just an observation, but it feels that it is more than just a sampling error. fraser |
7)
Message boards :
Number crunching :
The uploads are stuck
(Message 67571)
Posted 11 Jan 2023 by leloft Post: Fix for - Need more disk space. You currently have 0.00 MB available. Good advice, thank you. I set 'use no more than' to 1000G, 'leave at least' to 1G free, 'use at most' to 99% of disk, updated project and now the two hosts in question processing new units. 2/4 hosts uploading as well. Onwards and forwards. fraser |
8)
Message boards :
Number crunching :
The uploads are stuck
(Message 67555)
Posted 11 Jan 2023 by leloft Post:
Doubly unlucky: I've just had the same refusal from both the first machine and now a second one, both refer to the same value 7168.00 MB. The good news is that one of the hosts has managed to upload 8 tasks. I'll keep trying, but I'm limited by the 3636 seconds rule. Best fraser |
9)
Message boards :
Number crunching :
The uploads are stuck
(Message 67544)
Posted 11 Jan 2023 by leloft Post: Thanks for your reply. HI Fraser, There are no limits on disk space: /var/lib/boinc-client has its own 46G partition. These restrictions have been 'unticked' in the account preferences for all 'locations' for a while (days/weeks) since the upload issues went long term. I'm puzzled boinc gave you the tasks if there wasn't enough memory. Did you by any chance change your disk limits lately? It's not a memory issue, the refusal was based on disk space. I've checked to see if it was a swap issue but swap is at 0.45% (of 12G). Host has 16G RAM, of which 10.5% (1.7G) in use. If that doesn't work, let us know. It's not working, but I haven't changed anything, so no surprises there. The host is this one ID: 1523000. If you want any logs, let me know and I'll send you the last 12 hours worth. I'll report any changes if it clears itself. Best fraser |
10)
Message boards :
Number crunching :
The uploads are stuck
(Message 67540)
Posted 11 Jan 2023 by leloft Post: Hi. I'm seeing an error message that there is insufficient space on one of my hosts from the project update process, but df, boinccmd and boinctui all report that there is over 17GB available. No movement on all four hosts, three of which are in the 'too many uploads' loop. update requested by user 11-Jan-2023 12:27:39 [climateprediction.net] Sending scheduler request: Requested by user. 11-Jan-2023 12:27:39 [climateprediction.net] Requesting new tasks for CPU 11-Jan-2023 12:27:41 [climateprediction.net] Scheduler request completed: got 0 new tasks 11-Jan-2023 12:27:41 [climateprediction.net] No tasks sent 11-Jan-2023 12:27:41 [climateprediction.net] OpenIFS 43r3 Perturbed Surface needs 38146.97MB more disk space. You currently have 0.00 MB available and it needs 38146.97 MB. 11-Jan-2023 12:27:41 [climateprediction.net] OpenIFS 43r3 Perturbed Surface needs 7168.00MB more disk space. You currently have 0.00 MB available and it needs 7168.00 MB. 11-Jan-2023 12:27:41 [climateprediction.net] Project requested delay of 3636 seconds boinccmd --get_disk_usage ======== Disk usage ======== total: 47000.71MB free: 18054.40MB 1) ----------- master URL: https://climateprediction.net/ disk usage: 26511.11MB Any ideas? fraser |
11)
Message boards :
Number crunching :
The uploads are stuck
(Message 67488)
Posted 10 Jan 2023 by leloft Post: Edit: Just realized if you can't write state file, any messing within BOINC might be hopeless. So have to find the space elsewhere from the system. Indeed, that's what I've done. The loss of the state file has caused problems: presumably, the .old state file was accessed as the client downloaded some hadam files; it also couldn't locate some of the oifs files and so 20 or so were abandoned as errors, with the loss of 20 results. My next move is to split the /boinc-client folder: I'm thinking to leave the boinc-client directory on the /var/lib partition but mount the /projects folder on a separate partition. At the moment, the whole of the boinc-client folder is on a separate partition. This arrangement would have meant that the state file could still have been written, much like mounting /var/log separately to /var. Any thoughts? |
12)
Questions and Answers :
Unix/Linux :
Help requested - using new hard disk under Linux Mint 21 [SOLVED]
(Message 67477)
Posted 9 Jan 2023 by leloft Post:
Provided that the fstab entry for the disk and mountpoint are correctly entered and saved, the changes should persist between reboots. At least, they do on my sysvinit hosts; it might be worth checking the situation with someone who understands systemd.mount, although there doesn't seem to be any conflict. https://unix.stackexchange.com/questions/90723/is-there-any-reason-to-move-away-from-fstab-on-a-systemd-system Don't forget to confirm that you've backed up the old fstab before unleashing the blkid >> best fraser |
13)
Questions and Answers :
Unix/Linux :
Help requested - using new hard disk under Linux Mint 21 [SOLVED]
(Message 67456)
Posted 9 Jan 2023 by leloft Post: As before, all suggestions are welcome. Hi. I have had to do this quite often with boinc. This is what I do. /dev/sdX is your new drive You'll need to adapt this for systemd. I used parted to make the /dev/sdX partition and # mkfs -t ext4 /dev/sdX before this procedure. # service boinc-client stop # mkdir /tmp/bc # mount /dev/sdX /tmp/bc # rsync -a /var/lib/boinc-client/ /tmp/bc (note trailing /) # mv /var/lib/boinc-client /path/to/backup/directory/for/safe/keeping # blkid /dev/sdX >> /etc/fstab (NOTE double arrows!!!) # vi /etc/fstab (edit last entry to mount /var/lib/boinc-client on /dev/sdX) # umount /tmp/bc # mkdir /var/lib/boinc-client # chown boinc:boinc /var/lib/boinc-client # mount -a # service boinc-client start *Only* when you are confident that the data transfer has worked properly, remove the backup. Before doing this, please wait for a couple of confirmations that these instructions are good to go. I've copied them from a bash history, so I can confirm that they worked on that machine. YMMV Good luck... fraser |
14)
Message boards :
Number crunching :
The uploads are stuck
(Message 67451)
Posted 9 Jan 2023 by leloft Post: [quote]2) The admins shouldn't assume once the upload server is up, everyone is good. It might take a while to fully drain pending uploads before things go back to normal. Hello. A related issue: On one of my hosts, the boinc-client partition is full and so the state file cannot be written and boinc exits. There are over 120 WU waiting to upload (174G), but if the state file cannot be written, I am concerned that these uploads will not get initiated when the upload server comes back online. Can anyone suggest a workaround for this? My first instinct is to move a chunk of the data to an adjacent partition, but I do not know how to ensure that the data structures will remain intact. In other words: How do I move exactly 100% of half of the completed WU from the data directory to a holding directory? I have done this several times (including twice this week) to move an entire data partition to a new, bigger one using rsync -a which works reliably, but as I do not know how to move these completed WU in their entirety, I'd appreciate some feedback. Many thanks |
15)
Message boards :
Number crunching :
The uploads are stuck
(Message 67255)
Posted 3 Jan 2023 by leloft Post: I'm Uploading. |
16)
Questions and Answers :
Unix/Linux :
Running 32-bit MacOS Tasks on Linux with KVM
(Message 65303)
Posted 16 Mar 2022 by leloft Post: I'm not sure it's fair to call it a distraction, Haha, my bad. The distraction is mine: I am not used to having to be in front of any one machine to manage boinc. The machines are spread over two locations 15km apart, two labs, three classrooms and two offices, all of which have their own timetables. I manage everything over ssh via boinctui from wherever I happen to be working. Having boinc running as a service means that workunits are likely to survive a reboot on a remote host, even if with a small loss of work since the last checkpoint. I guess I'll just wait for the next batch of WU, they've got to be coming soon! |
17)
Questions and Answers :
Unix/Linux :
Running 32-bit MacOS Tasks on Linux with KVM
(Message 65297)
Posted 16 Mar 2022 by leloft Post: boinccmd should work the same as in the Linux version Thanks for the edit. The 'mac' terminal doesn't seem to understand the command 'boinccmd' and I'm just using Boinc manager, but it's complaining that it needs to be reinstalled. I'm not satisfied that the VM route is anything other than a distraction (although an interesting one):, it's very volatile: the machine needed a kernel update and after the reboot, the VM dropped a couple of WU that had been downloaded but not started; they are no longer visible in Boinc manager. I'd like to finish the 8 WU under computation and then close the VM. So here's the questions: how in Boinc manager do I set the equivalent of 'No New Tasks' in boinctui ? and can I abort those two dropped WU from the cpdn site so that they get re-assigned promptly? Or, can anyone direct me to where they might be stored on the 'mac' and how to re-attach them in BM? Many thanks leloft |
18)
Questions and Answers :
Unix/Linux :
Running 32-bit MacOS Tasks on Linux with KVM
(Message 65260)
Posted 10 Mar 2022 by leloft Post: Hello. Strange events: I've just set up a qemu/kvm instance of mojave, installed boinc manager, attached to the project and apparabntly successfully downloaded files, (there are four marked as in progress (computer ID: 1528682)). However, Boinc manager is empty and holds the message 'No work available to process'. I am used to managing boinc through boinctui and I wondered it there was a terminal interface available in the mac version so I can see what's going on. Many thanks leloft |
19)
Message boards :
Number crunching :
Download errors on UK Met Office HadAM4 at N216 resolution v8.52 tasks
(Message 64680)
Posted 21 Oct 2021 by leloft Post: Same stuff repeating on my end on a Linux VM (Host 1519938). Tasks are stuck in the "Download: retry in xx:xx" loop. According to the event log some files started to download okay, but got suddenly stuck with the message "Temporarily failed download of …" & "Backing off xx:xx on download of …". Me too. Same issue on two hosts (ID: 1522999; ID: 1523002), although other hosts have received work units after the first one reported problems. Do I need to do anything, or does this get resolved server-side? Many thanks |
20)
Questions and Answers :
Getting started :
New to Climate Prediction - No tasks for a week
(Message 64493)
Posted 23 Sep 2021 by leloft Post: If this is just because of workunit availability, I'll just reattach and do other projects while I wait. The other machines have been assigned additional workunits since last night , but this machine is still being denied any. I cannot see any reason other than the cpdn server doesn't 'like' this machine's configuration. Is there anyway I can check? There is nothing except an app_config.xml in the climateprediction.net folder. Many thanks leloft |
©2024 climateprediction.net