Message boards :
Number crunching :
Batch 1008, and test batches 1009 to 1014 for Windows - issues
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 16 Jan 10 Posts: 1081 Credit: 7,078,171 RAC: 6,013 |
|
Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057 |
Batch 1012 will fail on Intel machines. Batches 1013 & 1014 should continue running. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910 |
I have a 1008 task (https://www.cpdn.org/result.php?resultid=22417298) that seems "stuck" - it's been at 6% and change for 6 days now while the rest of the tasks blow past it. Windows 10 VM on an AMD system. Is there any value to letting it continue spinning, or should I just abort it and let some other system try? |
Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057 |
Rather than abort, could you please do me a favour? Open up Resource Monitor, click the 'CPU' tab and scroll down to find the 'wah2' list of processes. You should have 3 processes per task. For your task, n15e, please let me know how many you see. I think you'll only see one process: wah2_8.29_windows_intelx86.exe and not the wah2am_* and wah2rm_* processes. Can you confirm? Also, if you know your way around the BOINC folder layout, would be great if you could locate the task directory and check a file for me. I'd like to see the last few lines of a file called 'stdout_mon.txt'. It can be found in the task directory, which will be under your BOINC install 'data' directory: e.g. c:\Program Files\BOINC\data\projects\climateprediction.net\wah2_eas25_n15e_201212_24_1008_012272746_0\stdout_mon.txt Note the 'data' directory under BOINC is usually a hidden directory, you'll need to unhide folders in file explorer. The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). It sounds like yours has crashed and I'd like to see how far it got. After this, rather than Abort I suggest doing 'End Task' in Resource Monitor (right click on the correct process name). I *think* this will avoid your host being marked down for aborting tasks. Many thanks. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910 |
I have 9 tasks currently running on the system (it's a bit short on disk, I should fix that). I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though. But looking more closely at the task, "CPU Time" reports "1d 03:51:19" on an elapsed time of 6d and change. There's no C:\Program Files\BOINC\data directory, but I've got a C:\ProgramData\BOINC directory with that sort of stuff. stderr_rm and stderr_um are both empty. stdout_mon.txt: worker: Created shared memory region key = wah2_eas25_n15e_201212_24_1008_012272746 of size 73278744 bytes (version 608) Run for 2 Years and 0 Months pShMem->PRECIS_LATITUDE 185 pShMem->PRECIS_LONGITUDE 285 pShMem->EWSPACEA 0.220000 pShMem->NSSPACEA 0.220000 pShMem->FRSTLATA 19.100000 pShMem->FRSTLONA 328.500000 pShMem->POLELATA 55.500000 pShMem->POLELONA 308.000000 pShMem->L_RUN_REGION 1 pShMem->UPLOAD_INTERVAL 0 ulTotalPhaseTimestep 276864 Starting model ID wah2_eas25_n15e_201212_24_1008_012272746 Phase 1 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2am3m2_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746 generic_phase1_spinup_eas25_global_aabaka ic19610319_14_N96 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403 NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403 so2dms_prei_N96_1855_0000P oxi.addfa ozone_preind_N96_1879_0000Pv5 Launching model "C:\ProgramData\BOINC/projects/climateprediction.net/wah2rm3m2t_um_8.29_windows_intelx86.exe" wah2_eas25_n15e_201212_24_1008_012272746 executeModelProcess: MonID=7868, GCM_PID=9088, RCM_PID=9824 stdout_rm.txt: Starting HadRM3 model for ID# wah2_eas25_n15e_201212_24_1008_012272746... Attached to shared memory segment with ID Setting run-time Fortran environment... UM environment variables in use: ASTART=dataout/region_restart.day UNIT11=dataout/xacxf.phist UM_SECTOR_SIZE=2048 UNIT02=jobs/xacxf UM_LBC_COUP=0 VN=4.5 TYPE=CRUN UNIT09=tmp/xacxf.namelists UNIT22=datain/ancil/ctldata/STASHmaster STASETS_DIR=datain/ancil/ctldata/stasets CACHE2=tmp/xacxf.cache2 UNIT08=tmp/xacxf.pipe_dummy UNIT14=tmp/xacxf.errors APSUM1=tmp/xacxf.apsum1 APSTMP1=tmp/xacxf.apstmp1 AOTRANS=tmp/xacxf.aotrans UNIT04=jobs/xacxf.stashc UNIT05=jobs/xacxf.namelists DATAM=dataout/ UNIT12=dataout/xacxf.thist UNIT10=dataout/xacxf.phist UNIT06=dataout/xacxf.out UNIT00=dataout/xacxf.err AINITIAL=dataout/region_restart.day UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3 UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3 SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3 LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3 Changing to slots dir C:\ProgramData\BOINC\slots\1 stdout_um.txt: Starting HadAM3P model for ID# wah2_eas25_n15e_201212_24_1008_012272746... Attached to shared memory segment with ID Setting run-time Fortran environment... UM environment variables in use: ASTART=datain/dumps/generic_phase1_spinup_eas25_global_aabaka UNIT15=datain/ancil/ic19610319_14_N96 SSTIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SST_2009-01-01_2022-12-30_v2403 SICEIN=datain/ancil/NATclim_ancil_168months_CMIP6-HadGEM3-GC31-LL_SIC_2009-01-01_2022-12-30_v2403 SULPEMIS=datain/ancil/so2dms_prei_N96_1855_0000P CHEMOXID=datain/ancil/oxi.addfa OZONE=datain/ancil/ozone_preind_N96_1879_0000Pv5 UM_LBC_COUP=0 UNIT11=dataout/xadae.phist UM_SECTOR_SIZE=2048 UNIT02=jobs/xadae VN=4.5 TYPE=CRUN UNIT09=tmp/xadae.namelists AINITIAL=dataout/atmos_restart.day UNIT57=jobs/spec3a_sw_3_asol2c_hadcm3 UNIT80=jobs/spec3a_lw_3_asol2c_hadcm3 SWSPECTD=jobs/spec3a_sw_3_asol2c_hadcm3 LWSPECTD=jobs/spec3a_lw_3_asol2c_hadcm3 UNIT22=datain/ancil/ctldata/STASHmaster STASETS_DIR=datain/ancil/ctldata/stasets CACHE2=tmp/xadae.cache2 UNIT08=tmp/xadae.pipe_dummy UNIT14=tmp/xadae.errors APSUM1=tmp/xadae.apsum1 APSTMP1=tmp/xadae.apstmp1 AOTRANS=tmp/xadae.aotrans UNIT04=jobs/xadae.stashc UNIT05=jobs/xadae.namelists DATAM=dataout/ UNIT12=dataout/xadae.thist UNIT10=dataout/xadae.phist UNIT06=dataout/xadae.out UNIT00=dataout/xadae.err UM_ATM_NPROCX=1 UM_ATM_NPROCY=1 UM_NPES=1 RUNID=xadae Changing to slots dir C:\ProgramData\BOINC\slots\1 Closing model... Detaching shared memory segment... I don't see anything obviously wrong in them... I'm tempted to suspend and resume that task, see if it comes back up properly. |
Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057 |
Thanks. I can see the problem. I have 9 tasks ... I see 9 wah2_8.29 tasks, 9 wah2am3m2_8.29 tasks, and 9 of the wah2am3m2t_8.29 tasks. One or two of them are showing no CPU use, though. Not sure how to link PIDs to tasks, though.Several ways to link PID&task, I like Resource Monitor. Start it up. On the CPU tab, scroll to find the CPDN task process you are interested in. Click the little checkbox left of the process you are interested in. Once clicked, open up the 'Associated Handles' section (little up/down arrow on the title bar below), and it will show you all the files and folders associated with the process. The last lines of that output from the global model 'stdout_um.txt' show the problem: 'Closing model'. That means the model has stopped but for some reason boinc hasn't recognised this and the process hasn't exited. That's why the model has hung up. The global model isn't running so the other two processes are just sitting waiting. Rather than suspend/resume, I would shut down the client to kill the processes. Make sure they really have gone (check Resource Monitor) and then start up the client again. It's possible the tasks will then error but that's what you need anyway. HTH p.s. I've just checked the machine this was running on. I noticed it's only got 8Gb RAM. How many CPDN tasks are running simultaneously and how much of that 8Gb is BOINC allowed to use? Am thinking you might have hit a memory limit causing this odd behaviour. --- CPDN Visiting Scientist |
Send message Joined: 7 Sep 16 Posts: 259 Credit: 32,091,992 RAC: 22,910 |
It's running 9 tasks, due to disk limits, showing 6.8GB of 8 in use, and BOINC is allowed 90% of RAM. I'll just reboot the VM and up the RAM to it - it's able to have at least 12GB, the system isn't running anything else. |
Send message Joined: 17 Feb 06 Posts: 2 Credit: 576,179 RAC: 3,313 |
The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). Hello, I have 4 1008 tasks at about 60% on a Ryzen 3600x. Is it worth continuing them if they don't output correct results or should I abort them? Thanks! |
Send message Joined: 15 May 09 Posts: 4352 Credit: 16,590,792 RAC: 6,226 |
I am going to let mine continue unless I see a message from Glen here or someone else at the project asking for them to be aborted. (Mine are suspended currently to let some testing branch tasks go through. |
Send message Joined: 29 Oct 17 Posts: 816 Credit: 13,672,275 RAC: 8,057 |
Please let it run. We're not 100% certain of the results. They are a useful comparison to the failed Intel runs. The reason I ask this is because up to now it's appeared that only AMD PCs can run the 1008 tasks (though not produce correct results). --- CPDN Visiting Scientist |
Send message Joined: 17 Feb 06 Posts: 2 Credit: 576,179 RAC: 3,313 |
Ok, will do! Thanks 👍 |
Send message Joined: 1 Jan 07 Posts: 943 Credit: 34,360,365 RAC: 9,337 |
Task 22425060, test batch 1014, Intel has finished successfully. |
©2024 climateprediction.net