climateprediction.net home page
Memory Allocation Failure

Memory Allocation Failure

Message boards : Number crunching : Memory Allocation Failure
Message board moderation

To post messages, you must log in.

AuthorMessage
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31429 - Posted: 19 Nov 2007, 19:10:38 UTC

I\'ve just had a slab model (hadsm3fub_0316_005909008_6) which has failed with a Memory Allocation Failure. I have been struggling to resuscitate it for several days, restoring it from several different backups, but it fails at exactly the same place each time: in the course of the Post Processing at the end of Phase 2. I have rerun it to the fail point 4 times and on the last occasion have allowed it to report. The only difference is that sometimes it appears to freeze at the fail, showing 66.666% complete and other times it goes to 100%. I have also tried it on Intel and AMD systems.

Does anybody have any ideas about what might make a difference if I were to try it a fifth time?
ID: 31429 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 31430 - Posted: 19 Nov 2007, 20:24:57 UTC


Highlighting Memory Allocation Failure in your post and then clicking the Google button on my browser gives a large number of \'hits\'.
Just reading the summary of a few seems to indicate that your climate program keeps running out of memory.
I remember that you posted a while back about lack of memory.

So perhaps if you make sure that nothing else is running, that work units for any other project are suspended, and that you have set your prefs option to give the maximum amount of memory to the program when the computer is not in use, it will succeed. Better still, suspend the climate model, and let other project WUs complete to get them out of the way. This obviously means first of all setting everything in the Projects tab to No new tasks.

Your computer specs say: 991.48 MB, which is less than 1 Gig, (1024Mb), so, unless you have an odd amount of memory installed, there is some \'missing\'.
Is this becaus you don\'t have a separate graphics card? If so, the onboard chips may be using a fair bit of memory at a crucial point, for some reason.

As for BOINC sometimes showing 100%, this happens when BOINC loses contact with the model. It then thinks that it\'s because the model has completed, and so sets it\'s percentage number to 100%, to show this \'completion\'.

ID: 31430 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31431 - Posted: 19 Nov 2007, 21:31:02 UTC

Thanks, Les. I had assumed the error message was a BOINC/CPDN specific message, so didn\'t do what you\'ve done.

The 991.48 ?!? relates to the AMD Sempron machine I tried it on last. Mostly, this model runs on a 2Gb Core 2 Duo, alongside a Coupled Model which still has about 3 months still to go, but otherwise with nothing else much running (apart from email). Normal practice is Network is Off, Screensaver is Off, Windows and Antivirus updates are Off. The message I posted about re Lack of Memory some time ago was for a completely different slab model running on a 512Mb laptop - which incidentally is still running.

I can try it one more time when I get back to my Dual in a few days, as you suggest, and change the memory in Preferences? See what that does.

ID: 31431 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31432 - Posted: 19 Nov 2007, 22:44:31 UTC


That\'s a very interesting crash, first time I\'ve ever seen one like it. A lot of debug info.

http://climateapps2.oucs.ox.ac.uk/cpdnboinc/result.php?resultid=6932855
...
- exit code -1073741819 (0xc0000005)
...
MainError: 06:17:03 PM Memory allocation (malloc) failure
MainError: 06:17:03 PM Memory allocation (malloc) failure
MainError: 06:17:03 PM Memory allocation (malloc) failure
MainError: 06:17:03 PM Memory allocation (malloc) failure
MainError: 06:17:03 PM Memory allocation (malloc) failure
MainError: 06:17:03 PM Variable not found
...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0044AD38 read attempt to address 0x00000003
...
- Paged Pool Usage -
QuotaPagedPoolUsage: 69024, QuotaPeakPagedPoolUsage: 69024
QuotaNonPagedPoolUsage: 3136, QuotaPeakNonPagedPoolUsage: 3440
...
- Callstack -
ChildEBP RetAddr Args to Child
0318fef4 7c90e9c0 7c8025cb 000000a0 00000000 0318ff2c ntdll!_KiFastSystemCallRet@0+0x0 FPO: [0,0,0]
0318fef8 7c8025cb 000000a0 00000000 0318ff2c 00001000 ntdll!_ZwWaitForSingleObject@12+0x0 FPO: [3,0,0]
0318ff5c 7c802532 000000a0 000003e8 00000000 0318ffb4 kernel32!_WaitForSingleObjectEx@12+0x0
0318ff70 00428e21 000000a0 000003e8 02d0bde8 00471888 kernel32!_WaitForSingleObject@8+0x0
0318ffb4 7c80b683 02d0bde8 00001000 02d0b890 02d0bde8 hadsm3_5.06_windows_intelx86!+0x0
0318ffec 00000000 00471819 02d0bde8 00000000 73646168 kernel32!_BaseThreadStart@8+0x0


The most likely reason for a malloc failure is an out-of-memory condition as Les indicates. You could also get an error if the memory allocation pool has been corrupted somehow, and also if the pool has been fragmented to the point where malloc can\'t get a single block of memory large enough to satisfy a request.

I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31432 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31433 - Posted: 20 Nov 2007, 8:17:10 UTC

Thanks, MikeMars. I\'ve tried it again this morning after a reboot and pushing the Memory Preference up from 50% to 70%, albeit on my 1Gb single core AMD system, but it just froze at the 66.666% point, i.e. Phase 2 end. Freezing is the most common symptom. If I suspend the offending slab model and resume my aoupled model, the CM carries on just fine. Do slabs have a larger memory requirement?
ID: 31433 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 31434 - Posted: 20 Nov 2007, 9:26:01 UTC
Last modified: 20 Nov 2007, 9:28:08 UTC

Slab uses 50MB for the worker process, with another 2MB for the controller process (increasing to 12MB if you open the graphics window). Coupled uses 100MB for the worker and 15MB for the controller after opening the graphics window.

As far as freezing at 66.66% goes, that\'s the end of the second phase. The controller process does lots of post-processing at that point. If you\'re monitoring the percentage completed in BOINC manager slab models can appear to be stuck at that point for a few minutes before phase 3 starts and the percentage starts increasing again.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 31434 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31435 - Posted: 20 Nov 2007, 12:15:27 UTC - in response to Message 31434.  

Slab uses 50MB for the worker process, with another 2MB for the controller process (increasing to 12MB if you open the graphics window). Coupled uses 100MB for the worker and 15MB for the controller after opening the graphics window.

As far as freezing at 66.66% goes, that\'s the end of the second phase. The controller process does lots of post-processing at that point. If you\'re monitoring the percentage completed in BOINC manager slab models can appear to be stuck at that point for a few minutes before phase 3 starts and the percentage starts increasing again.


Thanks, Thyme Lawn. Trouble is, the Slab froze at 66.666% for about 7 hours, i.e. overnight, and Windows Task Manager showed 0% mill being used.
ID: 31435 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 31436 - Posted: 20 Nov 2007, 13:06:30 UTC

Is there still a directory BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008\\dataout on that system, and if so check the names of the 0316ba.* files. If post-processing completed they should all have been given the suffix .x2.nc but if something went wrong you\'ll have some without that suffix.

If something did go wrong have a look at the BOINC manager messages tab (or the file BOINC/stdoutdae.txt) to check if anything strange happened around the timestamp on the newest file with the .x2.nc suffix (e.g. did a benchmark force the application to be removed from memory).
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 31436 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31437 - Posted: 20 Nov 2007, 19:40:29 UTC - in response to Message 31436.  

Is there still a directory BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008\\dataout on that system, and if so check the names of the 0316ba.* files. If post-processing completed they should all have been given the suffix .x2.nc but if something went wrong you\'ll have some without that suffix.

If something did go wrong have a look at the BOINC manager messages tab (or the file BOINC/stdoutdae.txt) to check if anything strange happened around the timestamp on the newest file with the .x2.nc suffix (e.g. did a benchmark force the application to be removed from memory).


Thanks Thyme Lawn. The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time.
ID: 31437 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 31442 - Posted: 21 Nov 2007, 4:33:15 UTC - in response to Message 31433.  

Thanks, MikeMars. I\'ve tried it again this morning after a reboot and pushing the Memory Preference up from 50% to 70%, albeit on my 1Gb single core AMD system, but it just froze at the 66.666% point, i.e. Phase 2 end. Freezing is the most common symptom. If I suspend the offending slab model and resume my aoupled model, the CM carries on just fine. Do slabs have a larger memory requirement?


Malloc errors would not be related to BOINC settings, as CPDN requests memory from the OS, not BOINC. So, BOINC preferences should make no difference in this case. I would be very curious to see how much memory the Windows kernel is using and how much free memory is available on the system after a reboot with CPDN running. You can find kernel memory usage in the Task Manager under Performance tab.
ID: 31442 · Report as offensive     Reply Quote
old_user81336

Send message
Joined: 10 Jun 05
Posts: 10
Credit: 4,863
RAC: 0
Message 31443 - Posted: 21 Nov 2007, 5:35:32 UTC - in response to Message 31442.  

Malloc errors would not be related to BOINC settings, as CPDN requests memory from the OS, not BOINC. So, BOINC preferences should make no difference in this case.


if you mean CPDN requests memory from the OS when told by BOINC that it is allowed to ask for more memory, then I would approve of it. If CPDN completely ignores the user\'s prefences, that would be bad.
ID: 31443 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 31444 - Posted: 21 Nov 2007, 8:38:07 UTC
Last modified: 21 Nov 2007, 8:49:12 UTC


Boinc itself monitors memory usage and when it gets too high will suspend enough tasks to reduce usage to below the limit. As DJStarfox says, CPDN asks the OS directly for memory.

However, Boinc does not know how much system memory is free, and how much is used by non-Boinc tasks. It just uses the fixed percentages of the total memory. If you are running something else which uses an excessive amount of memory, or if you run out of disk space on the drive which holds your swap file (virtual memory), then the system will be in big trouble but Boinc won\'t realise.

Various things I\'d suggest trying:

* Run a memory tester (memcheck86), or failing that a stress tester with a large memory size set (Prime95\'s torture test) for 24 hours.

* As Thyme says, look in the task manager on the processes tab after the system has been running for a while to see if there is anything using a lot of memory. You may need to add the \'peak memory usage\' column to the display.

* I don\'t know how much memory is used during the post-processing phase. If you watch in the task manager as above when post-processing is taking place, then this may give you additional information.

* How long does the system stay up at a stretch? If this is too long (weeks), you may suffer memory fragmentation after a while. Similarly, if sonething has a memory leak (such as vsmon), it may make the system crash with memory related errors.
I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 31444 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 31446 - Posted: 21 Nov 2007, 9:11:53 UTC - in response to Message 31437.  

The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time.

Which indicates that post-processing completed. There are 10 hours between the last phase 2 trickle and the model crashing, but going by the distribution of your earlier trickles it might have been significantly longer.

Check the file BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008.xml. It will indicate exactly where you got to in the processing. Post the lines from <PH> to <RSYT> if you need any help decyphering it.
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 31446 · Report as offensive     Reply Quote
DJStarfox

Send message
Joined: 27 Jan 07
Posts: 300
Credit: 3,288,263
RAC: 26,370
Message 31448 - Posted: 21 Nov 2007, 13:47:54 UTC - in response to Message 31446.  

The directory is still there and the suffices are all .x2.nc . stdoutdae shows no messages at the relevant time.

Which indicates that post-processing completed. There are 10 hours between the last phase 2 trickle and the model crashing, but going by the distribution of your earlier trickles it might have been significantly longer.

Check the file BOINC\\projects\\climateprediction.net\\hadsm3fub_0316_005909008.xml. It will indicate exactly where you got to in the processing. Post the lines from <PH> to <RSYT> if you need any help decyphering it.


That\'s very interesting if true. I thought the problem was post-processing, which on my linux system more than doubles its memory requirement (for just that step). If the error happens afterward, then I\'m not sure his system is really out of memory.

I have a slab at 50% now....if it\'s still relevant in this post when my model hits 66%, then I\'ll observe the memory usage more closely.
ID: 31448 · Report as offensive     Reply Quote
Lockleys

Send message
Joined: 13 Jan 07
Posts: 195
Credit: 10,581,566
RAC: 0
Message 31460 - Posted: 23 Nov 2007, 17:11:59 UTC

I have now had the opportunity to rerun the slab model (hadsm3fub_0316_005909008_6) from a backup to the 66.666% point once more, using my Core2 Duo with 2 Gb memory. All other BOINC tasks were detached and the other coupled model deleted befre system reboot. No other apps were running apart from AVG antivrus. Memory preferences were set to \"use 90%\".

A new CPID was allocated at the start of the run.

It has now frozen again at 66.666%. CPU usage has dropped to 0-1%, so processing has materially ceased. There does not appear to be a Malloc failure this run.

Once again, all the dataout/----0316da.* files have the .x2.nc suffix .

The lines in file hadsm3fub_0316_005909008.xml which Thyme Lawn enquired about are as follows
<PH>2</PH>
<TS>259248</TS>
<DAY>2</DAY>
<MTH>12</MTH>
<YR>1840</YR>
<HR>0</HR>
<MIN>0</MIN>
<SEC>0</SEC>
<CSF>372</CSF>
<TR>259248</TR>
<ST>1</ST>
<RS>3</RS>
<RSC>0</RSC>
<RSDT>259200</RSDT>
<RSMT>259056</RSMT>
<RSYT>259056</RSYT>

I\'m now off to restore the coupled model from its backup and see if I can continue with it.

Thanks everyone for your assistance.
ID: 31460 · Report as offensive     Reply Quote
Profile Thyme Lawn
Volunteer moderator

Send message
Joined: 5 Aug 04
Posts: 1283
Credit: 15,824,334
RAC: 0
Message 31461 - Posted: 23 Nov 2007, 17:54:19 UTC

Those values indicate that it hasn\'t progressed into phase 3 (the <PH> tag is still set to 2). If you sort the dataout folder by date the oldest compressed working data file should be 0316ba.pa26c10.x2.nc, with a set of more recent files generated at the end of post-processing in the following increasing age order:

0316ba.pa.8yac.x2.nc
0316aa.pc.8yac.x2.nc (yes, it is \'aa\' rather than \'ba\'!)
0316ba.pa.gmts.x2.nc
0316ba.pa.rmts.x2.nc
0316ba.pd.gmts.x2.nc
0316ba.pd.rmts.x2.nc
0316ba.pe.gmts.x2.nc
0316ba.pe.rmts.x2.nc
0316ba.pf.gmts.x2.nc
0316ba.pf.rmts.x2.nc
0316ba.pg.gmts.x2.nc
0316ba.pg.rmts.x2.nc
0316ba.pw.8yac.x2.nc
"The ultimate test of a moral society is the kind of world that it leaves to its children." - Dietrich Bonhoeffer
ID: 31461 · Report as offensive     Reply Quote
Profile Iain Inglis

Send message
Joined: 9 Jan 07
Posts: 467
Credit: 14,549,176
RAC: 317
Message 31462 - Posted: 23 Nov 2007, 18:09:18 UTC
Last modified: 23 Nov 2007, 18:26:02 UTC

I\'ve just watched three slab models run through the Phase 1 Zip upload (at 33.333%). This may, of course, differ from the Phase 2 situation.

The sequence of events was as follows:

1. At 33.327% there is a processing drop, which possibly coincides with a checkpoint.

2. At 33.333% the \'hadsm3_um_5.06_windows_intelx86.exe\' process stops. The memory used drops by ~49 MB, which is approximately the reported \'Mem Usage\' and \'VM Size\' for that process. The BOINC Manager progress percentage doesn\'t change.

3. Processing remains at very low levels for about 12 minutes, during which time the \'hadsm3_5.06_windows_intelx86.exe\' process writes about 1 GB, with occasional interventions by the \'hadsm3_se_5.06_windows_intelx86.exe\' process. Numerous trickles are uploaded. Memory occasionally increases by a few MB (presumably the \'_se_\' process running) but never approaches the original amount.

4. After using about 80-90 seconds of CPU, \'hadsm3_5.06_windows_intelx86.exe\' finishes writing, a new \'hadsm3_um_5.06_windows_intelx86.exe\' process starts, the progress percentage begins to increment, and the Zip file is uploaded.

5. The memory used settles back at about the original value plus ~ 2 MB.

6. Quitting/resuming BOINC Manager returns the memory used to the original value.

So, CPU usage dropping to approx. zero is usual - but Lockleys\' model should evidently restart for Phase 3 ...

[Edit: Are the folder permissions OK? You can safely clear any read-only flags and also ensure that the running user has \'full control\'.]
ID: 31462 · Report as offensive     Reply Quote

Message boards : Number crunching : Memory Allocation Failure

©2024 climateprediction.net