climateprediction.net home page
Task wont restart

Task wont restart

Message boards : Number crunching : Task wont restart
Message board moderation

To post messages, you must log in.

AuthorMessage
ktf

Send message
Joined: 28 Jun 07
Posts: 6
Credit: 929,653
RAC: 20,169
Message 46566 - Posted: 2 Jul 2013, 8:07:25 UTC
Last modified: 2 Jul 2013, 8:18:57 UTC

Hi,

I have this task running: http://climateapps2.oerc.ox.ac.uk/cpdnboinc/result.php?resultid=15813485

But it won't restart. I tried restarting my computer several times, but it is stuck at 25.097%. ps aux on my computers says

_user_ 3540 0.2 0.1 9836 7144 ? SNl 09:58 0:00 ../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu hadcm3n_o4ep_1940_40_008382057 ocean_o4ep_1940_40_008382057_0 atmos_o4ep_1940_40_008382057_0 spec3a_sw_3_asol2c_hadcm3 spec3a_lw_3_asol2c_hadcm3 waterfix.ancil.be.32 NAT_VOLC DMSSO2NH3_1900_RCP sulpc_oxidants_19_A2_1990f SPARC_O3_rebuild_1900


It isn't using CPU resources so it isn't really running, as you can see CPU time is zero.

The last lines of stderr.txt (while still in 'running' state) in the slots directory are these

[...]
Signal 15 received, exiting...
Called boinc_finish
*** Error in `../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu': double free or corruption (out): 0x09821548 ***
hadcm3n_6.07_i686-pc-linux-gnu: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
SIGABRT: abort called
Signal 1 received, exiting...
Called boinc_finish
Signal 1 received, exiting...
Called boinc_finish
Signal 15 received, exiting...
Called boinc_finish
*** Error in `../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu': double free or corruption (out): 0x0a04b548 ***
hadcm3n_6.07_i686-pc-linux-gnu: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
SIGABRT: abort called
*** Error in `../../projects/climateprediction.net/hadcm3n_6.07_i686-pc-linux-gnu': double free or corruption (out): 0x09c36548 ***
hadcm3n_6.07_i686-pc-linux-gnu: malloc.c:2369: sysmalloc: Assertion `(old_top == (((mbinptr) (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= (unsigned long)((((__builtin_offsetof (struct malloc_chunk, fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == 0)' failed.
SIGABRT: abort called


What should I do? Is there a way to start this one again? Should I abort? Is one of the devs interested in more information?
ID: 46566 · Report as offensive     Reply Quote
Profile MikeMarsUK
Volunteer moderator
Avatar

Send message
Joined: 13 Jan 06
Posts: 1498
Credit: 15,613,038
RAC: 0
Message 46568 - Posted: 2 Jul 2013, 8:29:00 UTC



There is a fragile point in the models at 25%, 50%, 75%, and 100% (when certain key output files are being generated). It is not uncommon for them to get stuck at this point. The only solution is either to revert to a backup, or to abort them.


I'm a volunteer and my views are my own.
News and Announcements and FAQ
ID: 46568 · Report as offensive     Reply Quote
Les Bayliss
Volunteer moderator

Send message
Joined: 5 Sep 04
Posts: 7629
Credit: 24,240,330
RAC: 0
Message 46569 - Posted: 2 Jul 2013, 8:31:58 UTC - in response to Message 46566.  
Last modified: 2 Jul 2013, 8:32:39 UTC

The parts of interest are:
25.097%
malloc and failed

All of the 25% points are a very common place for this model type to fail.

The malloc ... failed is a common failure message.

The only cure is to abort it.

edit
Beaten to it. :)
ID: 46569 · Report as offensive     Reply Quote

Message boards : Number crunching : Task wont restart

©2024 climateprediction.net