Skip to content

Segfault caused by deserialize (?) or too many tasks (?) #8551

@eschnett

Description

@eschnett

For the past two weeks I've been trying to find the cause of a rather persistent segfault in a Julia program of mine. So far, I whittled it down to less then 1,000 lines of Julia code, and no external dependencies except Deque from DataStructures.jl (as a replacement for MPI.jl).

I create many tasks, many of them active (and yielding from time to time). I'm sure there's a lot of memory allocation and garbage collection going on. I also serialize and de-serialize many objects, including functions and lambdas.

The error is a segfault. I assume that there is a safe subset of Julia programs that should never segfault (no ccall, no @inbounds, etc.). I believe my program is safe in this respect.

Here is the code: https://bitbucket.org/eschnett/funhpc.jl/branch/memdebug. I apologize for the size -- I've already greatly reduced it. This is how to run it, and how it fails (after a few seconds):

$ ~/julia/bin/julia-debug Wave.jl
Wave[FunHPC.jl]
Initialization

signal (11): Segmentation fault: 11
unknown function (ip: 0)
Segmentation fault: 11

This with the current development version of Julia.

I believe the problem is somehow connected to or triggered by deserialization and/or by using many tasks. Deserialization is called from the file Comm.jl, routine recv_item, line 39. The result of the deserialize call is never used. If this call is commented out, the program runs fine (and output "Done."). (In my attempts to reduce code size, termination detection probably got wonky, but if the text "Done." is output, everything is fine, and the segfault avoided.)

Similarly, when I reduce the number of concurrent tasks (i.e. tasks that are runnable simultaneously, as opposed to waiting), the segfault becomes more sporadic, and disappears if there are just a few tasks. The current example runs 1000 tasks simultaneously.

When I attach a debugger, the segfault happens in array.c or gc.c in an allocation routine, with a seemingly impossible internal state. I assume that malloc's internal heap data structures have been destroyed by then. Adding assert statements didn't reveal anything. I can't use MEMDEBUG since this doesn't bootstrap since about five weeks ago ( #8422 ).

I tried SANITIZE=1, but this didn't bootstrap for me either.

I tried building Julia with both LLVM 3.3 and LLVM 3.5.

I tried running this via Travis instead on my laptop (OS X) and my workstation (Ubuntu) since I thought something may be wrong with my local setups, and Travis's setup should be well tested. However, I didn't have much luck there either -- debugging via Travis is too indirect to be productive.

I have looked in detail into the files array.c and gc.c (where the segfault is reported), as well as the file serialize.jl and iobuffer.jl (which handle deserialization), but I have not found any problems there. I would describe the programming styles of these files as "real world" rather than "Julia showcase", and the number of comments as "strictly for experts only" ( #8492 ), but they seem otherwise sound and well optimized.

I would be grateful for any comments or suggestions. At the moment, two promising courses of action seem either (1) make MEMDEBUG work again, or (2) investigate what happens during deserialization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIndicates an unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions