[C++-sig] subprocess fork() sometimes hangs when called from within boost::python::exec

Discussion:

[C++-sig] subprocess fork() sometimes hangs when called from within boost::python::exec_file

Peter Schüller

2015-04-02 19:12:49 UTC

Dear all,

In the hexhex project on github [1] we execute a python program using
boost::python::exec_file, the respective code is in
src/PythonPlugin.cpp in the function PythonPlugin::runPythonMain(...).

In this python program I use subprocess.call() to run a shell
application (actually graphviz).

Sometimes it works, sometimes it hangs using 100% CPU. It seems to
depend on how much the C++ program had to do before it executed the
python part. In both cases the C++ program uses threads (is that
relevant wrt. fork() in python? I found some hints online that it
could be relevant).

I traced it down to os.fork() in subprocess.py which returns in one
case and does not return at all (neither parent nor child) in the
other case.

If I do strace -f on the process I get the following (the
SIGPROF/rt_sigreturn/clone repeats until the disk is full):

[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---

If i do strace -f on the working process it looks like this:

[pid 21442] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5f4c1e5a50) = ? ERESTARTNOINTR (To be restarted)
[pid 21442] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 21442] rt_sigreturn() = 56
[pid 21442] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5f4c1e5a50) = ? ERESTARTNOINTR (To be restarted)
[pid 21442] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 21442] rt_sigreturn() = 56
[pid 21442] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5f4c1e5a50) = ? ERESTARTNOINTR (To be restarted)
[pid 21442] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 21442] rt_sigreturn() = 56
[pid 21442] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f5f4c1e5a50) = 21448
Process 21448 attached
[pid 21448] set_robust_list(0x7f5f4c1e5a60, 24 <unfinished ...>
[pid 21442] write(2, "got pid", 7got pid <unfinished ...>
[pid 21448] <... set_robust_list resumed> ) = 0
[pid 21442] <... write resumed> ) = 7
[pid 21442] write(2, " ", 1 <unfinished ...>
[pid 21448] close(7 <unfinished ...>

If I do not do strace but attach with gdb instead, the bactrace is as follows:

(gdb) bt
#0 0x00007fa03c9c5025 in __libc_fork () at
../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:130
#1 0x00007fa03ccdb155 in __fork () at
../nptl/sysdeps/unix/sysv/linux/pt-fork.c:25
#2 0x00007fa03c36d6eb in posix_fork (self=<optimized out>,
noargs=<optimized out>) at ../Modules/posixmodule.c:3832
#3 0x00007fa03c2f2d33 in call_function (oparg=<optimized out>,
pp_stack=0x7fffbbc90ae0) at ../Python/ceval.c:4004
#4 PyEval_EvalFrameEx (f=***@entry=0x225e950,
throwflag=***@entry=0) at ../Python/ceval.c:2666
#5 0x00007fa03c2f517d in PyEval_EvalCodeEx (co=<optimized out>,
globals=<optimized out>, locals=***@entry=0x0, args=<optimized
out>,
argcount=***@entry=18, kws=0x2693418, kwcount=0, defs=0x0,
defcount=0, closure=***@entry=0x0) at ../Python/ceval.c:3252
#6 0x00007fa03c2f2dd8 in fast_function (nk=<optimized out>, na=18,
n=18, pp_stack=0x7fffbbc90d00, func=0x7fa0351e6410)
at ../Python/ceval.c:4116
#7 call_function (oparg=<optimized out>, pp_stack=0x7fffbbc90d00) at
../Python/ceval.c:4041
#8 PyEval_EvalFrameEx (f=***@entry=0x2693140,
throwflag=***@entry=0) at ../Python/ceval.c:2666
#9 0x00007fa03c2f517d in PyEval_EvalCodeEx (co=<optimized out>,
globals=<optimized out>, locals=***@entry=0x0,
args=***@entry=0x7fa0317ee530, argcount=2,
kws=***@entry=0x7fa0317ee578, kwcount=***@entry=1,
defs=***@entry=0x7fa03545bce8,
defcount=***@entry=13, closure=0x0) at ../Python/ceval.c:3252
#10 0x00007fa03c2f53e5 in function_call.15737 (func=0x7fa0351cff50,
arg=0x7fa0317ee518, kw=0x7fa0318247f8) at ../Objects/funcobject.c:526
#11 0x00007fa03c2bce23 in PyObject_Call
(func=***@entry=0x7fa0351cff50, arg=***@entry=0x7fa0317ee518,
kw=***@entry=0x7fa0318247f8)
at ../Objects/abstract.c:2529
#12 0x00007fa03c21030d in instancemethod_call.8988
(func=0x7fa0351cff50, arg=0x7fa0317ee518, kw=0x7fa0318247f8)
at ../Objects/classobject.c:2602
#13 0x00007fa03c2bce23 in PyObject_Call
(func=***@entry=0x7fa035438f00, arg=***@entry=0x7fa0317e8e50,
kw=***@entry=0x7fa0318247f8)
at ../Objects/abstract.c:2529
#14 0x00007fa03c2c148f in slot_tp_init.26761 (self=0x7fa0317f0110,
args=0x7fa0317e8e50, kwds=0x7fa0318247f8) at
../Objects/typeobject.c:5692
#15 0x00007fa03c2bf4df in type_call.26491 (type=<optimized out>,
args=0x7fa0317e8e50, kwds=0x7fa0318247f8) at
../Objects/typeobject.c:745
#16 0x00007fa03c2bce23 in PyObject_Call (func=***@entry=0x1d7aab0,
arg=***@entry=0x7fa0317e8e50, kw=***@entry=0x7fa0318247f8)
at ../Objects/abstract.c:2529
#17 0x00007fa03c2ef3b1 in ext_do_call (nk=<optimized out>,
na=<optimized out>, flags=<optimized out>, pp_stack=0x7fffbbc91300,
func=0x1d7aab0) at ../Python/ceval.c:4333
#18 PyEval_EvalFrameEx (f=***@entry=0x7fa0317f2050,
throwflag=***@entry=0) at ../Python/ceval.c:2705
#19 0x00007fa03c2f517d in PyEval_EvalCodeEx (co=<optimized out>,
globals=<optimized out>, locals=***@entry=0x0, args=<optimized
out>,
argcount=***@entry=1, kws=0x1da91e0, kwcount=1, defs=0x0,
defcount=0, closure=***@entry=0x0) at ../Python/ceval.c:3252
#20 0x00007fa03c2f2dd8 in fast_function (nk=<optimized out>, na=1,
n=3, pp_stack=0x7fffbbc91520, func=0x7fa0351cfcf8)
at ../Python/ceval.c:4116
#21 call_function (oparg=<optimized out>, pp_stack=0x7fffbbc91520) at
../Python/ceval.c:4041
#22 PyEval_EvalFrameEx (f=***@entry=0x1da8fc0,
throwflag=***@entry=0) at ../Python/ceval.c:2666
#23 0x00007fa03c2f517d in PyEval_EvalCodeEx (co=<optimized out>,
globals=<optimized out>, locals=***@entry=0x0,
args=***@entry=0x7fa03ebdd068, argcount=0, kws=***@entry=0x0,
kwcount=***@entry=0, defs=***@entry=0x0,
defcount=***@entry=0,
closure=0x0) at ../Python/ceval.c:3252
#24 0x00007fa03c2f5310 in function_call.15737 (func=0x7fa0351cf5f0,
arg=0x7fa03ebdd050, kw=0x0) at ../Objects/funcobject.c:526
#25 0x00007fa03c2bce23 in PyObject_Call
(func=***@entry=0x7fa0351cf5f0, arg=***@entry=0x7fa03ebdd050,
kw=<optimized out>)
at ../Objects/abstract.c:2529
#26 0x00007fa03c27b837 in PyEval_CallObjectWithKeywords
(func=***@entry=0x7fa0351cf5f0, arg=***@entry=0x7fa03ebdd050,
kw=***@entry=0x0)
at ../Python/ceval.c:3889
#27 0x00007fa03c2feb57 in PyEval_CallFunction (obj=0x7fa0351cf5f0,
format=<optimized out>) at ../Python/modsupport.c:557
#28 0x00007fa03e58b94b in dlvhex::PythonPlugin::runPythonMain(std::string) ()
from /home/ps/_science/projects/hex/installdir_benchmark/lib/libdlvhex2-base.so.11
#29 0x0000000000421eda in main ()

I am using python 2.7 on ubuntu 14.04.

How can I find out why it is hanging? Could it be a bug of the
os.fork() implementation?

Best Regards,
Peter

[1] https://github.com/hexhex/core/blob/master/src/PythonPlugin.cpp

--
http://www.peterschueller.com/

Stefan Ring

2015-05-02 12:43:17 UTC

Permalink

Post by Peter SchÃ¼ller
Dear all,
In the hexhex project on github [1] we execute a python program using
boost::python::exec_file, the respective code is in
src/PythonPlugin.cpp in the function PythonPlugin::runPythonMain(...).
In this python program I use subprocess.call() to run a shell
application (actually graphviz).
Sometimes it works, sometimes it hangs using 100% CPU. It seems to
depend on how much the C++ program had to do before it executed the
python part. In both cases the C++ program uses threads (is that
relevant wrt. fork() in python? I found some hints online that it
could be relevant).
I traced it down to os.fork() in subprocess.py which returns in one
case and does not return at all (neither parent nor child) in the
other case.
If I do strace -f on the process I get the following (the
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---

There is no loop in Python's fork code, so it cannot really be blamed.
It sure looks like there is a pending signal which interferes for some
reason.

Peter Schüller

2015-05-04 06:45:38 UTC

Permalink

Post by Stefan Ring

Post by Peter SchÃ¼ller
In the hexhex project on github [1] we execute a python program using
boost::python::exec_file, the respective code is in
src/PythonPlugin.cpp in the function PythonPlugin::runPythonMain(...).
In this python program I use subprocess.call() to run a shell
application (actually graphviz).
Sometimes it works, sometimes it hangs using 100% CPU. It seems to
depend on how much the C++ program had to do before it executed the
python part. In both cases the C++ program uses threads (is that
relevant wrt. fork() in python? I found some hints online that it
could be relevant).
I traced it down to os.fork() in subprocess.py which returns in one
case and does not return at all (neither parent nor child) in the
other case.
If I do strace -f on the process I get the following (the
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,
flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f96a5fa2a50) = ? ERESTARTNOINTR (To be restarted)
[pid 18646] --- SIGPROF {si_signo=SIGPROF, si_code=SI_KERNEL} ---
[pid 18646] rt_sigreturn() = 56
[pid 18646] clone(child_stack=0,

There is no loop in Python's fork code, so it cannot really be blamed.
It sure looks like there is a pending signal which interferes for some
reason.

Thank you.

I have a signal handler for SIGINT in the program, could this interfere?
Or should I setup another signal handler?
Or is it necessary to setup signal handlers in a special way when
using boost::python?
Could you suggest a way to get more useful debugging information to
track down the problem?

Best,
Peter

--
http://www.peterschueller.com/

Stefan Ring

2015-05-05 13:48:46 UTC

Permalink

Post by Peter SchÃ¼ller
Could you suggest a way to get more useful debugging information to
track down the problem?

I'd try to find out if there is a signal handler for signal 56 (I'm
guessing that's the signal number) and where it's coming from.

Stefan Ring

2015-05-05 13:50:35 UTC

Permalink

Post by Stefan Ring

Post by Peter SchÃ¼ller
Could you suggest a way to get more useful debugging information to
track down the problem?

I'd try to find out if there is a signal handler for signal 56 (I'm
guessing that's the signal number) and where it's coming from.

Actually, it's SIGPROF (27), as I just noticed.

Stefan Ring

2015-05-05 14:00:40 UTC

Permalink

See also <http://caml.inria.fr/mantis/view.php?id=5893>. Maybe
RedHat's bugfix never made it into mainline, or the problem
reappeared.

Stefan Ring

2015-05-07 15:27:31 UTC

Permalink

Post by Stefan Ring
See also <http://caml.inria.fr/mantis/view.php?id=5893>. Maybe
RedHat's bugfix never made it into mainline, or the problem
reappeared.

Completely unrelated to boost::python, but I guess you could just
block the SIGPROF signal before forking and unblock it afterwards. The
RHEL 5 kernel has apparently been patched with some sort of
workaround, but this has not made it into mainline or the later RHEL
branches.

Peter Schüller

2015-05-13 10:05:32 UTC

Permalink

Post by Stefan Ring

Post by Stefan Ring
See also <http://caml.inria.fr/mantis/view.php?id=5893>. Maybe
RedHat's bugfix never made it into mainline, or the problem
reappeared.

Unfortunately I cannot reproduce the problem anymore on the computer
it originally happened (on my other computer it never happened).

I noticed that I compiled with -pg which seems to be the reason for
the profiling signals that might prevent fork() to succeed.

So if someone else has the same problem, removing -pg from the
compilation might be sufficient to fix the problem.

Thank you for the support!
Peter

--
Peter SCHÜLLER, Assistant Professor Dr
Marmara University, Computer Engineering Department
http://www.peterschueller.com/