24 Sep, 2009

40 commits

  • id_reg.if_mode might be unitialized when (*mrq)->error is nonzero. move
    dev_dbg() inside the if so that we are sure we can use id_reg values.

    Signed-off-by: Jiri Slaby
    Cc: Alex Dubov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • unsigned block cannot be less than 0.

    Signed-off-by: Roel Kluin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • Module edac_core.ko uses call_rcu() callbacks in edac_device.c, edac_mc.c
    and edac_pci.c.

    They all use a wait_for_completion() scheme, but this scheme it not 100%
    safe on multiple CPUs. See the _rcu_barrier() implementation which
    explains why extra precausion is needed.

    The patch adds a comment about rcu_barrier() and as a precausion calls
    rcu_barrier(). A maintainer needs to look at removing the
    wait_for_completion code.

    [dougthompson@xmission.com: remove the wait_for_completion code]
    Signed-off-by Jesper Dangaard Brouer
    Signed-off-by: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jesper Dangaard Brouer
     
  • A driver for the Intel 3200 and 3210 memory controllers. It has only had
    light testing so far, and currently makes no attempt to decode error
    addresses at anything finer than csrow granularity.

    Signed-off-by: Jason Uhlenkott
    Signed-off-by: Doug Thompson
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Uhlenkott
     
  • Use the function resource_size, which reduces the chance of introducing
    off-by-one errors in calculating the resource size.

    The semantic patch that makes this change is as follows:
    (http://www.emn.fr/x-info/coccinelle/)

    //
    @@
    struct resource *res;
    @@

    - (res->end - res->start) + 1
    + resource_size(res)
    //

    Signed-off-by: Julia Lawall
    Signed-off-by: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Julia Lawall
     
  • Add support for the Freescale MPC83xx memory controller to the existing
    driver for the Freescale MPC85xx memory controller. The only difference
    between the two processors are in the CS_BNDS register parsing code, which
    has been changed so it will work on both processors.

    The L2 cache controller does not exist on the MPC83xx, but the OF
    subsystem will not use the driver if the device is not present in the OF
    device tree.

    I had to change the nr_pages calculation to make the math work out. I
    checked it on my board and did the math by hand for a 64GB 85xx using 64K
    pages. In both cases, nr_pages * PAGE_SIZE comes out to the correct
    value.

    Signed-off-by: Ira W. Snyder
    Signed-off-by: Doug Thompson
    Cc: Kumar Gala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira W. Snyder
     
  • Based on Kumar's new compatible types patch, add P2020 into MPC85xx EDAC
    compatible lists so that EDAC can recognize P2020 meomry controller and L2
    cache controller and export the relevant fields to sysfs.

    EDAC MPC85xx DDR3 support is needed if DDR3 memory stick is installed on a
    P2020DS board so that EDAC core can recognize DDR3 memory type.

    Signed-off-by: Yang Shi
    Acked-by: Dave Jiang
    Signed-off-by: Doug Thompson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • The forward decls for some kernel types are only needed by the code behind
    __KERNEL__, so don't bleed these types to userspace.

    Signed-off-by: Mike Frysinger
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • CLONE_PARENT was used to implement an older threading model. For
    consistency with the CLONE_THREAD check in copy_pid_ns(), disable
    CLONE_PARENT with CLONE_NEWPID, at least until the required semantics of
    pid namespaces are clear.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Roland McGrath
    Acked-by: Serge Hallyn
    Cc: Oren Laadan
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • When global or container-init processes use CLONE_PARENT, they create a
    multi-rooted process tree. Besides siblings of global init remain as
    zombies on exit since they are not reaped by their parent (swapper). So
    prevent global and container-inits from creating siblings.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Eric W. Biederman
    Acked-by: Roland McGrath
    Cc: Oren Laadan
    Cc: Oleg Nesterov
    Cc: Serge Hallyn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • It's unused.

    It isn't needed -- read or write flag is already passed and sysctl
    shouldn't care about the rest.

    It _was_ used in two places at arch/frv for some reason.

    Signed-off-by: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Cc: Al Viro
    Cc: Ralf Baechle
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Cc: "David S. Miller"
    Cc: James Morris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Signed-off-by: Joe Perches
    Cc: David Daney
    Cc: Herbert Xu
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • The on-chip OTP may be written at runtime, so enable support for it in the
    driver. However, since writing should really be done only on development
    systems, don't bend over backwards to make sure the simple software lock
    is per-fd -- per-device is OK.

    Signed-off-by: Mike Frysinger
    Signed-off-by: Bryan Wu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • This driver memory maps the UV Hub RTC.

    Signed-off-by: Dimitri Sivanich
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dimitri Sivanich
     
  • If DownLoad.ProductCode == MAX_PRODUCT, that would be a problem when we do
    RIOBootTable[DownLoad.ProductCode] a couple lines down.

    Found by smatch (http://repo.or.cz/w/smatch.git).

    Signed-off-by: Dan Carpenter
    Cc: Jiri Slaby
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • The periodic interrupt from drivers/char/hpet.c does not work correctly,
    both when using the periodic capability of the hardware and while
    emulating the periodic interrupt (when hardware does not support periodic
    mode).

    With timers capable of periodic interrupts, the comparator field is first
    set with the period value followed by set of hidden accumulator, which has
    the side effect of overwriting the comparator value. This results in
    wrong periodicity for the interrupts. For, periodic interrupts to work,
    following steps are necessary, in that order.

    * Set config with Tn_VAL_SET_CNF bit

    * Write to hidden accumulator, the value written is the time when the
    first interrupt should be generated

    * Write compartor with period interval for subsequent interrupts
    (http://www.intel.com/hardwaredesign/hpetspec_1.pdf )

    When emulating periodic timer with timers not capable of periodic
    interrupt, driver is adding the period to counter value instead of
    comparator value, which causes slow drift when using this emulation.

    Also, driver seems to add hpetp->hp_delta both while setting up periodic
    interrupt and while emulating periodic interrupts with timers not capable
    of doing periodic interrupts. This hp_delta will result in slower than
    expected interrupt rate and should not be used while setting the interval.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Nils Carlson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nils Carlson
     
  • Check whether index is within bounds before grabbing the element.

    Signed-off-by: Roel Kluin
    Cc: Kay Sievers
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roel Kluin
     
  • There are two useless lines in fs/char_dev.c.

    In register_chrdev there is a loop to change all '/' into '!' in the
    kernel object name.
    This code is useless as the same substitution is in kobject_set_name_vargs in
    lib/kobject.c:
    228 /* ewww... some of these buggers have '/' in the name ... */
    229 while ((s = strchr(kobj->name, '/')))
    230 s[0] = '!';

    kobject_set_name_vargs is called by kobject_set_name.
    kobject_set_name is called just above the useless loop.

    [hidave.darkstar@gmail.com: fix warning, remove the unused char *s]
    Signed-off-by: Renzo Davoli
    Cc: Al Viro
    Signed-off-by: Dave Young
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Renzo Davoli
     
  • In read_zero, we check for access_ok() once for the count bytes. It is
    unnecessarily checked again in clear_user. Use __clear_user, which does
    not check for access_ok().

    Signed-off-by: Nikanth Karthikesan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nikanth Karthikesan
     
  • There is a common macro now for testing mixed pointer/errno values, so use
    that rather than handling the casts ourself.

    Signed-off-by: Mike Frysinger
    Acked-by: David McCullough
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Ignore the loader's PT_GNU_STACK when calculating the stack size, and only
    consider the executable's PT_GNU_STACK, assuming the executable has one.

    Currently the behaviour is to take the largest stack size and use that,
    but that means you can't reduce the stack size in the executable. The
    loader's stack size should probably only be used when executing the loader
    directly.

    WARNING: This patch is slightly dangerous - it may render a system
    inoperable if the loader's stack size is larger than that of important
    executables, and the system relies unknowingly on this increasing the size
    of the stack.

    Signed-off-by: David Howells
    Signed-off-by: Mike Frysinger
    Acked-by: Paul Mundt
    Cc: Pavel Machek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Introduce a helper function elf_note_info_init() to help fill_note_info()
    to do initializations, also fix the potential memory leaks.

    [akpm@linux-foundation.org: remove NUM_NOTES]
    Signed-off-by: WANG Cong
    Cc: Alexander Viro
    Cc: David Howells
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • __fatal_signal_pending inlines to one instruction on x86, probably two
    instructions on other machines. It takes two longer x86 instructions just
    to call it and test its return value, not to mention the function itself.

    On my random x86_64 config, this saved 70 bytes of text (59 of those being
    __fatal_signal_pending itself).

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • In order to direct the SIGIO signal to a particular thread of a
    multi-threaded application we cannot, like suggested by the manpage, put a
    TID into the regular fcntl(F_SETOWN) call. It will still be send to the
    whole process of which that thread is part.

    Since people do want to properly direct SIGIO we introduce F_SETOWN_EX.

    The need to direct SIGIO comes from self-monitoring profiling such as with
    perf-counters. Perf-counters uses SIGIO to notify that new sample data is
    available. If the signal is delivered to the same task that generated the
    new sample it can augment that data by inspecting the task's user-space
    state right after it returns from the kernel. This is esp. convenient
    for interpreted or virtual machine driven environments.

    Both F_SETOWN_EX and F_GETOWN_EX take a pointer to a struct f_owner_ex
    as argument:

    struct f_owner_ex {
    int type;
    pid_t pid;
    };

    Where type is one of F_OWNER_TID, F_OWNER_PID or F_OWNER_GID.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Oleg Nesterov
    Tested-by: stephane eranian
    Cc: Michael Kerrisk
    Cc: Roland McGrath
    Cc: Al Viro
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • group_send_sig_info()->check_kill_permission() assumes that current is the
    sender and uses current_cred().

    This is not true in send_sigio_to_task() case. From the security pov the
    sender is not current, but the task which did fcntl(F_SETOWN), that is why
    we have sigio_perm() which uses the right creds to check.

    Fortunately, send_sigio() always sends either SEND_SIG_PRIV or
    SI_FROMKERNEL() signal, so check_kill_permission() does nothing. But
    still it would be tidier to avoid this bogus security check and save a
    couple of cycles.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce do_send_sig_info() and convert group_send_sig_info(),
    send_sig_info(), do_send_specific() to use this helper.

    Hopefully it will have more users soon, it allows to specify
    specific/group behaviour via "bool group" argument.

    Shaves 80 bytes from .text.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: stephane eranian
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_delete_module() can set MODULE_STATE_GOING after
    search_binary_handler() does try_module_get(). In this case
    set_binfmt()->try_module_get() fails but since none of the callers
    check the returned error, the task will run with the wrong old
    ->binfmt.

    The proper fix should change all ->load_binary() methods, but we can
    rely on fact that the caller must hold a reference to binfmt->module
    and use __module_get() which never fails.

    Signed-off-by: Oleg Nesterov
    Acked-by: Rusty Russell
    Cc: Hiroshi Shimamoto
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Allow core_pattern pipes to wait for user space to complete

    One of the things that user space processes like to do is look at metadata
    for a crashing process in their /proc/ directory. this is racy
    however, since do_coredump in the kernel doesn't wait for the user space
    process to complete before it reaps the crashing process. This patch
    corrects that. Allowing the kernel to wait for the user space process to
    complete before cleaning up the crashing process. This is a bit tricky to
    do for a few reasons:

    1) The user space process isn't our child, so we can't sys_wait4 on it
    2) We need to close the pipe before waiting for the user process to complete,
    since the user process may rely on an EOF condition

    I've discussed several solutions with Oleg Nesterov off-list about this,
    and this is the one we've come up with. We add ourselves as a pipe reader
    (to prevent premature cleanup of the pipe_inode_info), and remove
    ourselves as a writer (to provide an EOF condition to the writer in user
    space), then we iterate until the user space process exits (which we
    detect by pipe->readers == 1, hence the > 1 check in the loop). When we
    exit the loop, we restore the proper reader/writer values, then we return
    and let filp_close in do_coredump clean up the pipe data properly.

    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Introduce core pipe limiting sysctl.

    Since we can dump cores to pipe, rather than directly to the filesystem,
    we create a condition in which a user can create a very high load on the
    system simply by running bad applications.

    If the pipe reader specified in core_pattern is poorly written, we can
    have lots of ourstandig resources and processes in the system.

    This sysctl introduces an ability to limit that resource consumption.
    core_pipe_limit defines how many in-flight dumps may be run in parallel,
    dumps beyond this value are skipped and a note is made in the kernel log.
    A special value of 0 in core_pipe_limit denotes unlimited core dumps may
    be handled (this is the default value).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Change how we detect recursive dumps.

    Currently we have a mechanism by which we try to compare pathnames of the
    crashing process to the core_pattern path. This is broken for a dozen
    reasons, and just doesn't work in any sort of robust way.

    I'm replacing it with the use of a 0 RLIMIT_CORE value. Since helper apps
    set RLIMIT_CORE to zero, we don't write out core files for any process
    with that particular limit set. It the core_pattern is a pipe, any
    non-zero limit is translated to RLIM_INFINITY.

    This allows complete dumps to be captured, but prevents infinite recursion
    in the event that the core_pattern process itself crashes.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Neil Horman
    Reported-by: Earl Chew
    Cc: Oleg Nesterov
    Cc: Andi Kleen
    Cc: Alan Cox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • This changes tracehook_notify_jctl() so it's called with the siglock held,
    and changes its argument and return value definition. These clean-ups
    make it a better fit for what new tracing hooks need to check.

    Tracing needs the siglock here, held from the time TASK_STOPPED was set,
    to avoid potential SIGCONT races if it wants to allow any blocking in its
    tracing hooks.

    This also folds the finish_stop() function into its caller
    do_signal_stop(). The function is short, called only once and only
    unconditionally. It aids readability to fold it in.

    [oleg@redhat.com: do not call tracehook_notify_jctl() in TASK_STOPPED state]
    [oleg@redhat.com: introduce tracehook_finish_jctl() helper]
    Signed-off-by: Roland McGrath
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • Current behaviour of sys_waitid() looks odd. If user passes infop ==
    NULL, sys_waitid() returns success. When user additionally specifies flag
    WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
    combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

    This patch adds check for ->wo_info in wait_noreap_copyout().

    User-visible change: starting from this commit, sys_waitid() always checks
    infop != NULL and does not fail if it is NULL.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • do_wait() checks ->wo_info to figure out who is the caller. If it's not
    NULL the caller should be sys_waitid(), in that case do_wait() fixes up
    the retval or zeros ->wo_info, depending on retval from underlying
    function.

    This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
    incorrect value.

    man 2 waitid says:

    waitid(): returns 0 on success

    Test-case:

    int main(void)
    {
    if (fork())
    assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

    return 0;
    }

    Result:

    Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

    Move that code to sys_waitid().

    User-visible change: sys_waitid() will return 0 on success, either
    infop is set or not.

    Note, there's another bug in wait_noreap_copyout() which affects
    return value of sys_waitid(). It will be fixed in next patch.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Kill the unused "parent" argument in wait_consider_task(), it was never used.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_type() is only used by eligible_pid() which has to check wo_type
    != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
    out ->pids[type] access, this shrinks .text a bit and simplifies the code.

    The matches the behaviour of other similar helpers, say get_task_pid().
    The caller must ensure that pid_type is valid, not the callee.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • child_wait_callback()->eligible_child() is not right, we can miss the
    wakeup if the task was detached before __wake_up_parent() and the caller
    of do_wait() didn't use __WALL.

    Move ->wo_pid checks from eligible_child() to the new helper,
    eligible_pid(), and change child_wait_callback() to use it instead of
    eligible_child().

    Note: actually I think it would be better to fix the __WCLONE check in
    eligible_child(), it doesn't look exactly right. But it is not clear what
    is the supposed behaviour, and any change is user-visible.

    Reported-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Suggested by Roland.

    do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
    it is ->real_parent and the child is not traced. IOW, caller == p->parent
    otherwise we should not wake up.

    Change child_wait_callback() to check this. Ratan reports the workload with
    CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Ratan Nalumasu reported that in a process with many threads doing
    unnecessary wakeups. Every waiting thread in the process wakes up to loop
    through the children and see that the only ones it cares about are still
    not ready.

    Now that we have struct wait_opts we can change do_wait/__wake_up_parent
    to use filtered wakeups.

    We can make child_wait_callback() more clever later, right now it only
    checks eligible_child().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Acked-by: James Morris
    Tested-by: Valdis Kletnieks
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • …to wait_consider_task()

    Preparation, no functional changes.

    eligible_child() has a single caller, wait_consider_task(). We can move
    security_task_wait() out from eligible_child(), this allows us to use it
    for filtered wake_up().

    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Roland McGrath <roland@redhat.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Ratan Nalumasu <rnalumasu@gmail.com>
    Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Oleg Nesterov
     
  • The bug is old, it wasn't cause by recent changes.

    Test case:

    static void *tfunc(void *arg)
    {
    int pid = (long)arg;

    assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
    kill(pid, SIGKILL);

    sleep(1);
    return NULL;
    }

    int main(void)
    {
    pthread_t th;
    long pid = fork();

    if (!pid)
    pause();

    signal(SIGCHLD, SIG_IGN);
    assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

    int r = waitpid(-1, NULL, __WNOTHREAD);
    printf("waitpid: %d %m\n", r);

    return 0;
    }

    Before the patch this program hangs, after this patch waitpid() correctly
    fails with errno == -ECHILD.

    The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
    ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
    we should wake up other threads which can sleep in do_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov