28 May, 2010

40 commits

  • The wake-up part of semtimedop() consists out of two steps:

    - the right tasks must be identified.
    - they must be woken up.

    Right now, both steps run while the array spinlock is held. This patch
    reorders the code and moves the actual wake_up_process() behind the point
    where the spinlock is dropped.

    The code also moves setting sem->sem_otime to one place: It does not make
    sense to set the last modify time multiple times.

    [akpm@linux-foundation.org: repair kerneldoc]
    [akpm@linux-foundation.org: fix uninitialised retval]
    Signed-off-by: Manfred Spraul
    Cc: Chris Mason
    Cc: Zach Brown
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • The following series of patches tries to fix the spinlock contention
    reported by Chris Mason - his benchmark exposes problems of the current
    code:

    - In the worst case, the algorithm used by update_queue() is O(N^2).
    Bulk wake-up calls can enter this worst case. The patch series fix
    that.

    Note that the benchmark app doesn't expose the problem, it just should
    be fixed: Real world apps might do the wake-ups in another order than
    perfect FIFO.

    - The part of the code that runs within the semaphore array spinlock is
    significantly larger than necessary.

    The patch series fixes that. This change is responsible for the main
    improvement.

    - The cacheline with the spinlock is also used for a variable that is
    read in the hot path (sem_base) and for a variable that is unnecessarily
    written to multiple times (sem_otime). The last step of the series
    cacheline-aligns the spinlock.

    This patch:

    The SysV semaphore code allows to perform multiple operations on all
    semaphores in the array as atomic operations. After a modification,
    update_queue() checks which of the waiting tasks can complete.

    The algorithm that is used to identify the tasks is O(N^2) in the worst
    case. For some cases, it is simple to avoid the O(N^2).

    The patch adds a detection logic for some cases, especially for the case
    of an array where all sleeping tasks are single sembuf operations and a
    multi-sembuf operation is used to wake up multiple tasks.

    A big database application uses that approach.

    The patch fixes wakeup due to semctl(,,SETALL,) - the initial version of
    the patch breaks that.

    [akpm@linux-foundation.org: make do_smart_update() static]
    Signed-off-by: Manfred Spraul
    Cc: Chris Mason
    Cc: Zach Brown
    Cc: Jens Axboe
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Manfred Spraul
     
  • Currently idr_remove_all will fail with a use after free error if
    idr::layers is bigger than 2, which on 32 bit systems corresponds to items
    more than 1024. This is due to stepping back too many levels during
    backtracking. For simplicity let's assume that IDR_BITS=1 -> we have 2
    nodes at each level below the root node and each leaf node stores two IDs.
    (In reality for 32 bit systems IDR_BITS=5, with 32 nodes at each sub-root
    level and 32 IDs in each leaf node). The sequence of freeing the nodes at
    the moment is as follows:

    layer
    1 -> a(7)
    2 -> b(3) c(5)
    3 -> d(1) e(2) f(4) g(6)

    Until step 4 things go fine, but then node c is freed, whereas node g
    should be freed first. Since node c contains the pointer to node g we'll
    have a use after free error at step 6.

    How many levels we step back after visiting the leaf nodes is currently
    determined by the msb of the id we are currently visiting:

    Step
    1. node d with IDs 0,1 is freed, current ID is advanced to 2.
    msb of the current ID bit 1. This means we need to step back
    1 level to node b and take the next sibling, node e.
    2-3. node e with IDs 2,3 is freed, current ID is 4, msb is bit 2.
    This means we need to step back 2 levels to node a, freeing
    node b on the way.
    4-5. node f with IDs 4,5 is freed, current ID is 6, msb is still
    bit 2. This means we again need to step back 2 levels to node
    a and free c on the way.
    6. We should visit node g, but its pointer is not available as
    node c was freed.

    The fix changes how we determine the number of levels to step back.
    Instead of deducting this merely from the msb of the current ID, we should
    really check if advancing the ID causes an overflow to a bit position
    corresponding to a given layer. In the above example overflow from bit 0
    to bit 1 should mean stepping back 1 level. Overflow from bit 1 to bit 2
    should mean stepping back 2 levels and so on.

    The fix was tested with IDs up to 1 << 20, which corresponds to 4 layers
    on 32 bit systems.

    Signed-off-by: Imre Deak
    Reviewed-by: Tejun Heo
    Cc: Eric Paris
    Cc: "Paul E. McKenney"
    Cc: [2.6.34.1]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Imre Deak
     
  • Since when CONFIG_HOTPLUG_CPU=n, get_online_cpus() do nothing, so we don't
    need cpu_hotplug_begin() either.

    This patch moves cpu_hotplug_begin()/cpu_hotplug_done() into the code
    block of CONFIG_HOTPLUG_CPU=y.

    Signed-off-by: Lai Jiangshan
    Cc: Gautham R Shenoy
    Cc: Ingo Molnar

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • I used this module to test the series of modification to the cpu notifiers
    code.

    Example1: inject CPU offline error (-1 == -EPERM)

    # modprobe cpu-notifier-error-inject cpu_down_prepare_error=-1
    # echo 0 > /sys/devices/system/cpu/cpu1/online
    bash: echo: write error: Operation not permitted

    Example2: inject CPU online error (-2 == -ENOENT)

    # modprobe cpu-notifier-error-inject cpu_up_prepare_error=-2
    # echo 1 > /sys/devices/system/cpu/cpu1/online
    bash: echo: write error: No such file or directory

    [akpm@linux-foundation.org: fix Kconfig help text]
    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for raid5.

    Signed-off-by: Akinobu Mita
    Cc: Neil Brown
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for s390.

    Signed-off-by: Akinobu Mita
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for ehca.

    Signed-off-by: Akinobu Mita
    Cc: Hoang-Nam Nguyen
    Cc: Christoph Raisch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for iucv.

    Signed-off-by: Akinobu Mita
    Cc: Ursula Braun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for slab.

    Signed-off-by: Akinobu Mita
    Cc: Christoph Lameter
    Acked-by: Pekka Enberg
    Cc: Matt Mackall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for kernel/*.c

    Signed-off-by: Akinobu Mita
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for topology.

    Signed-off-by: Akinobu Mita
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for msr, cpuid, and
    therm_throt.

    Signed-off-by: Akinobu Mita
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • This changes notifier_from_errno(0) to be NOTIFY_OK instead of
    NOTIFY_STOP_MASK | NOTIFY_OK.

    Currently, the notifiers which return encapsulated errno value have to
    do something like this:

    err = do_something(); // returns -errno
    if (err)
    return notifier_from_errno(err);
    else
    return NOTIFY_OK;

    This change makes the above code simple:

    err = do_something(); // returns -errno

    return return notifier_from_errno(err);

    Signed-off-by: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Currently, onlining or offlining a CPU failure by one of the cpu notifiers
    error always cause -EINVAL error. (i.e. writing 0 or 1 to
    /sys/devices/system/cpu/cpuX/online gets EINVAL)

    To get better error reporting rather than always getting -EINVAL, This
    changes cpu_notify() to return -errno value with notifier_to_errno() and
    fix the callers. Now that cpu notifiers can return encapsulate errno
    value.

    Currently, all cpu hotplug notifiers return NOTIFY_OK, NOTIFY_BAD, or
    NOTIFY_DONE. So cpu_notify() can returns 0 or -EPERM with this change for
    now.

    (notifier_to_errno(NOTIFY_OK) == 0, notifier_to_errno(NOTIFY_DONE) == 0,
    notifier_to_errno(NOTIFY_BAD) == -EPERM)

    Forthcoming patches convert several cpu notifiers to return encapsulate
    errno value with notifier_from_errno().

    Signed-off-by: Akinobu Mita
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • No functional change. These are just wrappers of
    raw_cpu_notifier_call_chain.

    Signed-off-by: Akinobu Mita
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Extend KCORE_TEXT to cover the pages between _text and _stext, to allow
    examining some important page table pages.

    `readelf -a` output on x86_64 before and after patch:
    Type Offset VirtAddr PhysAddr
    before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000
    after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000

    The newly covered pages are:

    0xffffffff81000000 etc.
    0xffffffff81001000
    0xffffffff81002000
    0xffffffff81003000
    0xffffffff81004000
    0xffffffff81005000
    0xffffffff81006000
    0xffffffff81007000
    0xffffffff81008000

    Before patch, /proc/kcore shows outdated contents for the above page
    table pages, for example:

    (gdb) p level3_ident_pgt
    $1 = {} 0xffffffff81002000
    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    $2 = {{pud = 0x1006063}, {pud = 0x0} }

    while the real content is:

    root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem
    1002000 6063 0100 0000 0000 8067 0000 0000 0000
    1002010 0000 0000 0000 0000 0000 0000 0000 0000
    *
    1003000

    That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB
    identity mapping before/after patch:

    (gdb) p/x *((pud_t *)&level3_ident_pgt)@512
    before $1 = {{pud = 0x1006063}, {pud = 0x0} }
    after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} }

    Obviously the content before patch is wrong.

    Signed-off-by: Wu Fengguang
    Cc: Andi Kleen
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wu Fengguang
     
  • A quick test shows these comments are obsolete, so just remove them.

    Signed-off-by: WANG Cong
    Cc: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Amerigo Wang
     
  • I removed 3 unused assignments. The first two get reset on the first
    statement of their functions. For "err" in root.c we don't return an
    error and we don't use the variable again.

    Signed-off-by: Dan Carpenter
    Cc: Oleg Nesterov
    Acked-by: Serge Hallyn
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Carpenter
     
  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    keyctl_session_to_parent() is the only user of signal->count which needs
    the correct value. Change it to use thread_group_empty() instead, this
    must be strictly equivalent under tasklist, and imho looks better.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial, use get_nr_threads() helper to read signal->count which we are
    going to change.

    Like other callers, proc_sched_show_task() doesn't need the exactly
    precise nr_threads.

    David said:

    : Note that get_nr_threads() isn't completely equivalent (it can return 0
    : where proc_sched_show_task() will display a 1). But I don't think this
    : should be a problem.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that task->signal can't go away get_nr_threads() doesn't need
    ->siglock to read signal->count.

    Also, make it inline, move into sched.h, and convert 2 other proc users of
    signal->count to use this (now trivial) helper.

    Henceforth get_nr_threads() is the only valid user of signal->count, we
    are ready to turn it into "int nr_threads" or, perhaps, kill it.

    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: David Howells
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the
    task is multithreaded to ensure unshare_thread() will fail.

    Not only this is a bit strange way to return the error, this is absolutely
    meaningless. If signal->count > 1 then sighand->count must be also > 1,
    and unshare_sighand() will fail anyway.

    In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not
    look right. Fortunately this code doesn't really work anyway.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

    This way signal->stats never points to nowhere and we can read ->stats
    lockless.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill the empty thread_group_cputime_free() helper. It was needed to free
    the per-cpu data which we no longer have.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup:

    - Add the boolean, group_dead = thread_group_leader(), for clarity.

    - Do not test/set sig == NULL to detect the all-dead case, use this
    boolean.

    - Pass this boolen to __unhash_process() and use it instead of another
    thread_group_leader() call which needs ->group_leader.

    This can be considered as microoptimization, but hopefully this also
    allows us do do other cleanups later.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that task->signal can't go away we can revert the horrible hack added
    by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for
    account_group_exec_runtime(), make sure ->signal can't be freed under
    rq->lock").

    And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When the last thread exits signal->tty is freed, but the pointer is not
    cleared and points to nowhere.

    This is OK. Nobody should use signal->tty lockless, and it is no longer
    possible to take ->siglock. However this looks wrong even if correct, and
    the nice OOPS is better than subtle and hard to find bugs.

    Change __exit_signal() to clear signal->tty under ->siglock.

    Note: __exit_signal() needs more cleanups. It should not check "sig !=
    NULL" to detect the all-dead case and we have the same issues with
    signal->stats.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We have a lot of problems with accessing task_struct->signal, it can
    "disappear" at any moment. Even current can't use its ->signal safely
    after exit_notify(). ->siglock helps, but it is not convenient, not
    always possible, and sometimes it makes sense to use task->signal even
    after this task has already dead.

    This patch adds the reference counter, sigcnt, into signal_struct. This
    reference is owned by task_struct and it is dropped in
    __put_task_struct(). Perhaps it makes sense to export
    get/put_signal_struct() later, but currently I don't see the immediate
    reason.

    Rename __cleanup_signal() to free_signal_struct() and unexport it. With
    the previous changes it does nothing except kmem_cache_free().

    Change __exit_signal() to not clear/free ->signal, it will be freed when
    the last reference to any thread in the thread group goes away.

    Note:
    - when the last thead exits signal->tty can point to nowhere, see
    the next patch.

    - with or without this patch signal_struct->count should go away,
    or at least it should be "int nr_threads" for fs/proc. This will
    be addressed later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tty_kref_put() has two callsites in copy_process() paths,

    1. if copy_process() suceeds it is called before we copy
    signal->tty from parent

    2. otherwise it is called from __cleanup_signal() under
    bad_fork_cleanup_signal: label

    In both cases tty_kref_put() is not right and unneeded because we don't
    have the balancing tty_kref_get(). Fortunately, this is harmless because
    this can only happen without CLONE_THREAD, and in this case signal->tty
    must be NULL.

    Remove tty_kref_put() from copy_process() and __cleanup_signal(), and
    change another caller of __cleanup_signal(), __exit_signal(), to call
    tty_kref_put() by hand.

    I hope this change makes sense by itself, but it is also needed to make
    ->signal refcountable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox
    Acked-by: Roland McGrath
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Preparation to make task->signal immutable, no functional changes.

    It doesn't matter which pointer we check under tasklist to ensure the task
    was not released, ->signal or ->sighand. But we are going to make
    ->signal refcountable, change the code to use ->sighand.

    Note: this code doesn't need this check and tasklist_lock at all, it
    should be converted to use lock_task_sighand(). And, the code under
    SIGNAL_STOP_STOPPED check looks wrong.

    Signed-off-by: Oleg Nesterov
    Cc: Fenghua Yu
    Cc: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Preparation to make task->signal immutable, no functional changes.

    posix-cpu-timers.c checks task->signal != NULL to ensure this task is
    alive and didn't pass __exit_signal(). This is correct but we are going
    to change the lifetime rules for ->signal and never reset this pointer.

    Change the code to check ->sighand instead, it doesn't matter which
    pointer we check under tasklist, they both are cleared simultaneously.

    As Roland pointed out, some of these changes are not strictly needed and
    probably it makes sense to revert them later, when ->signal will be pinned
    to task_struct. But this patch tries to ensure the subsequent changes in
    fork/exit can't make any visible impact on posix cpu timers.

    Signed-off-by: Oleg Nesterov
    Cc: Fenghua Yu
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change __exit_signal() to check thread_group_leader() instead of
    atomic_dec_and_test(&sig->count). This must be equivalent, the group
    leader must be released only after all other threads have exited and
    passed __exit_signal().

    Henceforth sig->count is not actually used, except in fs/proc for
    get_nr_threads/etc.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change zap_other_threads() to return the number of other sub-threads found
    on ->thread_group list.

    Other changes are cosmetic:

    - change the code to use while_each_thread() helper

    - remove the obsolete comment about SIGKILL/SIGSTOP

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • signal_struct->count in its current form must die.

    - it has no reasons to be atomic_t

    - it looks like a reference counter, but it is not

    - otoh, we really need to make task->signal refcountable, just look at
    the extremely ugly task_rq_unlock_wait() called from __exit_signals().

    - we should change the lifetime rules for task->signal, it should be
    pinned to task_struct. We have a lot of code which can be simplified
    after that.

    - it is not needed! while the code is correct, any usage of this
    counter is artificial, except fs/proc uses it correctly to show the
    number of threads.

    This series removes the usage of sig->count from exit pathes.

    This patch:

    Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
    can just check notify_count < 0 to ensure the execing sub-threads needs
    the notification from us. No need to do other checks, notify_count != 0
    must always mean ->group_exit_task != NULL is waiting for us.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - move the cprm.mm_flags checks up, before we take mmap_sem

    - move down_write(mmap_sem) and ->core_state check from do_coredump()
    to coredump_wait()

    This simplifies the code and makes the locking symmetrical.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
    to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - kill "int dump_count", argv_split(argcp) accepts argcp == NULL.

    - move "int dump_count" under " if (ispipe)" branch, fail_dropcount
    can check ispipe.

    - move "char **helper_argv" as well, change the code to do argv_free()
    right after call_usermodehelper_fns().

    - If call_usermodehelper_fns() fails goto close_fail label instead
    of closing the file by hand.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov