26 Jul, 2008

40 commits

  • schedule_on_each_cpu() can use schedule_work_on() to avoid the code
    duplication.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • queue_work() can use queue_work_on() to avoid the code duplication.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add lockdep annotations to flush_work() and update the comment.

    Signed-off-by: Oleg Nesterov
    Cc: Jarek Poplawski
    Acked-by: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that it is safe to use get_online_cpus() we can revert

    [S390] cpu topology: Fix possible deadlock.
    commit: fd781fa25c9e9c6fd1599df060b05e7c4ad724e5

    and call arch_reinit_sched_domains() directly from topology_work_fn().

    Signed-off-by: Oleg Nesterov
    Cc: Gautham R Shenoy
    Tested-by: Heiko Carstens
    Cc: Max Krasnyansky
    Cc: Paul Jackson
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: Vegard Nossum
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • workqueue_cpu_callback(CPU_DEAD) flushes cwq->thread under
    cpu_maps_update_begin(). This means that the multithreaded workqueues
    can't use get_online_cpus() due to the possible deadlock, very bad and
    very old problem.

    Introduce the new state, CPU_POST_DEAD, which is called after
    cpu_hotplug_done() but before cpu_maps_update_done().

    Change workqueue_cpu_callback() to use CPU_POST_DEAD instead of CPU_DEAD.
    This means that create/destroy functions can't rely on get_online_cpus()
    any longer and should take cpu_add_remove_lock instead.

    [akpm@linux-foundation.org: fix CONFIG_SMP=n]
    Signed-off-by: Oleg Nesterov
    Acked-by: Gautham R Shenoy
    Cc: Heiko Carstens
    Cc: Max Krasnyansky
    Cc: Paul Jackson
    Cc: Paul Menage
    Cc: Peter Zijlstra
    Cc: Vegard Nossum
    Cc: Martin Schwidefsky
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change schedule_on_each_cpu() to use flush_work() instead of
    flush_workqueue(), this way we don't wait for other work_struct's which
    can be queued meanwhile.

    Signed-off-by: Oleg Nesterov
    Cc: Jarek Poplawski
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Most of users of flush_workqueue() can be changed to use cancel_work_sync(),
    but sometimes we really need to wait for the completion and cancelling is not
    an option. schedule_on_each_cpu() is good example.

    Add the new helper, flush_work(work), which waits for the completion of the
    specific work_struct. More precisely, it "flushes" the result of of the last
    queue_work() which is visible to the caller.

    For example, this code

    queue_work(wq, work);
    /* WINDOW */
    queue_work(wq, work);

    flush_work(work);

    doesn't necessary work "as expected". What can happen in the WINDOW above is

    - wq starts the execution of work->func()

    - the caller migrates to another CPU

    now, after the 2nd queue_work() this work is active on the previous CPU, and
    at the same time it is queued on another. In this case flush_work(work) may
    return before the first work->func() completes.

    It is trivial to add another helper

    int flush_work_sync(struct work_struct *work)
    {
    return flush_work(work) || wait_on_work(work);
    }

    which works "more correctly", but it has to iterate over all CPUs and thus
    it much slower than flush_work().

    Signed-off-by: Oleg Nesterov
    Acked-by: Max Krasnyansky
    Acked-by: Jarek Poplawski
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • insert_work() inserts the new work_struct before or after cwq->worklist,
    depending on the "int tail" parameter. Change it to accept "list_head *"
    instead, this shrinks .text a bit and allows us to insert the barrier
    after specific work_struct.

    Signed-off-by: Oleg Nesterov
    Cc: Jarek Poplawski
    Cc: Max Krasnyansky
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • I don't understand why the multi-thread coredump implies the core_uses_pid
    behaviour, but we shouldn't use mm->mm_users for that. This counter can
    be incremented by get_task_mm(). Use the valued returned by
    coredump_wait() instead.

    Also, remove the "const char *pattern" argument, format_corename() can use
    core_pattern directly.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Alan Cox
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we have core_state->dumper list we can use it to wake up the
    sub-threads waiting for the coredump completion.

    This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
    lessens by sizeof(struct completion). Also, with this change we can
    decouple exit_mm() from the coredumping code.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
    encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.

    This patch allows futher cleanups in binfmt_elf_fdpic.c.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill the nasty rcu_read_lock() + do_each_thread() loop, use the list
    encoded in mm->core_state instead, s/GFP_ATOMIC/GFP_KERNEL/.

    This patch allows futher cleanups in binfmt_elf.c, in particular we can
    kill the parallel info->threads list.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • binfmt->core_dump() has to iterate over the all threads in system in order
    to find the coredumping threads and construct the list using the
    GFP_ATOMIC allocations.

    With this patch each thread allocates the list node on exit_mm()'s stack and
    adds itself to the list.

    This allows us to do further changes:

    - simplify ->core_dump()

    - change exit_mm() to clear ->mm first, then wait for ->core_done.
    this makes the coredumping process visible to oom_kill

    - kill mm->core_done

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move the "struct core_state core_state" from coredump_wait() to
    do_coredump(), this makes mm->core_state visible to binfmt->core_dump().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Turn core_state->nr_threads into atomic_t and kill now unneeded
    down_write(&mm->mmap_sem) in exit_mm().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change zap_process() to return int instead of incrementing
    mm->core_state->nr_threads directly. Change zap_threads() to set
    mm->core_state only on success.

    This patch restores the original size of .text, and more importantly now
    ->nr_threads is used in two places only.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move mm->core_waiters into "struct core_state" allocated on stack. This
    shrinks mm_struct a little bit and allows further changes.

    This patch mostly does s/core_waiters/core_state. The only essential
    change is that coredump_wait() must clear mm->core_state before return.

    The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
    is fixed by the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mm->core_startup_done points to "struct completion startup_done" allocated
    on the coredump_wait()'s stack. Introduce the new structure, core_state,
    which holds this "struct completion". This way we can add more info
    visible to the threads participating in coredump without enlarging
    mm_struct.

    No changes in affected .o files.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • linux_binfmt->core_dump() runs before the process does exit_aio(), this
    means that we can hit the kernel thread which shares the same ->mm.
    Afaics, nothing really bad can happen, but perhaps it makes sense to fix
    this minor bug.

    It is sad we have to iterate over all threads in system and use
    GFP_ATOMIC. Hopefully we can kill theses ugly do_each_thread()s, but this
    needs some nontrivial changes in mm_struct and do_coredump.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The main loop in zap_threads() must skip kthreads which may use the same
    mm. Otherwise we "kill" this thread erroneously (for example, it can not
    fork or exec after that), and the coredumping task stucks in the
    TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters
    count.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill PF_BORROWED_MM. Change use_mm/unuse_mm to not play with ->flags, and
    do s/PF_BORROWED_MM/PF_KTHREAD/ for a couple of other users.

    No functional changes yet. But this allows us to do further
    fixes/cleanups.

    oom_kill/ptrace/etc often check "p->mm != NULL" to filter out the
    kthreads, this is wrong because of use_mm(). The problem with
    PF_BORROWED_MM is that we need task_lock() to avoid races. With this
    patch we can check PF_KTHREAD directly, or use a simple lockless helper:

    /* The result must not be dereferenced !!! */
    struct mm_struct *__get_task_mm(struct task_struct *tsk)
    {
    if (tsk->flags & PF_KTHREAD)
    return NULL;
    return tsk->mm;
    }

    Note also ecard_task(). It runs with ->mm != NULL, but it's the kernel
    thread without PF_BORROWED_MM.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
    by INIT_TASK() and copied to the forked childs (we could set it in
    kthreadd() along with PF_NOFREEZE instead).

    daemonize() was changed as well. In that case testing of PF_KTHREAD is
    racy, but daemonize() is hopeless anyway.

    This flag is cleared in do_execve(), before search_binary_handler().
    Probably not the best place, we can do this in exec_mmap() or in
    start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
    doesn't matter in practice, and if do_execve() fails kthread should die
    soon.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. SIGKILL can't be blocked, remove this check from sigkill_pending().

    2. When ptrace_stop() sees sigkill_pending() == T, it can just return.
    Kill "int killed" and simplify the code. This also is more correct,
    the tracer shouldn't see us in TASK_TRACED if we are not going to
    stop.

    I strongly believe this code needs further changes. We should do the "was
    this task killed" check unconditionally, currently it depends on
    arch_ptrace_stop_needed(). On the other hand, sigkill_pending() isn't
    very clever. If the task was killed tkill(SIGKILL), the signal can be
    already dequeued if the caller is do_exit().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ptrace_stop() has some complicated checks to prevent the scheduling in the
    TASK_TRACED state with the pending SIGKILL, but these checks are racy, and
    they depend on arch_ptrace_stop_needed().

    This patch assumes that the traced task should die asap if it was killed by
    SIGKILL, in that case schedule()->signal_pending_state() has no reason to
    ignore the TASK_WAKEKILL part of TASK_TRACED, and we can kill this nasty
    special case.

    Note: do_exit()->ptrace_notify() is special, the killed task can already
    dequeue SIGKILL at this point. Another indication that fatal_signal_pending()
    is not exactly right.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Matthew Wilcox
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • This patch contains the following cleanups for the asm/ptrace.h
    userspace headers:

    - include/asm-generic/Kbuild.asm already lists ptrace.h, remove
    the superfluous listings in the Kbuild files of the following
    architectures:
    - cris
    - frv
    - powerpc
    - x86
    - don't expose function prototypes and macros to userspace:
    - arm
    - blackfin
    - cris
    - mn10300
    - parisc
    - remove #ifdef CONFIG_'s around #define's:
    - blackfin
    - m68knommu
    - sh: AFAIK __SH5__ should work in both kernel and userspace,
    no need to leak CONFIG_SUPERH64 to userspace
    - xtensa: cosmetical change to remove empty
    #ifndef __ASSEMBLY__ #else #endif
    from the userspace headers

    Not changed by this patch is the fact that the following architectures
    have a different struct pt_regs depending on CONFIG_ variables:
    - h8300
    - m68knommu
    - mips

    This does not work in userspace.

    Signed-off-by: Adrian Bunk
    Cc:
    Cc: Roland McGrath
    Cc: Oleg Nesterov
    Acked-by: Greg Ungerer
    Acked-by: Paul Mundt
    Acked-by: Grant Grundler
    Acked-by: Jesper Nilsson
    Acked-by: Chris Zankel
    Acked-by: David Howells
    Acked-by: Paul Mackerras
    Acked-by: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Bunk
     
  • Change the type of pid and tgid variables from int to the POSIX type
    pid_t.

    Signed-off-by: Gustavo F. Padovan
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gustavo Fernando Padovan
     
  • In the switch to configurable HZ in 2.6, the treatment of the si_utime and
    si_stime fields that are exposed to userland via the siginfo structure
    looks to have been botched. As things stand, these fields report times in
    units of HZ, so that userland gets information that varies depending on
    the HZ that the kernel was configured with. This patch changes the
    reported values to use USER_HZ units.

    Signed-off-by: Michael Kerrisk
    Acked-by: Oleg Nesterov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Kerrisk
     
  • No changes in fs/exec.o

    The for_each_process() loop in zap_threads() is very subtle, it is not
    clear why we don't race with fork/exit/exec. Add the fat comment.

    Also, change the code to use while_each_thread().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • fae5fa44f1fd079ffbed8e0add929dd7bbd1347f changed do_signal_stop() to check
    SIGNAL_UNKILLABLE, this wasn't needed. If signal_group_exit() == F, the
    signal sent to SIGNAL_UNKILLABLE task must be already filtered out by the
    caller, get_signal_to_deliver(). And if signal_group_exit() == T we are
    not going to stop.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • dequeue_signal() checks SIGNAL_GROUP_EXIT before setting
    SIGNAL_STOP_DEQUEUED. This was added by
    788e05a67c343fa22f2ae1d3ca264e7f15c25eaf a long ago to avoid the
    coredump/SIGSTOP race.

    Since then the related code was changed, and now this subtle check is both
    incomplete and unneeded at the same time. It is incomplete because
    nowadays exec() doesn't set SIGNAL_GROUP_EXIT, so in fact we should check
    signal_group_exit() to avoid a similar race. Fortunately, we doesn't need
    the check at all. The only function which relies on SIGNAL_STOP_DEQUEUED
    is do_signal_stop(), and it ignores this flag if signal_group_exit() == T,
    this covers the SIGNAL_GROUP_EXIT case.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no reason for rcu_read_lock() in __exit_signal(). tsk->sighand
    can only be changed if tsk does exec, obviously this is not possible.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • With the recent changes collect_signal() always returns true. Change it
    to return void and update the single caller.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Factor out sigdelset() calls and remove the "still_pending" variable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • collect_signal() checks sigismember(&list->signal, sig), this is not
    needed. This "sig" was just found by next_signal(), so it must be valid.

    We have a (completely broken) call to ->notifier in between, but it must
    not play with sigpending->signal bits or unlock ->siglock.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • release_posix_timer() can't be called with ->it_process != NULL. Once
    sys_timer_create() sets ->it_process it must not call
    release_posix_timer(), otherwise we can race with another thread doing
    sys_timer_delete(), this timer is visible to idr_find() and unlocked.

    The same is true for two other callers (actually, for any possible
    caller), sys_timer_delete() and itimer_delete(). They must clear
    ->it_process before unlock_timer() + release_posix_timer().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: john stultz
    Cc: Thomas Gleixner
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_timer_delete() and itimer_delete() check "timer->it_process != NULL",
    this looks completely bogus. ->it_process == NULL means that this timer
    is already under destruction or it is not fully initialized, this must not
    happen.

    sys_timer_delete: the timer is locked, and lock_timer() can't succeed
    if ->it_process == NULL.

    itimer_delete: it is called by exit_itimers() when there are no other
    threads which can play with signal_struct->posix_timers.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: john stultz
    Cc: Thomas Gleixner
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • In cpuset_update_task_memory_state() local variable struct task_struct
    *tsk = current;

    And local variable tsk is used 14 times and statement task_cs(tsk) is used
    twice in this function. So using task_cs(tsk) instead of task_cs(current)
    is better for readability.

    And "(struct cgroup_scanner *)&scan" is not good for readability also.
    (and "container_of" is used in cpuset_do_move_task(), not
    "(cpuset_hotplug_scanner *)scan")

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • cgroup(cgroup_scan_tasks) will initialize heap->gt for us. This patch
    removes started_after() and its helper-function.

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • I create lots of empty cpusets(empty cpumasks) and turn off the
    "sched_load_balance" in top cpuset.

    I found that all these empty cpumasks are passed to
    partition_sched_domains() in rebuild_sched_domains(), it's very
    time-consuming for partition_sched_domains() and it's not need.

    It also reduce memory consumed and some works in rebuild_sched_domains()
    too.

    Signed-off-by: Lai Jiangshan
    Acked-by: Paul Menage
    Cc: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • When changing 'sched_relax_domain_level', don't rebuild sched domains if
    'cpus' is empty or 'sched_load_balance' is not set.

    Also make the comments of rebuild_sched_domains() more readable.

    Signed-off-by: Li Zefan
    Cc: Hidetoshi Seto
    Cc: Paul Jackson
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan