28 May, 2010

36 commits

  • Most distros turn the console verbosity down and that means a backtrace
    after a panic never makes it to the console. I assume we haven't seen
    this because a panic is often preceeded by an oops which will have called
    console_verbose. There are however a lot of places we call panic
    directly, and they are broken.

    Use console_verbose like we do in the oops path to ensure a directly
    called panic will print a backtrace.

    Signed-off-by: Anton Blanchard
    Acked-by: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     
  • copy_process(pid => &init_struct_pid) doesn't do attach_pid/etc.

    It shouldn't, but this means that the idle threads run with the wrong
    pids copied from the caller's task_struct. In x86 case the caller is
    either kernel_init() thread or keventd.

    In particular, this means that after the series of cpu_up/cpu_down an
    idle thread (which never exits) can run with .pid pointing to nowhere.

    Change fork_idle() to initialize idle->pids[] correctly. We only set
    .pid = &init_struct_pid but do not add .node to list, INIT_TASK() does
    the same for the boot-cpu idle thread (swapper).

    Signed-off-by: Oleg Nesterov
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Herbert Poetzl
    Cc: Mathias Krause
    Acked-by: Roland McGrath
    Acked-by: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • On a system with a substantial number of processors, the early default
    pid_max of 32k will not be enough. A system with 1664 CPU's, there are
    25163 processes started before the login prompt. It's estimated that with
    2048 CPU's we will pass the 32k limit. With 4096, we'll reach that limit
    very early during the boot cycle, and processes would stall waiting for an
    available pid.

    This patch increases the early maximum number of pids available, and
    increases the minimum number of pids that can be set during runtime.

    [akpm@linux-foundation.org: fix warnings]
    Signed-off-by: Hedi Berriche
    Signed-off-by: Mike Travis
    Signed-off-by: Robin Holt
    Acked-by: Linus Torvalds
    Cc: Ingo Molnar
    Cc: Pavel Machek
    Cc: Alan Cox
    Cc: Greg KH
    Cc: Rik van Riel
    Cc: John Stoffel
    Cc: Jack Steiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hedi Berriche
     
  • Since when CONFIG_HOTPLUG_CPU=n, get_online_cpus() do nothing, so we don't
    need cpu_hotplug_begin() either.

    This patch moves cpu_hotplug_begin()/cpu_hotplug_done() into the code
    block of CONFIG_HOTPLUG_CPU=y.

    Signed-off-by: Lai Jiangshan
    Cc: Gautham R Shenoy
    Cc: Ingo Molnar

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     
  • By the previous modification, the cpu notifier can return encapsulate
    errno value. This converts the cpu notifiers for kernel/*.c

    Signed-off-by: Akinobu Mita
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • Currently, onlining or offlining a CPU failure by one of the cpu notifiers
    error always cause -EINVAL error. (i.e. writing 0 or 1 to
    /sys/devices/system/cpu/cpuX/online gets EINVAL)

    To get better error reporting rather than always getting -EINVAL, This
    changes cpu_notify() to return -errno value with notifier_to_errno() and
    fix the callers. Now that cpu notifiers can return encapsulate errno
    value.

    Currently, all cpu hotplug notifiers return NOTIFY_OK, NOTIFY_BAD, or
    NOTIFY_DONE. So cpu_notify() can returns 0 or -EPERM with this change for
    now.

    (notifier_to_errno(NOTIFY_OK) == 0, notifier_to_errno(NOTIFY_DONE) == 0,
    notifier_to_errno(NOTIFY_BAD) == -EPERM)

    Forthcoming patches convert several cpu notifiers to return encapsulate
    errno value with notifier_from_errno().

    Signed-off-by: Akinobu Mita
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • No functional change. These are just wrappers of
    raw_cpu_notifier_call_chain.

    Signed-off-by: Akinobu Mita
    Cc: Rusty Russell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     
  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Trivial, use get_nr_threads() helper to read signal->count which we are
    going to change.

    Like other callers, proc_sched_show_task() doesn't need the exactly
    precise nr_threads.

    David said:

    : Note that get_nr_threads() isn't completely equivalent (it can return 0
    : where proc_sched_show_task() will display a 1). But I don't think this
    : should be a problem.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Howells
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • check_unshare_flags(CLONE_SIGHAND) adds CLONE_THREAD to *flags_ptr if the
    task is multithreaded to ensure unshare_thread() will fail.

    Not only this is a bit strange way to return the error, this is absolutely
    meaningless. If signal->count > 1 then sighand->count must be also > 1,
    and unshare_sighand() will fail anyway.

    In fact, all CLONE_THREAD/SIGHAND/VM checks inside sys_unshare() do not
    look right. Fortunately this code doesn't really work anyway.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

    This way signal->stats never points to nowhere and we can read ->stats
    lockless.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Kill the empty thread_group_cputime_free() helper. It was needed to free
    the per-cpu data which we no longer have.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup:

    - Add the boolean, group_dead = thread_group_leader(), for clarity.

    - Do not test/set sig == NULL to detect the all-dead case, use this
    boolean.

    - Pass this boolen to __unhash_process() and use it instead of another
    thread_group_leader() call which needs ->group_leader.

    This can be considered as microoptimization, but hopefully this also
    allows us do do other cleanups later.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that task->signal can't go away we can revert the horrible hack added
    by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for
    account_group_exec_runtime(), make sure ->signal can't be freed under
    rq->lock").

    And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When the last thread exits signal->tty is freed, but the pointer is not
    cleared and points to nowhere.

    This is OK. Nobody should use signal->tty lockless, and it is no longer
    possible to take ->siglock. However this looks wrong even if correct, and
    the nice OOPS is better than subtle and hard to find bugs.

    Change __exit_signal() to clear signal->tty under ->siglock.

    Note: __exit_signal() needs more cleanups. It should not check "sig !=
    NULL" to detect the all-dead case and we have the same issues with
    signal->stats.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We have a lot of problems with accessing task_struct->signal, it can
    "disappear" at any moment. Even current can't use its ->signal safely
    after exit_notify(). ->siglock helps, but it is not convenient, not
    always possible, and sometimes it makes sense to use task->signal even
    after this task has already dead.

    This patch adds the reference counter, sigcnt, into signal_struct. This
    reference is owned by task_struct and it is dropped in
    __put_task_struct(). Perhaps it makes sense to export
    get/put_signal_struct() later, but currently I don't see the immediate
    reason.

    Rename __cleanup_signal() to free_signal_struct() and unexport it. With
    the previous changes it does nothing except kmem_cache_free().

    Change __exit_signal() to not clear/free ->signal, it will be freed when
    the last reference to any thread in the thread group goes away.

    Note:
    - when the last thead exits signal->tty can point to nowhere, see
    the next patch.

    - with or without this patch signal_struct->count should go away,
    or at least it should be "int nr_threads" for fs/proc. This will
    be addressed later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tty_kref_put() has two callsites in copy_process() paths,

    1. if copy_process() suceeds it is called before we copy
    signal->tty from parent

    2. otherwise it is called from __cleanup_signal() under
    bad_fork_cleanup_signal: label

    In both cases tty_kref_put() is not right and unneeded because we don't
    have the balancing tty_kref_get(). Fortunately, this is harmless because
    this can only happen without CLONE_THREAD, and in this case signal->tty
    must be NULL.

    Remove tty_kref_put() from copy_process() and __cleanup_signal(), and
    change another caller of __cleanup_signal(), __exit_signal(), to call
    tty_kref_put() by hand.

    I hope this change makes sense by itself, but it is also needed to make
    ->signal refcountable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox
    Acked-by: Roland McGrath
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Preparation to make task->signal immutable, no functional changes.

    posix-cpu-timers.c checks task->signal != NULL to ensure this task is
    alive and didn't pass __exit_signal(). This is correct but we are going
    to change the lifetime rules for ->signal and never reset this pointer.

    Change the code to check ->sighand instead, it doesn't matter which
    pointer we check under tasklist, they both are cleared simultaneously.

    As Roland pointed out, some of these changes are not strictly needed and
    probably it makes sense to revert them later, when ->signal will be pinned
    to task_struct. But this patch tries to ensure the subsequent changes in
    fork/exit can't make any visible impact on posix cpu timers.

    Signed-off-by: Oleg Nesterov
    Cc: Fenghua Yu
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Tony Luck
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change __exit_signal() to check thread_group_leader() instead of
    atomic_dec_and_test(&sig->count). This must be equivalent, the group
    leader must be released only after all other threads have exited and
    passed __exit_signal().

    Henceforth sig->count is not actually used, except in fs/proc for
    get_nr_threads/etc.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change zap_other_threads() to return the number of other sub-threads found
    on ->thread_group list.

    Other changes are cosmetic:

    - change the code to use while_each_thread() helper

    - remove the obsolete comment about SIGKILL/SIGSTOP

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • signal_struct->count in its current form must die.

    - it has no reasons to be atomic_t

    - it looks like a reference counter, but it is not

    - otoh, we really need to make task->signal refcountable, just look at
    the extremely ugly task_rq_unlock_wait() called from __exit_signals().

    - we should change the lifetime rules for task->signal, it should be
    pinned to task_struct. We have a lot of code which can be simplified
    after that.

    - it is not needed! while the code is correct, any usage of this
    counter is artificial, except fs/proc uses it correctly to show the
    number of threads.

    This series removes the usage of sig->count from exit pathes.

    This patch:

    Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
    can just check notify_count < 0 to ensure the execing sub-threads needs
    the notification from us. No need to do other checks, notify_count != 0
    must always mean ->group_exit_task != NULL is waiting for us.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • UMH_WAIT_EXEC should report the error if kernel_thread() fails, like
    UMH_WAIT_PROC does.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • __call_usermodehelper(UMH_NO_WAIT) has 2 problems:

    - if kernel_thread() fails, call_usermodehelper_freeinfo()
    is not called.

    - for unknown reason UMH_NO_WAIT has UMH_WAIT_PROC logic,
    we spawn yet another thread which waits until the user
    mode application exits.

    Change the UMH_NO_WAIT code to use ____call_usermodehelper() instead of
    wait_for_helper(), and do call_usermodehelper_freeinfo() unconditionally.
    We can rely on CLONE_VFORK, do_fork(CLONE_VFORK) until the child exits or
    execs.

    With or without this patch UMH_NO_WAIT does not report the error if
    kernel_thread() fails, this is correct since the caller doesn't wait for
    result.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. wait_for_helper() calls allow_signal(SIGCHLD) to ensure the child
    can't autoreap itself.

    However, this means that a spurious SIGCHILD from user-space can
    set TIF_SIGPENDING and:

    - kernel_thread() or sys_wait4() can fail due to signal_pending()

    - worse, wait4() can fail before ____call_usermodehelper() execs
    or exits. In this case the caller may kfree(subprocess_info)
    while the child still uses this memory.

    Change the code to use SIG_DFL instead of magic "(void __user *)2"
    set by allow_signal(). This means that SIGCHLD won't be delivered,
    yet the child won't autoreap itsefl.

    The problem is minor, only root can send a signal to this kthread.

    2. If sys_wait4(&ret) fails it doesn't populate "ret", in this case
    wait_for_helper() reports a random value from uninitialized var.

    With this patch sys_wait4() should never fail, but still it makes
    sense to initialize ret = -ECHILD so that the caller can notice
    the problem.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ____call_usermodehelper() correctly calls flush_signal_handlers() to set
    SIG_DFL, but sigemptyset(->blocked) and recalc_sigpending() are not
    needed.

    This kthread was forked by workqueue thread, all signals must be unblocked
    and ignored, no pending signal is possible.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that nobody ever changes subprocess_info->cred we can kill this member
    and related code. ____call_usermodehelper() always runs in the context of
    freshly forked kernel thread, it has the proper ->cred copied from its
    parent kthread, keventd.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • call_usermodehelper_keys() uses call_usermodehelper_setkeys() to change
    subprocess_info->cred in advance. Now that we have info->init() we can
    change this code to set tgcred->session_keyring in context of execing
    kernel thread.

    Note: since currently call_usermodehelper_keys() is never called with
    UMH_NO_WAIT, call_usermodehelper_keys()->key_get() and umh_keys_cleanup()
    are not really needed, we could rely on install_session_keyring_to_cred()
    which does key_get() on success.

    Signed-off-by: Oleg Nesterov
    Acked-by: Neil Horman
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first patch in this series introduced an init function to the
    call_usermodehelper api so that processes could be customized by caller.
    This patch takes advantage of that fact, by customizing the helper in
    do_coredump to create the pipe and set its core limit to one (for our
    recusrsion check). This lets us clean up the previous uglyness in the
    usermodehelper internals and factor call_usermodehelper out entirely.
    While I'm at it, we can also modify the helper setup to look for a core
    limit value of 1 rather than zero for our recursion check

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • About 6 months ago, I made a set of changes to how the core-dump-to-a-pipe
    feature in the kernel works. We had reports of several races, including
    some reports of apps bypassing our recursion check so that a process that
    was forked as part of a core_pattern setup could infinitely crash and
    refork until the system crashed.

    We fixed those by improving our recursion checks. The new check basically
    refuses to fork a process if its core limit is zero, which works well.

    Unfortunately, I've been getting grief from maintainer of user space
    programs that are inserted as the forked process of core_pattern. They
    contend that in order for their programs (such as abrt and apport) to
    work, all the running processes in a system must have their core limits
    set to a non-zero value, to which I say 'yes'. I did this by design, and
    think thats the right way to do things.

    But I've been asked to ease this burden on user space enough times that I
    thought I would take a look at it. The first suggestion was to make the
    recursion check fail on a non-zero 'special' number, like one. That way
    the core collector process could set its core size ulimit to 1, and enable
    the kernel's recursion detection. This isn't a bad idea on the surface,
    but I don't like it since its opt-in, in that if a program like abrt or
    apport has a bug and fails to set such a core limit, we're left with a
    recursively crashing system again.

    So I've come up with this. What I've done is modify the
    call_usermodehelper api such that an extra parameter is added, a function
    pointer which will be called by the user helper task, after it forks, but
    before it exec's the required process. This will give the caller the
    opportunity to get a call back in the processes context, allowing it to do
    whatever it needs to to the process in the kernel prior to exec-ing the
    user space code. In the case of do_coredump, this callback is ues to set
    the core ulimit of the helper process to 1. This elimnates the opt-in
    problem that I had above, as it allows the ulimit for core sizes to be set
    to the value of 1, which is what the recursion check looks for in
    do_coredump.

    This patch:

    Create new function call_usermodehelper_fns() and allow it to assign both
    an init and cleanup function, as we'll as arbitrary data.

    The init function is called from the context of the forked process and
    allows for customization of the helper process prior to calling exec. Its
    return code gates the continuation of the process, or causes its exit.
    Also add an arbitrary data pointer to the subprocess_info struct allowing
    for data to be passed from the caller to the new process, and the
    subsequent cleanup process

    Also, use this patch to cleanup the cleanup function. It currently takes
    an argp and envp pointer for freeing, which is ugly. Lets instead just
    make the subprocess_info structure public, and pass that to the cleanup
    and init routines

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • Andrew Tridgell reports that aio_read(SIGEV_SIGNAL) can fail if the
    notification from the helper thread races with setresuid(), see
    http://samba.org/~tridge/junkcode/aio_uid.c

    This happens because check_kill_permission() doesn't permit sending a
    signal to the task with the different cred->xids. But there is not any
    security reason to check ->cred's when the task sends a signal (private or
    group-wide) to its sub-thread. Whatever we do, any thread can bypass all
    security checks and send SIGKILL to all threads, or it can block a signal
    SIG and do kill(gettid(), SIG) to deliver this signal to another
    sub-thread. Not to mention that CLONE_THREAD implies CLONE_VM.

    Change check_kill_permission() to avoid the credentials check when the
    sender and the target are from the same thread group.

    Also, move "cred = current_cred()" down to avoid calling get_current()
    twice.

    Note: David Howells pointed out we could relax this even more, the
    CLONE_SIGHAND (without CLONE_THREAD) case probably does not need
    these checks too.

    Roland said:
    : The glibc (libpthread) that does set*id across threads has
    : been in use for a while (2.3.4?), probably in distro's using kernels as old
    : or older than any active -stable streams. In the race in question, this
    : kernel bug is breaking valid POSIX application expectations.

    Reported-by: Andrew Tridgell
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Cc: Eric Paris
    Cc: Jakub Jelinek
    Cc: James Morris
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Cc: [all kernel versions]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that Mike Frysinger unified the FDPIC ptrace code, we can fix the
    unsafe usage of child->mm in ptrace_request(PTRACE_GETFDPIC).

    We have the reference to task_struct, and ptrace_check_attach() verified
    the tracee is stopped. But nothing can protect from SIGKILL after that,
    we must not assume child->mm != NULL.

    Signed-off-by: Oleg Nesterov
    Acked-by: Mike Frysinger
    Acked-by: David Howells
    Cc: Paul Mundt
    Cc: Greg Ungerer
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The Blackfin/FRV/SuperH guys all have the same exact FDPIC ptrace code in
    their arch handlers (since they were probably copied & pasted). Since
    these ptrace interfaces are an arch independent aspect of the FDPIC code,
    unify them in the common ptrace code so new FDPIC ports don't need to copy
    and paste this fundamental stuff yet again.

    Signed-off-by: Mike Frysinger
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Acked-by: Paul Mundt
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Frysinger
     
  • Some workloads that create a large number of small files tend to assign
    too many pages to node 0 (multi-node systems). Part of the reason is that
    the rotor (in cpuset_mem_spread_node()) used to assign nodes starts at
    node 0 for newly created tasks.

    This patch changes the rotor to be initialized to a random node number of
    the cpuset.

    [akpm@linux-foundation.org: fix layout]
    [Lee.Schermerhorn@hp.com: Define stub numa_random() for !NUMA configuration]
    Signed-off-by: Jack Steiner
    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • We have observed several workloads running on multi-node systems where
    memory is assigned unevenly across the nodes in the system. There are
    numerous reasons for this but one is the round-robin rotor in
    cpuset_mem_spread_node().

    For example, a simple test that writes a multi-page file will allocate
    pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it
    allocates on odd nodes & skips even nodes).

    An example is shown below. The program "lfile" writes a file consisting
    of 10 pages. The program then mmaps the file & uses get_mempolicy(...,
    MPOL_F_NODE) to determine the nodes where the file pages were allocated.
    The output is shown below:

    # ./lfile
    allocated on nodes: 2 4 6 0 1 2 6 0 2

    There is a single rotor that is used for allocating both file pages & slab
    pages. Writing the file allocates both a data page & a slab page
    (buffer_head). This advances the RR rotor 2 nodes for each page
    allocated.

    A quick confirmation seems to confirm this is the cause of the uneven
    allocation:

    # echo 0 >/dev/cpuset/memory_spread_slab
    # ./lfile
    allocated on nodes: 6 7 8 9 0 1 2 3 4 5

    This patch introduces a second rotor that is used for slab allocations.

    Signed-off-by: Jack Steiner
    Acked-by: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Paul Menage
    Cc: Jack Steiner
    Cc: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jack Steiner
     
  • Since we are unable to handle an error returned by
    cftype.unregister_event() properly, let's make the callback
    void-returning.

    mem_cgroup_unregister_event() has been rewritten to be a "never fail"
    function. On mem_cgroup_usage_register_event() we save old buffer for
    thresholds array and reuse it in mem_cgroup_usage_unregister_event() to
    avoid allocation.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Phil Carmody
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

26 May, 2010

3 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (63 commits)
    drivers/net/usb/asix.c: Fix pointer cast.
    be2net: Bug fix to avoid disabling bottom half during firmware upgrade.
    proc_dointvec: write a single value
    hso: add support for new products
    Phonet: fix potential use-after-free in pep_sock_close()
    ath9k: remove VEOL support for ad-hoc
    ath9k: change beacon allocation to prefer the first beacon slot
    sock.h: fix kernel-doc warning
    cls_cgroup: Fix build error when built-in
    macvlan: do proper cleanup in macvlan_common_newlink() V2
    be2net: Bug fix in init code in probe
    net/dccp: expansion of error code size
    ath9k: Fix rx of mcast/bcast frames in PS mode with auto sleep
    wireless: fix sta_info.h kernel-doc warnings
    wireless: fix mac80211.h kernel-doc warnings
    iwlwifi: testing the wrong variable in iwl_add_bssid_station()
    ath9k_htc: rare leak in ath9k_hif_usb_alloc_tx_urbs()
    ath9k_htc: dereferencing before check in hif_usb_tx_cb()
    rt2x00: Fix rt2800usb TX descriptor writing.
    rt2x00: Fix failed SLEEP->AWAKE and AWAKE->SLEEP transitions.
    ...

    Linus Torvalds
     
  • This reverts commit 480b02df3aa9f07d1c7df0cd8be7a5ca73893455, since
    Rafael reports that it causes occasional kernel paging request faults in
    load_module().

    Dropping the module lock and re-taking it deep in the call-chain is
    definitely not the right thing to do. That just turns the mutex from a
    lock into a "random non-locking data structure" that doesn't actually
    protect what it's supposed to protect.

    Requested-and-tested-by: Rafael J. Wysocki
    Cc: Rusty Russell
    Cc: Brandon Philips
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The commit 00b7c3395aec3df43de5bd02a3c5a099ca51169f
    "sysctl: refactor integer handling proc code"
    modified the behaviour of writing to /proc.
    Before the commit, write("1\n") to /proc/sys/kernel/printk succeeded. But
    now it returns EINVAL.

    This commit supports writing a single value to a multi-valued entry.

    Signed-off-by: J. R. Okajima
    Reviewed-and-tested-by: WANG Cong
    Signed-off-by: David S. Miller

    J. R. Okajima
     

25 May, 2010

1 commit