21 May, 2011

1 commit

  • * 'ptrace' of git://git.kernel.org/pub/scm/linux/kernel/git/oleg/misc: (41 commits)
    signal: trivial, fix the "timespec declared inside parameter list" warning
    job control: reorganize wait_task_stopped()
    ptrace: fix signal->wait_chldexit usage in task_clear_group_stop_trapping()
    signal: sys_sigprocmask() needs retarget_shared_pending()
    signal: cleanup sys_sigprocmask()
    signal: rename signandsets() to sigandnsets()
    signal: do_sigtimedwait() needs retarget_shared_pending()
    signal: introduce do_sigtimedwait() to factor out compat/native code
    signal: sys_rt_sigtimedwait: simplify the timeout logic
    signal: cleanup sys_rt_sigprocmask()
    x86: signal: sys_rt_sigreturn() should use set_current_blocked()
    x86: signal: handle_signal() should use set_current_blocked()
    signal: sigprocmask() should do retarget_shared_pending()
    signal: sigprocmask: narrow the scope of ->siglock
    signal: retarget_shared_pending: optimize while_each_thread() loop
    signal: retarget_shared_pending: consider shared/unblocked signals only
    signal: introduce retarget_shared_pending()
    ptrace: ptrace_check_attach() should not do s/STOPPED/TRACED/
    signal: Turn SIGNAL_STOP_DEQUEUED into GROUP_STOP_DEQUEUED
    signal: do_signal_stop: Remove the unneeded task_clear_group_stop_pending()
    ...

    Linus Torvalds
     

14 May, 2011

1 commit

  • wait_task_stopped() tested task_stopped_code() without acquiring
    siglock and, if stop condition existed, called wait_task_stopped() and
    directly returned the result. This patch moves the initial
    task_stopped_code() testing into wait_task_stopped() and make
    wait_consider_task() fall through to wait_task_continue() on 0 return.

    This is for the following two reasons.

    * Because the initial task_stopped_code() test is done without
    acquiring siglock, it may race against SIGCONT generation. The
    stopped condition might have been replaced by continued state by the
    time wait_task_stopped() acquired siglock. This may lead to
    unexpected failure of WNOHANG waits.

    This reorganization addresses this single race case but there are
    other cases - TASK_RUNNING -> TASK_STOPPED transition and EXIT_*
    transitions.

    * Scheduled ptrace updates require changes to the initial test which
    would fit better inside wait_task_stopped().

    Signed-off-by: Tejun Heo
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Oleg Nesterov

    Tejun Heo
     

25 Apr, 2011

1 commit

  • When a task is traced and is in a stopped state, the tracer
    may execute a ptrace request to examine the tracee state and
    get its task struct. Right after, the tracee can be killed
    and thus its breakpoints released.
    This can happen concurrently when the tracer is in the middle
    of reading or modifying these breakpoints, leading to dereferencing
    a freed pointer.

    Hence, to prepare the fix, create a generic breakpoint reference
    holding API. When a reference on the breakpoints of a task is
    held, the breakpoints won't be released until the last reference
    is dropped. After that, no more ptrace request on the task's
    breakpoints can be serviced for the tracer.

    Reported-by: Oleg Nesterov
    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: Prasad
    Cc: Paul Mundt
    Cc: v2.6.33..
    Link: http://lkml.kernel.org/r/1302284067-7860-2-git-send-email-fweisbec@gmail.com

    Frederic Weisbecker
     

08 Apr, 2011

1 commit


31 Mar, 2011

1 commit


23 Mar, 2011

3 commits

  • Currently a real parent can't access job control stopped/continued
    events through a ptraced child. This utterly breaks job control when
    the children are ptraced.

    For example, if a program is run from an interactive shell and then
    strace(1) attaches to it, pressing ^Z would send SIGTSTP and strace(1)
    would notice it but the shell has no way to tell whether the child
    entered job control stop and thus can't tell when to take over the
    terminal - leading to awkward lone ^Z on the terminal.

    Because the job control and ptrace stopped states are independent,
    there is no reason to prevent real parents from accessing the stopped
    state regardless of ptrace. The continued state isn't separate but
    ptracers don't have any use for them as ptracees can never resume
    without explicit command from their ptracers, so as long as ptracers
    don't consume it, it should be fine.

    Although this is a behavior change, because the previous behavior is
    utterly broken when viewed from real parents and the change is only
    visible to real parents, I don't think it's necessary to make this
    behavior optional.

    One situation to be careful about is when a task from the real
    parent's group is ptracing. The parent group is the recipient of both
    ptrace and job control stop events and one stop can be reported as
    both job control and ptrace stops. As this can break the current
    ptrace users, suppress job control stopped events for these cases.

    If a real parent ptracer wants to know about both job control and
    ptrace stops, it can create a separate process to serve the role of
    real parent.

    Note that this only updates wait(2) side of things. The real parent
    can access the states via wait(2) but still is not properly notified
    (woken up and delivered signal). Test case polls wait(2) with WNOHANG
    to work around. Notification will be updated by future patches.

    Test case follows.

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    const struct timespec ts100ms = { .tv_nsec = 100000000 };
    pid_t tracee, tracer;
    siginfo_t si;
    int i;

    tracee = fork();
    if (tracee == 0) {
    while (1) {
    printf("tracee: SIGSTOP\n");
    raise(SIGSTOP);
    nanosleep(&ts100ms, NULL);
    printf("tracee: SIGCONT\n");
    raise(SIGCONT);
    nanosleep(&ts100ms, NULL);
    }
    }

    waitid(P_PID, tracee, &si, WSTOPPED | WNOHANG | WNOWAIT);

    tracer = fork();
    if (tracer == 0) {
    nanosleep(&ts100ms, NULL);
    ptrace(PTRACE_ATTACH, tracee, NULL, NULL);

    for (i = 0; i < 11; i++) {
    si.si_pid = 0;
    waitid(P_PID, tracee, &si, WSTOPPED);
    if (si.si_pid && si.si_code == CLD_TRAPPED)
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(long)si.si_status);
    }
    printf("tracer: EXITING\n");
    return 0;
    }

    while (1) {
    si.si_pid = 0;
    waitid(P_PID, tracee, &si,
    WSTOPPED | WCONTINUED | WEXITED | WNOHANG);
    if (si.si_pid)
    printf("mommy : WAIT status=%02d code=%02d\n",
    si.si_status, si.si_code);
    nanosleep(&ts100ms, NULL);
    }
    return 0;
    }

    Before the patch, while ptraced, the parent can't see any job control
    events.

    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    tracee: SIGSTOP
    tracee: SIGCONT
    tracee: SIGSTOP
    tracee: SIGCONT
    tracee: SIGSTOP
    tracer: EXITING
    mommy : WAIT status=19 code=05
    ^C

    After the patch,

    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    mommy : WAIT status=19 code=05
    tracee: SIGCONT
    mommy : WAIT status=18 code=06
    tracee: SIGSTOP
    tracer: EXITING
    mommy : WAIT status=19 code=05
    ^C

    -v2: Oleg pointed out that wait(2) should be suppressed for the real
    parent's group instead of only the real parent task itself.
    Updated accordingly.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     
  • wait(2) and friends allow access to stopped/continued states through
    zombies, which is required as the states are process-wide and should
    be accessible whether the leader task is alive or undead.
    wait_consider_task() implements this by always clearing notask_error
    and going through wait_task_stopped/continued() for unreaped zombies.

    However, while ptraced, the stopped state is per-task and as such if
    the ptracee became a zombie, there's no further stopped event to
    listen to and wait(2) and friends should return -ECHILD on the tracee.

    Fix it by clearing notask_error only if WCONTINUED | WEXITED is set
    for ptraced zombies. While at it, document why clearing notask_error
    is safe for each case.

    Test case follows.

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    static void *nooper(void *arg)
    {
    pause();
    return NULL;
    }

    int main(void)
    {
    const struct timespec ts1s = { .tv_sec = 1 };
    pid_t tracee, tracer;
    siginfo_t si;

    tracee = fork();
    if (tracee == 0) {
    pthread_t thr;

    pthread_create(&thr, NULL, nooper, NULL);
    nanosleep(&ts1s, NULL);
    printf("tracee exiting\n");
    pthread_exit(NULL); /* let subthread run */
    }

    tracer = fork();
    if (tracer == 0) {
    ptrace(PTRACE_ATTACH, tracee, NULL, NULL);
    while (1) {
    if (waitid(P_PID, tracee, &si, WSTOPPED) < 0) {
    perror("waitid");
    break;
    }
    ptrace(PTRACE_CONT, tracee, NULL,
    (void *)(long)si.si_status);
    }
    return 0;
    }

    waitid(P_PID, tracer, &si, WEXITED);
    kill(tracee, SIGKILL);
    return 0;
    }

    Before the patch, after the tracee becomes a zombie, the tracer's
    waitid(WSTOPPED) never returns and the program doesn't terminate.

    tracee exiting
    ^C

    After the patch, tracee exiting triggers waitid() to fail.

    tracee exiting
    waitid: No child processes

    -v2: Oleg pointed out that exited in addition to continued can happen
    for ptraced dead group leader. Clear notask_error for ptraced
    child on WEXITED too.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     
  • Move EXIT_DEAD test in wait_consider_task() above ptrace check. As
    ptraced tasks can't be EXIT_DEAD, this change doesn't cause any
    behavior change. This is to prepare for further changes.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov

    Tejun Heo
     

10 Mar, 2011

1 commit

  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Jan, 2011

1 commit

  • …/git/tip/linux-2.6-tip

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    perf session: Fix infinite loop in __perf_session__process_events
    perf evsel: Support perf_evsel__open(cpus > 1 && threads > 1)
    perf sched: Use PTHREAD_STACK_MIN to avoid pthread_attr_setstacksize() fail
    perf tools: Emit clearer message for sys_perf_event_open ENOENT return
    perf stat: better error message for unsupported events
    perf sched: Fix allocation result check
    perf, x86: P4 PMU - Fix unflagged overflows handling
    dynamic debug: Fix build issue with older gcc
    tracing: Fix TRACE_EVENT power tracepoint creation
    tracing: Fix preempt count leak
    tracepoint: Add __rcu annotation
    tracing: remove duplicate null-pointer check in skb tracepoint
    tracing/trivial: Add missing comma in TRACE_EVENT comment
    tracing: Include module.h in define_trace.h
    x86: Save rbp in pt_regs on irq entry
    x86, dumpstack: Fix unused variable warning
    x86, NMI: Clean-up default_do_nmi()
    x86, NMI: Allow NMI reason io port (0x61) to be processed on any CPU
    x86, NMI: Remove DIE_NMI_IPI
    x86, NMI: Add priorities to handlers
    ...

    Linus Torvalds
     

07 Jan, 2011

1 commit

  • In particular this patch move perf_event_exit_task() before
    cgroup_exit() to allow for cgroup support. The cgroup_exit()
    function detaches the cgroups attached to a task.

    Other movements include hoisting some definitions and inlines
    at the top of perf_event.c

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stephane Eranian
     

17 Dec, 2010

1 commit

  • __get_cpu_var() can be replaced with this_cpu_read and will then use a
    single read instruction with implied address calculation to access the
    correct per cpu instance.

    However, the address of a per cpu variable passed to __this_cpu_read()
    cannot be determined (since it's an implied address conversion through
    segment prefixes). Therefore apply this only to uses of __get_cpu_var
    where the address of the variable is not used.

    Cc: Pekka Enberg
    Cc: Hugh Dickins
    Cc: Thomas Gleixner
    Acked-by: H. Peter Anvin
    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

03 Dec, 2010

1 commit

  • If a user manages to trigger an oops with fs set to KERNEL_DS, fs is not
    otherwise reset before do_exit(). do_exit may later (via mm_release in
    fork.c) do a put_user to a user-controlled address, potentially allowing
    a user to leverage an oops into a controlled write into kernel memory.

    This is only triggerable in the presence of another bug, but this
    potentially turns a lot of DoS bugs into privilege escalations, so it's
    worth fixing. I have proof-of-concept code which uses this bug along
    with CVE-2010-3849 to write a zero to an arbitrary kernel address, so
    I've tested that this is not theoretical.

    A more logical place to put this fix might be when we know an oops has
    occurred, before we call do_exit(), but that would involve changing
    every architecture, in multiple places.

    Let's just stick it in do_exit instead.

    [akpm@linux-foundation.org: update code comment]
    Signed-off-by: Nelson Elhage
    Cc: KOSAKI Motohiro
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nelson Elhage
     

06 Nov, 2010

1 commit

  • posix-cpu-timers.c correctly assumes that the dying process does
    posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
    timers from signal->cpu_timers list.

    But, it also assumes that timer->it.cpu.task is always the group
    leader, and thus the dead ->task means the dead thread group.

    This is obviously not true after de_thread() changes the leader.
    After that almost every posix_cpu_timer_ method has problems.

    It is not simple to fix this bug correctly. First of all, I think
    that timer->it.cpu should use struct pid instead of task_struct.
    Also, the locking should be reworked completely. In particular,
    tasklist_lock should not be used at all. This all needs a lot of
    nontrivial and hard-to-test changes.

    Change __exit_signal() to do posix_cpu_timers_exit_group() when
    the old leader dies during exec. This is not the fix, just the
    temporary hack to hide the problem for 2.6.37 and stable. IOW,
    this is obviously wrong but this is what we currently have anyway:
    cpu timers do not work after mt exec.

    In theory this change adds another race. The exiting leader can
    detach the timers which were attached to the new leader. However,
    the window between de_thread() and release_task() is small, we
    can pretend that sys_timer_create() was called before de_thread().

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Oct, 2010

1 commit

  • find_new_reaper() releases and regrabs tasklist_lock but was missing
    proper annotations. Add it. This remove following sparse warning:

    warning: context imbalance in 'find_new_reaper' - unexpected unlock

    Signed-off-by: Namhyung Kim
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

10 Sep, 2010

1 commit


18 Aug, 2010

1 commit

  • Using a program like the following:

    #include
    #include
    #include
    #include

    int main() {
    id_t id;
    siginfo_t infop;
    pid_t res;

    id = fork();
    if (id == 0) { sleep(1); exit(0); }
    kill(id, SIGSTOP);
    alarm(1);
    waitid(P_PID, id, &infop, WCONTINUED);
    return 0;
    }

    to call waitid() on a stopped process results in access to the child task's
    credentials without the RCU read lock being held - which may be replaced in the
    meantime - eliciting the following warning:

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    kernel/exit.c:1460 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    2 locks held by waitid02/22252:
    #0: (tasklist_lock){.?.?..}, at: [] do_wait+0xc5/0x310
    #1: (&(&sighand->siglock)->rlock){-.-...}, at: []
    wait_consider_task+0x19a/0xbe0

    stack backtrace:
    Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
    Call Trace:
    [] lockdep_rcu_dereference+0xa4/0xc0
    [] wait_consider_task+0xaf1/0xbe0
    [] do_wait+0xf5/0x310
    [] sys_waitid+0x86/0x1f0
    [] ? child_wait_callback+0x0/0x70
    [] system_call_fastpath+0x16/0x1b

    This is fixed by holding the RCU read lock in wait_task_continued() to ensure
    that the task's current credentials aren't destroyed between us reading the
    cred pointer and us reading the UID from those credentials.

    Furthermore, protect wait_task_stopped() in the same way.

    We don't need to keep holding the RCU read lock once we've read the UID from
    the credentials as holding the RCU read lock doesn't stop the target task from
    changing its creds under us - so the credentials may be outdated immediately
    after we've read the pointer, lock or no lock.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: David Howells
    Acked-by: Paul E. McKenney
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Daniel J Blueman
     

11 Aug, 2010

1 commit

  • exit_ptrace() takes tasklist_lock unconditionally. We need this lock to
    avoid the race with ptrace_traceme(), it acts as a barrier.

    Change its caller, forget_original_parent(), to call exit_ptrace() under
    tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if
    needed.

    This allows us to add the fastpath list_empty(ptraced) check. In the
    likely no-tracees case exit_ptrace() just returns and we avoid the lock()
    + unlock() sequence.

    "Zhang, Yanmin" suggested to add this
    check, and he reports that this change adds about 11% improvement in some
    tests.

    Suggested-and-tested-by: "Zhang, Yanmin"
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 May, 2010

10 commits

  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

    This way signal->stats never points to nowhere and we can read ->stats
    lockless.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup:

    - Add the boolean, group_dead = thread_group_leader(), for clarity.

    - Do not test/set sig == NULL to detect the all-dead case, use this
    boolean.

    - Pass this boolen to __unhash_process() and use it instead of another
    thread_group_leader() call which needs ->group_leader.

    This can be considered as microoptimization, but hopefully this also
    allows us do do other cleanups later.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that task->signal can't go away we can revert the horrible hack added
    by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for
    account_group_exec_runtime(), make sure ->signal can't be freed under
    rq->lock").

    And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When the last thread exits signal->tty is freed, but the pointer is not
    cleared and points to nowhere.

    This is OK. Nobody should use signal->tty lockless, and it is no longer
    possible to take ->siglock. However this looks wrong even if correct, and
    the nice OOPS is better than subtle and hard to find bugs.

    Change __exit_signal() to clear signal->tty under ->siglock.

    Note: __exit_signal() needs more cleanups. It should not check "sig !=
    NULL" to detect the all-dead case and we have the same issues with
    signal->stats.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We have a lot of problems with accessing task_struct->signal, it can
    "disappear" at any moment. Even current can't use its ->signal safely
    after exit_notify(). ->siglock helps, but it is not convenient, not
    always possible, and sometimes it makes sense to use task->signal even
    after this task has already dead.

    This patch adds the reference counter, sigcnt, into signal_struct. This
    reference is owned by task_struct and it is dropped in
    __put_task_struct(). Perhaps it makes sense to export
    get/put_signal_struct() later, but currently I don't see the immediate
    reason.

    Rename __cleanup_signal() to free_signal_struct() and unexport it. With
    the previous changes it does nothing except kmem_cache_free().

    Change __exit_signal() to not clear/free ->signal, it will be freed when
    the last reference to any thread in the thread group goes away.

    Note:
    - when the last thead exits signal->tty can point to nowhere, see
    the next patch.

    - with or without this patch signal_struct->count should go away,
    or at least it should be "int nr_threads" for fs/proc. This will
    be addressed later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tty_kref_put() has two callsites in copy_process() paths,

    1. if copy_process() suceeds it is called before we copy
    signal->tty from parent

    2. otherwise it is called from __cleanup_signal() under
    bad_fork_cleanup_signal: label

    In both cases tty_kref_put() is not right and unneeded because we don't
    have the balancing tty_kref_get(). Fortunately, this is harmless because
    this can only happen without CLONE_THREAD, and in this case signal->tty
    must be NULL.

    Remove tty_kref_put() from copy_process() and __cleanup_signal(), and
    change another caller of __cleanup_signal(), __exit_signal(), to call
    tty_kref_put() by hand.

    I hope this change makes sense by itself, but it is also needed to make
    ->signal refcountable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox
    Acked-by: Roland McGrath
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change __exit_signal() to check thread_group_leader() instead of
    atomic_dec_and_test(&sig->count). This must be equivalent, the group
    leader must be released only after all other threads have exited and
    passed __exit_signal().

    Henceforth sig->count is not actually used, except in fs/proc for
    get_nr_threads/etc.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • signal_struct->count in its current form must die.

    - it has no reasons to be atomic_t

    - it looks like a reference counter, but it is not

    - otoh, we really need to make task->signal refcountable, just look at
    the extremely ugly task_rq_unlock_wait() called from __exit_signals().

    - we should change the lifetime rules for task->signal, it should be
    pinned to task_struct. We have a lot of code which can be simplified
    after that.

    - it is not needed! while the code is correct, any usage of this
    counter is artificial, except fs/proc uses it correctly to show the
    number of threads.

    This series removes the usage of sig->count from exit pathes.

    This patch:

    Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
    can just check notify_count < 0 to ensure the execing sub-threads needs
    the notification from us. No need to do other checks, notify_count != 0
    must always mean ->group_exit_task != NULL is waiting for us.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

15 Apr, 2010

1 commit


07 Apr, 2010

1 commit

  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

03 Apr, 2010

1 commit


14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

07 Mar, 2010

2 commits

  • kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one
    kernel/exit.c:1173:21: originally declared here

    Signed-off-by: Thiago Farina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thiago Farina
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Mar, 2010

1 commit

  • Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
    a good thing. This patch instead exports a function that uses
    lockdep to check whether tasklist_lock is held.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Christoph Hellwig
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

18 Dec, 2009

1 commit

  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Dec, 2009

1 commit