09 Aug, 2014

1 commit


07 Aug, 2014

1 commit

  • The oom killer scans each process and determines whether it is eligible
    for oom kill or whether the oom killer should abort because of
    concurrent memory freeing. It will abort when an eligible process is
    found to have TIF_MEMDIE set, meaning it has already been oom killed and
    we're waiting for it to exit.

    Processes with task->mm == NULL should not be considered because they
    are either kthreads or have already detached their memory and killing
    them would not lead to memory freeing. That memory is only freed after
    exit_mm() has returned, however, and not when task->mm is first set to
    NULL.

    Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
    is no longer considered for oom kill, but only until exit_mm() has
    returned. This was fragile in the past because it relied on
    exit_notify() to be reached before no longer considering TIF_MEMDIE
    processes.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 Jun, 2014

1 commit

  • Move the declaration/definition of allow_signal/disallow_signal to
    signal.h/signal.c. The new place is more logical and allows to use the
    static helpers in signal.c (see the next changes).

    While at it, make them return void and remove the valid_signal() check.
    Nobody checks the returned value, and in-kernel users must not pass the
    wrong signal number.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: David Woodhouse
    Cc: Frederic Weisbecker
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Cc: Mathieu Desnoyers
    Cc: Richard Weinberger
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

05 Jun, 2014

3 commits

  • for_each_process_thread() is sub-optimal. All threads share the same
    ->mm, we can swicth to the next process once we found a thread with
    ->mm != NULL and ->mm != mm.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Peter Chiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "Search through everything else" in mm_update_next_owner() can hit a
    kthread which adopted this "mm" via use_mm(), it should not be used as
    mm->owner. Add the PF_KTHREAD check.

    While at it, change this code to use for_each_process_thread() instead
    of deprecated do_each_thread/while_each_thread.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Michal Hocko
    Cc: Balbir Singh
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Michal Hocko
    Cc: Peter Chiang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • CONFIG_MM_OWNER makes no sense. It is not user-selectable, it is only
    selected by CONFIG_MEMCG automatically. So we can kill this option in
    init/Kconfig and do s/CONFIG_MM_OWNER/CONFIG_MEMCG/ globally.

    Signed-off-by: Oleg Nesterov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

08 Apr, 2014

9 commits

  • Even if the main thread is dead the process still can stop/continue.
    However, if the leader is ptraced wait_consider_task(ptrace => false)
    always skips wait_task_stopped/wait_task_continued, so WSTOPPED or
    WCONTINUED can never work for the natural parent in this case.

    Move the "A zombie ptracee is only visible to its ptracer" check into the
    "if (!delay_group_leader(p))" block. ->notask_error is cleared by the
    "fall through" code below.

    This depends on the previous change, wait_task_stopped/continued must be
    avoided if !delay_group_leader() and the tracer is ->real_parent.
    Otherwise WSTOPPED|WEXITED could wrongly report "stopped" when the child
    is already dead (single-threaded or not). If it is traced by another task
    then the "stopped" state is fine until the debugger detaches and reveals a
    zombie state.

    Stupid test-case:

    void *tfunc(void *arg)
    {
    sleep(1); // wait for zombie leader
    raise(SIGSTOP);
    exit(0x13);
    return NULL;
    }

    int run_child(void)
    {
    pthread_t thread;

    if (!fork()) {
    int tracee = getppid();

    assert(ptrace(PTRACE_ATTACH, tracee, 0,0) == 0);
    do
    ptrace(PTRACE_CONT, tracee, 0,0);
    while (wait(NULL) > 0);

    return 0;
    }

    sleep(1); // wait for PTRACE_ATTACH
    assert(pthread_create(&thread, NULL, tfunc, NULL) == 0);
    pthread_exit(NULL);
    }

    int main(void)
    {
    int child, stat;

    child = fork();
    if (!child)
    return run_child();

    assert(child == waitpid(-1, &stat, WSTOPPED));
    assert(stat == 0x137f);

    kill(child, SIGCONT);

    assert(child == waitpid(-1, &stat, WCONTINUED));
    assert(stat == 0xffff);

    assert(child == waitpid(-1, &stat, 0));
    assert(stat == 0x1300);

    return 0;
    }

    Without this patch it hangs in waitpid(WSTOPPED), wait_task_stopped() is
    never called.

    Note: this doesn't fix all problems with a zombie delay_group_leader(),
    WCONTINUED | WEXITED check is not exactly right. debugger can't assume it
    will be notified if another thread reaps the whole thread group.

    Signed-off-by: Oleg Nesterov
    Cc: Al Viro
    Cc: Jan Kratochvil
    Cc: Lennart Poettering
    Cc: Michal Schmidt
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "A zombie is only visible to its ptracer" logic in wait_consider_task()
    is very wrong. Trivial test-case:

    #include
    #include
    #include
    #include

    int main(void)
    {
    int child = fork();

    if (!child) {
    assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
    return 0x23;
    }

    assert(waitid(P_ALL, child, NULL, WEXITED | WNOWAIT) == 0);
    assert(waitid(P_ALL, 0, NULL, WSTOPPED) == -1);
    return 0;
    }

    it hangs in waitpid(WSTOPPED) despite the fact it has a single zombie
    child. This is because wait_consider_task(ptrace => 0) sees p->ptrace and
    cleares ->notask_error assuming that the debugger should detach and notify
    us.

    Change wait_consider_task(ptrace => 0) to pretend that ptrace == T if the
    child is traced by us. This really simplifies the logic and allows us to
    do more fixes, see the next changes. This also hides the unwanted group
    stop state automatically, we can remove another ptrace_reparented() check.

    Unfortunately, this adds the following behavioural changes:

    1. Before this patch wait(WEXITED | __WNOTHREAD) does not reap
    a natural child if it is traced by the caller's sub-thread.

    Hopefully nobody will ever notice this change, and I think
    that nobody should rely on this behaviour anyway.

    2. SIGNAL_STOP_CONTINUED is no longer hidden from debugger if
    it is real parent.

    While this change comes as a side effect, I think it is good
    by itself. The group continued state can not be consumed by
    another process in this case, it doesn't depend on ptrace,
    it doesn't make sense to hide it from real parent.

    Perhaps we should add the thread_group_leader() check before
    wait_task_continued()? May be, but this shouldn't depend on
    ptrace_reparented().

    Signed-off-by: Oleg Nesterov
    Cc: Al Viro
    Cc: Jan Kratochvil
    Cc: Lennart Poettering
    Cc: Michal Schmidt
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that EXIT_DEAD is the terminal state it doesn't make sense to call
    eligible_child() or security_task_wait() if the task is really dead.

    Signed-off-by: Oleg Nesterov
    Tested-by: Michal Schmidt
    Cc: Jan Kratochvil
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_zombie() always uses EXIT_TRACE/ptrace_unlink() if
    ptrace_reparented(). This is suboptimal and a bit confusing: we do not
    need do_notify_parent(p) if !thread_group_leader(p) and in this case we
    also do not need ptrace_unlink(), we can rely on ptrace_release_task().

    Change wait_task_zombie() to check thread_group_leader() along with
    ptrace_reparented() and simplify the final p->exit_state transition.

    Signed-off-by: Oleg Nesterov
    Tested-by: Michal Schmidt
    Cc: Jan Kratochvil
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
    drops tasklist_lock. If this task is not the natural child and it is
    traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

    The last transition is racy, this is even documented in 50b8d257486a
    "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
    race". wait_consider_task() tries to detect this transition and clear
    ->notask_error but we can't rely on ptrace_reparented(), debugger can
    exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

    And there is another problem which were missed before: this transition
    can also race with reparent_leader() which doesn't reset >exit_signal if
    EXIT_DEAD, assuming that this task must be reaped by someone else. So
    the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
    /sbin/init doesn't use __WALL it becomes unreapable. This was fixed by
    the previous commit, but it was the temporary hack.

    1. Add the new exit_state, EXIT_TRACE. It means that the task is the
    traced zombie, debugger is going to detach and notify its natural
    parent.

    This new state is actually EXIT_ZOMBIE | EXIT_DEAD. This way we
    can avoid the changes in proc/kgdb code, get_task_state() still
    reports "X (dead)" in this case.

    Note: with or without this change userspace can see Z -> X -> Z
    transition. Not really bad, but probably makes sense to fix.

    2. Change wait_task_zombie() to use EXIT_TRACE instead of EXIT_DEAD
    if we need to notify the ->real_parent.

    3. Revert the previous hack in reparent_leader(), now that EXIT_DEAD
    is always the final state we can safely ignore such a task.

    4. Change wait_consider_task() to check EXIT_TRACE separately and kill
    the racy and no longer needed ptrace_reparented() case.

    If ptrace == T an EXIT_TRACE thread should be simply ignored, the
    owner of this state is going to ptrace_unlink() this task. We can
    pretend that it was already removed from ->ptraced list.

    Otherwise we should skip this thread too but clear ->notask_error,
    we must be the natural parent and debugger is going to untrace and
    notify us. IOW, this doesn't differ from "EXIT_ZOMBIE && p->ptrace"
    even if the task was already untraced.

    Signed-off-by: Oleg Nesterov
    Reported-by: Jan Kratochvil
    Reported-by: Michal Schmidt
    Tested-by: Michal Schmidt
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_zombie() first does EXIT_ZOMBIE->EXIT_DEAD transition and
    drops tasklist_lock. If this task is not the natural child and it is
    traced, we change its state back to EXIT_ZOMBIE for ->real_parent.

    The last transition is racy, this is even documented in 50b8d257486a
    "ptrace: partially fix the do_wait(WEXITED) vs EXIT_DEAD->EXIT_ZOMBIE
    race". wait_consider_task() tries to detect this transition and clear
    ->notask_error but we can't rely on ptrace_reparented(), debugger can
    exit and do ptrace_unlink() before its sub-thread sets EXIT_ZOMBIE.

    And there is another problem which were missed before: this transition
    can also race with reparent_leader() which doesn't reset >exit_signal if
    EXIT_DEAD, assuming that this task must be reaped by someone else. So
    the tracee can be re-parented with ->exit_signal != SIGCHLD, and if
    /sbin/init doesn't use __WALL it becomes unreapable.

    Change reparent_leader() to update ->exit_signal even if EXIT_DEAD.
    Note: this is the simple temporary hack for -stable, it doesn't try to
    solve all problems, it will be reverted by the next changes.

    Signed-off-by: Oleg Nesterov
    Reported-by: Jan Kratochvil
    Reported-by: Michal Schmidt
    Tested-by: Michal Schmidt
    Cc: Al Viro
    Cc: Lennart Poettering
    Cc: Roland McGrath
    Cc: Tejun Heo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The process events connector delivers a notification when a process
    exits. This is really convenient for a process that spawns and wants to
    monitor its children through an epoll-able() interface.

    Unfortunately, there is a small window between when the event is
    delivered and the child become wait()-able.

    This is creates a race if the parent wants to make sure that it knows
    about the exit, e.g

    pid_t pid = fork();
    if (pid > 0) {
    register_interest_for_pid(pid);
    if (waitpid(pid, NULL, WNOHANG) > 0)
    {
    /* We might have raced with exit() */
    }
    return;
    }

    /* Child */
    execve(...)

    register_interest_for_pid() would be telling the the connector socket
    reader to pay attention to events related to pid.

    Though this is not a bug, I think it would make the connector a bit more
    usable if this race was closed by simply moving the call to
    proc_exit_connector() from just before exit_notify() to right after.

    Oleg said:

    : Even with this patch the code above is still "racy" if the child is
    : multi-threaded. Plus it should obviously filter-out subthreads. And
    : afaics there is no way to make it reliable, even if you change the code
    : above so that waitpid() is called only after the last thread exits WNOHANG
    : still can fail.

    Signed-off-by: Guillaume Morin
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David S. Miller
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Guillaume Morin
     
  • It is not clear why check_stack_usage() is called so early and thus it
    never checks the stack usage in, say, exit_notify() or
    flush_ptrace_hw_breakpoint() or other functions which are only called by
    do_exit().

    Move the callsite down to the last preempt_disable/schedule.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Commit 8aac62706ada ("move exit_task_namespaces() outside of
    exit_notify()") breaks pppd and the exiting service crashes the kernel:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    IP: ppp_register_channel+0x13/0x20 [ppp_generic]
    Call Trace:
    ppp_asynctty_open+0x12b/0x170 [ppp_async]
    tty_ldisc_open.isra.2+0x27/0x60
    tty_ldisc_hangup+0x1e3/0x220
    __tty_hangup+0x2c4/0x440
    disassociate_ctty+0x61/0x270
    do_exit+0x7f2/0xa50

    ppp_register_channel() needs ->net_ns and current->nsproxy == NULL.

    Move disassociate_ctty() before exit_task_namespaces(), it doesn't make
    sense to delay it after perf_event_exit_task() or cgroup_exit().

    This also allows to use task_work_add() inside the (nontrivial) code
    paths in disassociate_ctty().

    Investigated by Peter Hurley.

    Signed-off-by: Oleg Nesterov
    Reported-by: Sree Harsha Totakura
    Cc: Peter Hurley
    Cc: Sree Harsha Totakura
    Cc: "Eric W. Biederman"
    Cc: Jeff Dike
    Cc: Ingo Molnar
    Cc: Andrey Vagin
    Cc: Al Viro
    Cc: [v3.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

29 Mar, 2014

1 commit


22 Jan, 2014

1 commit

  • while_each_thread() and next_thread() should die, almost every lockless
    usage is wrong.

    1. Unless g == current, the lockless while_each_thread() is not safe.

    while_each_thread(g, t) can loop forever if g exits, next_thread()
    can't reach the unhashed thread in this case. Note that this can
    happen even if g is the group leader, it can exec.

    2. Even if while_each_thread() itself was correct, people often use
    it wrongly.

    It was never safe to just take rcu_read_lock() and loop unless
    you verify that pid_alive(g) == T, even the first next_thread()
    can point to the already freed/reused memory.

    This patch adds signal_struct->thread_head and task->thread_node to
    create the normal rcu-safe list with the stable head. The new
    for_each_thread(g, t) helper is always safe under rcu_read_lock() as
    long as this task_struct can't go away.

    Note: of course it is ugly to have both task_struct->thread_node and the
    old task_struct->thread_group, we will kill it later, after we change
    the users of while_each_thread() to use for_each_thread().

    Perhaps we can kill it even before we convert all users, we can
    reimplement next_thread(t) using the new thread_head/thread_node. But
    we can't do this right now because this will lead to subtle behavioural
    changes. For example, do/while_each_thread() always sees at least one
    task, while for_each_thread() can do nothing if the whole thread group
    has died. Or thread_group_empty(), currently its semantics is not clear
    unless thread_group_leader(p) and we need to audit the callers before we
    can change it.

    So this patch adds the new interface which has to coexist with the old
    one for some time, hopefully the next changes will be more or less
    straightforward and the old one will go away soon.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Acked-by: David Rientjes
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Cc: Michal Hocko
    Cc: "Tu, Xiaobing"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

10 Jul, 2013

1 commit

  • This reverts commit bf26c018490c ("Prepare to fix racy accesses on task
    breakpoints").

    The patch was fine but we can no longer race with SIGKILL after commit
    9899d11f6544 ("ptrace: ensure arch_ptrace/ptrace_request can never race
    with SIGKILL"), the __TASK_TRACED tracee can't be woken up and
    ->ptrace_bps[] can't go away.

    Now that ptrace_get_breakpoints/ptrace_put_breakpoints have no callers,
    we can kill them and remove task->ptrace_bp_refcnt.

    Signed-off-by: Oleg Nesterov
    Acked-by: Frederic Weisbecker
    Acked-by: Michael Neuling
    Cc: Benjamin Herrenschmidt
    Cc: Ingo Molnar
    Cc: Jan Kratochvil
    Cc: Paul Mackerras
    Cc: Paul Mundt
    Cc: Will Deacon
    Cc: Prasad
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

04 Jul, 2013

2 commits

  • Merge first patch-bomb from Andrew Morton:
    - various misc bits
    - I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
    distracted. There has been quite a bit of activity.
    - About half the MM queue
    - Some backlight bits
    - Various lib/ updates
    - checkpatch updates
    - zillions more little rtc patches
    - ptrace
    - signals
    - exec
    - procfs
    - rapidio
    - nbd
    - aoe
    - pps
    - memstick
    - tools/testing/selftests updates

    * emailed patches from Andrew Morton : (445 commits)
    tools/testing/selftests: don't assume the x bit is set on scripts
    selftests: add .gitignore for kcmp
    selftests: fix clean target in kcmp Makefile
    selftests: add .gitignore for vm
    selftests: add hugetlbfstest
    self-test: fix make clean
    selftests: exit 1 on failure
    kernel/resource.c: remove the unneeded assignment in function __find_resource
    aio: fix wrong comment in aio_complete()
    drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
    drivers/memstick/host/r592.c: convert to module_pci_driver
    drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
    pps-gpio: add device-tree binding and support
    drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
    drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
    drivers/parport/share.c: use kzalloc
    Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
    aoe: update internal version number to v83
    aoe: update copyright date
    aoe: perform I/O completions in parallel
    ...

    Linus Torvalds
     
  • Move __set_special_pids() from exit.c to sys.c close to its single caller
    and make it static.

    And rename it to set_special_pids(), another helper with this name has
    gone away.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Jun, 2013

1 commit

  • * freezer:
    af_unix: use freezable blocking calls in read
    sigtimedwait: use freezable blocking call
    nanosleep: use freezable blocking call
    futex: use freezable blocking call
    select: use freezable blocking call
    epoll: use freezable blocking call
    binder: use freezable blocking calls
    freezer: add new freezable helpers using freezer_do_not_count()
    freezer: convert freezable helpers to static inline where possible
    freezer: convert freezable helpers to freezer_do_not_count()
    freezer: skip waking up tasks with PF_FREEZER_SKIP set
    freezer: shorten freezer sleep time using exponential backoff
    lockdep: check that no locks held at freeze time
    lockdep: remove task argument from debug_check_no_locks_held
    freezer: add unsafe versions of freezable helpers for CIFS
    freezer: add unsafe versions of freezable helpers for NFS

    Rafael J. Wysocki
     

15 Jun, 2013

1 commit

  • exit_notify() does exit_task_namespaces() after
    forget_original_parent(). This was needed to ensure that ->nsproxy
    can't be cleared prematurely, an exiting child we are going to
    reparent can do do_notify_parent() and use the parent's (ours) pid_ns.

    However, after 32084504 "pidns: use task_active_pid_ns in
    do_notify_parent" ->nsproxy != NULL is no longer needed, we rely
    on task_active_pid_ns().

    Move exit_task_namespaces() from exit_notify() to do_exit(), after
    exit_fs() and before exit_task_work().

    This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy()
    does fput() which needs task_work_add().

    Note: this particular problem can be fixed if we change fput(), and
    that change makes sense anyway. But there is another reason to move
    the callsite. The original reason for exit_task_namespaces() from
    the middle of exit_notify() was subtle and it has already gone away,
    now this looks confusing. And this allows us do simplify exit_notify(),
    we can avoid unlock/lock(tasklist) and we can use ->exit_state instead
    of PF_EXITING in forget_original_parent().

    Reported-by: Andrey Vagin
    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Andrey Vagin
    Signed-off-by: Al Viro

    Oleg Nesterov
     

12 May, 2013

1 commit

  • The only existing caller to debug_check_no_locks_held calls it
    with 'current' as the task, and the freezer needs to call
    debug_check_no_locks_held but doesn't already have a current
    task pointer, so remove the argument. It is already assuming
    that the current task is relevant by dumping the current stack
    trace as part of the warning.

    This was originally part of 6aa9707099c (lockdep: check that
    no locks held at freeze time) which was reverted in
    dbf520a9d7d4.

    Original-author: Mandeep Singh Baines
    Acked-by: Pavel Machek
    Acked-by: Tejun Heo
    Signed-off-by: Colin Cross
    Signed-off-by: Rafael J. Wysocki

    Colin Cross
     

02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds
     

01 May, 2013

1 commit

  • Pull compat cleanup from Al Viro:
    "Mostly about syscall wrappers this time; there will be another pile
    with patches in the same general area from various people, but I'd
    rather push those after both that and vfs.git pile are in."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
    syscalls.h: slightly reduce the jungles of macros
    get rid of union semop in sys_semctl(2) arguments
    make do_mremap() static
    sparc: no need to sign-extend in sync_file_range() wrapper
    ppc compat wrappers for add_key(2) and request_key(2) are pointless
    x86: trim sys_ia32.h
    x86: sys32_kill and sys32_mprotect are pointless
    get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC
    merge compat sys_ipc instances
    consolidate compat lookup_dcookie()
    convert vmsplice to COMPAT_SYSCALL_DEFINE
    switch getrusage() to COMPAT_SYSCALL_DEFINE
    switch epoll_pwait to COMPAT_SYSCALL_DEFINE
    convert sendfile{,64} to COMPAT_SYSCALL_DEFINE
    switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE
    make SYSCALL_DEFINE-generated wrappers do asmlinkage_protect
    make HAVE_SYSCALL_WRAPPERS unconditional
    consolidate cond_syscall and SYSCALL_ALIAS declarations
    teach SYSCALL_DEFINE how to deal with long long/unsigned long long
    get rid of duplicate logics in __SC_....[1-6] definitions

    Linus Torvalds
     

10 Apr, 2013

1 commit


01 Apr, 2013

1 commit

  • This reverts commit 6aa9707099c4b25700940eb3d016f16c4434360d.

    Commit 6aa9707099c4 ("lockdep: check that no locks held at freeze time")
    causes problems with NFS root filesystems. The failures were noticed on
    OMAP2 and 3 boards during kernel init:

    [ BUG: swapper/0/1 still has locks held! ]
    3.9.0-rc3-00344-ga937536 #1 Not tainted
    -------------------------------------
    1 lock held by swapper/0/1:
    #0: (&type->s_umount_key#13/1){+.+.+.}, at: [] sget+0x248/0x574

    stack backtrace:
    rpc_wait_bit_killable
    __wait_on_bit
    out_of_line_wait_on_bit
    __rpc_execute
    rpc_run_task
    rpc_call_sync
    nfs_proc_get_root
    nfs_get_root
    nfs_fs_mount_common
    nfs_try_mount
    nfs_fs_mount
    mount_fs
    vfs_kern_mount
    do_mount
    sys_mount
    do_mount_root
    mount_root
    prepare_namespace
    kernel_init_freeable
    kernel_init

    Although the rootfs mounts, the system is unstable. Here's a transcript
    from a PM test:

    http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt

    Here's what the test log should look like:

    http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt

    Mailing list discussion is here:

    http://lkml.org/lkml/2013/3/4/221

    Deal with this for v3.9 by reverting the problem commit, until folks can
    figure out the right long-term course of action.

    Signed-off-by: Paul Walmsley
    Cc: Mandeep Singh Baines
    Cc: Jeff Layton
    Cc: Shawn Guo
    Cc:
    Cc: Fengguang Wu
    Cc: Trond Myklebust
    Cc: Ingo Molnar
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Walmsley
     

04 Mar, 2013

1 commit


28 Feb, 2013

2 commits

  • Prevents hung_task detector from panicing the machine. This is also
    needed to prevent this wait from blocking suspend.

    (It doesnt' currently block suspend but it would once the next
    patch in this series is applied.)

    [yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include]
    Signed-off-by: Mandeep Singh Baines
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Ingo Molnar
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     
  • We shouldn't try_to_freeze if locks are held. Holding a lock can cause a
    deadlock if the lock is later acquired in the suspend or hibernate path
    (e.g. by dpm). Holding a lock can also cause a deadlock in the case of
    cgroup_freezer if a lock is held inside a frozen cgroup that is later
    acquired by a process outside that group.

    [akpm@linux-foundation.org: export debug_check_no_locks_held]
    Signed-off-by: Mandeep Singh Baines
    Cc: Ben Chan
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Rafael J. Wysocki
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mandeep Singh Baines
     

28 Jan, 2013

1 commit

  • This is in preparation for the full dynticks feature. While
    remotely reading the cputime of a task running in a full
    dynticks CPU, we'll need to do some extra-computation. This
    way we can account the time it spent tickless in userspace
    since its last cputime snapshot.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

18 Dec, 2012

1 commit

  • Pull user namespace changes from Eric Biederman:
    "While small this set of changes is very significant with respect to
    containers in general and user namespaces in particular. The user
    space interface is now complete.

    This set of changes adds support for unprivileged users to create user
    namespaces and as a user namespace root to create other namespaces.
    The tyranny of supporting suid root preventing unprivileged users from
    using cool new kernel features is broken.

    This set of changes completes the work on setns, adding support for
    the pid, user, mount namespaces.

    This set of changes includes a bunch of basic pid namespace
    cleanups/simplifications. Of particular significance is the rework of
    the pid namespace cleanup so it no longer requires sending out
    tendrils into all kinds of unexpected cleanup paths for operation. At
    least one case of broken error handling is fixed by this cleanup.

    The files under /proc//ns/ have been converted from regular files
    to magic symlinks which prevents incorrect caching by the VFS,
    ensuring the files always refer to the namespace the process is
    currently using and ensuring that the ptrace_mayaccess permission
    checks are always applied.

    The files under /proc//ns/ have been given stable inode numbers
    so it is now possible to see if different processes share the same
    namespaces.

    Through the David Miller's net tree are changes to relax many of the
    permission checks in the networking stack to allowing the user
    namespace root to usefully use the networking stack. Similar changes
    for the mount namespace and the pid namespace are coming through my
    tree.

    Two small changes to add user namespace support were commited here adn
    in David Miller's -net tree so that I could complete the work on the
    /proc//ns/ files in this tree.

    Work remains to make it safe to build user namespaces and 9p, afs,
    ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the
    Kconfig guard remains in place preventing that user namespaces from
    being built when any of those filesystems are enabled.

    Future design work remains to allow root users outside of the initial
    user namespace to mount more than just /proc and /sys."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits)
    proc: Usable inode numbers for the namespace file descriptors.
    proc: Fix the namespace inode permission checks.
    proc: Generalize proc inode allocation
    userns: Allow unprivilged mounts of proc and sysfs
    userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file
    procfs: Print task uids and gids in the userns that opened the proc file
    userns: Implement unshare of the user namespace
    userns: Implent proc namespace operations
    userns: Kill task_user_ns
    userns: Make create_new_namespaces take a user_ns parameter
    userns: Allow unprivileged use of setns.
    userns: Allow unprivileged users to create new namespaces
    userns: Allow setting a userns mapping to your current uid.
    userns: Allow chown and setgid preservation
    userns: Allow unprivileged users to create user namespaces.
    userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped
    userns: fix return value on mntns_install() failure
    vfs: Allow unprivileged manipulation of the mount namespace.
    vfs: Only support slave subtrees across different user namespaces
    vfs: Add a user namespace reference from struct mnt_namespace
    ...

    Linus Torvalds
     

13 Dec, 2012

1 commit

  • Pull big execve/kernel_thread/fork unification series from Al Viro:
    "All architectures are converted to new model. Quite a bit of that
    stuff is actually shared with architecture trees; in such cases it's
    literally shared branch pulled by both, not a cherry-pick.

    A lot of ugliness and black magic is gone (-3KLoC total in this one):

    - kernel_thread()/kernel_execve()/sys_execve() redesign.

    We don't do syscalls from kernel anymore for either kernel_thread()
    or kernel_execve():

    kernel_thread() is essentially clone(2) with callback run before we
    return to userland, the callbacks either never return or do
    successful do_execve() before returning.

    kernel_execve() is a wrapper for do_execve() - it doesn't need to
    do transition to user mode anymore.

    As a result kernel_thread() and kernel_execve() are
    arch-independent now - they live in kernel/fork.c and fs/exec.c
    resp. sys_execve() is also in fs/exec.c and it's completely
    architecture-independent.

    - daemonize() is gone, along with its parts in fs/*.c

    - struct pt_regs * is no longer passed to do_fork/copy_process/
    copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump.

    - sys_fork()/sys_vfork()/sys_clone() unified; some architectures
    still need wrappers (ones with callee-saved registers not saved in
    pt_regs on syscall entry), but the main part of those suckers is in
    kernel/fork.c now."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits)
    do_coredump(): get rid of pt_regs argument
    print_fatal_signal(): get rid of pt_regs argument
    ptrace_signal(): get rid of unused arguments
    get rid of ptrace_signal_deliver() arguments
    new helper: signal_pt_regs()
    unify default ptrace_signal_deliver
    flagday: kill pt_regs argument of do_fork()
    death to idle_regs()
    don't pass regs to copy_process()
    flagday: don't pass regs to copy_thread()
    bfin: switch to generic vfork, get rid of pointless wrappers
    xtensa: switch to generic clone()
    openrisc: switch to use of generic fork and clone
    unicore32: switch to generic clone(2)
    score: switch to generic fork/vfork/clone
    c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone()
    take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h
    mn10300: switch to generic fork/vfork/clone
    h8300: switch to generic fork/vfork/clone
    tile: switch to generic clone()
    ...

    Conflicts:
    arch/microblaze/include/asm/Kbuild

    Linus Torvalds
     

29 Nov, 2012

2 commits

  • Signed-off-by: Al Viro

    Al Viro
     
  • We have thread_group_cputime() and thread_group_times(). The naming
    doesn't provide enough information about the difference between
    these two APIs.

    To lower the confusion, rename thread_group_times() to
    thread_group_cputime_adjusted(). This name better suggests that
    it's a version of thread_group_cputime() that does some stabilization
    on the raw cputime values. ie here: scale on top of CFS runtime
    stats and bound lower value for monotonicity.

    Signed-off-by: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

19 Nov, 2012

1 commit


03 Oct, 2012

1 commit

  • Pull vfs update from Al Viro:

    - big one - consolidation of descriptor-related logics; almost all of
    that is moved to fs/file.c

    (BTW, I'm seriously tempted to rename the result to fd.c. As it is,
    we have a situation when file_table.c is about handling of struct
    file and file.c is about handling of descriptor tables; the reasons
    are historical - file_table.c used to be about a static array of
    struct file we used to have way back).

    A lot of stray ends got cleaned up and converted to saner primitives,
    disgusting mess in android/binder.c is still disgusting, but at least
    doesn't poke so much in descriptor table guts anymore. A bunch of
    relatively minor races got fixed in process, plus an ext4 struct file
    leak.

    - related thing - fget_light() partially unuglified; see fdget() in
    there (and yes, it generates the code as good as we used to have).

    - also related - bits of Cyrill's procfs stuff that got entangled into
    that work; _not_ all of it, just the initial move to fs/proc/fd.c and
    switch of fdinfo to seq_file.

    - Alex's fs/coredump.c spiltoff - the same story, had been easier to
    take that commit than mess with conflicts. The rest is a separate
    pile, this was just a mechanical code movement.

    - a few misc patches all over the place. Not all for this cycle,
    there'll be more (and quite a few currently sit in akpm's tree)."

    Fix up trivial conflicts in the android binder driver, and some fairly
    simple conflicts due to two different changes to the sock_alloc_file()
    interface ("take descriptor handling from sock_alloc_file() to callers"
    vs "net: Providing protocol type via system.sockprotoname xattr of
    /proc/PID/fd entries" adding a dentry name to the socket)

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits)
    MAX_LFS_FILESIZE should be a loff_t
    compat: fs: Generic compat_sys_sendfile implementation
    fs: push rcu_barrier() from deactivate_locked_super() to filesystems
    btrfs: reada_extent doesn't need kref for refcount
    coredump: move core dump functionality into its own file
    coredump: prevent double-free on an error path in core dumper
    usb/gadget: fix misannotations
    fcntl: fix misannotations
    ceph: don't abuse d_delete() on failure exits
    hypfs: ->d_parent is never NULL or negative
    vfs: delete surplus inode NULL check
    switch simple cases of fget_light to fdget
    new helpers: fdget()/fdput()
    switch o2hb_region_dev_write() to fget_light()
    proc_map_files_readdir(): don't bother with grabbing files
    make get_file() return its argument
    vhost_set_vring(): turn pollstart/pollstop into bool
    switch prctl_set_mm_exe_file() to fget_light()
    switch xfs_find_handle() to fget_light()
    switch xfs_swapext() to fget_light()
    ...

    Linus Torvalds
     

27 Sep, 2012

2 commits


25 Sep, 2012

1 commit

  • We currently use a per socket order-0 page cache for tcp_sendmsg()
    operations.

    This page is used to build fragments for skbs.

    Its done to increase probability of coalescing small write() into
    single segments in skbs still in write queue (not yet sent)

    But it wastes a lot of memory for applications handling many mostly
    idle sockets, since each socket holds one page in sk->sk_sndmsg_page

    Its also quite inefficient to build TSO 64KB packets, because we need
    about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
    page allocator more than wanted.

    This patch adds a per task frag allocator and uses bigger pages,
    if available. An automatic fallback is done in case of memory pressure.

    (up to 32768 bytes per frag, thats order-3 pages on x86)

    This increases TCP stream performance by 20% on loopback device,
    but also benefits on other network devices, since 8x less frags are
    mapped on transmit and unmapped on tx completion. Alexander Duyck
    mentioned a probable performance win on systems with IOMMU enabled.

    Its possible some SG enabled hardware cant cope with bigger fragments,
    but their ndo_start_xmit() should already handle this, splitting a
    fragment in sub fragments, since some arches have PAGE_SIZE=65536

    Successfully tested on various ethernet devices.
    (ixgbe, igb, bnx2x, tg3, mellanox mlx4)

    Signed-off-by: Eric Dumazet
    Cc: Ben Hutchings
    Cc: Vijay Subramanian
    Cc: Alexander Duyck
    Tested-by: Vijay Subramanian
    Signed-off-by: David S. Miller

    Eric Dumazet