22 Nov, 2016

1 commit

  • Exactly because for_each_thread() in autogroup_move_group() can't see it
    and update its ->sched_task_group before _put() and possibly free().

    So the exiting task needs another sched_move_task() before exit_notify()
    and we need to re-introduce the PF_EXITING (or similar) check removed by
    the previous change for another reason.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hartsjc@redhat.com
    Cc: vbendel@redhat.com
    Cc: vlovejoy@redhat.com
    Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

08 Oct, 2016

1 commit

  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

22 Sep, 2016

1 commit

  • Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
    context switch, we can avoid the TASK_DEAD special case currently in
    __schedule() because that avoids the extra preempt_disable() from
    schedule().

    In order to facilitate this, create a do_task_dead() helper which we
    place in the scheduler code, such that it can access __schedule().

    Also add some __noreturn annotations to the functions, there's no
    coming back from do_exit().

    Suggested-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Cheng Chao
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: chris@chris-wilson.co.uk
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Sep, 2016

1 commit

  • KASAN allocates memory from the page allocator as part of
    kmem_cache_free(), and that can reference current->mempolicy through any
    number of allocation functions. It needs to be NULL'd out before the
    final reference is dropped to prevent a use-after-free bug:

    BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
    CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
    ...
    Call Trace:
    dump_stack
    kasan_object_err
    kasan_report_error
    __asan_report_load2_noabort
    alloc_pages_current mempolicy to NULL before dropping the final
    reference.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
    Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
    Signed-off-by: David Rientjes
    Reported-by: Vegard Nossum
    Acked-by: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: [4.6+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 Aug, 2016

1 commit

  • Many targets enable CONFIG_DEBUG_STACK_USAGE, and while the information
    is useful, it isn't worthy of pr_warn(). Reduce it to pr_info().

    Link: http://lkml.kernel.org/r/1466982072-29836-1-git-send-email-anton@ozlabs.org
    Signed-off-by: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anton Blanchard
     

26 Jul, 2016

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - introduce and use task_rcu_dereference()/try_get_task_struct() to fix
    and generalize task_struct handling (Oleg Nesterov)

    - do various per entity load tracking (PELT) fixes and optimizations
    (Peter Zijlstra)

    - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

    - introduce consolidated cputime output file cpuacct.usage_all and
    related refactorings (Zhao Lei)

    - ... plus misc fixes and enhancements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
    sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
    sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
    sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
    sched/fair: Rework throttle_count sync
    sched/core: Fix sched_getaffinity() return value kerneldoc comment
    sched/fair: Reorder cgroup creation code
    sched/fair: Apply more PELT fixes
    sched/fair: Fix PELT integrity for new tasks
    sched/cgroup: Fix cpu_cgroup_fork() handling
    sched/fair: Fix PELT integrity for new groups
    sched/fair: Fix and optimize the fork() path
    sched/cputime: Add steal time support to full dynticks CPU time accounting
    sched/cputime: Fix prev steal time accouting during CPU hotplug
    KVM: Fix steal clock warp during guest CPU hotplug
    sched/debug: Always show 'nr_migrations'
    sched/fair: Use task_rcu_dereference()
    sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
    sched/idle: Optimize the generic idle loop
    sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()

    Linus Torvalds
     

14 Jun, 2016

1 commit

  • With the modified semantics of spin_unlock_wait() a number of
    explicit barriers can be removed. Also update the comment for the
    do_exit() usecase, as that was somewhat stale/obscure.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jun, 2016

1 commit

  • Generally task_struct is only protected by RCU if it was found on a
    RCU protected list (say, for_each_process() or find_task_by_vpid()).

    As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
    drops the (potentially) last reference without RCU gp, this means
    that we need to fix the code which uses foreign_rq->curr under
    rcu_read_lock().

    Add a new helper which can be used to dereference rq->curr or any
    other pointer to task_struct assuming that it should be cleared or
    updated before the final put_task_struct(). It returns non-NULL
    only if this task can't go away before rcu_read_unlock().

    ( Also add try_get_task_struct() to make it easier to use this API
    correctly. )

    Suggested-by: Kirill Tkhai
    Signed-off-by: Oleg Nesterov
    [ Updated comments; added try_get_task_struct()]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

24 May, 2016

2 commits

  • I see no reason why waitid() can't support other linux-specific flags
    allowed in sys_wait4().

    In particular this change can help if we reconsider the previous change
    ("wait/ptrace: assume __WALL if the child is traced") which adds the
    "automagical" __WALL for debugger.

    Signed-off-by: Oleg Nesterov
    Cc: Dmitry Vyukov
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pedro Alves
    Cc: Roland McGrath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The following program (simplified version of generated by syzkaller)

    #include
    #include
    #include
    #include
    #include

    void *thread_func(void *arg)
    {
    ptrace(PTRACE_TRACEME, 0,0,0);
    return 0;
    }

    int main(void)
    {
    pthread_t thread;

    if (fork())
    return 0;

    while (getppid() != 1)
    ;

    pthread_create(&thread, NULL, thread_func, NULL);
    pthread_join(thread, NULL);
    return 0;
    }

    creates an unreapable zombie if /sbin/init doesn't use __WALL.

    This is not a kernel bug, at least in a sense that everything works as
    expected: debugger should reap a traced sub-thread before it can reap the
    leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

    Unfortunately, it seems that /sbin/init in most (all?) distributions
    doesn't use it and we have to change the kernel to avoid the problem.
    Note also that most init's use sys_waitid() which doesn't allow __WALL, so
    the necessary user-space fix is not that trivial.

    This patch just adds the "ptrace" check into eligible_child(). To some
    degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
    mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
    tracer doesn't need to set this flag to wait for the stopped tracee.

    This obviously means the user-visible change: __WCLONE and __WALL no
    longer have any meaning for debugger. And I can only hope that this won't
    break something, but at least strace/gdb won't suffer.

    We could make a more conservative change. Say, we can take __WCLONE into
    account, or !thread_group_leader(). But it would be nice to not
    complicate these historical/confusing checks.

    Signed-off-by: Oleg Nesterov
    Reported-by: Dmitry Vyukov
    Cc: Denys Vlasenko
    Cc: Jan Kratochvil
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Pedro Alves
    Cc: Roland McGrath
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

21 May, 2016

1 commit

  • We need to call exit_thread from copy_process in a fail path. So make it
    accept task_struct as a parameter.

    [v2]
    * s390: exit_thread_runtime_instr doesn't make sense to be called for
    non-current tasks.
    * arm: fix the comment in vfp_thread_copy
    * change 'me' to 'tsk' for task_struct
    * now we can change only archs that actually have exit_thread

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Jiri Slaby
    Cc: "David S. Miller"
    Cc: "H. Peter Anvin"
    Cc: "James E.J. Bottomley"
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Chris Zankel
    Cc: David Howells
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ivan Kokshaysky
    Cc: James Hogan
    Cc: Jeff Dike
    Cc: Jesper Nilsson
    Cc: Jiri Slaby
    Cc: Jonas Bonn
    Cc: Koichi Yasutake
    Cc: Lennox Wu
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mikael Starvik
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Richard Henderson
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Steven Miao
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

26 Mar, 2016

1 commit

  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

21 Jan, 2016

2 commits


04 Nov, 2015

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle were:

    - sched/fair load tracking fixes and cleanups (Byungchul Park)

    - Make load tracking frequency scale invariant (Dietmar Eggemann)

    - sched/deadline updates (Juri Lelli)

    - stop machine fixes, cleanups and enhancements for bugs triggered by
    CPU hotplug stress testing (Oleg Nesterov)

    - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
    cleanups (Peter Zijlstra)

    - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

    - Optimize per entity utilization tracking (Peter Zijlstra)

    - ... misc other fixes, cleanups and smaller updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
    sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
    sched: Start stopper early
    stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
    stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
    stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
    stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
    stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
    sched/x86: Fix typo in __switch_to() comments
    sched/core: Remove a parameter in the migrate_task_rq() function
    sched/core: Drop unlikely behind BUG_ON()
    sched/core: Fix task and run queue sched_info::run_delay inconsistencies
    sched/numa: Fix task_tick_fair() from disabling numa_balancing
    sched/core: Add preempt_count invariant check
    sched/core: More notrace annotations
    sched/core: Kill PREEMPT_ACTIVE
    sched/core, sched/x86: Kill thread_info::saved_preempt_count
    sched/core: Simplify preempt_count tests
    sched/core: Robustify preemption leak checks
    sched/core: Stop setting PREEMPT_ACTIVE
    ...

    Linus Torvalds
     

07 Oct, 2015

1 commit

  • Currently, __srcu_read_lock() cannot be invoked from restricted
    environments because it contains calls to preempt_disable() and
    preempt_enable(), both of which can invoke lockdep, which is a bad
    idea in some restricted execution modes. This commit therefore moves
    the preempt_disable() and preempt_enable() from __srcu_read_lock()
    to srcu_read_lock(). It also inserts the preempt_disable() and
    preempt_enable() around the call to __srcu_read_lock() in do_exit().

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

06 Oct, 2015

1 commit

  • When we warn about a preempt_count leak; reset the preempt_count to
    the known good value such that the problem does not ripple forward.

    This is most important on x86 which has a per cpu preempt_count that is
    not saved/restored (after this series). So if you schedule with an
    invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is
    messed up too.

    Enforcing this invariant limits the borkage to just the one task.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Aug, 2015

1 commit


26 Jun, 2015

1 commit

  • There is a helpful comment in do_exit() that states we sync the mm's RSS
    info before statistics gathering.

    The function that does the statistics gathering is called right above that
    comment.

    Change the code to obey the comment.

    Signed-off-by: Rik van Riel
    Cc: Oleg Nesterov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 Jun, 2015

1 commit

  • Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
    are related in functionality, but the interface is not symmetrical at
    all: one is an internal OOM killer function used during the killing, the
    other is for an OOM victim to signal its own death on exit later on.
    This has locking implications, see follow-up changes.

    While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
    is easier on the eye.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Apr, 2015

1 commit

  • All users of exec_domain are gone, now we can get rid
    of that abandoned feature.
    To not break existing userspace we keep a dummy
    /proc/execdomains file which will always contain
    "0-0 Linux [kernel]".

    Signed-off-by: Richard Weinberger

    Richard Weinberger
     

12 Feb, 2015

2 commits

  • Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
    suspend") has left a race window when OOM killer manages to
    note_oom_kill after freeze_processes checks the counter. The race
    window is quite small and really unlikely and partial solution deemed
    sufficient at the time of submission.

    Tejun wasn't happy about this partial solution though and insisted on a
    full solution. That requires the full OOM and freezer's task freezing
    exclusion, though. This is done by this patch which introduces oom_sem
    RW lock and turns oom_killer_disable() into a full OOM barrier.

    oom_killer_disabled check is moved from the allocation path to the OOM
    level and we take oom_sem for reading for both the check and the whole
    OOM invocation.

    oom_killer_disable() takes oom_sem for writing so it waits for all
    currently running OOM killer invocations. Then it disable all the further
    OOMs by setting oom_killer_disabled and checks for any oom victims.
    Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
    last victim wakes up all waiters enqueued by oom_killer_disable().
    Therefore this function acts as the full OOM barrier.

    The page fault path is covered now as well although it was assumed to be
    safe before. As per Tejun, "We used to have freezing points deep in file
    system code which may be reacheable from page fault." so it would be
    better and more robust to not rely on freezing points here. Same applies
    to the memcg OOM killer.

    out_of_memory tells the caller whether the OOM was allowed to trigger and
    the callers are supposed to handle the situation. The page allocation
    path simply fails the allocation same as before. The page fault path will
    retry the fault (more on that later) and Sysrq OOM trigger will simply
    complain to the log.

    Normally there wouldn't be any unfrozen user tasks after
    try_to_freeze_tasks so the function will not block. But if there was an
    OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
    finish yet then we have to wait for it. This should complete in a finite
    time, though, because

    - the victim cannot loop in the page fault handler (it would die
    on the way out from the exception)
    - it cannot loop in the page allocator because all the further
    allocation would fail and __GFP_NOFAIL allocations are not
    acceptable at this stage
    - it shouldn't be blocked on any locks held by frozen tasks
    (try_to_freeze expects lockless context) and kernel threads and
    work queues are not frozen yet

    Signed-off-by: Michal Hocko
    Suggested-by: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset addresses a race which was described in the changelog for
    5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

    : PM freezer relies on having all tasks frozen by the time devices are
    : getting frozen so that no task will touch them while they are getting
    : frozen. But OOM killer is allowed to kill an already frozen task in order
    : to handle OOM situtation. In order to protect from late wake ups OOM
    : killer is disabled after all tasks are frozen. This, however, still keeps
    : a window open when a killed task didn't manage to die by the time
    : freeze_processes finishes.

    The original patch hasn't closed the race window completely because that
    would require a more complex solution as it can be seen by this patchset.

    The primary motivation was to close the race condition between OOM killer
    and PM freezer _completely_. As Tejun pointed out, even though the race
    condition is unlikely the harder it would be to debug weird bugs deep in
    the PM freezer when the debugging options are reduced considerably. I can
    only speculate what might happen when a task is still runnable
    unexpectedly.

    On a plus side and as a side effect the oom enable/disable has a better
    (full barrier) semantic without polluting hot paths.

    I have tested the series in KVM with 100M RAM:
    - many small tasks (20M anon mmap) which are triggering OOM continually
    - s2ram which resumes automatically is triggered in a loop
    echo processors > /sys/power/pm_test
    while true
    do
    echo mem > /sys/power/state
    sleep 1s
    done
    - simple module which allocates and frees 20M in 8K chunks. If it sees
    freezing(current) then it tries another round of allocation before calling
    try_to_freeze
    - debugging messages of PM stages and OOM killer enable/disable/fail added
    and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
    it wakes up waiters.
    - rebased on top of the current mmotm which means some necessary updates
    in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
    I think this should be OK because __thaw_task shouldn't interfere with any
    locking down wake_up_process. Oleg?

    As expected there are no OOM killed tasks after oom is disabled and
    allocations requested by the kernel thread are failing after all the tasks
    are frozen and OOM disabled. I wasn't able to catch a race where
    oom_killer_disable would really have to wait but I kinda expected the race
    is really unlikely.

    [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
    [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
    [ 243.636072] (elapsed 2.837 seconds) done.
    [ 243.641985] Trying to disable OOM killer
    [ 243.643032] Waiting for concurent OOM victims
    [ 243.644342] OOM killer disabled
    [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
    [ 243.652983] Suspending console(s) (use no_console_suspend to debug)
    [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
    [...]
    [ 243.992600] PM: suspend of devices complete after 336.667 msecs
    [ 243.993264] PM: late suspend of devices complete after 0.660 msecs
    [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
    [ 243.994717] ACPI: Preparing to enter system sleep state S3
    [ 243.994795] PM: Saving platform NVS memory
    [ 243.994796] Disabling non-boot CPUs ...

    The first 2 patches are simple cleanups for OOM. They should go in
    regardless the rest IMO.

    Patches 3 and 4 are trivial printk -> pr_info conversion and they should
    go in ditto.

    The main patch is the last one and I would appreciate acks from Tejun and
    Rafael. I think the OOM part should be OK (except for __thaw_task vs.
    task_lock where a look from Oleg would appreciated) but I am not so sure I
    haven't screwed anything in the freezer code. I have found several
    surprises there.

    This patch (of 5):

    This patch is just a preparatory and it doesn't introduce any functional
    change.

    Note:
    I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
    wait for the oom victim and to prevent from new killing. This is
    just a side effect of the flag. The primary meaning is to give the oom
    victim access to the memory reserves and that shouldn't be necessary
    here.

    Signed-off-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jan, 2015

1 commit

  • wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
    both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
    change in between, gcc needs to reload p->exit_state after
    security_task_wait(). In this case ->notask_error will be wrongly
    cleared and do_wait() can hang forever if it was the last eligible
    child.

    Many thanks to Arne who carefully investigated the problem.

    Note: this bug is very old but it was pure theoretical until commit
    b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before
    this commit "-O2" was probably enough to guarantee that compiler won't
    read ->exit_state twice.

    Signed-off-by: Oleg Nesterov
    Reported-by: Arne Goedeke
    Tested-by: Arne Goedeke
    Cc: [3.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Dec, 2014

1 commit

  • Pull tty/serial driver updates from Greg KH:
    "Here's the big tty/serial driver update for 3.19-rc1.

    There are a number of TTY core changes/fixes in here from Peter Hurley
    that have all been teted in linux-next for a long time now. There are
    also the normal serial driver updates as well, full details in the
    changelog below"

    * tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
    serial: pxa: hold port.lock when reporting modem line changes
    tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
    tty: Deletion of unnecessary checks before two function calls
    n_tty: Fix read_buf race condition, increment read_head after pushing data
    serial: of-serial: add PM suspend/resume support
    Revert "serial: of-serial: add PM suspend/resume support"
    Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
    serial: 8250: don't attempt a trylock if in sysrq
    serial: core: Add big-endian iotype
    serial: samsung: use port->fifosize instead of hardcoded values
    serial: samsung: prefer to use fifosize from driver data
    serial: samsung: fix style problems
    serial: samsung: wait for transfer completion before clock disable
    serial: icom: fix error return code
    serial: tegra: clean up tty-flag assignments
    serial: Fix io address assign flow with Fintek PCI-to-UART Product
    serial: mxs-auart: fix tx_empty against shift register
    serial: mxs-auart: fix gpio change detection on interrupt
    serial: mxs-auart: Fix mxs_auart_set_ldisc()
    serial: 8250_dw: Use 64-bit access for OCTEON.
    ...

    Linus Torvalds
     

11 Dec, 2014

14 commits

  • After the previous change we can add just the exiting EXIT_DEAD task to
    the "dead" list and remove another release_task(tsk).

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Shift "release dead children" loop from forget_original_parent() to its
    caller, exit_notify(). It is safe to reap them even if our parent reaps
    us right after we drop tasklist_lock, those children no longer have any
    connection to the exiting task.

    And this allows us to avoid write_lock_irq(tasklist_lock) right after it
    was released by forget_original_parent(), we can simply call it with
    tasklist_lock held.

    While at it, move the comment about forget_original_parent() up to
    this function.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that pid_ns logic was isolated we can change forget_original_parent()
    to return right after find_child_reaper() when father->children is empty,
    there is nothing to reparent in this case.

    In particular this avoids find_alive_thread() and this can help if the
    whole process exits and it has a lot of PF_EXITING threads at the start of
    the thread list, this can easily lead to O(nr_threads ** 2) iterations.

    Trivial test case (tested under KVM, 2 CPUs):

    static void *tfunc(void *arg)
    {
    pause();
    return NULL;
    }

    static int child(unsigned int nt)
    {
    pthread_t pt;

    while (nt--)
    assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);

    pthread_kill(pt, SIGTRAP);
    pause();
    return 0;
    }

    int main(int argc, const char *argv[])
    {
    int stat;
    unsigned int nf = atoi(argv[1]);
    unsigned int nt = atoi(argv[2]);

    while (nf--) {
    if (!fork())
    return child(nt);

    wait(&stat);
    assert(stat == SIGTRAP);
    }

    return 0;
    }

    $ time ./test 16 16536 shows:

    real user sys
    - 5m37.628s 0m4.437s 8m5.560s
    + 0m50.032s 0m7.130s 1m4.927s

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Add the new simple helper to factor out the for_each_thread() code in
    find_child_reaper() and find_new_reaper(). It can also simplify the
    potential PF_EXITING -> exit_state change, plus perhaps we can change this
    code to take SIGNAL_GROUP_EXIT into account.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • find_new_reaper() does 2 completely different things. Not only it finds a
    reaper, it also updates pid_ns->child_reaper or kills the whole namespace
    if the caller is ->child_reaper.

    Now that has_child_subreaper logic doesn't depend on child_reaper check we
    can move that pid_ns code into a separate helper. IMHO this makes the
    code more clean, and this allows the next changes.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Swap the "init_task" and same_thread_group() checks. This way it is more
    simple to document these checks and we can remove the link to the previous
    discussion on lkml.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change find_new_reaper() to use for_each_thread() instead of deprecated
    while_each_thread(). We do not bother to check "thread != father" in the
    1st loop, we can rely on PF_EXITING check.

    Note: this means the minor behavioural change: for_each_thread() starts
    from the group leader. But this should be fine, nobody should make any
    assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
    And this can avoid the pointless reparenting to a short-living thread
    While zombie leaders are not that common.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • find_new_reaper() assumes that "has_child_subreaper" logic is safe as
    long as we are not the exiting ->child_reaper and this is doubly wrong:

    1. In fact it is safe if "pid_ns->child_reaper == father"; there must
    be no children after zap_pid_ns_processes() returns, so it doesn't
    matter what we return in this case and even pid_ns->child_reaper is
    wrong otherwise: we can't reparent to ->child_reaper == current.

    This is not a bug, but this is confusing.

    2. It is not safe if we are not pid_ns->child_reaper but from the same
    thread group. We drop tasklist_lock before zap_pid_ns_processes(),
    so another thread can lock it and choose the new reaper from the
    upper namespace if has_child_subreaper == T, and this is obviously
    wrong.

    This is not that bad, zap_pid_ns_processes() won't return until the
    the new reaper reaps all zombies, but this should be fixed anyway.

    We could change for_each_thread() loop to use ->exit_state instead of
    PF_EXITING which we had to use until 8aac62706ada, or we could change
    copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
    but lets change this code so that it is clear we can't look outside of
    our namespace, otherwise same_thread_group(reaper, child_reaper) check
    will look wrong and confusing anyway.

    We can simply start from "father" and fix the problem. We can't wrongly
    return a thread from the same thread group if ->is_child_subreaper == T,
    we know that all threads have PF_EXITING set.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
    but returns another "reaper" thread which can be dead.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Kay Sievers
    Cc: Lennart Poettering
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Contrary to what the comment in __exit_signal() says we do account the
    group leader. Fix this and explain why.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Rik van Riel
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_zombie() no longer needs tasklist_lock to accumulate the
    psig->c* counters, we can drop it right after cmpxchg(exit_state).

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Rik van Riel
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • 1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
    correct but needs tasklist_lock, ->real_parent can exit.

    We can use "current" instead. This is our natural child, its parent
    must be our sub-thread.

    2. Read psig/sig outside of ->siglock, ->signal is no longer protected
    by this lock.

    3. Fix the outdated comments about tasklist_lock. We can not race with
    __exit_signal(), the whole thread group is dead, nobody but us can
    call it.

    Also clarify the usage of ->stats_lock and ->siglock.

    Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
    probably want to export cputime_adjust() to avoid thread_group_cputime().
    The comment says "all threads" but there are no other threads.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Rik van Riel
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that EXIT_DEAD is the terminal state we can kill "int traced"
    variable and check "state == EXIT_DEAD" instead to cleanup the code. In
    particular, this way it is clear that the check obviously doesn't need
    tasklist_lock.

    Also fix the type of "unsigned long state", "long" was always wrong
    although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).

    [akpm@linux-foundation.org: don't make me google the C Operator Precedence table]
    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: "Eric W. Biederman"
    Cc: Rik van Riel
    Cc: Sterling Alexander
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
    we can simply pass "dead_children" list to exit_ptrace() and remove
    another release_task() loop. Plus this way we do not need to drop and
    reacquire tasklist_lock.

    Also shift the list_empty(ptraced) check, if we want this optimization it
    makes sense to eliminate the function call altogether.

    Signed-off-by: Oleg Nesterov
    Cc: Aaron Tomlin
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman" ,
    Cc: Sterling Alexander
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov