Eric Lee / smarc-fsl-linux-kernel

22 Nov, 2016

1 commit

8e5bfa8c1 sched/autogroup: Do not use autogroup->tg in zombie threads ... Browse Code »

Exactly because for_each_thread() in autogroup_move_group() can't see it
and update its ->sched_task_group before _put() and possibly free().

So the exiting task needs another sched_move_task() before exit_notify()
and we need to re-introduce the PF_EXITING (or similar) check removed by
the previous change for another reason.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hartsjc@redhat.com
Cc: vbendel@redhat.com
Cc: vlovejoy@redhat.com
Link: http://lkml.kernel.org/r/20161114184612.GA15968@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-11-22 19:33:43 +0800

08 Oct, 2016

1 commit

38531201c mm, oom: enforce exit_oom_victim on current task ... Browse Code »

There are no users of exit_oom_victim on !current task anymore so enforce
the API to always work on the current.

Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
Signed-off-by: Tetsuo Handa
Signed-off-by: Michal Hocko
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Vladimir Davydov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tetsuo Handa
2016-10-08 09:46:28 +0800

22 Sep, 2016

1 commit

9af6528ee sched/core: Optimize __schedule() ... Browse Code »

Oleg noted that by making do_exit() use __schedule() for the TASK_DEAD
context switch, we can avoid the TASK_DEAD special case currently in
__schedule() because that avoids the extra preempt_disable() from
schedule().

In order to facilitate this, create a do_task_dead() helper which we
place in the scheduler code, such that it can access __schedule().

Also add some __noreturn annotations to the functions, there's no
coming back from do_exit().

Suggested-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Cheng Chao
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: chris@chris-wilson.co.uk
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20160913163729.GB5012@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-22 20:53:45 +0800

02 Sep, 2016

1 commit

c11600e4f mm, mempolicy: task->mempolicy must be NULL before dropping final reference ... Browse Code »

KASAN allocates memory from the page allocator as part of
kmem_cache_free(), and that can reference current->mempolicy through any
number of allocation functions. It needs to be NULL'd out before the
final reference is dropped to prevent a use-after-free bug:

BUG: KASAN: use-after-free in alloc_pages_current+0x363/0x370 at addr ffff88010b48102c
CPU: 0 PID: 15425 Comm: trinity-c2 Not tainted 4.8.0-rc2+ #140
...
Call Trace:
dump_stack
kasan_object_err
kasan_report_error
__asan_report_load2_noabort
alloc_pages_current mempolicy to NULL before dropping the final
reference.

Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1608301442180.63329@chino.kir.corp.google.com
Fixes: cd11016e5f52 ("mm, kasan: stackdepot implementation. Enable stackdepot for SLAB")
Signed-off-by: David Rientjes
Reported-by: Vegard Nossum
Acked-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: [4.6+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2016-09-02 08:52:01 +0800

03 Aug, 2016

1 commit

627393d44 kernel/exit.c: quieten greatest stack depth printk ... Browse Code »

Many targets enable CONFIG_DEBUG_STACK_USAGE, and while the information
is useful, it isn't worthy of pr_warn(). Reduce it to pr_info().

Link: http://lkml.kernel.org/r/1466982072-29836-1-git-send-email-anton@ozlabs.org
Signed-off-by: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2016-08-03 07:35:23 +0800

26 Jul, 2016

1 commit

cca08cd66 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- introduce and use task_rcu_dereference()/try_get_task_struct() to fix
and generalize task_struct handling (Oleg Nesterov)

- do various per entity load tracking (PELT) fixes and optimizations
(Peter Zijlstra)

- cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

- introduce consolidated cputime output file cpuacct.usage_all and
related refactorings (Zhao Lei)

- ... plus misc fixes and enhancements

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
sched/fair: Rework throttle_count sync
sched/core: Fix sched_getaffinity() return value kerneldoc comment
sched/fair: Reorder cgroup creation code
sched/fair: Apply more PELT fixes
sched/fair: Fix PELT integrity for new tasks
sched/cgroup: Fix cpu_cgroup_fork() handling
sched/fair: Fix PELT integrity for new groups
sched/fair: Fix and optimize the fork() path
sched/cputime: Add steal time support to full dynticks CPU time accounting
sched/cputime: Fix prev steal time accouting during CPU hotplug
KVM: Fix steal clock warp during guest CPU hotplug
sched/debug: Always show 'nr_migrations'
sched/fair: Use task_rcu_dereference()
sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
sched/idle: Optimize the generic idle loop
sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()

Linus Torvalds
2016-07-26 04:59:34 +0800

14 Jun, 2016

1 commit

be3e78449 locking/spinlock: Update spin_unlock_wait() users ... Browse Code »

With the modified semantics of spin_unlock_wait() a number of
explicit barriers can be removed. Also update the comment for the
do_exit() usecase, as that was somewhat stale/obscure.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-14 17:55:15 +0800

03 Jun, 2016

1 commit

150593bf8 sched/api: Introduce task_rcu_dereference() and try_get_task_struct() ... Browse Code »

Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).

As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().

Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().

( Also add try_get_task_struct() to make it easier to use this API
correctly. )

Suggested-by: Kirill Tkhai
Signed-off-by: Oleg Nesterov
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vladimir Davydov
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Oleg Nesterov
2016-06-03 15:18:57 +0800

24 May, 2016

2 commits

91c4e8ea8 wait: allow sys_waitid() to accept __WNOTHREAD/__WCLONE/__WALL ... Browse Code »

I see no reason why waitid() can't support other linux-specific flags
allowed in sys_wait4().

In particular this change can help if we reconsider the previous change
("wait/ptrace: assume __WALL if the child is traced") which adds the
"automagical" __WALL for debugger.

Signed-off-by: Oleg Nesterov
Cc: Dmitry Vyukov
Cc: Denys Vlasenko
Cc: Jan Kratochvil
Cc: "Michael Kerrisk (man-pages)"
Cc: Pedro Alves
Cc: Roland McGrath
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-05-24 08:04:14 +0800
bf959931d wait/ptrace: assume __WALL if the child is traced ... Browse Code »

The following program (simplified version of generated by syzkaller)

#include
#include
#include
#include
#include

void *thread_func(void *arg)
{
ptrace(PTRACE_TRACEME, 0,0,0);
return 0;
}

int main(void)
{
pthread_t thread;

if (fork())
return 0;

while (getppid() != 1)
;

pthread_create(&thread, NULL, thread_func, NULL);
pthread_join(thread, NULL);
return 0;
}

creates an unreapable zombie if /sbin/init doesn't use __WALL.

This is not a kernel bug, at least in a sense that everything works as
expected: debugger should reap a traced sub-thread before it can reap the
leader, but without __WALL/__WCLONE do_wait() ignores sub-threads.

Unfortunately, it seems that /sbin/init in most (all?) distributions
doesn't use it and we have to change the kernel to avoid the problem.
Note also that most init's use sys_waitid() which doesn't allow __WALL, so
the necessary user-space fix is not that trivial.

This patch just adds the "ptrace" check into eligible_child(). To some
degree this matches the "tsk->ptrace" in exit_notify(), ->exit_signal is
mostly ignored when the tracee reports to debugger. Or WSTOPPED, the
tracer doesn't need to set this flag to wait for the stopped tracee.

This obviously means the user-visible change: __WCLONE and __WALL no
longer have any meaning for debugger. And I can only hope that this won't
break something, but at least strace/gdb won't suffer.

We could make a more conservative change. Say, we can take __WCLONE into
account, or !thread_group_leader(). But it would be nice to not
complicate these historical/confusing checks.

Signed-off-by: Oleg Nesterov
Reported-by: Dmitry Vyukov
Cc: Denys Vlasenko
Cc: Jan Kratochvil
Cc: "Michael Kerrisk (man-pages)"
Cc: Pedro Alves
Cc: Roland McGrath
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-05-24 08:04:14 +0800

21 May, 2016

1 commit

e64646946 exit_thread: accept a task parameter to be exited ... Browse Code »

We need to call exit_thread from copy_process in a fail path. So make it
accept task_struct as a parameter.

[v2]
* s390: exit_thread_runtime_instr doesn't make sense to be called for
non-current tasks.
* arm: fix the comment in vfp_thread_copy
* change 'me' to 'tsk' for task_struct
* now we can change only archs that actually have exit_thread

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jiri Slaby
Cc: "David S. Miller"
Cc: "H. Peter Anvin"
Cc: "James E.J. Bottomley"
Cc: Aurelien Jacquiot
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Chen Liqin
Cc: Chris Metcalf
Cc: Chris Zankel
Cc: David Howells
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Guan Xuetao
Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ivan Kokshaysky
Cc: James Hogan
Cc: Jeff Dike
Cc: Jesper Nilsson
Cc: Jiri Slaby
Cc: Jonas Bonn
Cc: Koichi Yasutake
Cc: Lennox Wu
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mikael Starvik
Cc: Paul Mackerras
Cc: Peter Zijlstra
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Richard Henderson
Cc: Richard Kuo
Cc: Richard Weinberger
Cc: Russell King
Cc: Steven Miao
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2016-05-21 08:58:30 +0800

26 Mar, 2016

1 commit

36324a990 oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space ... Browse Code »

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem for
write should be blocked killable which will help to reduce such a lock
contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up waiters too
early.

Signed-off-by: Michal Hocko
Suggested-by: Johannes Weiner
Suggested-by: Tetsuo Handa
Cc: Andrea Argangeli
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Oleg Nesterov
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-26 07:37:42 +0800

23 Mar, 2016

1 commit

5c9a8750a kernel: add kcov code coverage ... Browse Code »

kcov provides code coverage collection for coverage-guided fuzzing
(randomized testing). Coverage-guided fuzzing is a testing technique
that uses coverage feedback to determine new interesting inputs to a
system. A notable user-space example is AFL
(http://lcamtuf.coredump.cx/afl/). However, this technique is not
widely used for kernel testing due to missing compiler and kernel
support.

kcov does not aim to collect as much coverage as possible. It aims to
collect more or less stable coverage that is function of syscall inputs.
To achieve this goal it does not collect coverage in soft/hard
interrupts and instrumentation of some inherently non-deterministic or
non-interesting parts of kernel is disbled (e.g. scheduler, locking).

Currently there is a single coverage collection mode (tracing), but the
API anticipates additional collection modes. Initially I also
implemented a second mode which exposes coverage in a fixed-size hash
table of counters (what Quentin used in his original patch). I've
dropped the second mode for simplicity.

This patch adds the necessary support on kernel side. The complimentary
compiler support was added in gcc revision 231296.

We've used this support to build syzkaller system call fuzzer, which has
found 90 kernel bugs in just 2 months:

https://github.com/google/syzkaller/wiki/Found-Bugs

We've also found 30+ bugs in our internal systems with syzkaller.
Another (yet unexplored) direction where kcov coverage would greatly
help is more traditional "blob mutation". For example, mounting a
random blob as a filesystem, or receiving a random blob over wire.

Why not gcov. Typical fuzzing loop looks as follows: (1) reset
coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
typical coverage can be just a dozen of basic blocks (e.g. an invalid
input). In such context gcov becomes prohibitively expensive as
reset/collect coverage steps depend on total number of basic
blocks/edges in program (in case of kernel it is about 2M). Cost of
kcov depends only on number of executed basic blocks/edges. On top of
that, kernel requires per-thread coverage because there are always
background threads and unrelated processes that also produce coverage.
With inlined gcov instrumentation per-thread coverage is not possible.

kcov exposes kernel PCs and control flow to user-space which is
insecure. But debugfs should not be mapped as user accessible.

Based on a patch by Quentin Casasnovas.

[akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
[akpm@linux-foundation.org: unbreak allmodconfig]
[akpm@linux-foundation.org: follow x86 Makefile layout standards]
Signed-off-by: Dmitry Vyukov
Reviewed-by: Kees Cook
Cc: syzkaller
Cc: Vegard Nossum
Cc: Catalin Marinas
Cc: Tavis Ormandy
Cc: Will Deacon
Cc: Quentin Casasnovas
Cc: Kostya Serebryany
Cc: Eric Dumazet
Cc: Alexander Potapenko
Cc: Kees Cook
Cc: Bjorn Helgaas
Cc: Sasha Levin
Cc: David Drysdale
Cc: Ard Biesheuvel
Cc: Andrey Ryabinin
Cc: Kirill A. Shutemov
Cc: Jiri Slaby
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Vyukov
2016-03-23 06:36:02 +0800

21 Jan, 2016

2 commits

c428fbdbf exit: remove unneeded declaration of exit_mm() ... Browse Code »

Signed-off-by: Dmitry Safonov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dmitry Safonov
2016-01-21 09:09:18 +0800
570ac9337 ptrace: task_stopped_code(ptrace => true) can't see TASK_STOPPED task ... Browse Code »

task_stopped_code()->task_is_stopped_or_traced() doesn't look right, the
traced task must never be TASK_STOPPED.

We can not add WARN_ON(task_is_stopped(p)), but this is only because
do_wait() can race with PTRACE_ATTACH from another thread.

[akpm@linux-foundation.org: teeny cleanup]
Signed-off-by: Oleg Nesterov
Cc: Andrey Ryabinin
Cc: Roland McGrath
Acked-by: Tejun Heo
Cc: Pedro Alves
Cc: Jan Kratochvil
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-01-21 09:09:18 +0800

04 Nov, 2015

1 commit

53528695f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:

- sched/fair load tracking fixes and cleanups (Byungchul Park)

- Make load tracking frequency scale invariant (Dietmar Eggemann)

- sched/deadline updates (Juri Lelli)

- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)

- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)

- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

- Optimize per entity utilization tracking (Peter Zijlstra)

- ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...

Linus Torvalds
2015-11-04 10:03:50 +0800

07 Oct, 2015

1 commit

49f5903b4 rcu: Move preemption disabling out of __srcu_read_lock() ... Browse Code »

Currently, __srcu_read_lock() cannot be invoked from restricted
environments because it contains calls to preempt_disable() and
preempt_enable(), both of which can invoke lockdep, which is a bad
idea in some restricted execution modes. This commit therefore moves
the preempt_disable() and preempt_enable() from __srcu_read_lock()
to srcu_read_lock(). It also inserts the preempt_disable() and
preempt_enable() around the call to __srcu_read_lock() in do_exit().

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2015-10-07 02:15:43 +0800

06 Oct, 2015

1 commit

1dc0fffc4 sched/core: Robustify preemption leak checks ... Browse Code »

When we warn about a preempt_count leak; reset the preempt_count to
the known good value such that the problem does not ripple forward.

This is most important on x86 which has a per cpu preempt_count that is
not saved/restored (after this series). So if you schedule with an
invalid (!2*PREEMPT_DISABLE_OFFSET) preempt_count the next task is
messed up too.

Enforcing this invariant limits the borkage to just the one task.

Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Frederic Weisbecker
Reviewed-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-10-06 23:08:17 +0800

07 Aug, 2015

1 commit

3da56d166 kernel: exit: fix typo in comment ... Browse Code »

s,critiera,criteria,

While at it, add a comma, because it makes sense grammatically.

Signed-off-by: Frans Klaver
Signed-off-by: Jiri Kosina

Frans Klaver
2015-08-07 19:59:49 +0800

26 Jun, 2015

1 commit

51229b495 exit,stats: /* obey this comment */ ... Browse Code »

There is a helpful comment in do_exit() that states we sync the mm's RSS
info before statistics gathering.

The function that does the statistics gathering is called right above that
comment.

Change the code to obey the comment.

Signed-off-by: Rik van Riel
Cc: Oleg Nesterov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2015-06-26 08:00:43 +0800

25 Jun, 2015

1 commit

16e951966 mm: oom_kill: clean up victim marking and exiting interfaces ... Browse Code »

Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Cc: Andrea Arcangeli
Cc: Dave Chinner
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2015-06-25 08:49:43 +0800

13 Apr, 2015

1 commit

973f911f5 Remove execution domain support ... Browse Code »

All users of exec_domain are gone, now we can get rid
of that abandoned feature.
To not break existing userspace we keep a dummy
/proc/execdomains file which will always contain
"0-0 Linux [kernel]".

Signed-off-by: Richard Weinberger

Richard Weinberger
2015-04-13 02:58:24 +0800

12 Feb, 2015

2 commits

c32b3cbe0 oom, PM: make OOM detection in the freezer path raceless ... Browse Code »

Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
suspend") has left a race window when OOM killer manages to
note_oom_kill after freeze_processes checks the counter. The race
window is quite small and really unlikely and partial solution deemed
sufficient at the time of submission.

Tejun wasn't happy about this partial solution though and insisted on a
full solution. That requires the full OOM and freezer's task freezing
exclusion, though. This is done by this patch which introduces oom_sem
RW lock and turns oom_killer_disable() into a full OOM barrier.

oom_killer_disabled check is moved from the allocation path to the OOM
level and we take oom_sem for reading for both the check and the whole
OOM invocation.

oom_killer_disable() takes oom_sem for writing so it waits for all
currently running OOM killer invocations. Then it disable all the further
OOMs by setting oom_killer_disabled and checks for any oom victims.
Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
last victim wakes up all waiters enqueued by oom_killer_disable().
Therefore this function acts as the full OOM barrier.

The page fault path is covered now as well although it was assumed to be
safe before. As per Tejun, "We used to have freezing points deep in file
system code which may be reacheable from page fault." so it would be
better and more robust to not rely on freezing points here. Same applies
to the memcg OOM killer.

out_of_memory tells the caller whether the OOM was allowed to trigger and
the callers are supposed to handle the situation. The page allocation
path simply fails the allocation same as before. The page fault path will
retry the fault (more on that later) and Sysrq OOM trigger will simply
complain to the log.

Normally there wouldn't be any unfrozen user tasks after
try_to_freeze_tasks so the function will not block. But if there was an
OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
finish yet then we have to wait for it. This should complete in a finite
time, though, because

- the victim cannot loop in the page fault handler (it would die
on the way out from the exception)
- it cannot loop in the page allocator because all the further
allocation would fail and __GFP_NOFAIL allocations are not
acceptable at this stage
- it shouldn't be blocked on any locks held by frozen tasks
(try_to_freeze expects lockless context) and kernel threads and
work queues are not frozen yet

Signed-off-by: Michal Hocko
Suggested-by: Tejun Heo
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Cong Wang
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-02-12 09:06:03 +0800
49550b605 oom: add helpers for setting and clearing TIF_MEMDIE ... Browse Code »

This patchset addresses a race which was described in the changelog for
5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen. But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation. In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen. This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_. As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably. I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
echo processors > /sys/power/pm_test
while true
do
echo mem > /sys/power/state
sleep 1s
done
- simple module which allocates and frees 20M in 8K chunks. If it sees
freezing(current) then it tries another round of allocation before calling
try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
I think this should be OK because __thaw_task shouldn't interfere with any
locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled. I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[ 243.636072] (elapsed 2.837 seconds) done.
[ 243.641985] Trying to disable OOM killer
[ 243.643032] Waiting for concurent OOM victims
[ 243.644342] OOM killer disabled
[ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[ 243.652983] Suspending console(s) (use no_console_suspend to debug)
[ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[ 243.992600] PM: suspend of devices complete after 336.667 msecs
[ 243.993264] PM: late suspend of devices complete after 0.660 msecs
[ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[ 243.994717] ACPI: Preparing to enter system sleep state S3
[ 243.994795] PM: Saving platform NVS memory
[ 243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM. They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael. I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code. I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko
Cc: Tejun Heo
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Oleg Nesterov
Cc: Cong Wang
Cc: "Rafael J. Wysocki"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-02-12 09:06:03 +0800

09 Jan, 2015

1 commit

3245d6aca exit: fix race between wait_consider_task() and wait_task_zombie() ... Browse Code »

wait_consider_task() checks EXIT_ZOMBIE after EXIT_DEAD/EXIT_TRACE and
both checks can fail if we race with EXIT_ZOMBIE -> EXIT_DEAD/EXIT_TRACE
change in between, gcc needs to reload p->exit_state after
security_task_wait(). In this case ->notask_error will be wrongly
cleared and do_wait() can hang forever if it was the last eligible
child.

Many thanks to Arne who carefully investigated the problem.

Note: this bug is very old but it was pure theoretical until commit
b3ab03160dfa ("wait: completely ignore the EXIT_DEAD tasks"). Before
this commit "-O2" was probably enough to guarantee that compiler won't
read ->exit_state twice.

Signed-off-by: Oleg Nesterov
Reported-by: Arne Goedeke
Tested-by: Arne Goedeke
Cc: [3.15+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-01-09 07:10:51 +0800

15 Dec, 2014

1 commit

37da7bbbe Merge tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty ... Browse Code »

Pull tty/serial driver updates from Greg KH:
"Here's the big tty/serial driver update for 3.19-rc1.

There are a number of TTY core changes/fixes in here from Peter Hurley
that have all been teted in linux-next for a long time now. There are
also the normal serial driver updates as well, full details in the
changelog below"

* tag 'tty-3.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (219 commits)
serial: pxa: hold port.lock when reporting modem line changes
tty-hvsi_lib: Deletion of an unnecessary check before the function call "tty_kref_put"
tty: Deletion of unnecessary checks before two function calls
n_tty: Fix read_buf race condition, increment read_head after pushing data
serial: of-serial: add PM suspend/resume support
Revert "serial: of-serial: add PM suspend/resume support"
Revert "serial: of-serial: fix up PM ops on no_console_suspend and port type"
serial: 8250: don't attempt a trylock if in sysrq
serial: core: Add big-endian iotype
serial: samsung: use port->fifosize instead of hardcoded values
serial: samsung: prefer to use fifosize from driver data
serial: samsung: fix style problems
serial: samsung: wait for transfer completion before clock disable
serial: icom: fix error return code
serial: tegra: clean up tty-flag assignments
serial: Fix io address assign flow with Fintek PCI-to-UART Product
serial: mxs-auart: fix tx_empty against shift register
serial: mxs-auart: fix gpio change detection on interrupt
serial: mxs-auart: Fix mxs_auart_set_ldisc()
serial: 8250_dw: Use 64-bit access for OCTEON.
...

Linus Torvalds
2014-12-15 07:23:32 +0800

11 Dec, 2014

14 commits

6c66e7dba exit: exit_notify: re-use "dead" list to autoreap current ... Browse Code »

After the previous change we can add just the exiting EXIT_DEAD task to
the "dead" list and remove another release_task(tsk).

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:18 +0800
482a3767e exit: reparent: call forget_original_parent() under tasklist_lock ... Browse Code »

Shift "release dead children" loop from forget_original_parent() to its
caller, exit_notify(). It is safe to reap them even if our parent reaps
us right after we drop tasklist_lock, those children no longer have any
connection to the exiting task.

And this allows us to avoid write_lock_irq(tasklist_lock) right after it
was released by forget_original_parent(), we can simply call it with
tasklist_lock held.

While at it, move the comment about forget_original_parent() up to
this function.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:18 +0800
ad9e206ae exit: reparent: avoid find_new_reaper() if no children ... Browse Code »

Now that pid_ns logic was isolated we can change forget_original_parent()
to return right after find_child_reaper() when father->children is empty,
there is nothing to reparent in this case.

In particular this avoids find_alive_thread() and this can help if the
whole process exits and it has a lot of PF_EXITING threads at the start of
the thread list, this can easily lead to O(nr_threads ** 2) iterations.

Trivial test case (tested under KVM, 2 CPUs):

static void *tfunc(void *arg)
{
pause();
return NULL;
}

static int child(unsigned int nt)
{
pthread_t pt;

while (nt--)
assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);

pthread_kill(pt, SIGTRAP);
pause();
return 0;
}

int main(int argc, const char *argv[])
{
int stat;
unsigned int nf = atoi(argv[1]);
unsigned int nt = atoi(argv[2]);

while (nf--) {
if (!fork())
return child(nt);

wait(&stat);
assert(stat == SIGTRAP);
}

return 0;
}

$ time ./test 16 16536 shows:

real user sys
- 5m37.628s 0m4.437s 8m5.560s
+ 0m50.032s 0m7.130s 1m4.927s

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:18 +0800
c9dc05bfd exit: reparent: introduce find_alive_thread() ... Browse Code »

Add the new simple helper to factor out the for_each_thread() code in
find_child_reaper() and find_new_reaper(). It can also simplify the
potential PF_EXITING -> exit_state change, plus perhaps we can change this
code to take SIGNAL_GROUP_EXIT into account.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
1109909c7 exit: reparent: introduce find_child_reaper() ... Browse Code »

find_new_reaper() does 2 completely different things. Not only it finds a
reaper, it also updates pid_ns->child_reaper or kills the whole namespace
if the caller is ->child_reaper.

Now that has_child_subreaper logic doesn't depend on child_reaper check we
can move that pid_ns code into a separate helper. IMHO this makes the
code more clean, and this allows the next changes.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
175aed3f8 exit: reparent: document the ->has_child_subreaper checks ... Browse Code »

Swap the "init_task" and same_thread_group() checks. This way it is more
simple to document these checks and we can remove the link to the previous
discussion on lkml.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
3750ef979 exit: reparent: s/while_each_thread/for_each_thread/ in find_new_reaper() ... Browse Code »

Change find_new_reaper() to use for_each_thread() instead of deprecated
while_each_thread(). We do not bother to check "thread != father" in the
1st loop, we can rely on PF_EXITING check.

Note: this means the minor behavioural change: for_each_thread() starts
from the group leader. But this should be fine, nobody should make any
assumption about do_wait(__WNOTHREAD) when it comes to reparented tasks.
And this can avoid the pointless reparenting to a short-living thread
While zombie leaders are not that common.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
7d24e2df5 exit: reparent: fix the cross-namespace PR_SET_CHILD_SUBREAPER reparenting ... Browse Code »

find_new_reaper() assumes that "has_child_subreaper" logic is safe as
long as we are not the exiting ->child_reaper and this is doubly wrong:

1. In fact it is safe if "pid_ns->child_reaper == father"; there must
be no children after zap_pid_ns_processes() returns, so it doesn't
matter what we return in this case and even pid_ns->child_reaper is
wrong otherwise: we can't reparent to ->child_reaper == current.

This is not a bug, but this is confusing.

2. It is not safe if we are not pid_ns->child_reaper but from the same
thread group. We drop tasklist_lock before zap_pid_ns_processes(),
so another thread can lock it and choose the new reaper from the
upper namespace if has_child_subreaper == T, and this is obviously
wrong.

This is not that bad, zap_pid_ns_processes() won't return until the
the new reaper reaps all zombies, but this should be fixed anyway.

We could change for_each_thread() loop to use ->exit_state instead of
PF_EXITING which we had to use until 8aac62706ada, or we could change
copy_signal() to check CLONE_NEWPID before setting has_child_subreaper,
but lets change this code so that it is clear we can't look outside of
our namespace, otherwise same_thread_group(reaper, child_reaper) check
will look wrong and confusing anyway.

We can simply start from "father" and fix the problem. We can't wrongly
return a thread from the same thread group if ->is_child_subreaper == T,
we know that all threads have PF_EXITING set.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
8a1296aea exit: reparent: fix the dead-parent PR_SET_CHILD_SUBREAPER reparenting ... Browse Code »

The ->has_child_subreaper code in find_new_reaper() finds alive "thread"
but returns another "reaper" thread which can be dead.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Kay Sievers
Cc: Lennart Poettering
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
26e75b5c3 exit: release_task: fix the comment about group leader accounting ... Browse Code »

Contrary to what the comment in __exit_signal() says we do account the
group leader. Fix this and explain why.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Rik van Riel
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
986094dfe exit: wait: drop tasklist_lock before psig->c* accounting ... Browse Code »

wait_task_zombie() no longer needs tasklist_lock to accumulate the
psig->c* counters, we can drop it right after cmpxchg(exit_state).

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Rik van Riel
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
f953ccd00 exit: wait: don't use zombie->real_parent ... Browse Code »

1. wait_task_zombie() uses p->real_parent to get psig/siglock. This is
correct but needs tasklist_lock, ->real_parent can exit.

We can use "current" instead. This is our natural child, its parent
must be our sub-thread.

2. Read psig/sig outside of ->siglock, ->signal is no longer protected
by this lock.

3. Fix the outdated comments about tasklist_lock. We can not race with
__exit_signal(), the whole thread group is dead, nobody but us can
call it.

Also clarify the usage of ->stats_lock and ->siglock.

Note: thread_group_cputime_adjusted() is sub-optimal in this case, we
probably want to export cputime_adjust() to avoid thread_group_cputime().
The comment says "all threads" but there are no other threads.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Rik van Riel
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
f6507f83b exit: wait: cleanup the ptrace_reparented() checks ... Browse Code »

Now that EXIT_DEAD is the terminal state we can kill "int traced"
variable and check "state == EXIT_DEAD" instead to cleanup the code. In
particular, this way it is clear that the check obviously doesn't need
tasklist_lock.

Also fix the type of "unsigned long state", "long" was always wrong
although this doesn't matter because cmpxchg/xchg uses typeof(*ptr).

[akpm@linux-foundation.org: don't make me google the C Operator Precedence table]
Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: "Eric W. Biederman"
Cc: Rik van Riel
Cc: Sterling Alexander
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:17 +0800
7c8bd2322 exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() ... Browse Code »

Now that forget_original_parent() uses ->ptrace_entry for EXIT_DEAD tasks,
we can simply pass "dead_children" list to exit_ptrace() and remove
another release_task() loop. Plus this way we do not need to drop and
reacquire tasklist_lock.

Also shift the list_empty(ptraced) check, if we want this optimization it
makes sense to eliminate the function call altogether.

Signed-off-by: Oleg Nesterov
Cc: Aaron Tomlin
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander
Cc: Peter Zijlstra
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2014-12-11 09:41:10 +0800