Eric Lee / smarc-fsl-linux-kernel

04 Jan, 2020

1 commit

d9c82fd8c Merge tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull thread fixes from Christian Brauner:
"Here are two fixes:

- Panic earlier when global init exits to generate useable coredumps.

Currently, when global init and all threads in its thread-group
have exited we panic via:

do_exit()
-> exit_notify()
-> forget_original_parent()
-> find_child_reaper()

This makes it hard to extract a useable coredump for global init
from a kernel crashdump because by the time we panic exit_mm() will
have already released global init's mm. We now panic slightly
earlier. This has been a problem in certain environments such as
Android.

- Fix a race in assigning and reading taskstats for thread-groups
with more than one thread.

This patch has been waiting for quite a while since people
disagreed on what the correct fix was at first"

* tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
exit: panic before exit_mm() on global init exit
taskstats: fix data-race

Linus Torvalds
2020-01-04 03:17:14 +0800

21 Dec, 2019

1 commit

43cf75d96 exit: panic before exit_mm() on global init exit ... Browse Code »

Currently, when global init and all threads in its thread-group have exited
we panic via:
do_exit()
-> exit_notify()
-> forget_original_parent()
-> find_child_reaper()
This makes it hard to extract a useable coredump for global init from a
kernel crashdump because by the time we panic exit_mm() will have already
released global init's mm.
This patch moves the panic futher up before exit_mm() is called. As was the
case previously, we only panic when global init and all its threads in the
thread-group have exited.

Signed-off-by: chenqiwu
Acked-by: Christian Brauner
Acked-by: Oleg Nesterov
[christian.brauner@ubuntu.com: fix typo, rewrite commit message]
Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Christian Brauner

chenqiwu
2019-12-21 23:48:01 +0800

01 Dec, 2019

1 commit

6a965666b Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/lin… ... Browse Code »

…ux/kernel/git/dhowells/linux-fs

Pull pipe rework from David Howells:
"This is my set of preparatory patches for building a general
notification queue on top of pipes. It makes a number of significant
changes:

- It removes the nr_exclusive argument from __wake_up_sync_key() as
this is always 1. This prepares for the next step:

- Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue
spinlock.

- Change the pipe buffer ring to be managed in terms of unbounded
head and tail indices rather than bounded index and length. This
means that reading the pipe only needs to modify one index, not
two.

- A selection of helper functions are provided to query the state of
the pipe buffer, plus a couple to apply updates to the pipe
indices.

- The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.

- Advance the head and tail indices inside the pipe waitqueue lock
and use wake_up_interruptible_sync_poll_locked() to poke poll
without having to take the lock twice.

- Rearrange pipe_write() to preallocate the buffer it is going to
write into and then drop the spinlock. This allows kernel
notifications to then be added the ring whilst it is filling the
buffer it allocated. The read side is stalled because the pipe
mutex is still held.

- Don't wake up readers on a pipe if there was already data in it
when we added more.

- Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer"

* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
pipe: Remove sync on wake_ups
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Remove redundant wakeup from pipe_write()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Conditionalise wakeup in pipe_read()
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Allow pipes to have kernel-reserved slots
pipe: Use head and tail pointers for the ring, not cursor and length
Add wake_up_interruptible_sync_poll_locked()
Remove the nr_exclusive argument from __wake_up_sync_key()
pipe: Reduce #inclusion of pipe_fs_i.h

Linus Torvalds
2019-12-01 06:12:13 +0800

27 Nov, 2019

1 commit

168829ad0 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:

- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)

- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.

With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)

- Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...

Linus Torvalds
2019-11-27 08:02:40 +0800

20 Nov, 2019

4 commits

18f694385 futex: Mark the begin of futex exit explicitly ... Browse Code »

Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de

Thomas Gleixner
2019-11-20 16:40:09 +0800
f24f22435 futex: Set task::futex_state to DEAD right after handling futex exit ... Browse Code »

Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de

Thomas Gleixner
2019-11-20 16:40:08 +0800
4610ba7ad exit/exec: Seperate mm_release() ... Browse Code »

mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de

Thomas Gleixner
2019-11-20 16:40:08 +0800
3d4775df0 futex: Replace PF_EXITPIDONE with a state ... Browse Code »

The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de

Thomas Gleixner
2019-11-20 16:40:07 +0800

24 Oct, 2019

1 commit

ce4dd4429 Remove the nr_exclusive argument from __wake_up_sync_key() ... Browse Code »

Remove the nr_exclusive argument from __wake_up_sync_key() and derived
functions as everything seems to set it to 1. Note also that if it wasn't
set to 1, it would clear WF_SYNC anyway.

Signed-off-by: David Howells
Acked-by: Peter Zijlstra (Intel)

David Howells
2019-10-24 00:02:34 +0800

17 Oct, 2019

1 commit

1722c14a2 exit: use pid_has_task() in do_wait() ... Browse Code »

Replace hlist_empty() with the new pid_has_task() helper which is more
idiomatic, easier to grep for, and unifies how callers perform this check.

Signed-off-by: Christian Brauner
Reviewed-by: Oleg Nesterov
Link: https://lore.kernel.org/r/20191017101832.5985-4-christian.brauner@ubuntu.com

Christian Brauner
2019-10-17 21:36:58 +0800

25 Sep, 2019

2 commits

154abafc6 tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code ... Browse Code »

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited. It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Davidlohr Bueso
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

Eric W. Biederman
2019-09-25 23:42:29 +0800
3fbd7ee28 tasks: Add a count of task RCU users ... Browse Code »

Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds
Inspired-by: Oleg Nesterov
Signed-off-by: Eric W. Biederman
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Davidlohr Bueso
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

Eric W. Biederman
2019-09-25 23:42:29 +0800

17 Sep, 2019

1 commit

c17112a5c Merge tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull pidfd/waitid updates from Christian Brauner:
"This contains two features and various tests.

First, it adds support for waiting on process through pidfds by adding
the P_PIDFD type to the waitid() syscall. This completes the basic
functionality of the pidfd api (cf. [1]). In the meantime we also have
a new adition to the userspace projects that make use of the pidfd
api. The qt project was nice enough to send a mail pointing out that
they have a pr up to switch to the pidfd api (cf. [2]).

Second, this tag contains an extension to the waitid() syscall to make
it possible to wait on the current process group in a race free manner
(even though the actual problem is very unlikely) by specifing 0
together with the P_PGID type. This extension traces back to a
discussion on the glibc development mailing list.

There are also a range of tests for the features above. Additionally,
the test-suite which detected the pidfd-polling race we fixed in [3]
is included in this tag"

[1] https://lwn.net/Articles/794707/
[2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
[3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

* tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
waitid: Add support for waiting for the current process group
tests: add pidfd poll tests
tests: move common definitions and functions into pidfd.h
pidfd: add pidfd_wait tests
pidfd: add P_PIDFD to waitid()

Linus Torvalds
2019-09-17 00:28:19 +0800

10 Sep, 2019

1 commit

821cc7b0b waitid: Add support for waiting for the current process group ... Browse Code »

It was recently discovered that the linux version of waitid is not a
superset of the other wait functions because it does not include support
for waiting for the current process group. This has two downsides:
1. An extra system call is needed to get the current process group.
2. After the current process group is received and before it is passed
to waitid a signal could arrive causing the current process group to change.
Inherent race-conditions as these make it impossible for userspace to
emulate this functionaly and thus violate async-signal safety
requirements for waitpid.

Arguments can be made for using a different choice of idtype and id
for this case but the BSDs already use this P_PGID and 0 to indicate
waiting for the current process's process group. So be nice to user
space programmers and don't introduce an unnecessary incompatibility.

Some people have noted that the posix description is that
waitpid will wait for the current process group, and that in
the presence of pthreads that process group can change. To get
clarity on this issue I looked at XNU, FreeBSD, and Luminos. All of
those flavors of unix waited for the current process group at the
time of call and as written could not adapt to the process group
changing after the call.

At one point Linux did adapt to the current process group changing but
that stopped in 161550d74c07 ("pid: sys_wait... fixes"). It has been
over 11 years since Linux has that behavior, no programs that fail
with the change in behavior have been reported, and I could not
find any other unix that does this. So I think it is safe to clarify
the definition of current process group, to current process group
at the time of the wait function.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Christian Brauner
Reviewed-by: Oleg Nesterov
Cc: "H. Peter Anvin"
Cc: Arnd Bergmann
Cc: Palmer Dabbelt
Cc: Rich Felker
Cc: Alistair Francis
Cc: Zong Li
Cc: Andrew Morton
Cc: Oleg Nesterov
Cc: Linus Torvalds
Cc: Al Viro
Cc: Florian Weimer
Cc: Adhemerval Zanella
Cc: GNU C Library
Link: https://lore.kernel.org/r/20190814154400.6371-2-christian.brauner@ubuntu.com

Eric W. Biederman
2019-09-10 23:05:46 +0800

02 Aug, 2019

1 commit

3695eae5f pidfd: add P_PIDFD to waitid() ... Browse Code »

This adds the P_PIDFD type to waitid().
One of the last remaining bits for the pidfd api is to make it possible
to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
that want to use the pidfd api to exclusively manage processes can do so
now.

One of the things this will unblock in the future is the ability to make
it possible to retrieve the exit status via waitid(P_PIDFD) for
non-parent processes if handed a _suitable_ pidfd that has this feature
set. This is similar to what you can do on FreeBSD with kqueue(). It
might even end up being possible to wait on a process as a non-parent if
an appropriate property is enabled on the pidfd.

With P_PIDFD no scoping of the process identified by the pidfd is
possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
rely on pid-based wait*() syscalls for now.

Signed-off-by: Christian Brauner
Reviewed-by: Kees Cook
Reviewed-by: Oleg Nesterov
Cc: Arnd Bergmann
Cc: "Eric W. Biederman"
Cc: Joel Fernandes (Google)
Cc: Thomas Gleixner
Cc: David Howells
Cc: Jann Horn
Cc: Andy Lutomirsky
Cc: Andrew Morton
Cc: Aleksa Sarai
Cc: Linus Torvalds
Cc: Al Viro
Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

Christian Brauner
2019-08-02 03:49:46 +0800

31 Jul, 2019

1 commit

30b692d3b exit: make setting exit_state consistent ... Browse Code »

Since commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")
we unconditionally set exit_state to EXIT_ZOMBIE before calling into
do_notify_parent(). This was done to eliminate a race when querying
exit_state in do_notify_pidfd().
Back then we decided to do the absolute minimal thing to fix this and
not touch the rest of the exit_notify() function where exit_state is
set.
Since this fix has not caused any issues change the setting of
exit_state to EXIT_DEAD in the autoreap case to account for the fact hat
exit_state is set to EXIT_ZOMBIE unconditionally. This fix was planned
but also explicitly requested in [1] and makes the whole code more
consistent.

/* References */
[1]: https://lore.kernel.org/lkml/CAHk-=wigcxGFR2szue4wavJtH5cYTTeNES=toUBVGsmX0rzX+g@mail.gmail.com

Signed-off-by: Christian Brauner
Acked-by: Oleg Nesterov
Cc: Linus Torvalds

Christian Brauner
2019-07-31 01:57:14 +0800

22 Jul, 2019

1 commit

b191d6491 pidfd: fix a poll race when setting exit_state ... Browse Code »

There is a race between reading task->exit_state in pidfd_poll and
writing it after do_notify_parent calls do_notify_pidfd. Expected
sequence of events is:

CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
tsk->exit_state = EXIT_DEAD
pidfd_poll
if (tsk->exit_state)

However nothing prevents the following sequence:

CPU 0 CPU 1
------------------------------------------------
exit_notify
do_notify_parent
do_notify_pidfd
pidfd_poll
if (tsk->exit_state)
tsk->exit_state = EXIT_DEAD

This causes a polling task to wait forever, since poll blocks because
exit_state is 0 and the waiting task is not notified again. A stress
test continuously doing pidfd poll and process exits uncovered this bug.

To fix it, we make sure that the task's exit_state is always set before
calling do_notify_pidfd.

Fixes: b53b0b9d9a6 ("pidfd: add polling support")
Cc: kernel-team@android.com
Cc: Oleg Nesterov
Signed-off-by: Suren Baghdasaryan
Signed-off-by: Joel Fernandes (Google)
Link: https://lore.kernel.org/r/20190717172100.261204-1-joel@joelfernandes.org
[christian@brauner.io: adapt commit message and drop unneeded changes from wait_task_zombie]
Signed-off-by: Christian Brauner

Suren Baghdasaryan
2019-07-22 22:02:03 +0800

01 Jun, 2019

1 commit

6b115bf58 cgroup: Call cgroup_release() before __exit_signal() ... Browse Code »

cgroup_release() calls cgroup_subsys->release() which is used by the
pids controller to uncharge its pid. We want to use it to manage
iteration of dying tasks which requires putting it before
__unhash_process(). Move cgroup_release() above __exit_signal().
While this makes it uncharge before the pid is freed, pid is RCU freed
anyway and the window is very narrow.

Signed-off-by: Tejun Heo
Cc: Oleg Nesterov

Tejun Heo
2019-06-01 01:38:57 +0800

21 May, 2019

1 commit

457c89965 treewide: Add SPDX license identifier for missed files ... Browse Code »

Add SPDX license identifiers to all files which:

- Have no license information of any form

- Have EXPORT_.*_SYMBOL_GPL inside which was used in the
initial scan/conversion to ignore the file

These files fall under the project license, GPL v2 only. The resulting SPDX
license identifier is:

GPL-2.0-only

Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-05-21 16:50:45 +0800

15 May, 2019

1 commit

987717e5e mm: change mm_update_next_owner() to update mm->owner with WRITE_ONCE ... Browse Code »

The RCU reader uses rcu_dereference() inside rcu_read_lock critical
sections, so the writer shall use WRITE_ONCE. Just a cleanup, we still
rely on gcc to emit atomic writes in other places.

Link: http://lkml.kernel.org/r/20190325225636.11635-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Jann Horn
Cc: Jason Gunthorpe
Cc: "Kirill A . Shutemov"
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Oleg Nesterov
Cc: Peter Xu
Cc: zhong jiang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2019-05-15 10:52:47 +0800

08 Mar, 2019

1 commit

1fc1cd839 Merge branch 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:

- Oleg's pids controller accounting update which gets rid of rcu delay
in pids accounting updates

- rstat (cgroup hierarchical stat collection mechanism) optimization

- Doc updates

* 'for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cpuset: remove unused task_has_mempolicy()
cgroup, rstat: Don't flush subtree root unless necessary
cgroup: add documentation for pids.events file
Documentation: cgroup-v2: eliminate markup warnings
MAINTAINERS: Update cgroup entry
cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting

Linus Torvalds
2019-03-08 02:11:41 +0800

02 Feb, 2019

1 commit

8fb335e07 kernel/exit.c: release ptraced tasks before zap_pid_ns_processes ... Browse Code »

Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
zap_pid_ns_processes() waits on all tasks in a current pidns, and only
then are tasks from the dead list released.

zap_pid_ns_processes() can get stuck on waiting tasks from the dead
list. In this case, we will have one unkillable process with one or
more dead children.

Thanks to Oleg for the advice to release tasks in find_child_reaper().

Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
Signed-off-by: Andrei Vagin
Signed-off-by: Oleg Nesterov
Cc: "Eric W. Biederman"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrei Vagin
2019-02-02 07:46:23 +0800

31 Jan, 2019

1 commit

51bee5abe cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting ... Browse Code »

The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
needs pids_free() to uncharge the pid.

However, ->free() is called from __put_task_struct()->cgroup_free() and this
is too late. Even the trivial program which does

for (;;) {
int pid = fork();
assert(pid >= 0);
if (pid)
wait(NULL);
else
exit(0);
}

can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
implies an RCU gp after the task/pid goes away and before the final put().

Test-case:

mkdir -p /tmp/CG
mount -t cgroup2 none /tmp/CG
echo '+pids' > /tmp/CG/cgroup.subtree_control

mkdir /tmp/CG/PID
echo 2 > /tmp/CG/PID/pids.max

perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
echo $! > /tmp/CG/PID/cgroup.procs

Without this patch the forking process fails soon after migration.

Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
into the new helper, cgroup_release(), called by release_task() which actually
frees the pid(s).

Reported-by: Herton R. Krzesinski
Reported-by: Jan Stancek
Signed-off-by: Oleg Nesterov
Signed-off-by: Tejun Heo

Oleg Nesterov
2019-01-31 22:55:57 +0800

21 Jan, 2019

1 commit

6dc080eeb sched/wait: Fix rcuwait_wake_up() ordering ... Browse Code »

For some peculiar reason rcuwait_wake_up() has the right barrier in
the comment, but not in the code.

This mistake has been observed to cause a deadlock in the following
situation:

P1 P2

percpu_up_read() percpu_down_write()
rcu_sync_is_idle() // false
rcu_sync_enter()
...
__percpu_up_read()

[S] ,- __this_cpu_dec(*sem->read_count)
| smp_rmb();
[L] | task = rcu_dereference(w->task) // NULL
|
| [S] w->task = current
| smp_mb();
| [L] readers_active_check() // fail
`->

Where the smp_rmb() (obviously) fails to constrain the store.

[ peterz: Added changelog. ]

Signed-off-by: Prateek Sood
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Andrea Parri
Acked-by: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Link: https://lkml.kernel.org/r/1543590656-7157-1-git-send-email-prsood@codeaurora.org
Signed-off-by: Ingo Molnar

Prateek Sood
2019-01-21 18:15:36 +0800

16 Jan, 2019

1 commit

e8746440b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Fix regression in multi-SKB responses to RTM_GETADDR, from Arthur
Gautier.

2) Fix ipv6 frag parsing in openvswitch, from Yi-Hung Wei.

3) Unbounded recursion in ipv4 and ipv6 GUE tunnels, from Stefano
Brivio.

4) Use after free in hns driver, from Yonglong Liu.

5) icmp6_send() needs to handle the case of NULL skb, from Eric
Dumazet.

6) Missing rcu read lock in __inet6_bind() when operating on mapped
addresses, from David Ahern.

7) Memory leak in tipc-nl_compat_publ_dump(), from Gustavo A. R. Silva.

8) Fix PHY vs r8169 module loading ordering issues, from Heiner
Kallweit.

9) Fix bridge vlan memory leak, from Ido Schimmel.

10) Dev refcount leak in AF_PACKET, from Jason Gunthorpe.

11) Infoleak in ipv6_local_error(), flow label isn't completely
initialized. From Eric Dumazet.

12) Handle mv88e6390 errata, from Andrew Lunn.

13) Making vhost/vsock CID hashing consistent, from Zha Bin.

14) Fix lack of UMH cleanup when it unexpectedly exits, from Taehee Yoo.

15) Bridge forwarding must clear skb->tstamp, from Paolo Abeni.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
bnxt_en: Fix context memory allocation.
bnxt_en: Fix ring checking logic on 57500 chips.
mISDN: hfcsusb: Use struct_size() in kzalloc()
net: clear skb->tstamp in bridge forwarding path
net: bpfilter: disallow to remove bpfilter module while being used
net: bpfilter: restart bpfilter_umh when error occurred
net: bpfilter: use cleanup callback to release umh_info
umh: add exit routine for UMH process
isdn: i4l: isdn_tty: Fix some concurrency double-free bugs
vhost/vsock: fix vhost vsock cid hashing inconsistent
net: stmmac: Prevent RX starvation in stmmac_napi_poll()
net: stmmac: Fix the logic of checking if RX Watchdog must be enabled
net: stmmac: Check if CBS is supported before configuring
net: stmmac: dwxgmac2: Only clear interrupts that are active
net: stmmac: Fix PCI module removal leak
tools/bpf: fix bpftool map dump with bitfields
tools/bpf: test btf bitfield with >=256 struct member offset
bpf: fix bpffs bitfield pretty print
net: ethernet: mediatek: fix warning in phy_start_aneg
tcp: change txhash on SYN-data timeout
...

Linus Torvalds
2019-01-16 01:13:36 +0800

12 Jan, 2019

1 commit

73ab1cb2d umh: add exit routine for UMH process ... Browse Code »

A UMH process which is created by the fork_usermode_blob() such as
bpfilter needs to release members of the umh_info when process is
terminated.
But the do_exit() does not release members of the umh_info. hence module
which uses UMH needs own code to detect whether UMH process is
terminated or not.
But this implementation needs extra code for checking the status of
UMH process. it eventually makes the code more complex.

The new PF_UMH flag is added and it is used to identify UMH processes.
The exit_umh() does not release members of the umh_info.
Hence umh_info->cleanup callback should release both members of the
umh_info and the private data.

Suggested-by: David S. Miller
Signed-off-by: Taehee Yoo
Signed-off-by: David S. Miller

Taehee Yoo
2019-01-12 10:05:40 +0800

05 Jan, 2019

1 commit

594cc251f make 'user_access_begin()' do 'access_ok()' ... Browse Code »

Originally, the rule used to be that you'd have to do access_ok()
separately, and then user_access_begin() before actually doing the
direct (optimized) user access.

But experience has shown that people then decide not to do access_ok()
at all, and instead rely on it being implied by other operations or
similar. Which makes it very hard to verify that the access has
actually been range-checked.

If you use the unsafe direct user accesses, hardware features (either
SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
Access Never - on ARM) do force you to use user_access_begin(). But
nothing really forces the range check.

By putting the range check into user_access_begin(), we actually force
people to do the right thing (tm), and the range check vill be visible
near the actual accesses. We have way too long a history of people
trying to avoid them.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-05 04:56:09 +0800

04 Jan, 2019

1 commit

96d4f267e Remove 'type' argument from access_ok() function ... Browse Code »

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access. But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model. And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

- csky still had the old "verify_area()" name as an alias.

- the iter_iov code had magical hardcoded knowledge of the actual
values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
really used it)

- microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something. Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds

Linus Torvalds
2019-01-04 10:57:57 +0800

22 Jul, 2018

1 commit

010249808 signal: Pass pid type into group_send_sig_info ... Browse Code »

This passes the information we already have at the call sight
into group_send_sig_info. Ultimatelly allowing for to better handle
signals sent to a group of processes.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-22 01:57:35 +0800

21 Jul, 2018

2 commits

6883f81aa pid: Implement PIDTYPE_TGID ... Browse Code »

Everywhere except in the pid array we distinguish between a tasks pid and
a tasks tgid (thread group id). Even in the enumeration we want that
distinction sometimes so we have added __PIDTYPE_TGID. With leader_pid
we almost have an implementation of PIDTYPE_TGID in struct signal_struct.

Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
into the pids array. Then remove the __PIDTYPE_TGID special case and the
leader_pid in signal_struct.

The net size increase is just an extra pointer added to struct pid and
an extra pair of pointers of an hlist_node added to task_struct.

The effect on code maintenance is the removal of a number of special
cases today and the potential to remove many more special cases as
PIDTYPE_TGID gets used to it's fullest. The long term potential
is allowing zombie thread group leaders to exit, which will remove
a lot more special cases in the code.

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800
1fb53567a pids: Move task_pid_type into sched/signal.h ... Browse Code »

The function is general and inline so there is no need
to hide it inside of exit.c

Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2018-07-21 23:43:12 +0800

03 Apr, 2018

1 commit

d300b6108 kernel: use kernel_wait4() instead of sys_wait4() ... Browse Code »

All call sites of sys_wait4() set *rusage to NULL. Therefore, there is
no need for the copy_to_user() handling of *rusage, and we can use
kernel_wait4() directly.

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Acked-by: Luis R. Rodriguez
Cc: Al Viro
Cc: Andrew Morton
Signed-off-by: Dominik Brodowski

Dominik Brodowski
2018-04-03 02:14:51 +0800

05 Jan, 2018

1 commit

dc8635b78 kernel/exit.c: export abort() to modules ... Browse Code »

gcc -fisolate-erroneous-paths-dereference can generate calls to abort()
from modular code too.

[arnd@arndb.de: drop duplicate exports of abort()]
Link: http://lkml.kernel.org/r/20180102103311.706364-1-arnd@arndb.de
Reported-by: Vineet Gupta
Cc: Sudip Mukherjee
Cc: Arnd Bergmann
Cc: Alexey Brodkin
Cc: Russell King
Cc: Jose Abreu
Signed-off-by: Andrew Morton
Signed-off-by: Arnd Bergmann
Signed-off-by: Linus Torvalds

Andrew Morton
2018-01-05 08:45:09 +0800

15 Dec, 2017

1 commit

7c2c11b20 arch: define weak abort() ... Browse Code »

gcc toggle -fisolate-erroneous-paths-dereference (default at -O2
onwards) isolates faulty code paths such as null pointer access, divide
by zero etc. If gcc port doesnt implement __builtin_trap, an abort() is
generated which causes kernel link error.

In this case, gcc is generating abort due to 'divide by zero' in
lib/mpi/mpih-div.c.

Currently 'frv' and 'arc' are failing. Previously other arch was also
broken like m32r was fixed by commit d22e3d69ee1a ("m32r: fix build
failure").

Let's define this weak function which is common for all arch and fix the
problem permanently. We can even remove the arch specific 'abort' after
this is done.

Link: http://lkml.kernel.org/r/1513118956-8718-1-git-send-email-sudipm.mukherjee@gmail.com
Signed-off-by: Sudip Mukherjee
Cc: Alexey Brodkin
Cc: Vineet Gupta
Cc: Sudip Mukherjee
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sudip Mukherjee
2017-12-15 08:00:49 +0800

25 Oct, 2017

1 commit

6aa7de059 locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to … ... Browse Code »

…READ_ONCE()/WRITE_ONCE()

Please do not apply this to mainline directly, instead please re-run the
coccinelle script shown below and apply its output.

For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
preference to ACCESS_ONCE(), and new code is expected to use one of the
former. So far, there's been no reason to change most existing uses of
ACCESS_ONCE(), as these aren't harmful, and changing them results in
churn.

However, for some features, the read/write distinction is critical to
correct operation. To distinguish these cases, separate read/write
accessors must be used. This patch migrates (most) remaining
ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
coccinelle script:

----
// Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
// WRITE_ONCE()

// $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch

virtual patch

@ depends on patch @
expression E1, E2;
@@

- ACCESS_ONCE(E1) = E2
+ WRITE_ONCE(E1, E2)

@ depends on patch @
expression E;
@@

- ACCESS_ONCE(E)
+ READ_ONCE(E)
----

Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: davem@davemloft.net
Cc: linux-arch@vger.kernel.org
Cc: mpe@ellerman.id.au
Cc: shuah@kernel.org
Cc: snitzer@redhat.com
Cc: thor.thayer@linux.intel.com
Cc: tj@kernel.org
Cc: viro@zeniv.linux.org.uk
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Mark Rutland
2017-10-25 17:01:08 +0800

21 Oct, 2017

1 commit

1c9fec470 waitid(): Avoid unbalanced user_access_end() on access_ok() error ... Browse Code »

As pointed out by Linus and David, the earlier waitid() fix resulted in
a (currently harmless) unbalanced user_access_end() call. This fixes it
to just directly return EFAULT on access_ok() failure.

Fixes: 96ca579a1ecc ("waitid(): Add missing access_ok() checks")
Acked-by: David Daney
Cc: Al Viro
Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds

Kees Cook
2017-10-21 03:32:54 +0800

10 Oct, 2017

1 commit

96ca579a1 waitid(): Add missing access_ok() checks ... Browse Code »

Adds missing access_ok() checks.

CVE-2017-5123

Reported-by: Chris Salls
Signed-off-by: Kees Cook
Acked-by: Al Viro
Fixes: 4c48abe91be0 ("waitid(): switch copyout of siginfo to unsafe_put_user()")
Cc: stable@kernel.org # 4.13
Signed-off-by: Linus Torvalds

Kees Cook
2017-10-10 08:03:31 +0800

30 Sep, 2017

1 commit

6c85501f2 fix infoleak in waitid(2) ... Browse Code »

kernel_waitid() can return a PID, an error or 0. rusage is filled in the first
case and waitid(2) rusage should've been copied out exactly in that case, *not*
whenever kernel_waitid() has not returned an error. Compat variant shares that
braino; none of kernel_wait4() callers do, so the below ought to fix it.

Reported-and-tested-by: Alexander Potapenko
Fixes: ce72a16fa705 ("wait4(2)/waitid(2): separate copying rusage to userland")
Cc: stable@vger.kernel.org # v4.13
Signed-off-by: Al Viro

Al Viro
2017-09-30 01:43:15 +0800

12 Sep, 2017

1 commit

dd198ce71 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull namespace updates from Eric Biederman:
"Life has been busy and I have not gotten half as much done this round
as I would have liked. I delayed it so that a minor conflict
resolution with the mips tree could spend a little time in linux-next
before I sent this pull request.

This includes two long delayed user namespace changes from Kirill
Tkhai. It also includes a very useful change from Serge Hallyn that
allows the security capability attribute to be used inside of user
namespaces. The practical effect of this is people can now untar
tarballs and install rpms in user namespaces. It had been suggested to
generalize this and encode some of the namespace information
information in the xattr name. Upon close inspection that makes the
things that should be hard easy and the things that should be easy
more expensive.

Then there is my bugfix/cleanup for signal injection that removes the
magic encoding of the siginfo union member from the kernel internal
si_code. The mips folks reported the case where I had used FPE_FIXME
me is impossible so I have remove FPE_FIXME from mips, while at the
same time including a return statement in that case to keep gcc from
complaining about unitialized variables.

I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user. The code is available at:

git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3

But I did not have time/energy to get the code posted and reviewed
before the merge window opened.

I was able to see that the security excuse for just copying fields
that we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo. So
we still sometimes copy unitialized data to userspace"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Introduce v3 namespaced file capabilities
mips/signal: In force_fcr31_sig return in the impossible case
signal: Remove kernel interal si_code magic
fcntl: Don't use ambiguous SIG_POLL si_codes
prctl: Allow local CAP_SYS_ADMIN changing exe_file
security: Use user_namespace::level to avoid redundant iterations in cap_capable()
userns,pidns: Verify the userns for new pid namespaces
signal/testing: Don't look for __SI_FAULT in userspace
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/alpha: Document a conflict with SI_USER for SIGTRAP

Linus Torvalds
2017-09-12 09:34:47 +0800

05 Sep, 2017

1 commit

5f82e71a0 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:

- Add 'cross-release' support to lockdep, which allows APIs like
completions, where it's not the 'owner' who releases the lock, to be
tracked. It's all activated automatically under
CONFIG_PROVE_LOCKING=y.

- Clean up (restructure) the x86 atomics op implementation to be more
readable, in preparation of KASAN annotations. (Dmitry Vyukov)

- Fix static keys (Paolo Bonzini)

- Add killable versions of down_read() et al (Kirill Tkhai)

- Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

- Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

- Remove smp_mb__before_spinlock() and convert its usages, introduce
smp_mb__after_spinlock() (Peter Zijlstra)

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
locking/lockdep/selftests: Fix mixed read-write ABBA tests
sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
smp: Avoid using two cache lines for struct call_single_data
locking/lockdep: Untangle xhlock history save/restore from task independence
locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
futex: Remove duplicated code and fix undefined behaviour
Documentation/locking/atomic: Finish the document...
locking/lockdep: Fix workqueue crossrelease annotation
workqueue/lockdep: 'Fix' flush_work() annotation
locking/lockdep/selftests: Add mixed read-write ABBA tests
mm, locking/barriers: Clarify tlb_flush_pending() barriers
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
locking/lockdep: Explicitly initialize wq_barrier::done::map
locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
locking/refcounts, x86/asm: Implement fast refcount overflow protection
locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
...

Linus Torvalds
2017-09-05 02:52:29 +0800