Eric Lee / smarc-fsl-linux-kernel

15 Oct, 2020

1 commit

612e7a4c1 Merge tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull kernel_clone() updates from Christian Brauner:
"During the v5.9 merge window we reworked the process creation
codepaths across multiple architectures. After this work we were only
left with the _do_fork() helper based on the struct kernel_clone_args
calling convention. As was pointed out _do_fork() isn't valid
kernelese especially for a helper that isn't just static.

This series removes the _do_fork() helper and introduces the new
kernel_clone() helper. The process creation cleanup didn't change the
name to something more reasonable mainly because _do_fork() was used
in quite a few places. So sending this as a separate series seemed the
better strategy.

I originally intended to send this early in the v5.9 development cycle
after the merge window had closed but given that this was touching
quite a few places I decided to defer this until the v5.10 merge
window"

* tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
sched: remove _do_fork()
tracing: switch to kernel_clone()
kgdbts: switch to kernel_clone()
kprobes: switch to kernel_clone()
x86: switch to kernel_clone()
sparc: switch to kernel_clone()
nios2: switch to kernel_clone()
m68k: switch to kernel_clone()
ia64: switch to kernel_clone()
h8300: switch to kernel_clone()
fork: introduce kernel_clone()

Linus Torvalds
2020-10-15 05:32:52 +0800

14 Oct, 2020

3 commits

67197a4f2 mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary ... Browse Code »

Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm. This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important. Such operation can happen frequently. We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression. Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes. To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex. Its scope is changed to global.

The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
the case of vfork(). To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified. Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare. Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray
Suggested-by: Michal Hocko
Signed-off-by: Suren Baghdasaryan
Signed-off-by: Andrew Morton
Acked-by: Christian Brauner
Acked-by: Michal Hocko
Acked-by: Oleg Nesterov
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Eugene Syromiatnikov
Cc: Christian Kellner
Cc: Adrian Reber
Cc: Shakeel Butt
Cc: Aleksa Sarai
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Cc: Alexey Gladkov
Cc: Michel Lespinasse
Cc: Daniel Jordan
Cc: Andrei Vagin
Cc: Bernd Edlinger
Cc: John Johansen
Cc: Yafang Shao
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim
Signed-off-by: Linus Torvalds

Suren Baghdasaryan
2020-10-14 09:38:35 +0800
c78f46364 mm: remove src/dst mm parameter in copy_page_range() ... Browse Code »

Both of the mm pointers are not needed after commit 7a4830c380f3
("mm/fork: Pass new vma pointer into copy_page_range()").

Jason Gunthorpe also reported that the ordering of copy_page_range() is
odd. Since working at it, reorder the parameters to be logical, by (1)
always put the dst_* fields to be before src_* fields, and (2) keep the
same type of parameters together.

[peterx@redhat.com: further reorder some parameters and line format, per Jason]
Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com
[peterx@redhat.com: fix warnings]
Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1

Reported-by: Kirill A. Shutemov
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Acked-by: Kirill A. Shutemov
Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-10-14 09:38:32 +0800
cf508b584 mm: use helper function mapping_allow_writable() ... Browse Code »

Commit 4bb5f5d9395b ("mm: allow drivers to prevent new writable mappings")
changed i_mmap_writable from unsigned int to atomic_t and add the helper
function mapping_allow_writable() to atomic_inc i_mmap_writable. But it
forgot to use this helper function in dup_mmap() and __vma_link_file().

Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Cc: Christian Brauner
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Christian Kellner
Cc: Suren Baghdasaryan
Cc: Adrian Reber
Cc: Shakeel Butt
Cc: Aleksa Sarai
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-14 09:38:31 +0800

01 Oct, 2020

1 commit

0f2122045 io_uring: don't rely on weak ->files references ... Browse Code »

Grab actual references to the files_struct. To avoid circular references
issues due to this, we add a per-task note that keeps track of what
io_uring contexts a task has used. When the tasks execs or exits its
assigned files, we cancel requests based on this tracking.

With that, we can grab proper references to the files table, and no
longer need to rely on stashing away ring_fd and ring_file to check
if the ring_fd may have been closed.

Cc: stable@vger.kernel.org # v5.5+
Reviewed-by: Pavel Begunkov
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-01 10:32:32 +0800

28 Sep, 2020

2 commits

7a4830c38 mm/fork: Pass new vma pointer into copy_page_range() ... Browse Code »

This prepares for the future work to trigger early cow on pinned pages
during fork().

No functional change intended.

Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800
008cfe441 mm: Introduce mm_struct.has_pinned ... Browse Code »

(Commit message majorly collected from Jason Gunthorpe)

Reduce the chance of false positive from page_maybe_dma_pinned() by
keeping track if the mm_struct has ever been used with pin_user_pages().
This allows cases that might drive up the page ref_count to avoid any
penalty from handling dma_pinned pages.

Future work is planned, to provide a more sophisticated solution, likely
to turn it into a real counter. For now, make it atomic_t but use it as
a boolean for simplicity.

Suggested-by: Jason Gunthorpe
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800

06 Sep, 2020

1 commit

b0daa2c73 fork: adjust sysctl_max_threads definition to match prototype ... Browse Code »

Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
changed ctl_table.proc_handler to take a kernel pointer. Adjust the
definition of sysctl_max_threads to match its prototype in
linux/sysctl.h which fixes the following sparse error/warning:

kernel/fork.c:3050:47: warning: incorrect type in argument 3 (different address spaces)
kernel/fork.c:3050:47: expected void *
kernel/fork.c:3050:47: got void [noderef] __user *buffer
kernel/fork.c:3036:5: error: symbol 'sysctl_max_threads' redeclared with different type (incompatible argument 3 (different address spaces)):
kernel/fork.c:3036:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... )
kernel/fork.c: note: in included file (through include/linux/key.h, include/linux/cred.h, include/linux/sched/signal.h, include/linux/sched/cputime.h):
include/linux/sysctl.h:242:5: note: previously declared as:
include/linux/sysctl.h:242:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... )

Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
Signed-off-by: Tobias Klauser
Signed-off-by: Andrew Morton
Cc: Christoph Hellwig
Cc: Al Viro
Link: https://lkml.kernel.org/r/20200825093647.24263-1-tklauser@distanz.ch
Signed-off-by: Linus Torvalds

Tobias Klauser
2020-09-06 03:14:29 +0800

20 Aug, 2020

1 commit

cad6967ac fork: introduce kernel_clone() ... Browse Code »

The old _do_fork() helper doesn't follow naming conventions of in-kernel
helpers for syscalls. The process creation cleanup in [1] didn't change the
name to something more reasonable mainly because _do_fork() was used in quite a
few places. So sending this as a separate series seemed the better strategy.

This commit does two things:
1. renames _do_fork() to kernel_clone() but keeps _do_fork() as a simple static
inline wrapper around kernel_clone().
2. Changes the return type from long to pid_t. This aligns kernel_thread() and
kernel_clone(). Also, the return value from kernel_clone that is surfaced in
fork(), vfork(), clone(), and clone3() is taken from pid_vrn() which returns
a pid_t too.

Follow-up patches will switch each caller of _do_fork() and each place where it
is referenced over to kernel_clone(). After all these changes are done, we can
remove _do_fork() completely and will only be left with kernel_clone().

[1]: 9ba27414f2ec ("Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux")

Signed-off-by: Christian Brauner
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Matthew Wilcox (Oracle)
Cc: "Peter Zijlstra (Intel)"
Link: https://lore.kernel.org/r/20200819104655.436656-2-christian.brauner@ubuntu.com

Christian Brauner
2020-08-20 19:12:57 +0800

11 Aug, 2020

1 commit

97d052ea3 Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Thomas Gleixner:
"A set of locking fixes and updates:

- Untangle the header spaghetti which causes build failures in
various situations caused by the lockdep additions to seqcount to
validate that the write side critical sections are non-preemptible.

- The seqcount associated lock debug addons which were blocked by the
above fallout.

seqcount writers contrary to seqlock writers must be externally
serialized, which usually happens via locking - except for strict
per CPU seqcounts. As the lock is not part of the seqcount, lockdep
cannot validate that the lock is held.

This new debug mechanism adds the concept of associated locks.
sequence count has now lock type variants and corresponding
initializers which take a pointer to the associated lock used for
writer serialization. If lockdep is enabled the pointer is stored
and write_seqcount_begin() has a lockdep assertion to validate that
the lock is held.

Aside of the type and the initializer no other code changes are
required at the seqcount usage sites. The rest of the seqcount API
is unchanged and determines the type at compile time with the help
of _Generic which is possible now that the minimal GCC version has
been moved up.

Adding this lockdep coverage unearthed a handful of seqcount bugs
which have been addressed already independent of this.

While generally useful this comes with a Trojan Horse twist: On RT
kernels the write side critical section can become preemtible if
the writers are serialized by an associated lock, which leads to
the well known reader preempts writer livelock. RT prevents this by
storing the associated lock pointer independent of lockdep in the
seqcount and changing the reader side to block on the lock when a
reader detects that a writer is in the write side critical section.

- Conversion of seqcount usage sites to associated types and
initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
locking/seqlock, headers: Untangle the spaghetti monster
locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
x86/headers: Remove APIC headers from
seqcount: More consistent seqprop names
seqcount: Compress SEQCNT_LOCKNAME_ZERO()
seqlock: Fold seqcount_LOCKNAME_init() definition
seqlock: Fold seqcount_LOCKNAME_t definition
seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
hrtimer: Use sequence counter with associated raw spinlock
kvm/eventfd: Use sequence counter with associated spinlock
userfaultfd: Use sequence counter with associated spinlock
NFSv4: Use sequence counter with associated spinlock
iocost: Use sequence counter with associated spinlock
raid5: Use sequence counter with associated spinlock
vfs: Use sequence counter with associated spinlock
timekeeping: Use sequence counter with associated raw spinlock
xfrm: policy: Use sequence counters with associated lock
netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
netfilter: conntrack: Use sequence counter with associated spinlock
sched: tasks: Use sequence counter with associated spinlock
...

Linus Torvalds
2020-08-11 10:07:44 +0800

08 Aug, 2020

2 commits

8dcc1d346 kasan: don't tag stacks allocated with pagealloc ... Browse Code »

Patch series "kasan: support stack instrumentation for tag-based mode", v2.

This patch (of 5):

Prepare Software Tag-Based KASAN for stack tagging support.

With Tag-Based KASAN when kernel stacks are allocated via pagealloc (which
happens when CONFIG_VMAP_STACK is not enabled), they get tagged. KASAN
instrumentation doesn't expect the sp register to be tagged, and this
leads to false-positive reports.

Fix by resetting the tag of kernel stack pointers after allocation.

Signed-off-by: Andrey Konovalov
Signed-off-by: Andrew Morton
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: Marco Elver
Cc: Walter Wu
Cc: Elena Petrova
Cc: Vincenzo Frascino
Cc: Catalin Marinas
Cc: Ard Biesheuvel
Link: http://lkml.kernel.org/r/cover.1596199677.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/cover.1596544734.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/12d8c678869268dd0884b01271ab592f30792abf.1596544734.git.andreyknvl@google.com
Link: http://lkml.kernel.org/r/01c678b877755bcf29009176592402cdf6f2cb15.1596199677.git.andreyknvl@google.com
Link: https://bugzilla.kernel.org/show_bug.cgi?id=203497
Signed-off-by: Linus Torvalds

Andrey Konovalov
2020-08-08 02:33:28 +0800
991e76738 mm: memcontrol: account kernel stack per node ... Browse Code »

Currently the kernel stack is being accounted per-zone. There is no need
to do that. In addition due to being per-zone, memcg has to keep a
separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
node_stat_item. In addition localize the kernel stack stats updates to
account_kernel_stack().

Signed-off-by: Shakeel Butt
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
Signed-off-by: Linus Torvalds

Shakeel Butt
2020-08-08 02:33:25 +0800

05 Aug, 2020

4 commits

4f30a60aa Merge tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull close_range() implementation from Christian Brauner:
"This adds the close_range() syscall. It allows to efficiently close a
range of file descriptors up to all file descriptors of a calling
task.

This is coordinated with the FreeBSD folks which have copied our
version of this syscall and in the meantime have already merged it in
April 2019:

https://reviews.freebsd.org/D21627
https://svnweb.freebsd.org/base?view=revision&revision=359836

The syscall originally came up in a discussion around the new mount
API and making new file descriptor types cloexec by default. During
this discussion, Al suggested the close_range() syscall.

First, it helps to close all file descriptors of an exec()ing task.
This can be done safely via (quoting Al's example from [1] verbatim):

/* that exec is sensitive */
unshare(CLONE_FILES);
/* we don't want anything past stderr here */
close_range(3, ~0U);
execve(....);

The code snippet above is one way of working around the problem that
file descriptors are not cloexec by default. This is aggravated by the
fact that we can't just switch them over without massively regressing
userspace. For a whole class of programs having an in-kernel method of
closing all file descriptors is very helpful (e.g. demons, service
managers, programming language standard libraries, container managers
etc.).

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc//fd/* and calling close() on
each file descriptor and other hacks. From looking at various
large(ish) userspace code bases this or similar patterns are very
common in service managers, container runtimes, and programming
language runtimes/standard libraries such as Python or Rust.

In addition, the syscall will also work for tasks that do not have
procfs mounted and on kernels that do not have procfs support compiled
in. In such situations the only way to make sure that all file
descriptors are closed is to call close() on each file descriptor up
to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

Based on Linus' suggestion close_range() also comes with a new flag
CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
right before exec. This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part
of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
gets especially handy when we're closing all file descriptors above a
certain threshold.

Test-suite as always included"

* tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
tests: add CLOSE_RANGE_UNSHARE tests
close_range: add CLOSE_RANGE_UNSHARE
tests: add close_range() tests
arch: wire-up close_range()
open: add close_range()

Linus Torvalds
2020-08-05 06:12:02 +0800
9ba27414f Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull fork cleanups from Christian Brauner:
"This is cleanup series from when we reworked a chunk of the process
creation paths in the kernel and switched to struct
{kernel_}clone_args.

High-level this does two main things:

- Remove the double export of both do_fork() and _do_fork() where
do_fork() used the incosistent legacy clone calling convention.

Now we only export _do_fork() which is based on struct
kernel_clone_args.

- Remove the copy_thread_tls()/copy_thread() split making the
architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

This switches all remaining architectures to select
HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
convention. The current split makes the process creation codepaths
more convoluted than they need to be. Each architecture has their own
copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
has a copy_thread_tls() function.

The split is not needed anymore nowadays, all architectures support
CLONE_SETTLS but quite a few of them never bothered to select
HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
and use the old calling convention. Removing this split cleans up the
process creation codepaths and paves the way for implementing clone3()
on such architectures since it requires the copy_thread_tls() calling
convention.

After having made each architectures support copy_thread_tls() this
series simply renames that function back to copy_thread(). It also
switches all architectures that call do_fork() directly over to
_do_fork() and the struct kernel_clone_args calling convention. This
is a corollary of switching the architectures that did not yet support
it over to copy_thread_tls() since do_fork() is conditional on not
supporting copy_thread_tls() (Mostly because it lacks a separate
argument for tls which is trivial to fix but there's no need for this
function to exist.).

The do_fork() removal is in itself already useful as it allows to to
remove the export of both do_fork() and _do_fork() we currently have
in favor of only _do_fork(). This has already been discussed back when
we added clone3(). The legacy clone() calling convention is - as is
probably well-known - somewhat odd:

#
# ABI hall of shame
#
config CLONE_BACKWARDS
config CLONE_BACKWARDS2
config CLONE_BACKWARDS3

that is aggravated by the fact that some architectures such as sparc
follow the CLONE_BACKWARDSx calling convention but don't really select
the corresponding config option since they call do_fork() directly.

So do_fork() enforces a somewhat arbitrary calling convention in the
first place that doesn't really help the individual architectures that
deviate from it. They can thus simply be switched to _do_fork()
enforcing a single calling convention. (I really hope that any new
architectures will __not__ try to implement their own calling
conventions...)

Most architectures already have made a similar switch (m68k comes to
mind).

Overall this removes more code than it adds even with a good portion
of added comments. It simplifies a chunk of arch specific assembly
either by moving the code into C or by simply rewriting the assembly.

Architectures that have been touched in non-trivial ways have all been
actually boot and stress tested: sparc and ia64 have been tested with
Debian 9 images. They are the two architectures which have been
touched the most. All non-trivial changes to architectures have seen
acks from the relevant maintainers. nios2 with a custom built
buildroot image. h8300 I couldn't get something bootable to test on
but the changes have been fairly automatic and I'm sure we'll hear
people yell if I broke something there.

All other architectures that have been touched in trivial ways have
been compile tested for each single patch of the series via git rebase
-x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
even though they have just been trivially touched (removal of the
HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
basically "core architectures" and since it is trivial to get your
hands on a useable image"

* tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
arch: rename copy_thread_tls() back to copy_thread()
arch: remove HAVE_COPY_THREAD_TLS
unicore: switch to copy_thread_tls()
sh: switch to copy_thread_tls()
nds32: switch to copy_thread_tls()
microblaze: switch to copy_thread_tls()
hexagon: switch to copy_thread_tls()
c6x: switch to copy_thread_tls()
alpha: switch to copy_thread_tls()
fork: remove do_fork()
h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
sparc: unconditionally enable HAVE_COPY_THREAD_TLS
sparc: share process creation helpers between sparc and sparc64
sparc64: enable HAVE_COPY_THREAD_TLS
fork: fold legacy_clone_args_valid() into _do_fork()

Linus Torvalds
2020-08-05 05:47:45 +0800
3950e9754 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.

This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.

- Implement kernel_execve

- Make the user mode driver code a better citizen

With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...

Linus Torvalds
2020-08-05 05:27:25 +0800
9ecc6ea49 Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull seccomp updates from Kees Cook:
"There are a bunch of clean ups and selftest improvements along with
two major updates to the SECCOMP_RET_USER_NOTIF filter return:
EPOLLHUP support to more easily detect the death of a monitored
process, and being able to inject fds when intercepting syscalls that
expect an fd-opening side-effect (needed by both container folks and
Chrome). The latter continued the refactoring of __scm_install_fd()
started by Christoph, and in the process found and fixed a handful of
bugs in various callers.

- Improved selftest coverage, timeouts, and reporting

- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

- Refactor __scm_install_fd() into __receive_fd() and fix buggy
callers

- Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
Dhillon)"

* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Introduce addfd ioctl to seccomp user notifier
fs: Expand __receive_fd() to accept existing fd
pidfd: Replace open-coded receive_fd()
fs: Add receive_fd() wrapper for __receive_fd()
fs: Move __scm_install_fd() to __receive_fd()
net/scm: Regularize compat handling of scm_detach_fds()
pidfd: Add missing sock updates for pidfd_getfd()
net/compat: Add missing sock updates for SCM_RIGHTS
selftests/seccomp: Check ENOSYS under tracing
selftests/seccomp: Refactor to use fixture variants
selftests/harness: Clean up kern-doc for fixtures
seccomp: Use -1 marker for end of mode 1 syscall list
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
selftests/seccomp: Make kcmp() less required
seccomp: Use pr_fmt
selftests/seccomp: Improve calibration loop
selftests/seccomp: use 90s as timeout
selftests/seccomp: Expand benchmark to per-filter measurements
...

Linus Torvalds
2020-08-05 05:11:08 +0800

04 Aug, 2020

1 commit

e4cbce4d1 Merge tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- Improve uclamp performance by using a static key for the fast path

- Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
better power efficiency of RT tasks on battery powered devices.
(The default is to maximize performance & reduce RT latencies.)

- Improve utime and stime tracking accuracy, which had a fixed boundary
of error, which created larger and larger relative errors as the
values become larger. This is now replaced with more precise
arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

- Improve the deadline scheduler, such as making it capacity aware

- Improve frequency-invariant scheduling

- Misc cleanups in energy/power aware scheduling

- Add sched_update_nr_running tracepoint to track changes to nr_running

- Documentation additions and updates

- Misc cleanups and smaller fixes

* tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
sched/doc: Document capacity aware scheduling
sched: Document arch_scale_*_capacity()
arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
Documentation/sysctl: Document uclamp sysctl knobs
sched/uclamp: Add a new sysctl to control RT default boost value
sched/uclamp: Fix a deadlock when enabling uclamp static key
sched: Remove duplicated tick_nohz_full_enabled() check
sched: Fix a typo in a comment
sched/uclamp: Remove unnecessary mutex_init()
arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
arch_topology, sched/core: Cleanup thermal pressure definition
trace/events/sched.h: fix duplicated word
linux/sched/mm.h: drop duplicated words in comments
smp: Fix a potential usage of stale nr_cpus
sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
sched: nohz: stop passing around unused "ticks" parameter.
sched: Better document ttwu()
sched: Add a tracepoint to track rq->nr_running
...

Linus Torvalds
2020-08-04 05:58:38 +0800

01 Aug, 2020

1 commit

63722bbca Merge branch 'kcsan' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/li… ... Browse Code »

…nux-rcu into locking/core

Pull v5.9 KCSAN bits from Paul E. McKenney.

Perhaps the most important change is that GCC 11 now has all fixes in place
to support KCSAN, so GCC support can be enabled again.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2020-08-01 15:26:27 +0800

31 Jul, 2020

1 commit

0584df9c1 lockdep: Refactor IRQ trace events fields into struct ... Browse Code »

Refactor the IRQ trace events fields, used for printing information
about the IRQ trace events, into a separate struct 'irqtrace_events'.

This improves readability by separating the information only used in
reporting, as well as enables (simplified) storing/restoring of
irqtrace_events snapshots.

No functional change intended.

Signed-off-by: Marco Elver
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.com
Signed-off-by: Ingo Molnar

Marco Elver
2020-07-31 18:11:58 +0800

29 Jul, 2020

2 commits

b75058614 sched: tasks: Use sequence counter with associated spinlock ... Browse Code »

A sequence counter write side critical section must be protected by some
form of locking to serialize writers. A plain seqcount_t does not
contain the information of which lock must be held when entering a write
side critical section.

Use the new seqcount_spinlock_t data type, which allows to associate a
spinlock with the sequence counter. This enables lockdep to verify that
the spinlock used for writer serialization is held when the write side
critical section is entered.

If lockdep is disabled this lock association is compiled out and has
neither storage size nor runtime overhead.

Signed-off-by: Ahmed S. Darwish
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de

Ahmed S. Darwish
2020-07-29 22:14:26 +0800
13685c4a0 sched/uclamp: Add a new sysctl to control RT default boost value ... Browse Code »

RT tasks by default run at the highest capacity/performance level. When
uclamp is selected this default behavior is retained by enforcing the
requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
value.

This is also referred to as 'the default boost value of RT tasks'.

See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

On battery powered devices, it is desired to control this default
(currently hardcoded) behavior at runtime to reduce energy consumed by
RT tasks.

For example, a mobile device manufacturer where big.LITTLE architecture
is dominant, the performance of the little cores varies across SoCs, and
on high end ones the big cores could be too power hungry.

Given the diversity of SoCs, the new knob allows manufactures to tune
the best performance/power for RT tasks for the particular hardware they
run on.

They could opt to further tune the value when the user selects
a different power saving mode or when the device is actively charging.

The runtime aspect of it further helps in creating a single kernel image
that can be run on multiple devices that require different tuning.

Keep in mind that a lot of RT tasks in the system are created by the
kernel. On Android for instance I can see over 50 RT tasks, only
a handful of which created by the Android framework.

To control the default behavior globally by system admins and device
integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
to change the default boost value of the RT tasks.

I anticipate this to be mostly in the form of modifying the init script
of a particular device.

To avoid polluting the fast path with unnecessary code, the approach
taken is to synchronously do the update by traversing all the existing
tasks in the system. This could race with a concurrent fork(), which is
dealt with by introducing sched_post_fork() function which will ensure
the racy fork will get the right update applied.

Tested on Juno-r2 in combination with the RT capacity awareness [1].
By default an RT task will go to the highest capacity CPU and run at the
maximum frequency, which is particularly energy inefficient on high end
mobile devices because the biggest core[s] are 'huge' and power hungry.

With this patch the RT task can be controlled to run anywhere by
default, and doesn't cause the frequency to be maximum all the time.
Yet any task that really needs to be boosted can easily escape this
default behavior by modifying its requested uclamp.min value
(p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

[1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com

Qais Yousef
2020-07-29 19:51:47 +0800

11 Jul, 2020

1 commit

3a15fb6ed seccomp: release filter after task is fully dead ... Browse Code »

The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.

This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.

We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk->sighand will already have been NULLed which is required for
thread-sync and filter installation alike.

Cc: Tycho Andersen
Cc: Kees Cook
Cc: Matt Denton
Cc: Sargun Dhillon
Cc: Jann Horn
Cc: Chris Palmer
Cc: Aleksa Sarai
Cc: Robert Sesek
Cc: Jeffrey Vander Stoep
Cc: Linux Containers
Signed-off-by: Christian Brauner
Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook

Christian Brauner
2020-07-11 07:01:51 +0800

10 Jul, 2020

1 commit

a21ee6055 lockdep: Change hardirq{s_enabled,_context} to per-cpu variables ... Browse Code »

Currently all IRQ-tracking state is in task_struct, this means that
task_struct needs to be defined before we use it.

Especially for lockdep_assert_irq*() this can lead to header-hell.

Move the hardirq state into per-cpu variables to avoid the task_struct
dependency.

Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org

Peter Zijlstra
2020-07-10 18:00:02 +0800

08 Jul, 2020

1 commit

38fd525a4 exit: Factor thread_group_exited out of pidfd_poll ... Browse Code »

Create an independent helper thread_group_exited which returns true
when all threads have passed exit_notify in do_exit. AKA all of the
threads are at least zombies and might be dead or completely gone.

Create this helper by taking the logic out of pidfd_poll where it is
already tested, and adding a READ_ONCE on the read of
task->exit_state.

I will be changing the user mode driver code to use this same logic
to know when a user mode driver needs to be restarted.

Place the new helper thread_group_exited in kernel/exit.c and
EXPORT it so it can be used by modules.

Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
Acked-by: Christian Brauner
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-07-08 00:58:17 +0800

05 Jul, 2020

3 commits

714acdbd1 arch: rename copy_thread_tls() back to copy_thread() ... Browse Code »

Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
back simply copy_thread(). It's a simpler name, and doesn't imply that only
tls is copied here. This finishes an outstanding chunk of internal process
creation work since we've added clone3().

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer A
Acked-by: Stafford Horne
Acked-by: Greentime Hu
Acked-by: Geert Uytterhoeven A
Reviewed-by: Kees Cook
Signed-off-by: Christian Brauner

Christian Brauner
2020-07-05 05:41:37 +0800
140c8180e arch: remove HAVE_COPY_THREAD_TLS ... Browse Code »

All architectures support copy_thread_tls() now, so remove the legacy
copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
uses the same process creation calling convention based on
copy_thread_tls() and struct kernel_clone_args. This will make it easier to
maintain the core process creation code under kernel/, simplifies the
callpaths and makes the identical for all architectures.

Cc: linux-arch@vger.kernel.org
Acked-by: Thomas Bogendoerfer
Acked-by: Greentime Hu
Acked-by: Geert Uytterhoeven
Reviewed-by: Kees Cook
Signed-off-by: Christian Brauner

Christian Brauner
2020-07-05 05:41:37 +0800
ff2a91127 fork: remove do_fork() ... Browse Code »

Now that all architectures have been switched to use _do_fork() and the new
struct kernel_clone_args calling convention we can remove the legacy
do_fork() helper completely. The calling convention used to be brittle and
do_fork() didn't buy us anything. The only calling convention accepted
should be based on struct kernel_clone_args going forward. It's cleaner and
uniform.

Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Al Viro
Cc: "Matthew Wilcox (Oracle)"
Cc: "Peter Zijlstra (Intel)"
Signed-off-by: Christian Brauner

Christian Brauner
2020-07-05 05:41:36 +0800

30 Jun, 2020

1 commit

cda099b37 fork: Annotate a data race in vm_area_dup() ... Browse Code »

struct vm_area_struct could be accessed concurrently as noticed by
KCSAN,

write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
vma_interval_tree_insert+0x101/0x150:
rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
(inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
__vma_link_file+0x6e/0xe0
__vma_link_file at mm/mmap.c:629
vma_link+0xa2/0x120
mmap_region+0x753/0xb90
do_mmap+0x45c/0x710
vm_mmap_pgoff+0xc0/0x130
ksys_mmap_pgoff+0x1d1/0x300
__x64_sys_mmap+0x33/0x40
do_syscall_64+0x91/0xc44
entry_SYSCALL_64_after_hwframe+0x49/0xbe

read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
vm_area_dup+0x6a/0xe0
vm_area_dup at kernel/fork.c:362
__split_vma+0x72/0x2a0
__split_vma at mm/mmap.c:2661
split_vma+0x5a/0x80
mprotect_fixup+0x368/0x3f0
do_mprotect_pkey+0x263/0x420
__x64_sys_mprotect+0x51/0x70
do_syscall_64+0x91/0xc44
entry_SYSCALL_64_after_hwframe+0x49/0xbe

vm_area_dup() blindly copies all fields of original VMA to the new one.
This includes coping vm_area_struct::shared.rb which is normally
protected by i_mmap_lock. But this is fine because the read value will
be overwritten on the following __vma_link_file() under proper
protection. Thus, mark it as an intentional data race and insert a few
assertions for the fields that should not be modified concurrently.

Signed-off-by: Qian Cai
Signed-off-by: Paul E. McKenney

Qian Cai
2020-06-30 03:04:47 +0800

26 Jun, 2020

1 commit

c17d1a3a8 fork: annotate data race in copy_process() ... Browse Code »

KCSAN reported data race reading and writing nr_threads and max_threads.
The data race is intentional and benign. This is obvious from the comment
above it and based on general consensus when discussing this issue. So
there's no need for any heavy atomic or *_ONCE() machinery here.

In accordance with the newly introduced data_race() annotation consensus,
mark the offending line with data_race(). Here it's actually useful not
just to silence KCSAN but to also clearly communicate that the race is
intentional. This is especially helpful since nr_threads is otherwise
protected by tasklist_lock.

BUG: KCSAN: data-race in copy_process / copy_process

write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1:
copy_process+0x2eba/0x3c40 kernel/fork.c:2273
_do_fork+0xfe/0x7a0 kernel/fork.c:2421
__do_sys_clone kernel/fork.c:2576 [inline]
__se_sys_clone kernel/fork.c:2557 [inline]
__x64_sys_clone+0x130/0x170 kernel/fork.c:2557
do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0:
copy_process+0x94d/0x3c40 kernel/fork.c:1954
_do_fork+0xfe/0x7a0 kernel/fork.c:2421
__do_sys_clone kernel/fork.c:2576 [inline]
__se_sys_clone kernel/fork.c:2557 [inline]
__x64_sys_clone+0x130/0x170 kernel/fork.c:2557
do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo
deration/thvp7AHs5Ew/aPdYLXfYBQAJ

Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com
Signed-off-by: Zefan Li
Signed-off-by: Weilong Chen
Acked-by: Christian Brauner
Cc: Qian Cai
Cc: Oleg Nesterov
Cc: Christian Brauner
Cc: Marco Elver
[christian.brauner@ubuntu.com: rewrite commit message]
Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com
Signed-off-by: Christian Brauner

Weilong Chen
2020-06-26 07:05:29 +0800

22 Jun, 2020

1 commit

3af8588c7 fork: fold legacy_clone_args_valid() into _do_fork() ... Browse Code »

This separate helper only existed to guarantee the mutual exclusivity of
CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
abuses the parent_tid field to return the pidfd. But we can actually handle
this uniformely thus removing the helper. For legacy clone we can detect
that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
because they will share the same memory which is invalid and for clone3()
setting the separate pidfd and parent_tid fields to the same memory is
bogus as well. So fold that helper directly into _do_fork() by detecting
this case.

Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Al Viro
Cc: Geert Uytterhoeven
Cc: "Matthew Wilcox (Oracle)"
Cc: "Peter Zijlstra (Intel)"
Cc: linux-m68k@lists.linux-m68k.org
Cc: x86@kernel.org
Signed-off-by: Christian Brauner

Christian Brauner
2020-06-22 20:38:38 +0800

17 Jun, 2020

1 commit

60997c3d4 close_range: add CLOSE_RANGE_UNSHARE ... Browse Code »

One of the use-cases of close_range() is to drop file descriptors just before
execve(). This would usually be expressed in the sequence:

unshare(CLONE_FILES);
close_range(3, ~0U);

as pointed out by Linus it might be desirable to have this be a part of
close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
maximum number of file descriptors to copy from the old struct files. When the
user requests that all file descriptors are supposed to be closed via
close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
to do any of the heavy fput() work for everything above min.

The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
fact currently share our file descriptor table we create a new private copy.
We then close all fds in the requested range and finally after we're done we
install the new fd table.

Suggested-by: Linus Torvalds
Signed-off-by: Christian Brauner

Christian Brauner
2020-06-17 06:07:38 +0800

10 Jun, 2020

3 commits

aaa2cc56c mmap locking API: convert nested write lock sites ... Browse Code »

Add API for nested write locks and convert the few call sites doing that.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-7-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

05 Jun, 2020

1 commit

9ff725857 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull proc updates from Eric Biederman:
"This has four sets of changes:

- modernize proc to support multiple private instances

- ensure we see the exit of each process tid exactly

- remove has_group_leader_pid

- use pids not tasks in posix-cpu-timers lookup

Alexey updated proc so each mount of proc uses a new superblock. This
allows people to actually use mount options with proc with no fear of
messing up another mount of proc. Given the kernel's internal mounts
of proc for things like uml this was a real problem, and resulted in
Android's hidepid mount options being ignored and introducing security
issues.

The rest of the changes are small cleanups and fixes that came out of
my work to allow this change to proc. In essence it is swapping the
pids in de_thread during exec which removes a special case the code
had to handle. Then updating the code to stop handling that special
case"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
proc: proc_pid_ns takes super_block as an argument
remove the no longer needed pid_alive() check in __task_pid_nr_ns()
posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
posix-cpu-timers: Extend rcu_read_lock removing task_struct references
signal: Remove has_group_leader_pid
exec: Remove BUG_ON(has_group_leader_pid)
posix-cpu-timer: Unify the now redundant code in lookup_task
posix-cpu-timer: Tidy up group_leader logic in lookup_task
proc: Ensure we see the exit of each process tid exactly once
rculist: Add hlists_swap_heads_rcu
proc: Use PIDTYPE_TGID in next_tgid
Use proc_pid_ns() to get pid_namespace from the proc superblock
proc: use named enums for better readability
proc: use human-readable values for hidepid
docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
proc: add option to mount only a pids subset
proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
proc: allow to mount many instances of proc in one pid namespace
proc: rename struct proc_fs_info to proc_fs_opts

Linus Torvalds
2020-06-05 04:54:34 +0800

02 Jun, 2020

2 commits

533b220f7 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"A sizeable pile of arm64 updates for 5.8.

Summary below, but the big two features are support for Branch Target
Identification and Clang's Shadow Call stack. The latter is currently
arm64-only, but the high-level parts are all in core code so it could
easily be adopted by other architectures pending toolchain support

Branch Target Identification (BTI):

- Support for ARMv8.5-BTI in both user- and kernel-space. This allows
branch targets to limit the types of branch from which they can be
called and additionally prevents branching to arbitrary code,
although kernel support requires a very recent toolchain.

- Function annotation via SYM_FUNC_START() so that assembly functions
are wrapped with the relevant "landing pad" instructions.

- BPF and vDSO updates to use the new instructions.

- Addition of a new HWCAP and exposure of BTI capability to userspace
via ID register emulation, along with ELF loader support for the
BTI feature in .note.gnu.property.

- Non-critical fixes to CFI unwind annotations in the sigreturn
trampoline.

Shadow Call Stack (SCS):

- Support for Clang's Shadow Call Stack feature, which reserves
platform register x18 to point at a separate stack for each task
that holds only return addresses. This protects function return
control flow from buffer overruns on the main stack.

- Save/restore of x18 across problematic boundaries (user-mode,
hypervisor, EFI, suspend, etc).

- Core support for SCS, should other architectures want to use it
too.

- SCS overflow checking on context-switch as part of the existing
stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

CPU feature detection:

- Removed numerous "SANITY CHECK" errors when running on a system
with mismatched AArch32 support at EL1. This is primarily a concern
for KVM, which disabled support for 32-bit guests on such a system.

- Addition of new ID registers and fields as the architecture has
been extended.

Perf and PMU drivers:

- Minor fixes and cleanups to system PMU drivers.

Hardware errata:

- Unify KVM workarounds for VHE and nVHE configurations.

- Sort vendor errata entries in Kconfig.

Secure Monitor Call Calling Convention (SMCCC):

- Update to the latest specification from Arm (v1.2).

- Allow PSCI code to query the SMCCC version.

Software Delegated Exception Interface (SDEI):

- Unexport a bunch of unused symbols.

- Minor fixes to handling of firmware data.

Pointer authentication:

- Add support for dumping the kernel PAC mask in vmcoreinfo so that
the stack can be unwound by tools such as kdump.

- Simplification of key initialisation during CPU bringup.

BPF backend:

- Improve immediate generation for logical and add/sub instructions.

vDSO:

- Minor fixes to the linker flags for consistency with other
architectures and support for LLVM's unwinder.

- Clean up logic to initialise and map the vDSO into userspace.

ACPI:

- Work around for an ambiguity in the IORT specification relating to
the "num_ids" field.

- Support _DMA method for all named components rather than only PCIe
root complexes.

- Minor other IORT-related fixes.

Miscellaneous:

- Initialise debug traps early for KGDB and fix KDB cacheflushing
deadlock.

- Minor tweaks to early boot state (documentation update, set
TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

- Refactoring and cleanup"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
KVM: arm64: Check advertised Stage-2 page size capability
arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
ACPI/IORT: Remove the unused __get_pci_rid()
arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
arm64/cpufeature: Introduce ID_MMFR5 CPU register
arm64/cpufeature: Introduce ID_DFR1 CPU register
arm64/cpufeature: Introduce ID_PFR2 CPU register
arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
arm64: mm: Add asid_gen_match() helper
firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
arm64: vdso: Fix CFI directives in sigreturn trampoline
arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
...

Linus Torvalds
2020-06-02 06:18:27 +0800
2227e5b21 Merge tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The RCU updates for this cycle were:

- RCU-tasks update, including addition of RCU Tasks Trace for BPF use
and TASKS_RUDE_RCU

- kfree_rcu() updates.

- Remove scheduler locking restriction

- RCU CPU stall warning updates.

- Torture-test updates.

- Miscellaneous fixes and other updates"

* tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
rcu: Allow for smp_call_function() running callbacks from idle
rcu: Provide rcu_irq_exit_check_preempt()
rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
rcu: Provide __rcu_is_watching()
rcu: Provide rcu_irq_exit_preempt()
rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
rcu/tree: Mark the idle relevant functions noinstr
x86: Replace ist_enter() with nmi_enter()
x86/mce: Send #MC singal from task work
x86/entry: Get rid of ist_begin/end_non_atomic()
sched,rcu,tracing: Avoid tracing before in_nmi() is correct
sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
lockdep: Always inline lockdep_{off,on}()
hardirq/nmi: Allow nested nmi_enter()
arm64: Prepare arch_nmi_enter() for recursion
printk: Disallow instrumenting print_nmi_enter()
printk: Prepare for nested printk_nmi_enter()
rcutorture: Convert ULONG_CMP_LT() to time_before()
torture: Add a --kasan argument
torture: Save a few lines by using config_override_param initially
...

Linus Torvalds
2020-06-02 03:56:29 +0800

19 May, 2020

1 commit

9d78edeae proc: proc_pid_ns takes super_block as an argument ... Browse Code »

syzbot found that

touch /proc/testfile

causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.

Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.

To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.

Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: Alexey Gladkov
Signed-off-by: Eric W. Biederman

Alexey Gladkov
2020-05-19 20:07:50 +0800

15 May, 2020

1 commit

d08b9f0ca scs: Add support for Clang's Shadow Call Stack (SCS) ... Browse Code »

This change adds generic support for Clang's Shadow Call Stack,
which uses a shadow stack to protect return addresses from being
overwritten by an attacker. Details are available here:

https://clang.llvm.org/docs/ShadowCallStack.html

Note that security guarantees in the kernel differ from the ones
documented for user space. The kernel must store addresses of
shadow stacks in memory, which means an attacker capable reading
and writing arbitrary memory may be able to locate them and hijack
control flow by modifying the stacks.

Signed-off-by: Sami Tolvanen
Reviewed-by: Kees Cook
Reviewed-by: Miguel Ojeda
[will: Numerous cosmetic changes]
Signed-off-by: Will Deacon

Sami Tolvanen
2020-05-15 23:35:45 +0800

08 May, 2020

1 commit

3f2c788a1 fork: prevent accidental access to clone3 features ... Browse Code »

Jan reported an issue where an interaction between sign-extending clone's
flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes
clone() to consistently fail with EBADF.

The whole story is a little longer. The legacy clone() syscall is odd in a
bunch of ways and here two things interact. First, legacy clone's flag
argument is word-size dependent, i.e. it's an unsigned long whereas most
system calls with flag arguments use int or unsigned int. Second, legacy
clone() ignores unknown and deprecated flags. The two of them taken
together means that users on 64bit systems can pass garbage for the upper
32bit of the clone() syscall since forever and things would just work fine.
Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed
and on v5.7-rc1 where this will fail with EBADF:

int main(int argc, char *argv[])
{
pid_t pid;

/* Note that legacy clone() has different argument ordering on
* different architectures so this won't work everywhere.
*
* Only set the upper 32 bits.
*/
pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
NULL, NULL, NULL, NULL);
if (pid < 0)
exit(EXIT_FAILURE);
if (pid == 0)
exit(EXIT_SUCCESS);
if (wait(NULL) != pid)
exit(EXIT_FAILURE);

exit(EXIT_SUCCESS);
}

Since legacy clone() couldn't be extended this was not a problem so far and
nobody really noticed or cared since nothing in the kernel ever bothered to
look at the upper 32 bits.

But once we introduced clone3() and expanded the flag argument in struct
clone_args to 64 bit we opened this can of worms. With the first flag-based
extension to clone3() making use of the upper 32 bits of the flag argument
we've effectively made it possible for the legacy clone() syscall to reach
clone3() only flags. The sign extension scenario is just the odd
corner-case that we needed to figure this out.

The reason we just realized this now and not already when we introduced
CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup
file descriptor has been given. So the sign extension (or the user
accidently passing garbage for the upper 32 bits) caused the
CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it
didn't find a valid cgroup file descriptor.

Let's fix this by always capping the upper 32 bits for all codepaths that
are not aware of clone3() features. This ensures that we can't reach
clone3() only features by accident via legacy clone as with the sign
extension case and also that legacy clone() works exactly like before, i.e.
ignoring any unknown flags. This solution risks no regressions and is also
pretty clean.

Fixes: 7f192e3cd316 ("fork: add clone3")
Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
Reported-by: Jan Stancek
Signed-off-by: Christian Brauner
Cc: Arnd Bergmann
Cc: Dmitry V. Levin
Cc: Andreas Schwab
Cc: Florian Weimer
Cc: libc-alpha@sourceware.org
Cc: stable@vger.kernel.org # 5.3+
Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html
Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com

Christian Brauner
2020-05-08 23:31:50 +0800