Eric Lee / smarc-fsl-linux-kernel

29 Oct, 2020

1 commit

77f6ab8b7 don't dump the threads that had been already exiting when zapped. ... Browse Code »

Coredump logics needs to report not only the registers of the dumping
thread, but (since 2.5.43) those of other threads getting killed.

Doing that might require extra state saved on the stack in asm glue at
kernel entry; signal delivery logics does that (we need to be able to
save sigcontext there, at the very least) and so does seccomp.

That covers all callers of do_coredump(). Secondary threads get hit with
SIGKILL and caught as soon as they reach exit_mm(), which normally happens
in signal delivery, so those are also fine most of the time. Unfortunately,
it is possible to end up with secondary zapped when it has already entered
exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm()
when mm->core_state is already set, but the stack contents is not what
we would have in signal delivery.

At least on two architectures (alpha and m68k) it leads to infoleaks - we
end up with a chunk of kernel stack written into coredump, with the contents
consisting of normal C stack frames of the call chain leading to exit_mm()
instead of the expected copy of userland registers. In case of alpha we
leak 312 bytes of stack. Other architectures (including the regset-using
ones) might have similar problems - the normal user of regsets is ptrace
and the state of tracee at the time of such calls is special in the same
way signal delivery is.

Note that had the zapper gotten to the exiting thread slightly later,
it wouldn't have been included into coredump anyway - we skip the threads
that have already cleared their ->mm. So let's pretend that zapper always
loses the race. IOW, have exit_mm() only insert into the dumper list if
we'd gotten there from handling a fatal signal[*]

As the result, the callers of do_exit() that have *not* gone through get_signal()
are not seen by coredump logics as secondary threads. Which excludes voluntary
exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that,
so seccomp is fine.

[*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed
out that PF_SIGNALED is already doing just what we need.

Cc: stable@vger.kernel.org
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Acked-by: "Eric W. Biederman"
Signed-off-by: Al Viro

Al Viro
2020-10-29 04:39:49 +0800

19 Oct, 2020

1 commit

1aa92cd31 pid: move pidfd_get_pid() to pid.c ... Browse Code »

process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.

Suggested-by: Alexander Duyck
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Alexander Duyck
Reviewed-by: Vlastimil Babka
Acked-by: Christian Brauner
Acked-by: David Rientjes
Cc: Jens Axboe
Cc: Jann Horn
Cc: Brian Geffon
Cc: Daniel Colascione
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: John Dias
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Oleksandr Natalenko
Cc: Sandeep Patil
Cc: SeongJae Park
Cc: SeongJae Park
Cc: Shakeel Butt
Cc: Sonny Rao
Cc: Tim Murray
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
Signed-off-by: Linus Torvalds

Minchan Kim
2020-10-19 00:27:10 +0800

04 Sep, 2020

1 commit

ba7d25f3d exit: support non-blocking pidfds ... Browse Code »

Passing a non-blocking pidfd to waitid() currently has no effect, i.e. is not
supported. There are users which would like to use waitid() on pidfds that are
O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
waitid().
The expected behavior is to have waitid() return -EAGAIN for non-blocking
pidfds and to block for blocking pidfds without needing to perform any
additional checks for flags set on the pidfd before passing it to waitid().
Non-blocking pidfds will return EAGAIN from waitid() when no child process is
ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
loops that handle EAGAIN specially.

It also makes the API more consistent and uniform. In essence, waitid() is
treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
socket.
With the addition of support for non-blocking pidfds we support the same
functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
contrast non-blocking pidfds and non-blocking sockets are a setting on an open
file description affecting all threads in the calling process as well as other
processes that hold file descriptors referring to the same open file
description. Both behaviors, per call and per open file description, have
genuine use-cases.

The implementation should be straightforward:
- If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise
the WNOHANG flag internally. When do_wait() returns indicating that there are
eligible child processes but none have exited yet we set EAGAIN. If no child
process exists we continue returning ECHILD.
- If a non-blocking pidfd is passed and WNOHANG is raised waitid() will
continue returning 0, i.e. it will not set EAGAIN. This ensure backwards
compatibility with applications passing WNOHANG explicitly with pidfds.

A concrete use-case that was brought on-list was Josh's async pidfd library.
Ever since the introduction of pidfds and more advanced async io various
programming languages such as Rust have grown support for async event
libraries. These libraries are created to help build epoll-based event loops
around file descriptors. A common pattern is to automatically make all file
descriptors they manage to O_NONBLOCK.

For such libraries the EAGAIN error code is treated specially. When a function
is called that returns EAGAIN the function isn't called again until the event
loop indicates the the file descriptor is ready. Supporting EAGAIN when
waiting on pidfds makes such libraries just work with little effort.

Suggested-by: Josh Triplett
Signed-off-by: Christian Brauner
Reviewed-by: Josh Triplett
Reviewed-by: Oleg Nesterov
Cc: Kees Cook
Cc: Sargun Dhillon
Cc: Jann Horn
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: Oleg Nesterov
Cc: "Peter Zijlstra (Intel)"
Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
Link: https://github.com/joshtriplett/async-pidfd
Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com

Christian Brauner
2020-09-04 18:31:30 +0800

13 Aug, 2020

2 commits

8043fc147 kernel: add a kernel_wait helper ... Browse Code »

Add a helper that waits for a pid and stores the status in the passed in
kernel pointer. Use it to fix the usage of kernel_wait4 in
call_usermodehelper_exec_sync that only happens to work due to the
implicit set_fs(KERNEL_DS) for kernel threads.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Acked-by: "Eric W. Biederman"
Cc: Luis Chamberlain
Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-08-13 01:57:59 +0800
fe8141759 exec: use force_uaccess_begin during exec and exit ... Browse Code »

Both exec and exit want to ensure that the uaccess routines actually do
access user pointers. Use the newly added force_uaccess_begin helper
instead of an open coded set_fs for that to prepare for kernel builds
where set_fs() does not exist.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Nick Hu
Cc: Greentime Hu
Cc: Vincent Chen
Cc: Paul Walmsley
Cc: Palmer Dabbelt
Cc: Geert Uytterhoeven
Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-08-13 01:57:59 +0800

05 Aug, 2020

2 commits

3950e9754 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull execve updates from Eric Biederman:
"During the development of v5.7 I ran into bugs and quality of
implementation issues related to exec that could not be easily fixed
because of the way exec is implemented. So I have been diggin into
exec and cleaning up what I can.

This cycle I have been looking at different ideas and different
implementations to see what is possible to improve exec, and cleaning
the way exec interfaces with in kernel users. Only cleaning up the
interfaces of exec with rest of the kernel has managed to stabalize
and make it through review in time for v5.9-rc1 resulting in 2 sets of
changes this cycle.

- Implement kernel_execve

- Make the user mode driver code a better citizen

With kernel_execve the code size got a little larger as the copying of
parameters from userspace and copying of parameters from userspace is
now separate. The good news is kernel threads no longer need to play
games with set_fs to use exec. Which when combined with the rest of
Christophs set_fs changes should security bugs with set_fs much more
difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
exec: Implement kernel_execve
exec: Factor bprm_stack_limits out of prepare_arg_pages
exec: Factor bprm_execve out of do_execve_common
exec: Move bprm_mm_init into alloc_bprm
exec: Move initialization of bprm->filename into alloc_bprm
exec: Factor out alloc_bprm
exec: Remove unnecessary spaces from binfmts.h
umd: Stop using split_argv
umd: Remove exit_umh
bpfilter: Take advantage of the facilities of struct pid
exit: Factor thread_group_exited out of pidfd_poll
umd: Track user space drivers with struct pid
bpfilter: Move bpfilter_umh back into init data
exec: Remove do_execve_file
umh: Stop calling do_execve_file
umd: Transform fork_usermode_blob into fork_usermode_driver
umd: Rename umd_info.cmdline umd_info.driver_name
umd: For clarity rename umh_info umd_info
umh: Separate the user mode driver and the user mode helper support
umh: Remove call_usermodehelper_setup_file.
...

Linus Torvalds
2020-08-05 05:27:25 +0800
9ecc6ea49 Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux ... Browse Code »

Pull seccomp updates from Kees Cook:
"There are a bunch of clean ups and selftest improvements along with
two major updates to the SECCOMP_RET_USER_NOTIF filter return:
EPOLLHUP support to more easily detect the death of a monitored
process, and being able to inject fds when intercepting syscalls that
expect an fd-opening side-effect (needed by both container folks and
Chrome). The latter continued the refactoring of __scm_install_fd()
started by Christoph, and in the process found and fixed a handful of
bugs in various callers.

- Improved selftest coverage, timeouts, and reporting

- Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

- Refactor __scm_install_fd() into __receive_fd() and fix buggy
callers

- Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
Dhillon)"

* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
seccomp: Introduce addfd ioctl to seccomp user notifier
fs: Expand __receive_fd() to accept existing fd
pidfd: Replace open-coded receive_fd()
fs: Add receive_fd() wrapper for __receive_fd()
fs: Move __scm_install_fd() to __receive_fd()
net/scm: Regularize compat handling of scm_detach_fds()
pidfd: Add missing sock updates for pidfd_getfd()
net/compat: Add missing sock updates for SCM_RIGHTS
selftests/seccomp: Check ENOSYS under tracing
selftests/seccomp: Refactor to use fixture variants
selftests/harness: Clean up kern-doc for fixtures
seccomp: Use -1 marker for end of mode 1 syscall list
seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
selftests/seccomp: Make kcmp() less required
seccomp: Use pr_fmt
selftests/seccomp: Improve calibration loop
selftests/seccomp: use 90s as timeout
selftests/seccomp: Expand benchmark to per-filter measurements
...

Linus Torvalds
2020-08-05 05:11:08 +0800

17 Jul, 2020

1 commit

3f649ab72 treewide: Remove uninitialized_var() usage ... Browse Code »

Using uninitialized_var() is dangerous as it papers over real bugs[1]
(or can in the future), and suppresses unrelated compiler warnings
(e.g. "unused variable"). If the compiler thinks it is uninitialized,
either simply initialize the variable or make compiler changes.

In preparation for removing[2] the[3] macro[4], remove all remaining
needless uses with the following script:

git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
xargs perl -pi -e \
's/\buninitialized_var$([^$]+)\)/\1/g;
s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
pathological white-space.

No outstanding warnings were found building allmodconfig with GCC 9.3.0
for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
alpha, and m68k.

[1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
[2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
[4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
Acked-by: Jason Gunthorpe # IB
Acked-by: Kalle Valo # wireless drivers
Reviewed-by: Chao Yu # erofs
Signed-off-by: Kees Cook

Kees Cook
2020-07-17 03:35:15 +0800

11 Jul, 2020

1 commit

3a15fb6ed seccomp: release filter after task is fully dead ... Browse Code »

The seccomp filter used to be released in free_task() which is called
asynchronously via call_rcu() and assorted mechanisms. Since we need
to inform tasks waiting on the seccomp notifier when a filter goes empty
we will notify them as soon as a task has been marked fully dead in
release_task(). To not split seccomp cleanup into two parts, move
filter release out of free_task() and into release_task() after we've
unhashed struct task from struct pid, exited signals, and unlinked it
from the threadgroups' thread list. We'll put the empty filter
notification infrastructure into it in a follow up patch.

This also renames put_seccomp_filter() to seccomp_filter_release() which
is a more descriptive name of what we're doing here especially once
we've added the empty filter notification mechanism in there.

We're also NULL-ing the task's filter tree entrypoint which seems
cleaner than leaving a dangling pointer in there. Note that this shouldn't
need any memory barriers since we're calling this when the task is in
release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
filters anymore. You can also see this from the point where we're calling
seccomp_filter_release(). It's after __exit_signal() and at this point,
tsk->sighand will already have been NULLed which is required for
thread-sync and filter installation alike.

Cc: Tycho Andersen
Cc: Kees Cook
Cc: Matt Denton
Cc: Sargun Dhillon
Cc: Jann Horn
Cc: Chris Palmer
Cc: Aleksa Sarai
Cc: Robert Sesek
Cc: Jeffrey Vander Stoep
Cc: Linux Containers
Signed-off-by: Christian Brauner
Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
Signed-off-by: Kees Cook

Christian Brauner
2020-07-11 07:01:51 +0800

08 Jul, 2020

2 commits

8c2f52663 umd: Remove exit_umh ... Browse Code »

The bpfilter code no longer uses the umd_info.cleanup callback. This
callback is what exit_umh exists to call. So remove exit_umh and all
of it's associated booking.

v1: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87y2o53abg.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-15-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-07-08 00:58:59 +0800
38fd525a4 exit: Factor thread_group_exited out of pidfd_poll ... Browse Code »

Create an independent helper thread_group_exited which returns true
when all threads have passed exit_notify in do_exit. AKA all of the
threads are at least zombies and might be dead or completely gone.

Create this helper by taking the logic out of pidfd_poll where it is
already tested, and adding a READ_ONCE on the read of
task->exit_state.

I will be changing the user mode driver code to use this same logic
to know when a user mode driver needs to be restarted.

Place the new helper thread_group_exited in kernel/exit.c and
EXPORT it so it can be used by modules.

Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
Acked-by: Christian Brauner
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-07-08 00:58:17 +0800

04 Jul, 2020

2 commits

1c340ead1 umd: Track user space drivers with struct pid ... Browse Code »

Use struct pid instead of user space pid values that are prone to wrap
araound.

In addition track the entire thread group instead of just the first
thread that is started by exec. There are no multi-threaded user mode
drivers today but there is nothing preclucing user drivers from being
multi-threaded, so it is just a good idea to track the entire process.

Take a reference count on the tgid's in question to make it possible
to remove exit_umh in a future change.

As a struct pid is available directly use kill_pid_info.

The prior process signalling code was iffy in using a userspace pid
known to be in the initial pid namespace and then looking up it's task
in whatever the current pid namespace is. It worked only because
kernel threads always run in the initial pid namespace.

As the tgid is now refcounted verify the tgid is NULL at the start of
fork_usermode_driver to avoid the possibility of silent pid leaks.

v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-07-04 22:35:56 +0800
884c5e683 umh: Separate the user mode driver and the user mode helper support ... Browse Code »

This makes it clear which code is part of the core user mode
helper support and which code is needed to implement user mode
drivers.

This makes the kernel smaller for everyone who does not use a usermode
driver.

v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org
v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org
Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.com
Reviewed-by: Greg Kroah-Hartman
Acked-by: Alexei Starovoitov
Tested-by: Alexei Starovoitov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-07-04 22:34:32 +0800

10 Jun, 2020

3 commits

c1e8d7c6a mmap locking API: convert mmap_sem comments ... Browse Code »

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
d8ed45c5d mmap locking API: use coccinelle to convert mmap_sem rwsem call sites ... Browse Code »

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
e31cf2f4c mm: don't include asm/pgtable.h if linux/mm.h is already included ... Browse Code »

Patch series "mm: consolidate definitions of page table accessors", v2.

The low level page table accessors (pXY_index(), pXY_offset()) are
duplicated across all architectures and sometimes more than once. For
instance, we have 31 definition of pgd_offset() for 25 supported
architectures.

Most of these definitions are actually identical and typically it boils
down to, e.g.

static inline unsigned long pmd_index(unsigned long address)
{
return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
}

static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
{
return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
}

These definitions can be shared among 90% of the arches provided
XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

For architectures that really need a custom version there is always
possibility to override the generic version with the usual ifdefs magic.

These patches introduce include/linux/pgtable.h that replaces
include/asm-generic/pgtable.h and add the definitions of the page table
accessors to the new header.

This patch (of 12):

The linux/mm.h header includes to allow inlining of the
functions involving page table manipulations, e.g. pte_alloc() and
pmd_alloc(). So, there is no point to explicitly include
in the files that include .

The include statements in such cases are remove with a simple loop:

for f in $(git grep -l "include ") ; do
sed -i -e '/include / d' $f
done

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Brian Cain
Cc: Catalin Marinas
Cc: Chris Zankel
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Ingo Molnar
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Simek
Cc: Mike Rapoport
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vincent Chen
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-10 00:39:13 +0800

04 Jun, 2020

2 commits

039aeb9de Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm

- Start the post-32bit cleanup

- Cherry-pick a few non-invasive pre-NV patches

x86:
- Rework of TLB flushing

- Rework of event injection, especially with respect to nested
virtualization

- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases

- Nested AMD live migration support

- Optimization for TSC deadline MSR writes and IPIs

- Various cleanups

- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)

- Interrupt-based delivery of asynchronous "page ready" events (host
side)

- Hyper-V MSRs and hypercalls for guest debugging

- VMX preemption timer fixes

s390:
- Cleanups

Generic:
- switch vCPU thread wakeup from swait to rcuwait

The other architectures, and the guest side of the asynchronous page
fault work, will come next week"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...

Linus Torvalds
2020-06-04 06:13:47 +0800
d479c5a19 Merge tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The changes in this cycle are:

- Optimize the task wakeup CPU selection logic, to improve
scalability and reduce wakeup latency spikes

- PELT enhancements

- CFS bandwidth handling fixes

- Optimize the wakeup path by remove rq->wake_list and replacing it
with ->ttwu_pending

- Optimize IPI cross-calls by making flush_smp_call_function_queue()
process sync callbacks first.

- Misc fixes and enhancements"

* tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
sched: Replace rq::wake_list
sched: Add rq::ttwu_pending
irq_work, smp: Allow irq_work on call_single_queue
smp: Optimize send_call_function_single_ipi()
smp: Move irq_work_run() out of flush_smp_call_function_queue()
smp: Optimize flush_smp_call_function_queue()
sched: Fix smp_call_function_single_async() usage for ILB
sched/core: Offload wakee task activation if it the wakee is descheduling
sched/core: Optimize ttwu() spinning on p->on_cpu
sched: Defend cfs and rt bandwidth quota against overflow
sched/cpuacct: Fix charge cpuacct.usage_sys
sched/fair: Replace zero-length array with flexible-array
sched/pelt: Sync util/runnable_sum with PELT window when propagating
sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
sched/fair: Optimize enqueue_task_fair()
sched: Make scheduler_ipi inline
sched: Clean up scheduler_ipi()
sched/core: Simplify sched_init()
...

Linus Torvalds
2020-06-04 04:06:42 +0800

02 Jun, 2020

1 commit

e148a8f94 Merge branch 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull uaccess/readdir updates from Al Viro:
"Finishing the conversion of readdir.c to unsafe_... API.

This includes the uaccess_{read,write}_begin series by Christophe
Leroy"

* 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
readdir.c: get compat_filldir() more or less in sync with filldir()
switch readdir(2) to unsafe_copy_dirent_name()
drm/i915/gem: Replace user_access_begin by user_write_access_begin
uaccess: Selectively open read or write user access
uaccess: Add user_read_access_begin/end and user_write_access_begin/end

Linus Torvalds
2020-06-02 07:11:38 +0800

20 May, 2020

1 commit

9d5272f5e Merge tag 'noinstr-x86-kvm-2020-05-16' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/tip into HEAD

Paolo Bonzini
2020-05-20 15:40:09 +0800

14 May, 2020

2 commits

9d9a6ebfe rcuwait: Let rcuwait_wake_up() return whether or not a task was awoken ... Browse Code »

Propagating the return value of wake_up_process() back to the caller
can come in handy for future users, such as for statistics or
accounting purposes.

Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Davidlohr Bueso
Message-Id:
Signed-off-by: Paolo Bonzini

Davidlohr Bueso
2020-05-14 00:14:52 +0800
c9d64a1b2 rcuwait: Fix stale wake call name in comment ... Browse Code »

The 'trywake' name was renamed to simply 'wake', update the comment.

Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Davidlohr Bueso
Message-Id:
Signed-off-by: Paolo Bonzini

Davidlohr Bueso
2020-05-14 00:14:51 +0800

01 May, 2020

2 commits

41cd78052 uaccess: Selectively open read or write user access ... Browse Code »

When opening user access to only perform reads, only open read access.
When opening user access to only perform writes, only open write
access.

Signed-off-by: Christophe Leroy
Reviewed-by: Kees Cook
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr

Christophe Leroy
2020-05-01 10:35:21 +0800
586b58cac exit: Move preemption fixup up, move blocking operations down ... Browse Code »

With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
non-preemptible context look untidy; after the main oops, the kernel prints
a "sleeping function called from invalid context" report because
exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
fixup.

It looks like the same thing applies to profile_task_exit() and
kcov_task_exit().

Fix it by moving the preemption fixup up and the calls to
profile_task_exit() and kcov_task_exit() down.

Fixes: 1dc0fffc48af ("sched/core: Robustify preemption leak checks")
Signed-off-by: Jann Horn
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com

Jann Horn
2020-05-01 02:14:38 +0800

25 Apr, 2020

1 commit

6ade99ec6 proc: Put thread_pid in release_task not proc_flush_pid ... Browse Code »

Oleg pointed out that in the unlikely event the kernel is compiled
with CONFIG_PROC_FS unset that release_task will now leak the pid.

Move the put_pid out of proc_flush_pid into release_task to fix this
and to guarantee I don't make that mistake again.

When possible it makes sense to keep get and put in the same function
so it can easily been seen how they pair up.

Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
Reported-by: Oleg Nesterov
Signed-off-by: "Eric W. Biederman"

Eric W. Biederman
2020-04-25 04:49:00 +0800

03 Apr, 2020

1 commit

d987ca1c6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace ... Browse Code »

Pull exec/proc updates from Eric Biederman:
"This contains two significant pieces of work: the work to sort out
proc_flush_task, and the work to solve a deadlock between strace and
exec.

Fixing proc_flush_task so that it no longer requires a persistent
mount makes improvements to proc possible. The removal of the
persistent mount solves an old regression that that caused the hidepid
mount option to only work on remount not on mount. The regression was
found and reported by the Android folks. This further allows Alexey
Gladkov's work making proc mount options specific to an individual
mount of proc to move forward.

The work on exec starts solving a long standing issue with exec that
it takes mutexes of blocking userspace applications, which makes exec
extremely deadlock prone. For the moment this adds a second mutex with
a narrower scope that handles all of the easy cases. Which makes the
tricky cases easy to spot. With a little luck the code to solve those
deadlocks will be ready by next merge window"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
signal: Extend exec_id to 64bits
pidfd: Use new infrastructure to fix deadlocks in execve
perf: Use new infrastructure to fix deadlocks in execve
proc: io_accounting: Use new infrastructure to fix deadlocks in execve
proc: Use new infrastructure to fix deadlocks in execve
kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
kernel: doc: remove outdated comment cred.c
mm: docs: Fix a comment in process_vm_rw_core
selftests/ptrace: add test cases for dead-locks
exec: Fix a deadlock in strace
exec: Add exec_update_mutex to replace cred_guard_mutex
exec: Move exec_mmap right after de_thread in flush_old_exec
exec: Move cleanup of posix timers on exec out of de_thread
exec: Factor unshare_sighand out of de_thread and call it separately
exec: Only compute current once in flush_old_exec
pid: Improve the comment about waiting in zap_pid_ns_processes
proc: Remove the now unnecessary internal mount of proc
uml: Create a private mount of proc for mconsole
uml: Don't consult current to find the proc_mnt in mconsole_proc
proc: Use a list of inodes to flush from proc
...

Linus Torvalds
2020-04-03 02:22:17 +0800

31 Mar, 2020

2 commits

dbb381b61 Merge tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timekeeping and timer updates from Thomas Gleixner:
"Core:

- Consolidation of the vDSO build infrastructure to address the
difficulties of cross-builds for ARM64 compat vDSO libraries by
restricting the exposure of header content to the vDSO build.

This is achieved by splitting out header content into separate
headers. which contain only the minimaly required information which
is necessary to build the vDSO. These new headers are included from
the kernel headers and the vDSO specific files.

- Enhancements to the generic vDSO library allowing more fine grained
control over the compiled in code, further reducing architecture
specific storage and preparing for adopting the generic library by
PPC.

- Cleanup and consolidation of the exit related code in posix CPU
timers.

- Small cleanups and enhancements here and there

Drivers:

- The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support

- Correct the clock rate of PIT64b global clock

- setup_irq() cleanup

- Preparation for PWM and suspend support for the TI DM timer

- Expand the fttmr010 driver to support ast2600 systems

- The usual small fixes, enhancements and cleanups all over the
place"

* tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
vdso: Fix clocksource.h macro detection
um: Fix header inclusion
arm64: vdso32: Enable Clang Compilation
lib/vdso: Enable common headers
arm: vdso: Enable arm to use common headers
x86/vdso: Enable x86 to use common headers
mips: vdso: Enable mips to use common headers
arm64: vdso32: Include common headers in the vdso library
arm64: vdso: Include common headers in the vdso library
arm64: Introduce asm/vdso/processor.h
arm64: vdso32: Code clean up
linux/elfnote.h: Replace elf.h with UAPI equivalent
scripts: Fix the inclusion order in modpost
common: Introduce processor.h
linux/ktime.h: Extract common header for vDSO
linux/jiffies.h: Extract common header for vDSO
linux/time64.h: Extract common header for vDSO
linux/time32.h: Extract common header for vDSO
linux/time.h: Extract common header for vDSO
...

Linus Torvalds
2020-03-31 09:51:47 +0800
4b9fd8a82 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:

- Continued user-access cleanups in the futex code.

- percpu-rwsem rewrite that uses its own waitqueue and atomic_t
instead of an embedded rwsem. This addresses a couple of
weaknesses, but the primary motivation was complications on the -rt
kernel.

- Introduce raw lock nesting detection on lockdep
(CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
lock differences. This too originates from -rt.

- Reuse lockdep zapped chain_hlocks entries, to conserve RAM
footprint on distro-ish kernels running into the "BUG:
MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
chain-entries pool.

- Misc cleanups, smaller fixes and enhancements - see the changelog
for details"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
Documentation/locking/locktypes: Minor copy editor fixes
Documentation/locking/locktypes: Further clarifications and wordsmithing
m68knommu: Remove mm.h include from uaccess_no.h
x86: get rid of user_atomic_cmpxchg_inatomic()
generic arch_futex_atomic_op_inuser() doesn't need access_ok()
x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
objtool: whitelist __sanitizer_cov_trace_switch()
[parisc, s390, sparc64] no need for access_ok() in futex handling
sh: no need of access_ok() in arch_futex_atomic_op_inuser()
futex: arch_futex_atomic_op_inuser() calling conventions change
completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
lockdep: Add posixtimer context tracing bits
lockdep: Annotate irq_work
lockdep: Add hrtimer context tracing bits
lockdep: Introduce wait-type checks
completion: Use simple wait queues
sched/swait: Prepare usage in completions
...

Linus Torvalds
2020-03-31 07:17:15 +0800

04 Mar, 2020

1 commit

b95e31c07 posix-cpu-timers: Stop disabling timers on mt-exec ... Browse Code »

The reasons why the extra posix_cpu_timers_exit_group() invocation has been
added are not entirely clear from the commit message. Today all that
posix_cpu_timers_exit_group() does is stop timers that are tracking the
task from firing. Every other operation on those timers is still allowed.

The practical implication of this is posix_cpu_timer_del() which could
not get the siglock after the thread group leader has exited (because
sighand == NULL) would be able to run successfully because the timer
was already dequeued.

With that locking issue fixed there is no point in disabling all of the
timers. So remove this ``tempoary'' hack.

Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org

Eric W. Biederman
2020-03-04 16:54:55 +0800

28 Feb, 2020

1 commit

22a34c6fe exit: Fix Sparse errors and warnings ... Browse Code »

This patch fixes the following sparse error:
kernel/exit.c:627:25: error: incompatible types in comparison expression

And the following warning:
kernel/exit.c:626:40: warning: incorrect type in assignment

Signed-off-by: Madhuparna Bhowmik
Acked-by: Oleg Nesterov
Acked-by: Christian Brauner
[christian.brauner@ubuntu.com: edit commit message]
Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Christian Brauner

Madhuparna Bhowmik
2020-02-28 20:34:39 +0800

25 Feb, 2020

1 commit

7bc3e6e55 proc: Use a list of inodes to flush from proc ... Browse Code »

Rework the flushing of proc to use a list of directory inodes that
need to be flushed.

The list is kept on struct pid not on struct task_struct, as there is
a fixed connection between proc inodes and pids but at least for the
case of de_thread the pid of a task_struct changes.

This removes the dependency on proc_mnt which allows for different
mounts of proc having different mount options even in the same pid
namespace and this allows for the removal of proc_mnt which will
trivially the first mount of proc to honor it's mount options.

This flushing remains an optimization. The functions
pid_delete_dentry and pid_revalidate ensure that ordinary dcache
management will not attempt to use dentries past the point their
respective task has died. When unused the shrinker will
eventually be able to remove these dentries.

There is a case in de_thread where proc_flush_pid can be
called early for a given pid. Which winds up being
safe (if suboptimal) as this is just an optiimization.

Only pid directories are put on the list as the other
per pid files are children of those directories and
d_invalidate on the directory will get them as well.

So that the pid can be used during flushing it's reference count is
taken in release_task and dropped in proc_flush_pid. Further the call
of proc_flush_pid is moved after the tasklist_lock is released in
release_task so that it is certain that the pid has already been
unhashed when flushing it taking place. This removes a small race
where a dentry could recreated.

As struct pid is supposed to be small and I need a per pid lock
I reuse the only lock that currently exists in struct pid the
the wait_pidfd.lock.

The net result is that this adds all of this functionality
with just a little extra list management overhead and
a single extra pointer in struct pid.

v2: Initialize pid->inodes. I somehow failed to get that
initialization into the initial version of the patch. A boot
failure was reported by "kernel test robot ", and
failure to initialize that pid->inodes matches all of the reported
symptoms.

Signed-off-by: Eric W. Biederman

Eric W. Biederman
2020-02-25 00:14:44 +0800

11 Feb, 2020

1 commit

ac8dec420 locking/percpu-rwsem: Fold __percpu_up_read() ... Browse Code »

Now that __percpu_up_read() is only ever used from percpu_up_read()
merge them, it's a small function.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Will Deacon
Acked-by: Waiman Long
Link: https://lkml.kernel.org/r/20200131151540.212415454@infradead.org

Davidlohr Bueso
2020-02-11 20:10:58 +0800

04 Jan, 2020

1 commit

d9c82fd8c Merge tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux ... Browse Code »

Pull thread fixes from Christian Brauner:
"Here are two fixes:

- Panic earlier when global init exits to generate useable coredumps.

Currently, when global init and all threads in its thread-group
have exited we panic via:

do_exit()
-> exit_notify()
-> forget_original_parent()
-> find_child_reaper()

This makes it hard to extract a useable coredump for global init
from a kernel crashdump because by the time we panic exit_mm() will
have already released global init's mm. We now panic slightly
earlier. This has been a problem in certain environments such as
Android.

- Fix a race in assigning and reading taskstats for thread-groups
with more than one thread.

This patch has been waiting for quite a while since people
disagreed on what the correct fix was at first"

* tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
exit: panic before exit_mm() on global init exit
taskstats: fix data-race

Linus Torvalds
2020-01-04 03:17:14 +0800

21 Dec, 2019

1 commit

43cf75d96 exit: panic before exit_mm() on global init exit ... Browse Code »

Currently, when global init and all threads in its thread-group have exited
we panic via:
do_exit()
-> exit_notify()
-> forget_original_parent()
-> find_child_reaper()
This makes it hard to extract a useable coredump for global init from a
kernel crashdump because by the time we panic exit_mm() will have already
released global init's mm.
This patch moves the panic futher up before exit_mm() is called. As was the
case previously, we only panic when global init and all its threads in the
thread-group have exited.

Signed-off-by: chenqiwu
Acked-by: Christian Brauner
Acked-by: Oleg Nesterov
[christian.brauner@ubuntu.com: fix typo, rewrite commit message]
Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Christian Brauner

chenqiwu
2019-12-21 23:48:01 +0800

01 Dec, 2019

1 commit

6a965666b Merge tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/lin… ... Browse Code »

…ux/kernel/git/dhowells/linux-fs

Pull pipe rework from David Howells:
"This is my set of preparatory patches for building a general
notification queue on top of pipes. It makes a number of significant
changes:

- It removes the nr_exclusive argument from __wake_up_sync_key() as
this is always 1. This prepares for the next step:

- Adds wake_up_interruptible_sync_poll_locked() so that poll can be
woken up from a function that's holding the poll waitqueue
spinlock.

- Change the pipe buffer ring to be managed in terms of unbounded
head and tail indices rather than bounded index and length. This
means that reading the pipe only needs to modify one index, not
two.

- A selection of helper functions are provided to query the state of
the pipe buffer, plus a couple to apply updates to the pipe
indices.

- The pipe ring is allowed to have kernel-reserved slots. This allows
many notification messages to be spliced in by the kernel without
allowing userspace to pin too many pages if it writes to the same
pipe.

- Advance the head and tail indices inside the pipe waitqueue lock
and use wake_up_interruptible_sync_poll_locked() to poke poll
without having to take the lock twice.

- Rearrange pipe_write() to preallocate the buffer it is going to
write into and then drop the spinlock. This allows kernel
notifications to then be added the ring whilst it is filling the
buffer it allocated. The read side is stalled because the pipe
mutex is still held.

- Don't wake up readers on a pipe if there was already data in it
when we added more.

- Don't wake up writers on a pipe if the ring wasn't full before we
removed a buffer"

* tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
pipe: Remove sync on wake_ups
pipe: Increase the writer-wakeup threshold to reduce context-switch count
pipe: Check for ring full inside of the spinlock in pipe_write()
pipe: Remove redundant wakeup from pipe_write()
pipe: Rearrange sequence in pipe_write() to preallocate slot
pipe: Conditionalise wakeup in pipe_read()
pipe: Advance tail pointer inside of wait spinlock in pipe_read()
pipe: Allow pipes to have kernel-reserved slots
pipe: Use head and tail pointers for the ring, not cursor and length
Add wake_up_interruptible_sync_poll_locked()
Remove the nr_exclusive argument from __wake_up_sync_key()
pipe: Reduce #inclusion of pipe_fs_i.h

Linus Torvalds
2019-12-01 06:12:13 +0800

27 Nov, 2019

1 commit

168829ad0 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The main changes in this cycle were:

- A comprehensive rewrite of the robust/PI futex code's exit handling
to fix various exit races. (Thomas Gleixner et al)

- Rework the generic REFCOUNT_FULL implementation using
atomic_fetch_* operations so that the performance impact of the
cmpxchg() loops is mitigated for common refcount operations.

With these performance improvements the generic implementation of
refcount_t should be good enough for everybody - and this got
confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
REFCOUNT_FULL entirely, leaving the generic implementation enabled
unconditionally. (Will Deacon)

- Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
lkdtm: Remove references to CONFIG_REFCOUNT_FULL
locking/refcount: Remove unused 'refcount_error_report()' function
locking/refcount: Consolidate implementations of refcount_t
locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
locking/refcount: Move saturation warnings out of line
locking/refcount: Improve performance of generic REFCOUNT_FULL code
locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
locking/refcount: Remove unused refcount_*_checked() variants
locking/refcount: Ensure integer operands are treated as signed
locking/refcount: Define constants for saturation and max refcount values
futex: Prevent exit livelock
futex: Provide distinct return value when owner is exiting
futex: Add mutex around futex exit
futex: Provide state handling for exec() as well
futex: Sanitize exit state handling
futex: Mark the begin of futex exit explicitly
futex: Set task::futex_state to DEAD right after handling futex exit
futex: Split futex_mm_release() for exit/exec
exit/exec: Seperate mm_release()
futex: Replace PF_EXITPIDONE with a state
...

Linus Torvalds
2019-11-27 08:02:40 +0800

20 Nov, 2019

4 commits

18f694385 futex: Mark the begin of futex exit explicitly ... Browse Code »

Instead of relying on PF_EXITING use an explicit state for the futex exit
and set it in the futex exit function. This moves the smp barrier and the
lock/unlock serialization into the futex code.

As with the DEAD state this is restricted to the exit path as exec
continues to use the same task struct.

This allows to simplify that logic in a next step.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de

Thomas Gleixner
2019-11-20 16:40:09 +0800
f24f22435 futex: Set task::futex_state to DEAD right after handling futex exit ... Browse Code »

Setting task::futex_state in do_exit() is rather arbitrarily placed for no
reason. Move it into the futex code.

Note, this is only done for the exit cleanup as the exec cleanup cannot set
the state to FUTEX_STATE_DEAD because the task struct is still in active
use.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de

Thomas Gleixner
2019-11-20 16:40:08 +0800
4610ba7ad exit/exec: Seperate mm_release() ... Browse Code »

mm_release() contains the futex exit handling. mm_release() is called from
do_exit()->exit_mm() and from exec()->exec_mm().

In the exit_mm() case PF_EXITING and the futex state is updated. In the
exec_mm() case these states are not touched.

As the futex exit code needs further protections against exit races, this
needs to be split into two functions.

Preparatory only, no functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de

Thomas Gleixner
2019-11-20 16:40:08 +0800
3d4775df0 futex: Replace PF_EXITPIDONE with a state ... Browse Code »

The futex exit handling relies on PF_ flags. That's suboptimal as it
requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
the middle of do_exit() to enforce the observability of PF_EXITING in the
futex code.

Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
over to the new state. The PF_EXITING dependency will be cleaned up in a
later step.

This prepares for handling various futex exit issues later.

Signed-off-by: Thomas Gleixner
Reviewed-by: Ingo Molnar
Acked-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de

Thomas Gleixner
2019-11-20 16:40:07 +0800