29 Oct, 2020

1 commit

  • Coredump logics needs to report not only the registers of the dumping
    thread, but (since 2.5.43) those of other threads getting killed.

    Doing that might require extra state saved on the stack in asm glue at
    kernel entry; signal delivery logics does that (we need to be able to
    save sigcontext there, at the very least) and so does seccomp.

    That covers all callers of do_coredump(). Secondary threads get hit with
    SIGKILL and caught as soon as they reach exit_mm(), which normally happens
    in signal delivery, so those are also fine most of the time. Unfortunately,
    it is possible to end up with secondary zapped when it has already entered
    exit(2) (or, worse yet, is oopsing). In those cases we reach exit_mm()
    when mm->core_state is already set, but the stack contents is not what
    we would have in signal delivery.

    At least on two architectures (alpha and m68k) it leads to infoleaks - we
    end up with a chunk of kernel stack written into coredump, with the contents
    consisting of normal C stack frames of the call chain leading to exit_mm()
    instead of the expected copy of userland registers. In case of alpha we
    leak 312 bytes of stack. Other architectures (including the regset-using
    ones) might have similar problems - the normal user of regsets is ptrace
    and the state of tracee at the time of such calls is special in the same
    way signal delivery is.

    Note that had the zapper gotten to the exiting thread slightly later,
    it wouldn't have been included into coredump anyway - we skip the threads
    that have already cleared their ->mm. So let's pretend that zapper always
    loses the race. IOW, have exit_mm() only insert into the dumper list if
    we'd gotten there from handling a fatal signal[*]

    As the result, the callers of do_exit() that have *not* gone through get_signal()
    are not seen by coredump logics as secondary threads. Which excludes voluntary
    exit()/oopsen/traps/etc. The dumper thread itself is unaffected by that,
    so seccomp is fine.

    [*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed
    out that PF_SIGNALED is already doing just what we need.

    Cc: stable@vger.kernel.org
    Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
    History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Al Viro

    Al Viro
     

19 Oct, 2020

1 commit

  • process_madvise syscall needs pidfd_get_pid function to translate pidfd to
    pid so this patch move the function to kernel/pid.c.

    Suggested-by: Alexander Duyck
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Acked-by: Christian Brauner
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Brian Geffon
    Cc: Daniel Colascione
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

04 Sep, 2020

1 commit

  • Passing a non-blocking pidfd to waitid() currently has no effect, i.e. is not
    supported. There are users which would like to use waitid() on pidfds that are
    O_NONBLOCK and mix it with pidfds that are blocking and both pass them to
    waitid().
    The expected behavior is to have waitid() return -EAGAIN for non-blocking
    pidfds and to block for blocking pidfds without needing to perform any
    additional checks for flags set on the pidfd before passing it to waitid().
    Non-blocking pidfds will return EAGAIN from waitid() when no child process is
    ready yet. Returning -EAGAIN for non-blocking pidfds makes it easier for event
    loops that handle EAGAIN specially.

    It also makes the API more consistent and uniform. In essence, waitid() is
    treated like a read on a non-blocking pidfd or a recvmsg() on a non-blocking
    socket.
    With the addition of support for non-blocking pidfds we support the same
    functionality that sockets do. For sockets() recvmsg() supports MSG_DONTWAIT
    for pidfds waitid() supports WNOHANG. Both flags are per-call options. In
    contrast non-blocking pidfds and non-blocking sockets are a setting on an open
    file description affecting all threads in the calling process as well as other
    processes that hold file descriptors referring to the same open file
    description. Both behaviors, per call and per open file description, have
    genuine use-cases.

    The implementation should be straightforward:
    - If a non-blocking pidfd is passed and WNOHANG is not raised we simply raise
    the WNOHANG flag internally. When do_wait() returns indicating that there are
    eligible child processes but none have exited yet we set EAGAIN. If no child
    process exists we continue returning ECHILD.
    - If a non-blocking pidfd is passed and WNOHANG is raised waitid() will
    continue returning 0, i.e. it will not set EAGAIN. This ensure backwards
    compatibility with applications passing WNOHANG explicitly with pidfds.

    A concrete use-case that was brought on-list was Josh's async pidfd library.
    Ever since the introduction of pidfds and more advanced async io various
    programming languages such as Rust have grown support for async event
    libraries. These libraries are created to help build epoll-based event loops
    around file descriptors. A common pattern is to automatically make all file
    descriptors they manage to O_NONBLOCK.

    For such libraries the EAGAIN error code is treated specially. When a function
    is called that returns EAGAIN the function isn't called again until the event
    loop indicates the the file descriptor is ready. Supporting EAGAIN when
    waiting on pidfds makes such libraries just work with little effort.

    Suggested-by: Josh Triplett
    Signed-off-by: Christian Brauner
    Reviewed-by: Josh Triplett
    Reviewed-by: Oleg Nesterov
    Cc: Kees Cook
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: "Peter Zijlstra (Intel)"
    Link: https://lore.kernel.org/lkml/20200811181236.GA18763@localhost/
    Link: https://github.com/joshtriplett/async-pidfd
    Link: https://lore.kernel.org/r/20200902102130.147672-3-christian.brauner@ubuntu.com

    Christian Brauner
     

13 Aug, 2020

2 commits

  • Add a helper that waits for a pid and stores the status in the passed in
    kernel pointer. Use it to fix the usage of kernel_wait4 in
    call_usermodehelper_exec_sync that only happens to work due to the
    implicit set_fs(KERNEL_DS) for kernel threads.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: "Eric W. Biederman"
    Cc: Luis Chamberlain
    Link: http://lkml.kernel.org/r/20200721130449.5008-1-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Both exec and exit want to ensure that the uaccess routines actually do
    access user pointers. Use the newly added force_uaccess_begin helper
    instead of an open coded set_fs for that to prepare for kernel builds
    where set_fs() does not exist.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Nick Hu
    Cc: Greentime Hu
    Cc: Vincent Chen
    Cc: Paul Walmsley
    Cc: Palmer Dabbelt
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200710135706.537715-7-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

05 Aug, 2020

2 commits

  • Pull execve updates from Eric Biederman:
    "During the development of v5.7 I ran into bugs and quality of
    implementation issues related to exec that could not be easily fixed
    because of the way exec is implemented. So I have been diggin into
    exec and cleaning up what I can.

    This cycle I have been looking at different ideas and different
    implementations to see what is possible to improve exec, and cleaning
    the way exec interfaces with in kernel users. Only cleaning up the
    interfaces of exec with rest of the kernel has managed to stabalize
    and make it through review in time for v5.9-rc1 resulting in 2 sets of
    changes this cycle.

    - Implement kernel_execve

    - Make the user mode driver code a better citizen

    With kernel_execve the code size got a little larger as the copying of
    parameters from userspace and copying of parameters from userspace is
    now separate. The good news is kernel threads no longer need to play
    games with set_fs to use exec. Which when combined with the rest of
    Christophs set_fs changes should security bugs with set_fs much more
    difficult"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
    exec: Implement kernel_execve
    exec: Factor bprm_stack_limits out of prepare_arg_pages
    exec: Factor bprm_execve out of do_execve_common
    exec: Move bprm_mm_init into alloc_bprm
    exec: Move initialization of bprm->filename into alloc_bprm
    exec: Factor out alloc_bprm
    exec: Remove unnecessary spaces from binfmts.h
    umd: Stop using split_argv
    umd: Remove exit_umh
    bpfilter: Take advantage of the facilities of struct pid
    exit: Factor thread_group_exited out of pidfd_poll
    umd: Track user space drivers with struct pid
    bpfilter: Move bpfilter_umh back into init data
    exec: Remove do_execve_file
    umh: Stop calling do_execve_file
    umd: Transform fork_usermode_blob into fork_usermode_driver
    umd: Rename umd_info.cmdline umd_info.driver_name
    umd: For clarity rename umh_info umd_info
    umh: Separate the user mode driver and the user mode helper support
    umh: Remove call_usermodehelper_setup_file.
    ...

    Linus Torvalds
     
  • Pull seccomp updates from Kees Cook:
    "There are a bunch of clean ups and selftest improvements along with
    two major updates to the SECCOMP_RET_USER_NOTIF filter return:
    EPOLLHUP support to more easily detect the death of a monitored
    process, and being able to inject fds when intercepting syscalls that
    expect an fd-opening side-effect (needed by both container folks and
    Chrome). The latter continued the refactoring of __scm_install_fd()
    started by Christoph, and in the process found and fixed a handful of
    bugs in various callers.

    - Improved selftest coverage, timeouts, and reporting

    - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

    - Refactor __scm_install_fd() into __receive_fd() and fix buggy
    callers

    - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
    Dhillon)"

    * tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
    selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
    seccomp: Introduce addfd ioctl to seccomp user notifier
    fs: Expand __receive_fd() to accept existing fd
    pidfd: Replace open-coded receive_fd()
    fs: Add receive_fd() wrapper for __receive_fd()
    fs: Move __scm_install_fd() to __receive_fd()
    net/scm: Regularize compat handling of scm_detach_fds()
    pidfd: Add missing sock updates for pidfd_getfd()
    net/compat: Add missing sock updates for SCM_RIGHTS
    selftests/seccomp: Check ENOSYS under tracing
    selftests/seccomp: Refactor to use fixture variants
    selftests/harness: Clean up kern-doc for fixtures
    seccomp: Use -1 marker for end of mode 1 syscall list
    seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
    selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
    selftests/seccomp: Make kcmp() less required
    seccomp: Use pr_fmt
    selftests/seccomp: Improve calibration loop
    selftests/seccomp: use 90s as timeout
    selftests/seccomp: Expand benchmark to per-filter measurements
    ...

    Linus Torvalds
     

17 Jul, 2020

1 commit

  • Using uninitialized_var() is dangerous as it papers over real bugs[1]
    (or can in the future), and suppresses unrelated compiler warnings
    (e.g. "unused variable"). If the compiler thinks it is uninitialized,
    either simply initialize the variable or make compiler changes.

    In preparation for removing[2] the[3] macro[4], remove all remaining
    needless uses with the following script:

    git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \
    xargs perl -pi -e \
    's/\buninitialized_var\(([^\)]+)\)/\1/g;
    s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;'

    drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid
    pathological white-space.

    No outstanding warnings were found building allmodconfig with GCC 9.3.0
    for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64,
    alpha, and m68k.

    [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/
    [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/
    [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/
    [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/

    Reviewed-by: Leon Romanovsky # drivers/infiniband and mlx4/mlx5
    Acked-by: Jason Gunthorpe # IB
    Acked-by: Kalle Valo # wireless drivers
    Reviewed-by: Chao Yu # erofs
    Signed-off-by: Kees Cook

    Kees Cook
     

11 Jul, 2020

1 commit

  • The seccomp filter used to be released in free_task() which is called
    asynchronously via call_rcu() and assorted mechanisms. Since we need
    to inform tasks waiting on the seccomp notifier when a filter goes empty
    we will notify them as soon as a task has been marked fully dead in
    release_task(). To not split seccomp cleanup into two parts, move
    filter release out of free_task() and into release_task() after we've
    unhashed struct task from struct pid, exited signals, and unlinked it
    from the threadgroups' thread list. We'll put the empty filter
    notification infrastructure into it in a follow up patch.

    This also renames put_seccomp_filter() to seccomp_filter_release() which
    is a more descriptive name of what we're doing here especially once
    we've added the empty filter notification mechanism in there.

    We're also NULL-ing the task's filter tree entrypoint which seems
    cleaner than leaving a dangling pointer in there. Note that this shouldn't
    need any memory barriers since we're calling this when the task is in
    release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
    filters anymore. You can also see this from the point where we're calling
    seccomp_filter_release(). It's after __exit_signal() and at this point,
    tsk->sighand will already have been NULLed which is required for
    thread-sync and filter installation alike.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     

08 Jul, 2020

2 commits

  • The bpfilter code no longer uses the umd_info.cleanup callback. This
    callback is what exit_umh exists to call. So remove exit_umh and all
    of it's associated booking.

    v1: https://lkml.kernel.org/r/87bll6dlte.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/87y2o53abg.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-15-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • Create an independent helper thread_group_exited which returns true
    when all threads have passed exit_notify in do_exit. AKA all of the
    threads are at least zombies and might be dead or completely gone.

    Create this helper by taking the logic out of pidfd_poll where it is
    already tested, and adding a READ_ONCE on the read of
    task->exit_state.

    I will be changing the user mode driver code to use this same logic
    to know when a user mode driver needs to be restarted.

    Place the new helper thread_group_exited in kernel/exit.c and
    EXPORT it so it can be used by modules.

    Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
    Acked-by: Christian Brauner
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

04 Jul, 2020

2 commits

  • Use struct pid instead of user space pid values that are prone to wrap
    araound.

    In addition track the entire thread group instead of just the first
    thread that is started by exec. There are no multi-threaded user mode
    drivers today but there is nothing preclucing user drivers from being
    multi-threaded, so it is just a good idea to track the entire process.

    Take a reference count on the tgid's in question to make it possible
    to remove exit_umh in a future change.

    As a struct pid is available directly use kill_pid_info.

    The prior process signalling code was iffy in using a userspace pid
    known to be in the initial pid namespace and then looking up it's task
    in whatever the current pid namespace is. It worked only because
    kernel threads always run in the initial pid namespace.

    As the tgid is now refcounted verify the tgid is NULL at the start of
    fork_usermode_driver to avoid the possibility of silent pid leaks.

    v1: https://lkml.kernel.org/r/87mu4qdlv2.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/a70l4oy8.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-12-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     
  • This makes it clear which code is part of the core user mode
    helper support and which code is needed to implement user mode
    drivers.

    This makes the kernel smaller for everyone who does not use a usermode
    driver.

    v1: https://lkml.kernel.org/r/87tuyyf0ln.fsf_-_@x220.int.ebiederm.org
    v2: https://lkml.kernel.org/r/87imf963s6.fsf_-_@x220.int.ebiederm.org
    Link: https://lkml.kernel.org/r/20200702164140.4468-5-ebiederm@xmission.com
    Reviewed-by: Greg Kroah-Hartman
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Jun, 2020

2 commits

  • Pull kvm updates from Paolo Bonzini:
    "ARM:
    - Move the arch-specific code into arch/arm64/kvm

    - Start the post-32bit cleanup

    - Cherry-pick a few non-invasive pre-NV patches

    x86:
    - Rework of TLB flushing

    - Rework of event injection, especially with respect to nested
    virtualization

    - Nested AMD event injection facelift, building on the rework of
    generic code and fixing a lot of corner cases

    - Nested AMD live migration support

    - Optimization for TSC deadline MSR writes and IPIs

    - Various cleanups

    - Asynchronous page fault cleanups (from tglx, common topic branch
    with tip tree)

    - Interrupt-based delivery of asynchronous "page ready" events (host
    side)

    - Hyper-V MSRs and hypercalls for guest debugging

    - VMX preemption timer fixes

    s390:
    - Cleanups

    Generic:
    - switch vCPU thread wakeup from swait to rcuwait

    The other architectures, and the guest side of the asynchronous page
    fault work, will come next week"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
    KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
    KVM: check userspace_addr for all memslots
    KVM: selftests: update hyperv_cpuid with SynDBG tests
    x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
    x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
    x86/kvm/hyper-v: Add support for synthetic debugger interface
    x86/hyper-v: Add synthetic debugger definitions
    KVM: selftests: VMX preemption timer migration test
    KVM: nVMX: Fix VMX preemption timer migration
    x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
    KVM: x86/pmu: Support full width counting
    KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
    KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
    KVM: x86: acknowledgment mechanism for async pf page ready notifications
    KVM: x86: interrupt based APF 'page ready' event delivery
    KVM: introduce kvm_read_guest_offset_cached()
    KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
    KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
    Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
    KVM: VMX: Replace zero-length array with flexible-array
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The changes in this cycle are:

    - Optimize the task wakeup CPU selection logic, to improve
    scalability and reduce wakeup latency spikes

    - PELT enhancements

    - CFS bandwidth handling fixes

    - Optimize the wakeup path by remove rq->wake_list and replacing it
    with ->ttwu_pending

    - Optimize IPI cross-calls by making flush_smp_call_function_queue()
    process sync callbacks first.

    - Misc fixes and enhancements"

    * tag 'sched-core-2020-06-02' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    irq_work: Define irq_work_single() on !CONFIG_IRQ_WORK too
    sched/headers: Split out open-coded prototypes into kernel/sched/smp.h
    sched: Replace rq::wake_list
    sched: Add rq::ttwu_pending
    irq_work, smp: Allow irq_work on call_single_queue
    smp: Optimize send_call_function_single_ipi()
    smp: Move irq_work_run() out of flush_smp_call_function_queue()
    smp: Optimize flush_smp_call_function_queue()
    sched: Fix smp_call_function_single_async() usage for ILB
    sched/core: Offload wakee task activation if it the wakee is descheduling
    sched/core: Optimize ttwu() spinning on p->on_cpu
    sched: Defend cfs and rt bandwidth quota against overflow
    sched/cpuacct: Fix charge cpuacct.usage_sys
    sched/fair: Replace zero-length array with flexible-array
    sched/pelt: Sync util/runnable_sum with PELT window when propagating
    sched/cpuacct: Use __this_cpu_add() instead of this_cpu_ptr()
    sched/fair: Optimize enqueue_task_fair()
    sched: Make scheduler_ipi inline
    sched: Clean up scheduler_ipi()
    sched/core: Simplify sched_init()
    ...

    Linus Torvalds
     

02 Jun, 2020

1 commit

  • Pull uaccess/readdir updates from Al Viro:
    "Finishing the conversion of readdir.c to unsafe_... API.

    This includes the uaccess_{read,write}_begin series by Christophe
    Leroy"

    * 'uaccess.readdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    readdir.c: get rid of the last __put_user(), drop now-useless access_ok()
    readdir.c: get compat_filldir() more or less in sync with filldir()
    switch readdir(2) to unsafe_copy_dirent_name()
    drm/i915/gem: Replace user_access_begin by user_write_access_begin
    uaccess: Selectively open read or write user access
    uaccess: Add user_read_access_begin/end and user_write_access_begin/end

    Linus Torvalds
     

20 May, 2020

1 commit


14 May, 2020

2 commits


01 May, 2020

2 commits

  • When opening user access to only perform reads, only open read access.
    When opening user access to only perform writes, only open write
    access.

    Signed-off-by: Christophe Leroy
    Reviewed-by: Kees Cook
    Signed-off-by: Michael Ellerman
    Link: https://lore.kernel.org/r/2e73bc57125c2c6ab12a587586a4eed3a47105fc.1585898438.git.christophe.leroy@c-s.fr

    Christophe Leroy
     
  • With CONFIG_DEBUG_ATOMIC_SLEEP=y and CONFIG_CGROUPS=y, kernel oopses in
    non-preemptible context look untidy; after the main oops, the kernel prints
    a "sleeping function called from invalid context" report because
    exit_signals() -> cgroup_threadgroup_change_begin() -> percpu_down_read()
    can sleep, and that happens before the preempt_count_set(PREEMPT_ENABLED)
    fixup.

    It looks like the same thing applies to profile_task_exit() and
    kcov_task_exit().

    Fix it by moving the preemption fixup up and the calls to
    profile_task_exit() and kcov_task_exit() down.

    Fixes: 1dc0fffc48af ("sched/core: Robustify preemption leak checks")
    Signed-off-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200305220657.46800-1-jannh@google.com

    Jann Horn
     

25 Apr, 2020

1 commit

  • Oleg pointed out that in the unlikely event the kernel is compiled
    with CONFIG_PROC_FS unset that release_task will now leak the pid.

    Move the put_pid out of proc_flush_pid into release_task to fix this
    and to guarantee I don't make that mistake again.

    When possible it makes sense to keep get and put in the same function
    so it can easily been seen how they pair up.

    Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
    Reported-by: Oleg Nesterov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

03 Apr, 2020

1 commit

  • Pull exec/proc updates from Eric Biederman:
    "This contains two significant pieces of work: the work to sort out
    proc_flush_task, and the work to solve a deadlock between strace and
    exec.

    Fixing proc_flush_task so that it no longer requires a persistent
    mount makes improvements to proc possible. The removal of the
    persistent mount solves an old regression that that caused the hidepid
    mount option to only work on remount not on mount. The regression was
    found and reported by the Android folks. This further allows Alexey
    Gladkov's work making proc mount options specific to an individual
    mount of proc to move forward.

    The work on exec starts solving a long standing issue with exec that
    it takes mutexes of blocking userspace applications, which makes exec
    extremely deadlock prone. For the moment this adds a second mutex with
    a narrower scope that handles all of the easy cases. Which makes the
    tricky cases easy to spot. With a little luck the code to solve those
    deadlocks will be ready by next merge window"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (25 commits)
    signal: Extend exec_id to 64bits
    pidfd: Use new infrastructure to fix deadlocks in execve
    perf: Use new infrastructure to fix deadlocks in execve
    proc: io_accounting: Use new infrastructure to fix deadlocks in execve
    proc: Use new infrastructure to fix deadlocks in execve
    kernel/kcmp.c: Use new infrastructure to fix deadlocks in execve
    kernel: doc: remove outdated comment cred.c
    mm: docs: Fix a comment in process_vm_rw_core
    selftests/ptrace: add test cases for dead-locks
    exec: Fix a deadlock in strace
    exec: Add exec_update_mutex to replace cred_guard_mutex
    exec: Move exec_mmap right after de_thread in flush_old_exec
    exec: Move cleanup of posix timers on exec out of de_thread
    exec: Factor unshare_sighand out of de_thread and call it separately
    exec: Only compute current once in flush_old_exec
    pid: Improve the comment about waiting in zap_pid_ns_processes
    proc: Remove the now unnecessary internal mount of proc
    uml: Create a private mount of proc for mconsole
    uml: Don't consult current to find the proc_mnt in mconsole_proc
    proc: Use a list of inodes to flush from proc
    ...

    Linus Torvalds
     

31 Mar, 2020

2 commits

  • Pull timekeeping and timer updates from Thomas Gleixner:
    "Core:

    - Consolidation of the vDSO build infrastructure to address the
    difficulties of cross-builds for ARM64 compat vDSO libraries by
    restricting the exposure of header content to the vDSO build.

    This is achieved by splitting out header content into separate
    headers. which contain only the minimaly required information which
    is necessary to build the vDSO. These new headers are included from
    the kernel headers and the vDSO specific files.

    - Enhancements to the generic vDSO library allowing more fine grained
    control over the compiled in code, further reducing architecture
    specific storage and preparing for adopting the generic library by
    PPC.

    - Cleanup and consolidation of the exit related code in posix CPU
    timers.

    - Small cleanups and enhancements here and there

    Drivers:

    - The obligatory new drivers: Ingenic JZ47xx and X1000 TCU support

    - Correct the clock rate of PIT64b global clock

    - setup_irq() cleanup

    - Preparation for PWM and suspend support for the TI DM timer

    - Expand the fttmr010 driver to support ast2600 systems

    - The usual small fixes, enhancements and cleanups all over the
    place"

    * tag 'timers-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (80 commits)
    Revert "clocksource/drivers/timer-probe: Avoid creating dead devices"
    vdso: Fix clocksource.h macro detection
    um: Fix header inclusion
    arm64: vdso32: Enable Clang Compilation
    lib/vdso: Enable common headers
    arm: vdso: Enable arm to use common headers
    x86/vdso: Enable x86 to use common headers
    mips: vdso: Enable mips to use common headers
    arm64: vdso32: Include common headers in the vdso library
    arm64: vdso: Include common headers in the vdso library
    arm64: Introduce asm/vdso/processor.h
    arm64: vdso32: Code clean up
    linux/elfnote.h: Replace elf.h with UAPI equivalent
    scripts: Fix the inclusion order in modpost
    common: Introduce processor.h
    linux/ktime.h: Extract common header for vDSO
    linux/jiffies.h: Extract common header for vDSO
    linux/time64.h: Extract common header for vDSO
    linux/time32.h: Extract common header for vDSO
    linux/time.h: Extract common header for vDSO
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Continued user-access cleanups in the futex code.

    - percpu-rwsem rewrite that uses its own waitqueue and atomic_t
    instead of an embedded rwsem. This addresses a couple of
    weaknesses, but the primary motivation was complications on the -rt
    kernel.

    - Introduce raw lock nesting detection on lockdep
    (CONFIG_PROVE_RAW_LOCK_NESTING=y), document the raw_lock vs. normal
    lock differences. This too originates from -rt.

    - Reuse lockdep zapped chain_hlocks entries, to conserve RAM
    footprint on distro-ish kernels running into the "BUG:
    MAX_LOCKDEP_CHAIN_HLOCKS too low!" depletion of the lockdep
    chain-entries pool.

    - Misc cleanups, smaller fixes and enhancements - see the changelog
    for details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (55 commits)
    fs/buffer: Make BH_Uptodate_Lock bit_spin_lock a regular spinlock_t
    thermal/x86_pkg_temp: Make pkg_temp_lock a raw_spinlock_t
    Documentation/locking/locktypes: Minor copy editor fixes
    Documentation/locking/locktypes: Further clarifications and wordsmithing
    m68knommu: Remove mm.h include from uaccess_no.h
    x86: get rid of user_atomic_cmpxchg_inatomic()
    generic arch_futex_atomic_op_inuser() doesn't need access_ok()
    x86: don't reload after cmpxchg in unsafe_atomic_op2() loop
    x86: convert arch_futex_atomic_op_inuser() to user_access_begin/user_access_end()
    objtool: whitelist __sanitizer_cov_trace_switch()
    [parisc, s390, sparc64] no need for access_ok() in futex handling
    sh: no need of access_ok() in arch_futex_atomic_op_inuser()
    futex: arch_futex_atomic_op_inuser() calling conventions change
    completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all()
    lockdep: Add posixtimer context tracing bits
    lockdep: Annotate irq_work
    lockdep: Add hrtimer context tracing bits
    lockdep: Introduce wait-type checks
    completion: Use simple wait queues
    sched/swait: Prepare usage in completions
    ...

    Linus Torvalds
     

04 Mar, 2020

1 commit

  • The reasons why the extra posix_cpu_timers_exit_group() invocation has been
    added are not entirely clear from the commit message. Today all that
    posix_cpu_timers_exit_group() does is stop timers that are tracking the
    task from firing. Every other operation on those timers is still allowed.

    The practical implication of this is posix_cpu_timer_del() which could
    not get the siglock after the thread group leader has exited (because
    sighand == NULL) would be able to run successfully because the timer
    was already dequeued.

    With that locking issue fixed there is no point in disabling all of the
    timers. So remove this ``tempoary'' hack.

    Fixes: e0a70217107e ("posix-cpu-timers: workaround to suppress the problems with mt exec")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/87o8tityzs.fsf@x220.int.ebiederm.org

    Eric W. Biederman
     

28 Feb, 2020

1 commit

  • This patch fixes the following sparse error:
    kernel/exit.c:627:25: error: incompatible types in comparison expression

    And the following warning:
    kernel/exit.c:626:40: warning: incorrect type in assignment

    Signed-off-by: Madhuparna Bhowmik
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    [christian.brauner@ubuntu.com: edit commit message]
    Link: https://lore.kernel.org/r/20200130062028.4870-1-madhuparnabhowmik10@gmail.com
    Signed-off-by: Christian Brauner

    Madhuparna Bhowmik
     

25 Feb, 2020

1 commit

  • Rework the flushing of proc to use a list of directory inodes that
    need to be flushed.

    The list is kept on struct pid not on struct task_struct, as there is
    a fixed connection between proc inodes and pids but at least for the
    case of de_thread the pid of a task_struct changes.

    This removes the dependency on proc_mnt which allows for different
    mounts of proc having different mount options even in the same pid
    namespace and this allows for the removal of proc_mnt which will
    trivially the first mount of proc to honor it's mount options.

    This flushing remains an optimization. The functions
    pid_delete_dentry and pid_revalidate ensure that ordinary dcache
    management will not attempt to use dentries past the point their
    respective task has died. When unused the shrinker will
    eventually be able to remove these dentries.

    There is a case in de_thread where proc_flush_pid can be
    called early for a given pid. Which winds up being
    safe (if suboptimal) as this is just an optiimization.

    Only pid directories are put on the list as the other
    per pid files are children of those directories and
    d_invalidate on the directory will get them as well.

    So that the pid can be used during flushing it's reference count is
    taken in release_task and dropped in proc_flush_pid. Further the call
    of proc_flush_pid is moved after the tasklist_lock is released in
    release_task so that it is certain that the pid has already been
    unhashed when flushing it taking place. This removes a small race
    where a dentry could recreated.

    As struct pid is supposed to be small and I need a per pid lock
    I reuse the only lock that currently exists in struct pid the
    the wait_pidfd.lock.

    The net result is that this adds all of this functionality
    with just a little extra list management overhead and
    a single extra pointer in struct pid.

    v2: Initialize pid->inodes. I somehow failed to get that
    initialization into the initial version of the patch. A boot
    failure was reported by "kernel test robot ", and
    failure to initialize that pid->inodes matches all of the reported
    symptoms.

    Signed-off-by: Eric W. Biederman

    Eric W. Biederman
     

11 Feb, 2020

1 commit

  • Now that __percpu_up_read() is only ever used from percpu_up_read()
    merge them, it's a small function.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Will Deacon
    Acked-by: Waiman Long
    Link: https://lkml.kernel.org/r/20200131151540.212415454@infradead.org

    Davidlohr Bueso
     

04 Jan, 2020

1 commit

  • Pull thread fixes from Christian Brauner:
    "Here are two fixes:

    - Panic earlier when global init exits to generate useable coredumps.

    Currently, when global init and all threads in its thread-group
    have exited we panic via:

    do_exit()
    -> exit_notify()
    -> forget_original_parent()
    -> find_child_reaper()

    This makes it hard to extract a useable coredump for global init
    from a kernel crashdump because by the time we panic exit_mm() will
    have already released global init's mm. We now panic slightly
    earlier. This has been a problem in certain environments such as
    Android.

    - Fix a race in assigning and reading taskstats for thread-groups
    with more than one thread.

    This patch has been waiting for quite a while since people
    disagreed on what the correct fix was at first"

    * tag 'for-linus-2020-01-03' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    exit: panic before exit_mm() on global init exit
    taskstats: fix data-race

    Linus Torvalds
     

21 Dec, 2019

1 commit

  • Currently, when global init and all threads in its thread-group have exited
    we panic via:
    do_exit()
    -> exit_notify()
    -> forget_original_parent()
    -> find_child_reaper()
    This makes it hard to extract a useable coredump for global init from a
    kernel crashdump because by the time we panic exit_mm() will have already
    released global init's mm.
    This patch moves the panic futher up before exit_mm() is called. As was the
    case previously, we only panic when global init and all its threads in the
    thread-group have exited.

    Signed-off-by: chenqiwu
    Acked-by: Christian Brauner
    Acked-by: Oleg Nesterov
    [christian.brauner@ubuntu.com: fix typo, rewrite commit message]
    Link: https://lore.kernel.org/r/1576736993-10121-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Christian Brauner

    chenqiwu
     

01 Dec, 2019

1 commit

  • …ux/kernel/git/dhowells/linux-fs

    Pull pipe rework from David Howells:
    "This is my set of preparatory patches for building a general
    notification queue on top of pipes. It makes a number of significant
    changes:

    - It removes the nr_exclusive argument from __wake_up_sync_key() as
    this is always 1. This prepares for the next step:

    - Adds wake_up_interruptible_sync_poll_locked() so that poll can be
    woken up from a function that's holding the poll waitqueue
    spinlock.

    - Change the pipe buffer ring to be managed in terms of unbounded
    head and tail indices rather than bounded index and length. This
    means that reading the pipe only needs to modify one index, not
    two.

    - A selection of helper functions are provided to query the state of
    the pipe buffer, plus a couple to apply updates to the pipe
    indices.

    - The pipe ring is allowed to have kernel-reserved slots. This allows
    many notification messages to be spliced in by the kernel without
    allowing userspace to pin too many pages if it writes to the same
    pipe.

    - Advance the head and tail indices inside the pipe waitqueue lock
    and use wake_up_interruptible_sync_poll_locked() to poke poll
    without having to take the lock twice.

    - Rearrange pipe_write() to preallocate the buffer it is going to
    write into and then drop the spinlock. This allows kernel
    notifications to then be added the ring whilst it is filling the
    buffer it allocated. The read side is stalled because the pipe
    mutex is still held.

    - Don't wake up readers on a pipe if there was already data in it
    when we added more.

    - Don't wake up writers on a pipe if the ring wasn't full before we
    removed a buffer"

    * tag 'notifications-pipe-prep-20191115' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    pipe: Remove sync on wake_ups
    pipe: Increase the writer-wakeup threshold to reduce context-switch count
    pipe: Check for ring full inside of the spinlock in pipe_write()
    pipe: Remove redundant wakeup from pipe_write()
    pipe: Rearrange sequence in pipe_write() to preallocate slot
    pipe: Conditionalise wakeup in pipe_read()
    pipe: Advance tail pointer inside of wait spinlock in pipe_read()
    pipe: Allow pipes to have kernel-reserved slots
    pipe: Use head and tail pointers for the ring, not cursor and length
    Add wake_up_interruptible_sync_poll_locked()
    Remove the nr_exclusive argument from __wake_up_sync_key()
    pipe: Reduce #inclusion of pipe_fs_i.h

    Linus Torvalds
     

27 Nov, 2019

1 commit

  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - A comprehensive rewrite of the robust/PI futex code's exit handling
    to fix various exit races. (Thomas Gleixner et al)

    - Rework the generic REFCOUNT_FULL implementation using
    atomic_fetch_* operations so that the performance impact of the
    cmpxchg() loops is mitigated for common refcount operations.

    With these performance improvements the generic implementation of
    refcount_t should be good enough for everybody - and this got
    confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
    REFCOUNT_FULL entirely, leaving the generic implementation enabled
    unconditionally. (Will Deacon)

    - Other misc changes, fixes, cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    lkdtm: Remove references to CONFIG_REFCOUNT_FULL
    locking/refcount: Remove unused 'refcount_error_report()' function
    locking/refcount: Consolidate implementations of refcount_t
    locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
    locking/refcount: Move saturation warnings out of line
    locking/refcount: Improve performance of generic REFCOUNT_FULL code
    locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the header
    locking/refcount: Remove unused refcount_*_checked() variants
    locking/refcount: Ensure integer operands are treated as signed
    locking/refcount: Define constants for saturation and max refcount values
    futex: Prevent exit livelock
    futex: Provide distinct return value when owner is exiting
    futex: Add mutex around futex exit
    futex: Provide state handling for exec() as well
    futex: Sanitize exit state handling
    futex: Mark the begin of futex exit explicitly
    futex: Set task::futex_state to DEAD right after handling futex exit
    futex: Split futex_mm_release() for exit/exec
    exit/exec: Seperate mm_release()
    futex: Replace PF_EXITPIDONE with a state
    ...

    Linus Torvalds
     

20 Nov, 2019

4 commits

  • Instead of relying on PF_EXITING use an explicit state for the futex exit
    and set it in the futex exit function. This moves the smp barrier and the
    lock/unlock serialization into the futex code.

    As with the DEAD state this is restricted to the exit path as exec
    continues to use the same task struct.

    This allows to simplify that logic in a next step.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de

    Thomas Gleixner
     
  • Setting task::futex_state in do_exit() is rather arbitrarily placed for no
    reason. Move it into the futex code.

    Note, this is only done for the exit cleanup as the exec cleanup cannot set
    the state to FUTEX_STATE_DEAD because the task struct is still in active
    use.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de

    Thomas Gleixner
     
  • mm_release() contains the futex exit handling. mm_release() is called from
    do_exit()->exit_mm() and from exec()->exec_mm().

    In the exit_mm() case PF_EXITING and the futex state is updated. In the
    exec_mm() case these states are not touched.

    As the futex exit code needs further protections against exit races, this
    needs to be split into two functions.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de

    Thomas Gleixner
     
  • The futex exit handling relies on PF_ flags. That's suboptimal as it
    requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
    the middle of do_exit() to enforce the observability of PF_EXITING in the
    futex code.

    Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
    over to the new state. The PF_EXITING dependency will be cleaned up in a
    later step.

    This prepares for handling various futex exit issues later.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de

    Thomas Gleixner