15 Oct, 2020

1 commit

  • Pull kernel_clone() updates from Christian Brauner:
    "During the v5.9 merge window we reworked the process creation
    codepaths across multiple architectures. After this work we were only
    left with the _do_fork() helper based on the struct kernel_clone_args
    calling convention. As was pointed out _do_fork() isn't valid
    kernelese especially for a helper that isn't just static.

    This series removes the _do_fork() helper and introduces the new
    kernel_clone() helper. The process creation cleanup didn't change the
    name to something more reasonable mainly because _do_fork() was used
    in quite a few places. So sending this as a separate series seemed the
    better strategy.

    I originally intended to send this early in the v5.9 development cycle
    after the merge window had closed but given that this was touching
    quite a few places I decided to defer this until the v5.10 merge
    window"

    * tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    sched: remove _do_fork()
    tracing: switch to kernel_clone()
    kgdbts: switch to kernel_clone()
    kprobes: switch to kernel_clone()
    x86: switch to kernel_clone()
    sparc: switch to kernel_clone()
    nios2: switch to kernel_clone()
    m68k: switch to kernel_clone()
    ia64: switch to kernel_clone()
    h8300: switch to kernel_clone()
    fork: introduce kernel_clone()

    Linus Torvalds
     

14 Oct, 2020

3 commits

  • Currently __set_oom_adj loops through all processes in the system to keep
    oom_score_adj and oom_score_adj_min in sync between processes sharing
    their mm. This is done for any task with more that one mm_users, which
    includes processes with multiple threads (sharing mm and signals).
    However for such processes the loop is unnecessary because their signal
    structure is shared as well.

    Android updates oom_score_adj whenever a tasks changes its role
    (background/foreground/...) or binds to/unbinds from a service, making it
    more/less important. Such operation can happen frequently. We noticed
    that updates to oom_score_adj became more expensive and after further
    investigation found out that the patch mentioned in "Fixes" introduced a
    regression. Using Pixel 4 with a typical Android workload, write time to
    oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
    linearly depends on the number of multi-threaded processes running on the
    system.

    Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
    (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
    MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
    update should be synchronized between multiple processes. To prevent
    races between clone() and __set_oom_adj(), when oom_score_adj of the
    process being cloned might be modified from userspace, we use
    oom_adj_mutex. Its scope is changed to global.

    The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
    the case of vfork(). To prevent performance regressions of vfork(), we
    skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
    specified. Clearing the MMF_MULTIPROCESS flag (when the last process
    sharing the mm exits) is left out of this patch to keep it simple and
    because it is believed that this threading model is rare. Should there
    ever be a need for optimizing that case as well, it can be done by hooking
    into the exit path, likely following the mm_update_next_owner pattern.

    With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
    quite rare, the regression is gone after the change is applied.

    [surenb@google.com: v3]
    Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

    Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
    Reported-by: Tim Murray
    Suggested-by: Michal Hocko
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Eugene Syromiatnikov
    Cc: Christian Kellner
    Cc: Adrian Reber
    Cc: Shakeel Butt
    Cc: Aleksa Sarai
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Alexey Gladkov
    Cc: Michel Lespinasse
    Cc: Daniel Jordan
    Cc: Andrei Vagin
    Cc: Bernd Edlinger
    Cc: John Johansen
    Cc: Yafang Shao
    Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
    Debugged-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     
  • Both of the mm pointers are not needed after commit 7a4830c380f3
    ("mm/fork: Pass new vma pointer into copy_page_range()").

    Jason Gunthorpe also reported that the ordering of copy_page_range() is
    odd. Since working at it, reorder the parameters to be logical, by (1)
    always put the dst_* fields to be before src_* fields, and (2) keep the
    same type of parameters together.

    [peterx@redhat.com: further reorder some parameters and line format, per Jason]
    Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com
    [peterx@redhat.com: fix warnings]
    Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1

    Reported-by: Kirill A. Shutemov
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Acked-by: Kirill A. Shutemov
    Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Commit 4bb5f5d9395b ("mm: allow drivers to prevent new writable mappings")
    changed i_mmap_writable from unsigned int to atomic_t and add the helper
    function mapping_allow_writable() to atomic_inc i_mmap_writable. But it
    forgot to use this helper function in dup_mmap() and __vma_link_file().

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christian Brauner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: "Eric W. Biederman"
    Cc: Christian Kellner
    Cc: Suren Baghdasaryan
    Cc: Adrian Reber
    Cc: Shakeel Butt
    Cc: Aleksa Sarai
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

01 Oct, 2020

1 commit

  • Grab actual references to the files_struct. To avoid circular references
    issues due to this, we add a per-task note that keeps track of what
    io_uring contexts a task has used. When the tasks execs or exits its
    assigned files, we cancel requests based on this tracking.

    With that, we can grab proper references to the files table, and no
    longer need to rely on stashing away ring_fd and ring_file to check
    if the ring_fd may have been closed.

    Cc: stable@vger.kernel.org # v5.5+
    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Sep, 2020

2 commits

  • This prepares for the future work to trigger early cow on pinned pages
    during fork().

    No functional change intended.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • (Commit message majorly collected from Jason Gunthorpe)

    Reduce the chance of false positive from page_maybe_dma_pinned() by
    keeping track if the mm_struct has ever been used with pin_user_pages().
    This allows cases that might drive up the page ref_count to avoid any
    penalty from handling dma_pinned pages.

    Future work is planned, to provide a more sophisticated solution, likely
    to turn it into a real counter. For now, make it atomic_t but use it as
    a boolean for simplicity.

    Suggested-by: Jason Gunthorpe
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

06 Sep, 2020

1 commit

  • Commit 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    changed ctl_table.proc_handler to take a kernel pointer. Adjust the
    definition of sysctl_max_threads to match its prototype in
    linux/sysctl.h which fixes the following sparse error/warning:

    kernel/fork.c:3050:47: warning: incorrect type in argument 3 (different address spaces)
    kernel/fork.c:3050:47: expected void *
    kernel/fork.c:3050:47: got void [noderef] __user *buffer
    kernel/fork.c:3036:5: error: symbol 'sysctl_max_threads' redeclared with different type (incompatible argument 3 (different address spaces)):
    kernel/fork.c:3036:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... )
    kernel/fork.c: note: in included file (through include/linux/key.h, include/linux/cred.h, include/linux/sched/signal.h, include/linux/sched/cputime.h):
    include/linux/sysctl.h:242:5: note: previously declared as:
    include/linux/sysctl.h:242:5: int extern [addressable] [signed] [toplevel] sysctl_max_threads( ... )

    Fixes: 32927393dc1c ("sysctl: pass kernel pointers to ->proc_handler")
    Signed-off-by: Tobias Klauser
    Signed-off-by: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Al Viro
    Link: https://lkml.kernel.org/r/20200825093647.24263-1-tklauser@distanz.ch
    Signed-off-by: Linus Torvalds

    Tobias Klauser
     

20 Aug, 2020

1 commit

  • The old _do_fork() helper doesn't follow naming conventions of in-kernel
    helpers for syscalls. The process creation cleanup in [1] didn't change the
    name to something more reasonable mainly because _do_fork() was used in quite a
    few places. So sending this as a separate series seemed the better strategy.

    This commit does two things:
    1. renames _do_fork() to kernel_clone() but keeps _do_fork() as a simple static
    inline wrapper around kernel_clone().
    2. Changes the return type from long to pid_t. This aligns kernel_thread() and
    kernel_clone(). Also, the return value from kernel_clone that is surfaced in
    fork(), vfork(), clone(), and clone3() is taken from pid_vrn() which returns
    a pid_t too.

    Follow-up patches will switch each caller of _do_fork() and each place where it
    is referenced over to kernel_clone(). After all these changes are done, we can
    remove _do_fork() completely and will only be left with kernel_clone().

    [1]: 9ba27414f2ec ("Merge tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux")

    Signed-off-by: Christian Brauner
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Matthew Wilcox (Oracle)
    Cc: "Peter Zijlstra (Intel)"
    Link: https://lore.kernel.org/r/20200819104655.436656-2-christian.brauner@ubuntu.com

    Christian Brauner
     

11 Aug, 2020

1 commit

  • Pull locking updates from Thomas Gleixner:
    "A set of locking fixes and updates:

    - Untangle the header spaghetti which causes build failures in
    various situations caused by the lockdep additions to seqcount to
    validate that the write side critical sections are non-preemptible.

    - The seqcount associated lock debug addons which were blocked by the
    above fallout.

    seqcount writers contrary to seqlock writers must be externally
    serialized, which usually happens via locking - except for strict
    per CPU seqcounts. As the lock is not part of the seqcount, lockdep
    cannot validate that the lock is held.

    This new debug mechanism adds the concept of associated locks.
    sequence count has now lock type variants and corresponding
    initializers which take a pointer to the associated lock used for
    writer serialization. If lockdep is enabled the pointer is stored
    and write_seqcount_begin() has a lockdep assertion to validate that
    the lock is held.

    Aside of the type and the initializer no other code changes are
    required at the seqcount usage sites. The rest of the seqcount API
    is unchanged and determines the type at compile time with the help
    of _Generic which is possible now that the minimal GCC version has
    been moved up.

    Adding this lockdep coverage unearthed a handful of seqcount bugs
    which have been addressed already independent of this.

    While generally useful this comes with a Trojan Horse twist: On RT
    kernels the write side critical section can become preemtible if
    the writers are serialized by an associated lock, which leads to
    the well known reader preempts writer livelock. RT prevents this by
    storing the associated lock pointer independent of lockdep in the
    seqcount and changing the reader side to block on the lock when a
    reader detects that a writer is in the write side critical section.

    - Conversion of seqcount usage sites to associated types and
    initializers"

    * tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    locking/seqlock, headers: Untangle the spaghetti monster
    locking, arch/ia64: Reduce header dependencies by moving XTP bits into the new header
    x86/headers: Remove APIC headers from
    seqcount: More consistent seqprop names
    seqcount: Compress SEQCNT_LOCKNAME_ZERO()
    seqlock: Fold seqcount_LOCKNAME_init() definition
    seqlock: Fold seqcount_LOCKNAME_t definition
    seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
    hrtimer: Use sequence counter with associated raw spinlock
    kvm/eventfd: Use sequence counter with associated spinlock
    userfaultfd: Use sequence counter with associated spinlock
    NFSv4: Use sequence counter with associated spinlock
    iocost: Use sequence counter with associated spinlock
    raid5: Use sequence counter with associated spinlock
    vfs: Use sequence counter with associated spinlock
    timekeeping: Use sequence counter with associated raw spinlock
    xfrm: policy: Use sequence counters with associated lock
    netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
    netfilter: conntrack: Use sequence counter with associated spinlock
    sched: tasks: Use sequence counter with associated spinlock
    ...

    Linus Torvalds
     

08 Aug, 2020

2 commits

  • Patch series "kasan: support stack instrumentation for tag-based mode", v2.

    This patch (of 5):

    Prepare Software Tag-Based KASAN for stack tagging support.

    With Tag-Based KASAN when kernel stacks are allocated via pagealloc (which
    happens when CONFIG_VMAP_STACK is not enabled), they get tagged. KASAN
    instrumentation doesn't expect the sp register to be tagged, and this
    leads to false-positive reports.

    Fix by resetting the tag of kernel stack pointers after allocation.

    Signed-off-by: Andrey Konovalov
    Signed-off-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Marco Elver
    Cc: Walter Wu
    Cc: Elena Petrova
    Cc: Vincenzo Frascino
    Cc: Catalin Marinas
    Cc: Ard Biesheuvel
    Link: http://lkml.kernel.org/r/cover.1596199677.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/cover.1596544734.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/12d8c678869268dd0884b01271ab592f30792abf.1596544734.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/01c678b877755bcf29009176592402cdf6f2cb15.1596199677.git.andreyknvl@google.com
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=203497
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • Currently the kernel stack is being accounted per-zone. There is no need
    to do that. In addition due to being per-zone, memcg has to keep a
    separate MEMCG_KERNEL_STACK_KB. Make the stat per-node and deprecate
    MEMCG_KERNEL_STACK_KB as memcg_stat_item is an extension of
    node_stat_item. In addition localize the kernel stack stats updates to
    account_kernel_stack().

    Signed-off-by: Shakeel Butt
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200630161539.1759185-1-shakeelb@google.com
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

05 Aug, 2020

4 commits

  • Pull close_range() implementation from Christian Brauner:
    "This adds the close_range() syscall. It allows to efficiently close a
    range of file descriptors up to all file descriptors of a calling
    task.

    This is coordinated with the FreeBSD folks which have copied our
    version of this syscall and in the meantime have already merged it in
    April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

    The syscall originally came up in a discussion around the new mount
    API and making new file descriptor types cloexec by default. During
    this discussion, Al suggested the close_range() syscall.

    First, it helps to close all file descriptors of an exec()ing task.
    This can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that
    file descriptors are not cloexec by default. This is aggravated by the
    fact that we can't just switch them over without massively regressing
    userspace. For a whole class of programs having an in-kernel method of
    closing all file descriptors is very helpful (e.g. demons, service
    managers, programming language standard libraries, container managers
    etc.).

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on
    each file descriptor and other hacks. From looking at various
    large(ish) userspace code bases this or similar patterns are very
    common in service managers, container runtimes, and programming
    language runtimes/standard libraries such as Python or Rust.

    In addition, the syscall will also work for tasks that do not have
    procfs mounted and on kernels that do not have procfs support compiled
    in. In such situations the only way to make sure that all file
    descriptors are closed is to call close() on each file descriptor up
    to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

    Based on Linus' suggestion close_range() also comes with a new flag
    CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
    right before exec. This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part
    of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
    gets especially handy when we're closing all file descriptors above a
    certain threshold.

    Test-suite as always included"

    * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add CLOSE_RANGE_UNSHARE tests
    close_range: add CLOSE_RANGE_UNSHARE
    tests: add close_range() tests
    arch: wire-up close_range()
    open: add close_range()

    Linus Torvalds
     
  • Pull fork cleanups from Christian Brauner:
    "This is cleanup series from when we reworked a chunk of the process
    creation paths in the kernel and switched to struct
    {kernel_}clone_args.

    High-level this does two main things:

    - Remove the double export of both do_fork() and _do_fork() where
    do_fork() used the incosistent legacy clone calling convention.

    Now we only export _do_fork() which is based on struct
    kernel_clone_args.

    - Remove the copy_thread_tls()/copy_thread() split making the
    architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

    This switches all remaining architectures to select
    HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
    convention. The current split makes the process creation codepaths
    more convoluted than they need to be. Each architecture has their own
    copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
    has a copy_thread_tls() function.

    The split is not needed anymore nowadays, all architectures support
    CLONE_SETTLS but quite a few of them never bothered to select
    HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
    and use the old calling convention. Removing this split cleans up the
    process creation codepaths and paves the way for implementing clone3()
    on such architectures since it requires the copy_thread_tls() calling
    convention.

    After having made each architectures support copy_thread_tls() this
    series simply renames that function back to copy_thread(). It also
    switches all architectures that call do_fork() directly over to
    _do_fork() and the struct kernel_clone_args calling convention. This
    is a corollary of switching the architectures that did not yet support
    it over to copy_thread_tls() since do_fork() is conditional on not
    supporting copy_thread_tls() (Mostly because it lacks a separate
    argument for tls which is trivial to fix but there's no need for this
    function to exist.).

    The do_fork() removal is in itself already useful as it allows to to
    remove the export of both do_fork() and _do_fork() we currently have
    in favor of only _do_fork(). This has already been discussed back when
    we added clone3(). The legacy clone() calling convention is - as is
    probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

    that is aggravated by the fact that some architectures such as sparc
    follow the CLONE_BACKWARDSx calling convention but don't really select
    the corresponding config option since they call do_fork() directly.

    So do_fork() enforces a somewhat arbitrary calling convention in the
    first place that doesn't really help the individual architectures that
    deviate from it. They can thus simply be switched to _do_fork()
    enforcing a single calling convention. (I really hope that any new
    architectures will __not__ try to implement their own calling
    conventions...)

    Most architectures already have made a similar switch (m68k comes to
    mind).

    Overall this removes more code than it adds even with a good portion
    of added comments. It simplifies a chunk of arch specific assembly
    either by moving the code into C or by simply rewriting the assembly.

    Architectures that have been touched in non-trivial ways have all been
    actually boot and stress tested: sparc and ia64 have been tested with
    Debian 9 images. They are the two architectures which have been
    touched the most. All non-trivial changes to architectures have seen
    acks from the relevant maintainers. nios2 with a custom built
    buildroot image. h8300 I couldn't get something bootable to test on
    but the changes have been fairly automatic and I'm sure we'll hear
    people yell if I broke something there.

    All other architectures that have been touched in trivial ways have
    been compile tested for each single patch of the series via git rebase
    -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
    even though they have just been trivially touched (removal of the
    HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
    basically "core architectures" and since it is trivial to get your
    hands on a useable image"

    * tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    arch: rename copy_thread_tls() back to copy_thread()
    arch: remove HAVE_COPY_THREAD_TLS
    unicore: switch to copy_thread_tls()
    sh: switch to copy_thread_tls()
    nds32: switch to copy_thread_tls()
    microblaze: switch to copy_thread_tls()
    hexagon: switch to copy_thread_tls()
    c6x: switch to copy_thread_tls()
    alpha: switch to copy_thread_tls()
    fork: remove do_fork()
    h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    sparc: unconditionally enable HAVE_COPY_THREAD_TLS
    sparc: share process creation helpers between sparc and sparc64
    sparc64: enable HAVE_COPY_THREAD_TLS
    fork: fold legacy_clone_args_valid() into _do_fork()

    Linus Torvalds
     
  • Pull execve updates from Eric Biederman:
    "During the development of v5.7 I ran into bugs and quality of
    implementation issues related to exec that could not be easily fixed
    because of the way exec is implemented. So I have been diggin into
    exec and cleaning up what I can.

    This cycle I have been looking at different ideas and different
    implementations to see what is possible to improve exec, and cleaning
    the way exec interfaces with in kernel users. Only cleaning up the
    interfaces of exec with rest of the kernel has managed to stabalize
    and make it through review in time for v5.9-rc1 resulting in 2 sets of
    changes this cycle.

    - Implement kernel_execve

    - Make the user mode driver code a better citizen

    With kernel_execve the code size got a little larger as the copying of
    parameters from userspace and copying of parameters from userspace is
    now separate. The good news is kernel threads no longer need to play
    games with set_fs to use exec. Which when combined with the rest of
    Christophs set_fs changes should security bugs with set_fs much more
    difficult"

    * 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
    exec: Implement kernel_execve
    exec: Factor bprm_stack_limits out of prepare_arg_pages
    exec: Factor bprm_execve out of do_execve_common
    exec: Move bprm_mm_init into alloc_bprm
    exec: Move initialization of bprm->filename into alloc_bprm
    exec: Factor out alloc_bprm
    exec: Remove unnecessary spaces from binfmts.h
    umd: Stop using split_argv
    umd: Remove exit_umh
    bpfilter: Take advantage of the facilities of struct pid
    exit: Factor thread_group_exited out of pidfd_poll
    umd: Track user space drivers with struct pid
    bpfilter: Move bpfilter_umh back into init data
    exec: Remove do_execve_file
    umh: Stop calling do_execve_file
    umd: Transform fork_usermode_blob into fork_usermode_driver
    umd: Rename umd_info.cmdline umd_info.driver_name
    umd: For clarity rename umh_info umd_info
    umh: Separate the user mode driver and the user mode helper support
    umh: Remove call_usermodehelper_setup_file.
    ...

    Linus Torvalds
     
  • Pull seccomp updates from Kees Cook:
    "There are a bunch of clean ups and selftest improvements along with
    two major updates to the SECCOMP_RET_USER_NOTIF filter return:
    EPOLLHUP support to more easily detect the death of a monitored
    process, and being able to inject fds when intercepting syscalls that
    expect an fd-opening side-effect (needed by both container folks and
    Chrome). The latter continued the refactoring of __scm_install_fd()
    started by Christoph, and in the process found and fixed a handful of
    bugs in various callers.

    - Improved selftest coverage, timeouts, and reporting

    - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

    - Refactor __scm_install_fd() into __receive_fd() and fix buggy
    callers

    - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
    Dhillon)"

    * tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
    selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
    seccomp: Introduce addfd ioctl to seccomp user notifier
    fs: Expand __receive_fd() to accept existing fd
    pidfd: Replace open-coded receive_fd()
    fs: Add receive_fd() wrapper for __receive_fd()
    fs: Move __scm_install_fd() to __receive_fd()
    net/scm: Regularize compat handling of scm_detach_fds()
    pidfd: Add missing sock updates for pidfd_getfd()
    net/compat: Add missing sock updates for SCM_RIGHTS
    selftests/seccomp: Check ENOSYS under tracing
    selftests/seccomp: Refactor to use fixture variants
    selftests/harness: Clean up kern-doc for fixtures
    seccomp: Use -1 marker for end of mode 1 syscall list
    seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
    selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
    selftests/seccomp: Make kcmp() less required
    seccomp: Use pr_fmt
    selftests/seccomp: Improve calibration loop
    selftests/seccomp: use 90s as timeout
    selftests/seccomp: Expand benchmark to per-filter measurements
    ...

    Linus Torvalds
     

04 Aug, 2020

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - Improve uclamp performance by using a static key for the fast path

    - Add the "sched_util_clamp_min_rt_default" sysctl, to optimize for
    better power efficiency of RT tasks on battery powered devices.
    (The default is to maximize performance & reduce RT latencies.)

    - Improve utime and stime tracking accuracy, which had a fixed boundary
    of error, which created larger and larger relative errors as the
    values become larger. This is now replaced with more precise
    arithmetics, using the new mul_u64_u64_div_u64() helper in math64.h.

    - Improve the deadline scheduler, such as making it capacity aware

    - Improve frequency-invariant scheduling

    - Misc cleanups in energy/power aware scheduling

    - Add sched_update_nr_running tracepoint to track changes to nr_running

    - Documentation additions and updates

    - Misc cleanups and smaller fixes

    * tag 'sched-core-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched/doc: Factorize bits between sched-energy.rst & sched-capacity.rst
    sched/doc: Document capacity aware scheduling
    sched: Document arch_scale_*_capacity()
    arm, arm64: Fix selection of CONFIG_SCHED_THERMAL_PRESSURE
    Documentation/sysctl: Document uclamp sysctl knobs
    sched/uclamp: Add a new sysctl to control RT default boost value
    sched/uclamp: Fix a deadlock when enabling uclamp static key
    sched: Remove duplicated tick_nohz_full_enabled() check
    sched: Fix a typo in a comment
    sched/uclamp: Remove unnecessary mutex_init()
    arm, arm64: Select CONFIG_SCHED_THERMAL_PRESSURE
    sched: Cleanup SCHED_THERMAL_PRESSURE kconfig entry
    arch_topology, sched/core: Cleanup thermal pressure definition
    trace/events/sched.h: fix duplicated word
    linux/sched/mm.h: drop duplicated words in comments
    smp: Fix a potential usage of stale nr_cpus
    sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal
    sched: nohz: stop passing around unused "ticks" parameter.
    sched: Better document ttwu()
    sched: Add a tracepoint to track rq->nr_running
    ...

    Linus Torvalds
     

01 Aug, 2020

1 commit


31 Jul, 2020

1 commit

  • Refactor the IRQ trace events fields, used for printing information
    about the IRQ trace events, into a separate struct 'irqtrace_events'.

    This improves readability by separating the information only used in
    reporting, as well as enables (simplified) storing/restoring of
    irqtrace_events snapshots.

    No functional change intended.

    Signed-off-by: Marco Elver
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200729110916.3920464-1-elver@google.com
    Signed-off-by: Ingo Molnar

    Marco Elver
     

29 Jul, 2020

2 commits

  • A sequence counter write side critical section must be protected by some
    form of locking to serialize writers. A plain seqcount_t does not
    contain the information of which lock must be held when entering a write
    side critical section.

    Use the new seqcount_spinlock_t data type, which allows to associate a
    spinlock with the sequence counter. This enables lockdep to verify that
    the spinlock used for writer serialization is held when the write side
    critical section is entered.

    If lockdep is disabled this lock association is compiled out and has
    neither storage size nor runtime overhead.

    Signed-off-by: Ahmed S. Darwish
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200720155530.1173732-14-a.darwish@linutronix.de

    Ahmed S. Darwish
     
  • RT tasks by default run at the highest capacity/performance level. When
    uclamp is selected this default behavior is retained by enforcing the
    requested uclamp.min (p->uclamp_req[UCLAMP_MIN]) of the RT tasks to be
    uclamp_none(UCLAMP_MAX), which is SCHED_CAPACITY_SCALE; the maximum
    value.

    This is also referred to as 'the default boost value of RT tasks'.

    See commit 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks").

    On battery powered devices, it is desired to control this default
    (currently hardcoded) behavior at runtime to reduce energy consumed by
    RT tasks.

    For example, a mobile device manufacturer where big.LITTLE architecture
    is dominant, the performance of the little cores varies across SoCs, and
    on high end ones the big cores could be too power hungry.

    Given the diversity of SoCs, the new knob allows manufactures to tune
    the best performance/power for RT tasks for the particular hardware they
    run on.

    They could opt to further tune the value when the user selects
    a different power saving mode or when the device is actively charging.

    The runtime aspect of it further helps in creating a single kernel image
    that can be run on multiple devices that require different tuning.

    Keep in mind that a lot of RT tasks in the system are created by the
    kernel. On Android for instance I can see over 50 RT tasks, only
    a handful of which created by the Android framework.

    To control the default behavior globally by system admins and device
    integrator, introduce the new sysctl_sched_uclamp_util_min_rt_default
    to change the default boost value of the RT tasks.

    I anticipate this to be mostly in the form of modifying the init script
    of a particular device.

    To avoid polluting the fast path with unnecessary code, the approach
    taken is to synchronously do the update by traversing all the existing
    tasks in the system. This could race with a concurrent fork(), which is
    dealt with by introducing sched_post_fork() function which will ensure
    the racy fork will get the right update applied.

    Tested on Juno-r2 in combination with the RT capacity awareness [1].
    By default an RT task will go to the highest capacity CPU and run at the
    maximum frequency, which is particularly energy inefficient on high end
    mobile devices because the biggest core[s] are 'huge' and power hungry.

    With this patch the RT task can be controlled to run anywhere by
    default, and doesn't cause the frequency to be maximum all the time.
    Yet any task that really needs to be boosted can easily escape this
    default behavior by modifying its requested uclamp.min value
    (p->uclamp_req[UCLAMP_MIN]) via sched_setattr() syscall.

    [1] 804d402fb6f6: ("sched/rt: Make RT capacity-aware")

    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200716110347.19553-2-qais.yousef@arm.com

    Qais Yousef
     

11 Jul, 2020

1 commit

  • The seccomp filter used to be released in free_task() which is called
    asynchronously via call_rcu() and assorted mechanisms. Since we need
    to inform tasks waiting on the seccomp notifier when a filter goes empty
    we will notify them as soon as a task has been marked fully dead in
    release_task(). To not split seccomp cleanup into two parts, move
    filter release out of free_task() and into release_task() after we've
    unhashed struct task from struct pid, exited signals, and unlinked it
    from the threadgroups' thread list. We'll put the empty filter
    notification infrastructure into it in a follow up patch.

    This also renames put_seccomp_filter() to seccomp_filter_release() which
    is a more descriptive name of what we're doing here especially once
    we've added the empty filter notification mechanism in there.

    We're also NULL-ing the task's filter tree entrypoint which seems
    cleaner than leaving a dangling pointer in there. Note that this shouldn't
    need any memory barriers since we're calling this when the task is in
    release_task() which means it's EXIT_DEAD. So it can't modify its seccomp
    filters anymore. You can also see this from the point where we're calling
    seccomp_filter_release(). It's after __exit_signal() and at this point,
    tsk->sighand will already have been NULLed which is required for
    thread-sync and filter installation alike.

    Cc: Tycho Andersen
    Cc: Kees Cook
    Cc: Matt Denton
    Cc: Sargun Dhillon
    Cc: Jann Horn
    Cc: Chris Palmer
    Cc: Aleksa Sarai
    Cc: Robert Sesek
    Cc: Jeffrey Vander Stoep
    Cc: Linux Containers
    Signed-off-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200531115031.391515-2-christian.brauner@ubuntu.com
    Signed-off-by: Kees Cook

    Christian Brauner
     

10 Jul, 2020

1 commit

  • Currently all IRQ-tracking state is in task_struct, this means that
    task_struct needs to be defined before we use it.

    Especially for lockdep_assert_irq*() this can lead to header-hell.

    Move the hardirq state into per-cpu variables to avoid the task_struct
    dependency.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200623083721.512673481@infradead.org

    Peter Zijlstra
     

08 Jul, 2020

1 commit

  • Create an independent helper thread_group_exited which returns true
    when all threads have passed exit_notify in do_exit. AKA all of the
    threads are at least zombies and might be dead or completely gone.

    Create this helper by taking the logic out of pidfd_poll where it is
    already tested, and adding a READ_ONCE on the read of
    task->exit_state.

    I will be changing the user mode driver code to use this same logic
    to know when a user mode driver needs to be restarted.

    Place the new helper thread_group_exited in kernel/exit.c and
    EXPORT it so it can be used by modules.

    Link: https://lkml.kernel.org/r/20200702164140.4468-13-ebiederm@xmission.com
    Acked-by: Christian Brauner
    Acked-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Jul, 2020

3 commits

  • Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
    back simply copy_thread(). It's a simpler name, and doesn't imply that only
    tls is copied here. This finishes an outstanding chunk of internal process
    creation work since we've added clone3().

    Cc: linux-arch@vger.kernel.org
    Acked-by: Thomas Bogendoerfer A
    Acked-by: Stafford Horne
    Acked-by: Greentime Hu
    Acked-by: Geert Uytterhoeven A
    Reviewed-by: Kees Cook
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • All architectures support copy_thread_tls() now, so remove the legacy
    copy_thread() function and the HAVE_COPY_THREAD_TLS config option. Everyone
    uses the same process creation calling convention based on
    copy_thread_tls() and struct kernel_clone_args. This will make it easier to
    maintain the core process creation code under kernel/, simplifies the
    callpaths and makes the identical for all architectures.

    Cc: linux-arch@vger.kernel.org
    Acked-by: Thomas Bogendoerfer
    Acked-by: Greentime Hu
    Acked-by: Geert Uytterhoeven
    Reviewed-by: Kees Cook
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Now that all architectures have been switched to use _do_fork() and the new
    struct kernel_clone_args calling convention we can remove the legacy
    do_fork() helper completely. The calling convention used to be brittle and
    do_fork() didn't buy us anything. The only calling convention accepted
    should be based on struct kernel_clone_args going forward. It's cleaner and
    uniform.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Peter Zijlstra (Intel)"
    Signed-off-by: Christian Brauner

    Christian Brauner
     

30 Jun, 2020

1 commit

  • struct vm_area_struct could be accessed concurrently as noticed by
    KCSAN,

    write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
    vma_interval_tree_insert+0x101/0x150:
    rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
    (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
    __vma_link_file+0x6e/0xe0
    __vma_link_file at mm/mmap.c:629
    vma_link+0xa2/0x120
    mmap_region+0x753/0xb90
    do_mmap+0x45c/0x710
    vm_mmap_pgoff+0xc0/0x130
    ksys_mmap_pgoff+0x1d1/0x300
    __x64_sys_mmap+0x33/0x40
    do_syscall_64+0x91/0xc44
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
    vm_area_dup+0x6a/0xe0
    vm_area_dup at kernel/fork.c:362
    __split_vma+0x72/0x2a0
    __split_vma at mm/mmap.c:2661
    split_vma+0x5a/0x80
    mprotect_fixup+0x368/0x3f0
    do_mprotect_pkey+0x263/0x420
    __x64_sys_mprotect+0x51/0x70
    do_syscall_64+0x91/0xc44
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    vm_area_dup() blindly copies all fields of original VMA to the new one.
    This includes coping vm_area_struct::shared.rb which is normally
    protected by i_mmap_lock. But this is fine because the read value will
    be overwritten on the following __vma_link_file() under proper
    protection. Thus, mark it as an intentional data race and insert a few
    assertions for the fields that should not be modified concurrently.

    Signed-off-by: Qian Cai
    Signed-off-by: Paul E. McKenney

    Qian Cai
     

26 Jun, 2020

1 commit

  • KCSAN reported data race reading and writing nr_threads and max_threads.
    The data race is intentional and benign. This is obvious from the comment
    above it and based on general consensus when discussing this issue. So
    there's no need for any heavy atomic or *_ONCE() machinery here.

    In accordance with the newly introduced data_race() annotation consensus,
    mark the offending line with data_race(). Here it's actually useful not
    just to silence KCSAN but to also clearly communicate that the race is
    intentional. This is especially helpful since nr_threads is otherwise
    protected by tasklist_lock.

    BUG: KCSAN: data-race in copy_process / copy_process

    write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1:
    copy_process+0x2eba/0x3c40 kernel/fork.c:2273
    _do_fork+0xfe/0x7a0 kernel/fork.c:2421
    __do_sys_clone kernel/fork.c:2576 [inline]
    __se_sys_clone kernel/fork.c:2557 [inline]
    __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
    do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0:
    copy_process+0x94d/0x3c40 kernel/fork.c:1954
    _do_fork+0xfe/0x7a0 kernel/fork.c:2421
    __do_sys_clone kernel/fork.c:2576 [inline]
    __se_sys_clone kernel/fork.c:2557 [inline]
    __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
    do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo
    deration/thvp7AHs5Ew/aPdYLXfYBQAJ

    Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com
    Signed-off-by: Zefan Li
    Signed-off-by: Weilong Chen
    Acked-by: Christian Brauner
    Cc: Qian Cai
    Cc: Oleg Nesterov
    Cc: Christian Brauner
    Cc: Marco Elver
    [christian.brauner@ubuntu.com: rewrite commit message]
    Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com
    Signed-off-by: Christian Brauner

    Weilong Chen
     

22 Jun, 2020

1 commit

  • This separate helper only existed to guarantee the mutual exclusivity of
    CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
    abuses the parent_tid field to return the pidfd. But we can actually handle
    this uniformely thus removing the helper. For legacy clone we can detect
    that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
    because they will share the same memory which is invalid and for clone3()
    setting the separate pidfd and parent_tid fields to the same memory is
    bogus as well. So fold that helper directly into _do_fork() by detecting
    this case.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: Geert Uytterhoeven
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Peter Zijlstra (Intel)"
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: x86@kernel.org
    Signed-off-by: Christian Brauner

    Christian Brauner
     

17 Jun, 2020

1 commit

  • One of the use-cases of close_range() is to drop file descriptors just before
    execve(). This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part of
    close_range() itself under a new flag CLOSE_RANGE_UNSHARE.

    This expands {dup,unshare)_fd() to take a max_fds argument that indicates the
    maximum number of file descriptors to copy from the old struct files. When the
    user requests that all file descriptors are supposed to be closed via
    close_range(min, max) then we can cap via unshare_fd(min) and hence don't need
    to do any of the heavy fput() work for everything above min.

    The patch makes it so that if CLOSE_RANGE_UNSHARE is requested and we do in
    fact currently share our file descriptor table we create a new private copy.
    We then close all fds in the requested range and finally after we're done we
    install the new fd table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Christian Brauner

    Christian Brauner
     

10 Jun, 2020

3 commits

  • Add API for nested write locks and convert the few call sites doing that.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-7-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

05 Jun, 2020

1 commit

  • Pull proc updates from Eric Biederman:
    "This has four sets of changes:

    - modernize proc to support multiple private instances

    - ensure we see the exit of each process tid exactly

    - remove has_group_leader_pid

    - use pids not tasks in posix-cpu-timers lookup

    Alexey updated proc so each mount of proc uses a new superblock. This
    allows people to actually use mount options with proc with no fear of
    messing up another mount of proc. Given the kernel's internal mounts
    of proc for things like uml this was a real problem, and resulted in
    Android's hidepid mount options being ignored and introducing security
    issues.

    The rest of the changes are small cleanups and fixes that came out of
    my work to allow this change to proc. In essence it is swapping the
    pids in de_thread during exec which removes a special case the code
    had to handle. Then updating the code to stop handling that special
    case"

    * 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: proc_pid_ns takes super_block as an argument
    remove the no longer needed pid_alive() check in __task_pid_nr_ns()
    posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
    posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
    posix-cpu-timers: Extend rcu_read_lock removing task_struct references
    signal: Remove has_group_leader_pid
    exec: Remove BUG_ON(has_group_leader_pid)
    posix-cpu-timer: Unify the now redundant code in lookup_task
    posix-cpu-timer: Tidy up group_leader logic in lookup_task
    proc: Ensure we see the exit of each process tid exactly once
    rculist: Add hlists_swap_heads_rcu
    proc: Use PIDTYPE_TGID in next_tgid
    Use proc_pid_ns() to get pid_namespace from the proc superblock
    proc: use named enums for better readability
    proc: use human-readable values for hidepid
    docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
    proc: add option to mount only a pids subset
    proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
    proc: allow to mount many instances of proc in one pid namespace
    proc: rename struct proc_fs_info to proc_fs_opts

    Linus Torvalds
     

02 Jun, 2020

2 commits

  • Pull arm64 updates from Will Deacon:
    "A sizeable pile of arm64 updates for 5.8.

    Summary below, but the big two features are support for Branch Target
    Identification and Clang's Shadow Call stack. The latter is currently
    arm64-only, but the high-level parts are all in core code so it could
    easily be adopted by other architectures pending toolchain support

    Branch Target Identification (BTI):

    - Support for ARMv8.5-BTI in both user- and kernel-space. This allows
    branch targets to limit the types of branch from which they can be
    called and additionally prevents branching to arbitrary code,
    although kernel support requires a very recent toolchain.

    - Function annotation via SYM_FUNC_START() so that assembly functions
    are wrapped with the relevant "landing pad" instructions.

    - BPF and vDSO updates to use the new instructions.

    - Addition of a new HWCAP and exposure of BTI capability to userspace
    via ID register emulation, along with ELF loader support for the
    BTI feature in .note.gnu.property.

    - Non-critical fixes to CFI unwind annotations in the sigreturn
    trampoline.

    Shadow Call Stack (SCS):

    - Support for Clang's Shadow Call Stack feature, which reserves
    platform register x18 to point at a separate stack for each task
    that holds only return addresses. This protects function return
    control flow from buffer overruns on the main stack.

    - Save/restore of x18 across problematic boundaries (user-mode,
    hypervisor, EFI, suspend, etc).

    - Core support for SCS, should other architectures want to use it
    too.

    - SCS overflow checking on context-switch as part of the existing
    stack limit check if CONFIG_SCHED_STACK_END_CHECK=y.

    CPU feature detection:

    - Removed numerous "SANITY CHECK" errors when running on a system
    with mismatched AArch32 support at EL1. This is primarily a concern
    for KVM, which disabled support for 32-bit guests on such a system.

    - Addition of new ID registers and fields as the architecture has
    been extended.

    Perf and PMU drivers:

    - Minor fixes and cleanups to system PMU drivers.

    Hardware errata:

    - Unify KVM workarounds for VHE and nVHE configurations.

    - Sort vendor errata entries in Kconfig.

    Secure Monitor Call Calling Convention (SMCCC):

    - Update to the latest specification from Arm (v1.2).

    - Allow PSCI code to query the SMCCC version.

    Software Delegated Exception Interface (SDEI):

    - Unexport a bunch of unused symbols.

    - Minor fixes to handling of firmware data.

    Pointer authentication:

    - Add support for dumping the kernel PAC mask in vmcoreinfo so that
    the stack can be unwound by tools such as kdump.

    - Simplification of key initialisation during CPU bringup.

    BPF backend:

    - Improve immediate generation for logical and add/sub instructions.

    vDSO:

    - Minor fixes to the linker flags for consistency with other
    architectures and support for LLVM's unwinder.

    - Clean up logic to initialise and map the vDSO into userspace.

    ACPI:

    - Work around for an ambiguity in the IORT specification relating to
    the "num_ids" field.

    - Support _DMA method for all named components rather than only PCIe
    root complexes.

    - Minor other IORT-related fixes.

    Miscellaneous:

    - Initialise debug traps early for KGDB and fix KDB cacheflushing
    deadlock.

    - Minor tweaks to early boot state (documentation update, set
    TEXT_OFFSET to 0x0, increase alignment of PE/COFF sections).

    - Refactoring and cleanup"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    KVM: arm64: Move __load_guest_stage2 to kvm_mmu.h
    KVM: arm64: Check advertised Stage-2 page size capability
    arm64/cpufeature: Add get_arm64_ftr_reg_nowarn()
    ACPI/IORT: Remove the unused __get_pci_rid()
    arm64/cpuinfo: Add ID_MMFR4_EL1 into the cpuinfo_arm64 context
    arm64/cpufeature: Add remaining feature bits in ID_AA64PFR1 register
    arm64/cpufeature: Add remaining feature bits in ID_AA64PFR0 register
    arm64/cpufeature: Add remaining feature bits in ID_AA64ISAR0 register
    arm64/cpufeature: Add remaining feature bits in ID_MMFR4 register
    arm64/cpufeature: Add remaining feature bits in ID_PFR0 register
    arm64/cpufeature: Introduce ID_MMFR5 CPU register
    arm64/cpufeature: Introduce ID_DFR1 CPU register
    arm64/cpufeature: Introduce ID_PFR2 CPU register
    arm64/cpufeature: Make doublelock a signed feature in ID_AA64DFR0
    arm64/cpufeature: Drop TraceFilt feature exposure from ID_DFR0 register
    arm64/cpufeature: Add explicit ftr_id_isar0[] for ID_ISAR0 register
    arm64: mm: Add asid_gen_match() helper
    firmware: smccc: Fix missing prototype warning for arm_smccc_version_init
    arm64: vdso: Fix CFI directives in sigreturn trampoline
    arm64: vdso: Don't prefix sigreturn trampoline with a BTI C instruction
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The RCU updates for this cycle were:

    - RCU-tasks update, including addition of RCU Tasks Trace for BPF use
    and TASKS_RUDE_RCU

    - kfree_rcu() updates.

    - Remove scheduler locking restriction

    - RCU CPU stall warning updates.

    - Torture-test updates.

    - Miscellaneous fixes and other updates"

    * tag 'core-rcu-2020-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    rcu: Allow for smp_call_function() running callbacks from idle
    rcu: Provide rcu_irq_exit_check_preempt()
    rcu: Abstract out rcu_irq_enter_check_tick() from rcu_nmi_enter()
    rcu: Provide __rcu_is_watching()
    rcu: Provide rcu_irq_exit_preempt()
    rcu: Make RCU IRQ enter/exit functions rely on in_nmi()
    rcu/tree: Mark the idle relevant functions noinstr
    x86: Replace ist_enter() with nmi_enter()
    x86/mce: Send #MC singal from task work
    x86/entry: Get rid of ist_begin/end_non_atomic()
    sched,rcu,tracing: Avoid tracing before in_nmi() is correct
    sh/ftrace: Move arch_ftrace_nmi_{enter,exit} into nmi exception
    lockdep: Always inline lockdep_{off,on}()
    hardirq/nmi: Allow nested nmi_enter()
    arm64: Prepare arch_nmi_enter() for recursion
    printk: Disallow instrumenting print_nmi_enter()
    printk: Prepare for nested printk_nmi_enter()
    rcutorture: Convert ULONG_CMP_LT() to time_before()
    torture: Add a --kasan argument
    torture: Save a few lines by using config_override_param initially
    ...

    Linus Torvalds
     

19 May, 2020

1 commit

  • syzbot found that

    touch /proc/testfile

    causes NULL pointer dereference at tomoyo_get_local_path()
    because inode of the dentry is NULL.

    Before c59f415a7cb6, Tomoyo received pid_ns from proc's s_fs_info
    directly. Since proc_pid_ns() can only work with inode, using it in
    the tomoyo_get_local_path() was wrong.

    To avoid creating more functions for getting proc_ns, change the
    argument type of the proc_pid_ns() function. Then, Tomoyo can use
    the existing super_block to get pid_ns.

    Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
    Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
    Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
    Fixes: c59f415a7cb6 ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
    Signed-off-by: Alexey Gladkov
    Signed-off-by: Eric W. Biederman

    Alexey Gladkov
     

15 May, 2020

1 commit

  • This change adds generic support for Clang's Shadow Call Stack,
    which uses a shadow stack to protect return addresses from being
    overwritten by an attacker. Details are available here:

    https://clang.llvm.org/docs/ShadowCallStack.html

    Note that security guarantees in the kernel differ from the ones
    documented for user space. The kernel must store addresses of
    shadow stacks in memory, which means an attacker capable reading
    and writing arbitrary memory may be able to locate them and hijack
    control flow by modifying the stacks.

    Signed-off-by: Sami Tolvanen
    Reviewed-by: Kees Cook
    Reviewed-by: Miguel Ojeda
    [will: Numerous cosmetic changes]
    Signed-off-by: Will Deacon

    Sami Tolvanen
     

08 May, 2020

1 commit

  • Jan reported an issue where an interaction between sign-extending clone's
    flag argument on ppc64le and the new CLONE_INTO_CGROUP feature causes
    clone() to consistently fail with EBADF.

    The whole story is a little longer. The legacy clone() syscall is odd in a
    bunch of ways and here two things interact. First, legacy clone's flag
    argument is word-size dependent, i.e. it's an unsigned long whereas most
    system calls with flag arguments use int or unsigned int. Second, legacy
    clone() ignores unknown and deprecated flags. The two of them taken
    together means that users on 64bit systems can pass garbage for the upper
    32bit of the clone() syscall since forever and things would just work fine.
    Just try this on a 64bit kernel prior to v5.7-rc1 where this will succeed
    and on v5.7-rc1 where this will fail with EBADF:

    int main(int argc, char *argv[])
    {
    pid_t pid;

    /* Note that legacy clone() has different argument ordering on
    * different architectures so this won't work everywhere.
    *
    * Only set the upper 32 bits.
    */
    pid = syscall(__NR_clone, 0xffffffff00000000 | SIGCHLD,
    NULL, NULL, NULL, NULL);
    if (pid < 0)
    exit(EXIT_FAILURE);
    if (pid == 0)
    exit(EXIT_SUCCESS);
    if (wait(NULL) != pid)
    exit(EXIT_FAILURE);

    exit(EXIT_SUCCESS);
    }

    Since legacy clone() couldn't be extended this was not a problem so far and
    nobody really noticed or cared since nothing in the kernel ever bothered to
    look at the upper 32 bits.

    But once we introduced clone3() and expanded the flag argument in struct
    clone_args to 64 bit we opened this can of worms. With the first flag-based
    extension to clone3() making use of the upper 32 bits of the flag argument
    we've effectively made it possible for the legacy clone() syscall to reach
    clone3() only flags. The sign extension scenario is just the odd
    corner-case that we needed to figure this out.

    The reason we just realized this now and not already when we introduced
    CLONE_CLEAR_SIGHAND was that CLONE_INTO_CGROUP assumes that a valid cgroup
    file descriptor has been given. So the sign extension (or the user
    accidently passing garbage for the upper 32 bits) caused the
    CLONE_INTO_CGROUP bit to be raised and the kernel to error out when it
    didn't find a valid cgroup file descriptor.

    Let's fix this by always capping the upper 32 bits for all codepaths that
    are not aware of clone3() features. This ensures that we can't reach
    clone3() only features by accident via legacy clone as with the sign
    extension case and also that legacy clone() works exactly like before, i.e.
    ignoring any unknown flags. This solution risks no regressions and is also
    pretty clean.

    Fixes: 7f192e3cd316 ("fork: add clone3")
    Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups")
    Reported-by: Jan Stancek
    Signed-off-by: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Dmitry V. Levin
    Cc: Andreas Schwab
    Cc: Florian Weimer
    Cc: libc-alpha@sourceware.org
    Cc: stable@vger.kernel.org # 5.3+
    Link: https://sourceware.org/pipermail/libc-alpha/2020-May/113596.html
    Link: https://lore.kernel.org/r/20200507103214.77218-1-christian.brauner@ubuntu.com

    Christian Brauner