24 Oct, 2020

1 commit

  • Pull arch task_work cleanups from Jens Axboe:
    "Two cleanups that don't fit other categories:

    - Finally get the task_work_add() cleanup done properly, so we don't
    have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
    all callers, and also fixes up the documentation for
    task_work_add().

    - While working on some TIF related changes for 5.11, this
    TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
    duplication for how that is handled"

    * tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
    task_work: cleanup notification modes
    tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()

    Linus Torvalds
     

23 Oct, 2020

1 commit

  • Pull Kbuild updates from Masahiro Yamada:

    - Support 'make compile_commands.json' to generate the compilation
    database more easily, avoiding stale entries

    - Support 'make clang-analyzer' and 'make clang-tidy' for static checks
    using clang-tidy

    - Preprocess scripts/modules.lds.S to allow CONFIG options in the
    module linker script

    - Drop cc-option tests from compiler flags supported by our minimal
    GCC/Clang versions

    - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y

    - Use sha1 build id for both BFD linker and LLD

    - Improve deb-pkg for reproducible builds and rootless builds

    - Remove stale, useless scripts/namespace.pl

    - Turn -Wreturn-type warning into error

    - Fix build error of deb-pkg when CONFIG_MODULES=n

    - Replace 'hostname' command with more portable 'uname -n'

    - Various Makefile cleanups

    * tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
    kbuild: Use uname for LINUX_COMPILE_HOST detection
    kbuild: Only add -fno-var-tracking-assignments for old GCC versions
    kbuild: remove leftover comment for filechk utility
    treewide: remove DISABLE_LTO
    kbuild: deb-pkg: clean up package name variables
    kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n
    kbuild: enforce -Werror=return-type
    scripts: remove namespace.pl
    builddeb: Add support for all required debian/rules targets
    builddeb: Enable rootless builds
    builddeb: Pass -n to gzip for reproducible packages
    kbuild: split the build log of kallsyms
    kbuild: explicitly specify the build id style
    scripts/setlocalversion: make git describe output more reliable
    kbuild: remove cc-option test of -Werror=date-time
    kbuild: remove cc-option test of -fno-stack-check
    kbuild: remove cc-option test of -fno-strict-overflow
    kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles
    kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan
    kbuild: do not create built-in objects for external module builds
    ...

    Linus Torvalds
     

20 Oct, 2020

1 commit

  • Pull m68knommu updates from Greg Ungerer:
    "A collection of fixes for 5.10:

    - switch to using asm-generic uaccess code

    - fix sparse warnings in signal code

    - fix compilation of ColdFire MMC support

    - support sysrq in ColdFire serial driver"

    * tag 'm68knommu-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/gerg/m68knommu:
    serial: mcf: add sysrq capability
    m68knommu: include SDHC support only when hardware has it
    m68knommu: fix sparse warnings in signal code
    m68knommu: switch to using asm-generic/uaccess.h

    Linus Torvalds
     

19 Oct, 2020

1 commit

  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

18 Oct, 2020

1 commit


16 Oct, 2020

1 commit

  • Pull dma-mapping updates from Christoph Hellwig:

    - rework the non-coherent DMA allocator

    - move private definitions out of

    - lower CMA_ALIGNMENT (Paul Cercueil)

    - remove the omap1 dma address translation in favor of the common code

    - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

    - support per-node DMA CMA areas (Barry Song)

    - increase the default seg boundary limit (Nicolin Chen)

    - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

    - various cleanups

    * tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
    ARM/ixp4xx: add a missing include of dma-map-ops.h
    dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
    dma-direct: factor out a dma_direct_alloc_from_pool helper
    dma-direct check for highmem pages in dma_direct_alloc_pages
    dma-mapping: merge into
    dma-mapping: move large parts of to kernel/dma
    dma-mapping: move dma-debug.h to kernel/dma/
    dma-mapping: remove
    dma-mapping: merge into
    dma-contiguous: remove dma_contiguous_set_default
    dma-contiguous: remove dev_set_cma_area
    dma-contiguous: remove dma_declare_contiguous
    dma-mapping: split
    cma: decrease CMA_ALIGNMENT lower limit to 2
    firewire-ohci: use dma_alloc_pages
    dma-iommu: implement ->alloc_noncoherent
    dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
    dma-mapping: add a new dma_alloc_pages API
    dma-mapping: remove dma_cache_sync
    53c700: convert to dma_alloc_noncoherent
    ...

    Linus Torvalds
     

15 Oct, 2020

1 commit

  • Pull kernel_clone() updates from Christian Brauner:
    "During the v5.9 merge window we reworked the process creation
    codepaths across multiple architectures. After this work we were only
    left with the _do_fork() helper based on the struct kernel_clone_args
    calling convention. As was pointed out _do_fork() isn't valid
    kernelese especially for a helper that isn't just static.

    This series removes the _do_fork() helper and introduces the new
    kernel_clone() helper. The process creation cleanup didn't change the
    name to something more reasonable mainly because _do_fork() was used
    in quite a few places. So sending this as a separate series seemed the
    better strategy.

    I originally intended to send this early in the v5.9 development cycle
    after the merge window had closed but given that this was touching
    quite a few places I decided to defer this until the v5.10 merge
    window"

    * tag 'kernel-clone-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    sched: remove _do_fork()
    tracing: switch to kernel_clone()
    kgdbts: switch to kernel_clone()
    kprobes: switch to kernel_clone()
    x86: switch to kernel_clone()
    sparc: switch to kernel_clone()
    nios2: switch to kernel_clone()
    m68k: switch to kernel_clone()
    ia64: switch to kernel_clone()
    h8300: switch to kernel_clone()
    fork: introduce kernel_clone()

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull m68k updates from Geert Uytterhoeven:

    - Conversion of the Mac IDE driver to a platform driver

    - Minor cleanups and fixes

    * tag 'm68k-for-v5.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
    ide/macide: Convert Mac IDE driver to platform driver
    m68k: Replace HTTP links with HTTPS ones
    m68k: mm: Remove superfluous memblock_alloc*() casts
    m68k: mm: Use PAGE_ALIGNED() helper
    m68k: Sort selects in main Kconfig
    m68k: amiga: Clean up Amiga hardware configuration
    m68k: Revive _TIF_* masks
    m68k: Correct some typos in comments
    m68k: Use get_kernel_nofault() in show_registers()
    zorro: Fix address space collision message with RAM expansion boards
    m68k: amiga: Fix Denise detection on OCS

    Linus Torvalds
     

06 Oct, 2020

1 commit


05 Oct, 2020

1 commit

  • Commit 858b810bf63f ("m68knommu: switch to using asm-generic/uaccess.h")
    cleaned up a number of sparse warnings associated with lack of __user
    annotations. It also uncovered a couple of more in the signal handling
    code:

    arch/m68k/kernel/signal.c:923:16: expected char [noderef] __user *__x
    arch/m68k/kernel/signal.c:923:16: got void *
    arch/m68k/kernel/signal.c:1007:16: warning: incorrect type in initializer (different address spaces)
    arch/m68k/kernel/signal.c:1007:16: expected char [noderef] __user *__x
    arch/m68k/kernel/signal.c:1007:16: got void *
    arch/m68k/kernel/signal.c:1132:6: warning: symbol 'do_notify_resume' was not declared. Should it be static?

    These are specific to a non-MMU configuration. Fix these inserting the
    correct __user annotations as required.

    Reported-by: kernel test robot
    Signed-off-by: Greg Ungerer

    Greg Ungerer
     

24 Sep, 2020

1 commit

  • There was a request to preprocess the module linker script like we
    do for the vmlinux one. (https://lkml.org/lkml/2020/8/21/512)

    The difference between vmlinux.lds and module.lds is that the latter
    is needed for external module builds, thus must be cleaned up by
    'make mrproper' instead of 'make clean'. Also, it must be created
    by 'make modules_prepare'.

    You cannot put it in arch/$(SRCARCH)/kernel/, which is cleaned up by
    'make clean'. I moved arch/$(SRCARCH)/kernel/module.lds to
    arch/$(SRCARCH)/include/asm/module.lds.h, which is included from
    scripts/module.lds.S.

    scripts/module.lds is fine because 'make clean' keeps all the
    build artifacts under scripts/.

    You can add arch-specific sections in .

    Signed-off-by: Masahiro Yamada
    Tested-by: Jessica Yu
    Acked-by: Will Deacon
    Acked-by: Geert Uytterhoeven
    Acked-by: Palmer Dabbelt
    Reviewed-by: Kees Cook
    Acked-by: Jessica Yu

    Masahiro Yamada
     

26 Aug, 2020

2 commits


24 Aug, 2020

1 commit

  • Replace the existing /* fall through */ comments and its variants with
    the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
    fall-through markings when it is the case.

    [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

    Signed-off-by: Gustavo A. R. Silva

    Gustavo A. R. Silva
     

20 Aug, 2020

1 commit

  • The old _do_fork() helper is removed in favor of the new kernel_clone() helper.
    The latter adheres to naming conventions for kernel internal syscall helpers.

    Signed-off-by: Christian Brauner
    Acked-by: Geert Uytterhoeven
    Cc: Kars de Jong
    Cc: linux-m68k@lists.linux-m68k.org
    Link: https://lore.kernel.org/r/20200819104655.436656-5-christian.brauner@ubuntu.com

    Christian Brauner
     

15 Aug, 2020

1 commit

  • Since commit 61a47c1ad3a4dc ("sysctl: Remove the sysctl system call"),
    sys_sysctl is actually unavailable: any input can only return an error.

    We have been warning about people using the sysctl system call for years
    and believe there are no more users. Even if there are users of this
    interface if they have not complained or fixed their code by now they
    probably are not going to, so there is no point in warning them any
    longer.

    So completely remove sys_sysctl on all architectures.

    [nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu]
    Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com

    Signed-off-by: Xiaoming Ni
    Signed-off-by: Andrew Morton
    Acked-by: Will Deacon [arm/arm64]
    Acked-by: "Eric W. Biederman"
    Cc: Aleksa Sarai
    Cc: Alexander Shishkin
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Benjamin Herrenschmidt
    Cc: Bin Meng
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Catalin Marinas
    Cc: chenzefeng
    Cc: Christian Borntraeger
    Cc: Christian Brauner
    Cc: Chris Zankel
    Cc: David Howells
    Cc: David S. Miller
    Cc: Diego Elio Pettenò
    Cc: Dmitry Vyukov
    Cc: Dominik Brodowski
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Iurii Zaikin
    Cc: Ivan Kokshaysky
    Cc: James Bottomley
    Cc: Jens Axboe
    Cc: Jiri Olsa
    Cc: Kars de Jong
    Cc: Kees Cook
    Cc: Krzysztof Kozlowski
    Cc: Luis Chamberlain
    Cc: Marco Elver
    Cc: Mark Rutland
    Cc: Martin K. Petersen
    Cc: Masahiro Yamada
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Miklos Szeredi
    Cc: Minchan Kim
    Cc: Namhyung Kim
    Cc: Naveen N. Rao
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Olof Johansson
    Cc: Paul Burton
    Cc: "Paul E. McKenney"
    Cc: Paul Mackerras
    Cc: Peter Zijlstra (Intel)
    Cc: Randy Dunlap
    Cc: Ravi Bangoria
    Cc: Richard Henderson
    Cc: Rich Felker
    Cc: Russell King
    Cc: Sami Tolvanen
    Cc: Sargun Dhillon
    Cc: Stephen Rothwell
    Cc: Sudeep Holla
    Cc: Sven Schnelle
    Cc: Thiago Jung Bauermann
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vasily Gorbik
    Cc: Vlastimil Babka
    Cc: Yoshinori Sato
    Cc: Zhou Yanjie
    Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com
    Signed-off-by: Linus Torvalds

    Xiaoming Ni
     

08 Aug, 2020

1 commit

  • Patch series "mm: cleanup usage of "

    Most architectures have very similar versions of pXd_alloc_one() and
    pXd_free_one() for intermediate levels of page table. These patches add
    generic versions of these functions in and enable
    use of the generic functions where appropriate.

    In addition, functions declared and defined in headers are
    used mostly by core mm and early mm initialization in arch and there is no
    actual reason to have the included all over the place.
    The first patch in this series removes unneeded includes of

    In the end it didn't work out as neatly as I hoped and moving
    pXd_alloc_track() definitions to would require
    unnecessary changes to arches that have custom page table allocations, so
    I've decided to move lib/ioremap.c to mm/ and make pgalloc-track.h local
    to mm/.

    This patch (of 8):

    In most cases header is required only for allocations of
    page table memory. Most of the .c files that include that header do not
    use symbols declared in and do not require that header.

    As for the other header files that used to include , it is
    possible to move that include into the .c file that actually uses symbols
    from and drop the include from the header file.

    The process was somewhat automated using

    sed -i -E '/[
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Acked-by: Geert Uytterhoeven [m68k]
    Cc: Abdul Haleem
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Christophe Leroy
    Cc: Joerg Roedel
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Satheesh Rajendran
    Cc: Stafford Horne
    Cc: Stephen Rothwell
    Cc: Steven Rostedt
    Cc: Joerg Roedel
    Cc: Matthew Wilcox
    Link: http://lkml.kernel.org/r/20200627143453.31835-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200627143453.31835-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

05 Aug, 2020

2 commits

  • Pull close_range() implementation from Christian Brauner:
    "This adds the close_range() syscall. It allows to efficiently close a
    range of file descriptors up to all file descriptors of a calling
    task.

    This is coordinated with the FreeBSD folks which have copied our
    version of this syscall and in the meantime have already merged it in
    April 2019:

    https://reviews.freebsd.org/D21627
    https://svnweb.freebsd.org/base?view=revision&revision=359836

    The syscall originally came up in a discussion around the new mount
    API and making new file descriptor types cloexec by default. During
    this discussion, Al suggested the close_range() syscall.

    First, it helps to close all file descriptors of an exec()ing task.
    This can be done safely via (quoting Al's example from [1] verbatim):

    /* that exec is sensitive */
    unshare(CLONE_FILES);
    /* we don't want anything past stderr here */
    close_range(3, ~0U);
    execve(....);

    The code snippet above is one way of working around the problem that
    file descriptors are not cloexec by default. This is aggravated by the
    fact that we can't just switch them over without massively regressing
    userspace. For a whole class of programs having an in-kernel method of
    closing all file descriptors is very helpful (e.g. demons, service
    managers, programming language standard libraries, container managers
    etc.).

    Second, it allows userspace to avoid implementing closing all file
    descriptors by parsing through /proc//fd/* and calling close() on
    each file descriptor and other hacks. From looking at various
    large(ish) userspace code bases this or similar patterns are very
    common in service managers, container runtimes, and programming
    language runtimes/standard libraries such as Python or Rust.

    In addition, the syscall will also work for tasks that do not have
    procfs mounted and on kernels that do not have procfs support compiled
    in. In such situations the only way to make sure that all file
    descriptors are closed is to call close() on each file descriptor up
    to UINT_MAX or RLIMIT_NOFILE, OPEN_MAX trickery.

    Based on Linus' suggestion close_range() also comes with a new flag
    CLOSE_RANGE_UNSHARE to more elegantly handle file descriptor dropping
    right before exec. This would usually be expressed in the sequence:

    unshare(CLONE_FILES);
    close_range(3, ~0U);

    as pointed out by Linus it might be desirable to have this be a part
    of close_range() itself under a new flag CLOSE_RANGE_UNSHARE which
    gets especially handy when we're closing all file descriptors above a
    certain threshold.

    Test-suite as always included"

    * tag 'close-range-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add CLOSE_RANGE_UNSHARE tests
    close_range: add CLOSE_RANGE_UNSHARE
    tests: add close_range() tests
    arch: wire-up close_range()
    open: add close_range()

    Linus Torvalds
     
  • Pull fork cleanups from Christian Brauner:
    "This is cleanup series from when we reworked a chunk of the process
    creation paths in the kernel and switched to struct
    {kernel_}clone_args.

    High-level this does two main things:

    - Remove the double export of both do_fork() and _do_fork() where
    do_fork() used the incosistent legacy clone calling convention.

    Now we only export _do_fork() which is based on struct
    kernel_clone_args.

    - Remove the copy_thread_tls()/copy_thread() split making the
    architecture specific HAVE_COYP_THREAD_TLS config option obsolete.

    This switches all remaining architectures to select
    HAVE_COPY_THREAD_TLS and thus to the copy_thread_tls() calling
    convention. The current split makes the process creation codepaths
    more convoluted than they need to be. Each architecture has their own
    copy_thread() function unless it selects HAVE_COPY_THREAD_TLS then it
    has a copy_thread_tls() function.

    The split is not needed anymore nowadays, all architectures support
    CLONE_SETTLS but quite a few of them never bothered to select
    HAVE_COPY_THREAD_TLS and instead simply continued to use copy_thread()
    and use the old calling convention. Removing this split cleans up the
    process creation codepaths and paves the way for implementing clone3()
    on such architectures since it requires the copy_thread_tls() calling
    convention.

    After having made each architectures support copy_thread_tls() this
    series simply renames that function back to copy_thread(). It also
    switches all architectures that call do_fork() directly over to
    _do_fork() and the struct kernel_clone_args calling convention. This
    is a corollary of switching the architectures that did not yet support
    it over to copy_thread_tls() since do_fork() is conditional on not
    supporting copy_thread_tls() (Mostly because it lacks a separate
    argument for tls which is trivial to fix but there's no need for this
    function to exist.).

    The do_fork() removal is in itself already useful as it allows to to
    remove the export of both do_fork() and _do_fork() we currently have
    in favor of only _do_fork(). This has already been discussed back when
    we added clone3(). The legacy clone() calling convention is - as is
    probably well-known - somewhat odd:

    #
    # ABI hall of shame
    #
    config CLONE_BACKWARDS
    config CLONE_BACKWARDS2
    config CLONE_BACKWARDS3

    that is aggravated by the fact that some architectures such as sparc
    follow the CLONE_BACKWARDSx calling convention but don't really select
    the corresponding config option since they call do_fork() directly.

    So do_fork() enforces a somewhat arbitrary calling convention in the
    first place that doesn't really help the individual architectures that
    deviate from it. They can thus simply be switched to _do_fork()
    enforcing a single calling convention. (I really hope that any new
    architectures will __not__ try to implement their own calling
    conventions...)

    Most architectures already have made a similar switch (m68k comes to
    mind).

    Overall this removes more code than it adds even with a good portion
    of added comments. It simplifies a chunk of arch specific assembly
    either by moving the code into C or by simply rewriting the assembly.

    Architectures that have been touched in non-trivial ways have all been
    actually boot and stress tested: sparc and ia64 have been tested with
    Debian 9 images. They are the two architectures which have been
    touched the most. All non-trivial changes to architectures have seen
    acks from the relevant maintainers. nios2 with a custom built
    buildroot image. h8300 I couldn't get something bootable to test on
    but the changes have been fairly automatic and I'm sure we'll hear
    people yell if I broke something there.

    All other architectures that have been touched in trivial ways have
    been compile tested for each single patch of the series via git rebase
    -x "make ..." v5.8-rc2. arm{64} and x86{_64} have been boot tested
    even though they have just been trivially touched (removal of the
    HAVE_COPY_THREAD_TLS macro from their Kconfig) because well they are
    basically "core architectures" and since it is trivial to get your
    hands on a useable image"

    * tag 'fork-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    arch: rename copy_thread_tls() back to copy_thread()
    arch: remove HAVE_COPY_THREAD_TLS
    unicore: switch to copy_thread_tls()
    sh: switch to copy_thread_tls()
    nds32: switch to copy_thread_tls()
    microblaze: switch to copy_thread_tls()
    hexagon: switch to copy_thread_tls()
    c6x: switch to copy_thread_tls()
    alpha: switch to copy_thread_tls()
    fork: remove do_fork()
    h8300: select HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    nios2: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    ia64: enable HAVE_COPY_THREAD_TLS, switch to kernel_clone_args
    sparc: unconditionally enable HAVE_COPY_THREAD_TLS
    sparc: share process creation helpers between sparc and sparc64
    sparc64: enable HAVE_COPY_THREAD_TLS
    fork: fold legacy_clone_args_valid() into _do_fork()

    Linus Torvalds
     

04 Aug, 2020

1 commit

  • Pull m68k updates from Geert Uytterhoeven:

    - several Kbuild improvements

    - several Mac fixes

    - minor cleanups and fixes

    - defconfig updates

    * tag 'm68k-for-v5.9-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/linux-m68k:
    m68k: defconfig: Update defconfigs for v5.8-rc3
    m68k: Use CLEAN_FILES to clean up files
    m68k: mac: Improve IOP debug messages
    m68k: mac: Don't send uninitialized data in IOP message reply
    m68k: mac: Fix IOP status/control register writes
    m68k: mac: Don't send IOP message until channel is idle
    m68k: atari: Annotate dummy read in ROM port IO code as __maybe_unused
    m68k: Use sizeof_field() helper
    m68k: Pass -D options to KBUILD_CPPFLAGS instead of KBUILD_{A,C}FLAGS
    m68k: Optimize cc-option calls for cpuflags-y
    m68k: sun3: Descend to prom from arch/m68k/sun3
    m68k: Add arch/m68k/Kbuild

    Linus Torvalds
     

13 Jul, 2020

1 commit


05 Jul, 2020

1 commit

  • Now that HAVE_COPY_THREAD_TLS has been removed, rename copy_thread_tls()
    back simply copy_thread(). It's a simpler name, and doesn't imply that only
    tls is copied here. This finishes an outstanding chunk of internal process
    creation work since we've added clone3().

    Cc: linux-arch@vger.kernel.org
    Acked-by: Thomas Bogendoerfer A
    Acked-by: Stafford Horne
    Acked-by: Greentime Hu
    Acked-by: Geert Uytterhoeven A
    Reviewed-by: Kees Cook
    Signed-off-by: Christian Brauner

    Christian Brauner
     

29 Jun, 2020

1 commit

  • The m68k nommu setup code didn't register the beginning of the physical
    memory with memblock because it was anyway occupied by the kernel. However,
    commit fa3354e4ea39 ("mm: free_area_init: use maximal zone PFNs rather than
    zone sizes") changed zones initialization to use memblock.memory to detect
    the zone extents and this caused inconsistency between zone PFNs and the
    actual PFNs:

    BUG: Bad page state in process swapper pfn:20165
    page:41fe0ca0 refcount:0 mapcount:1 mapping:00000000 index:0x0 flags: 0x0()
    raw: 00000000 00000100 00000122 00000000 00000000 00000000 00000000 00000000
    page dumped because: nonzero mapcount
    CPU: 0 PID: 1 Comm: swapper Not tainted 5.8.0-rc1-00001-g3a38f8a60c65-dirty #1
    Stack from 404c9ebc:
    404c9ebc 4029ab28 4029ab28 40088470 41fe0ca0 40299e21 40299df1 404ba2a4
    00020165 00000000 41fd2c10 402c7ba0 41fd2c04 40088504 41fe0ca0 40299e21
    00000000 40088a12 41fe0ca0 41fe0ca4 0000020a 00000000 00000001 402ca000
    00000000 41fe0ca0 41fd2c10 41fd2c10 00000000 00000000 402b2388 00000001
    400a0934 40091056 404c9f44 404c9f44 40088db4 402c7ba0 00000001 41fd2c04
    41fe0ca0 41fd2000 41fe0ca0 40089e02 4026ecf4 40089e4e 41fe0ca0 ffffffff
    Call Trace:
    [] 0x40088470
    [] 0x40088504
    [] 0x40088a12
    [] 0x402ca000
    [] 0x400a0934

    Adjust the memory registration with memblock to include the beginning of
    the physical memory and make sure that the area occupied by the kernel is
    marked as reserved.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Greg Ungerer

    Mike Rapoport
     

22 Jun, 2020

1 commit

  • This separate helper only existed to guarantee the mutual exclusivity of
    CLONE_PIDFD and CLONE_PARENT_SETTID for legacy clone since CLONE_PIDFD
    abuses the parent_tid field to return the pidfd. But we can actually handle
    this uniformely thus removing the helper. For legacy clone we can detect
    that CLONE_PIDFD is specified in conjunction with CLONE_PARENT_SETTID
    because they will share the same memory which is invalid and for clone3()
    setting the separate pidfd and parent_tid fields to the same memory is
    bogus as well. So fold that helper directly into _do_fork() by detecting
    this case.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Al Viro
    Cc: Geert Uytterhoeven
    Cc: "Matthew Wilcox (Oracle)"
    Cc: "Peter Zijlstra (Intel)"
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: x86@kernel.org
    Signed-off-by: Christian Brauner

    Christian Brauner
     

17 Jun, 2020

1 commit

  • This wires up the close_range() syscall into all arches at once.

    Suggested-by: Arnd Bergmann
    Signed-off-by: Christian Brauner
    Reviewed-by: Oleg Nesterov
    Acked-by: Arnd Bergmann
    Acked-by: Michael Ellerman (powerpc)
    Cc: Jann Horn
    Cc: David Howells
    Cc: Dmitry V. Levin
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Florian Weimer
    Cc: linux-api@vger.kernel.org
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linux-mips@vger.kernel.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: sparclinux@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: linux-arch@vger.kernel.org
    Cc: x86@kernel.org

    Christian Brauner
     

10 Jun, 2020

6 commits

  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The replacement of with made the include
    of the latter in the middle of asm includes. Fix this up with the aid of
    the below script and manual adjustments here and there.

    import sys
    import re

    if len(sys.argv) is not 3:
    print "USAGE: %s " % (sys.argv[0])
    sys.exit(1)

    hdr_to_move="#include " % sys.argv[2]
    moved = False
    in_hdrs = False

    with open(sys.argv[1], "r") as f:
    lines = f.readlines()
    for _line in lines:
    line = _line.rstrip('
    ')
    if line == hdr_to_move:
    continue
    if line.startswith("#include
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-4-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The include/linux/pgtable.h is going to be the home of generic page table
    manipulation functions.

    Start with moving asm-generic/pgtable.h to include/linux/pgtable.h and
    make the latter include asm/pgtable.h.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-3-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "mm: consolidate definitions of page table accessors", v2.

    The low level page table accessors (pXY_index(), pXY_offset()) are
    duplicated across all architectures and sometimes more than once. For
    instance, we have 31 definition of pgd_offset() for 25 supported
    architectures.

    Most of these definitions are actually identical and typically it boils
    down to, e.g.

    static inline unsigned long pmd_index(unsigned long address)
    {
    return (address >> PMD_SHIFT) & (PTRS_PER_PMD - 1);
    }

    static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address)
    {
    return (pmd_t *)pud_page_vaddr(*pud) + pmd_index(address);
    }

    These definitions can be shared among 90% of the arches provided
    XYZ_SHIFT, PTRS_PER_XYZ and xyz_page_vaddr() are defined.

    For architectures that really need a custom version there is always
    possibility to override the generic version with the usual ifdefs magic.

    These patches introduce include/linux/pgtable.h that replaces
    include/asm-generic/pgtable.h and add the definitions of the page table
    accessors to the new header.

    This patch (of 12):

    The linux/mm.h header includes to allow inlining of the
    functions involving page table manipulations, e.g. pte_alloc() and
    pmd_alloc(). So, there is no point to explicitly include
    in the files that include .

    The include statements in such cases are remove with a simple loop:

    for f in $(git grep -l "include ") ; do
    sed -i -e '/include / d' $f
    done

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Brian Cain
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Ingo Molnar
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matthew Wilcox
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Mike Rapoport
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200514170327.31389-1-rppt@kernel.org
    Link: http://lkml.kernel.org/r/20200514170327.31389-2-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Now the last users of show_stack() got converted to use an explicit log
    level, show_stack_loglvl() can drop it's redundant suffix and become once
    again well known show_stack().

    Signed-off-by: Dmitry Safonov
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200418201944.482088-51-dima@arista.com
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     
  • Currently, the log-level of show_stack() depends on a platform
    realization. It creates situations where the headers are printed with
    lower log level or higher than the stacktrace (depending on a platform or
    user).

    Furthermore, it forces the logic decision from user to an architecture
    side. In result, some users as sysrq/kdb/etc are doing tricks with
    temporary rising console_loglevel while printing their messages. And in
    result it not only may print unwanted messages from other CPUs, but also
    omit printing at all in the unlucky case where the printk() was deferred.

    Introducing log-level parameter and KERN_UNSUPPRESSED [1] seems an easier
    approach than introducing more printk buffers. Also, it will consolidate
    printings with headers.

    Introduce show_stack_loglvl(), that eventually will substitute
    show_stack().

    [1]: https://lore.kernel.org/lkml/20190528002412.1625-1-dima@arista.com/T/#u

    Signed-off-by: Dmitry Safonov
    Signed-off-by: Andrew Morton
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20200418201944.482088-18-dima@arista.com
    Signed-off-by: Linus Torvalds

    Dmitry Safonov
     

14 May, 2020

1 commit

  • POSIX defines faccessat() as having a fourth "flags" argument, while the
    linux syscall doesn't have it. Glibc tries to emulate AT_EACCESS and
    AT_SYMLINK_NOFOLLOW, but AT_EACCESS emulation is broken.

    Add a new faccessat(2) syscall with the added flags argument and implement
    both flags.

    The value of AT_EACCESS is defined in glibc headers to be the same as
    AT_REMOVEDIR. Use this value for the kernel interface as well, together
    with the explanatory comment.

    Also add AT_EMPTY_PATH support, which is not documented by POSIX, but can
    be useful and is trivial to implement.

    Signed-off-by: Miklos Szeredi

    Miklos Szeredi
     

25 Mar, 2020

1 commit


04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

30 Jan, 2020

3 commits

  • Pull thread management updates from Christian Brauner:
    "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
    syscall.

    This syscall allows for the retrieval of file descriptors of a process
    based on its pidfd. A task needs to have ptrace_may_access()
    permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
    Andy) on the target.

    One of the main use-cases is in combination with seccomp's user
    notification feature. As a reminder, seccomp's user notification
    feature was made available in v5.0. It allows a task to retrieve a
    file descriptor for its seccomp filter. The file descriptor is usually
    handed of to a more privileged supervising process. The supervisor can
    then listen for syscall events caught by the seccomp filter of the
    supervisee and perform actions in lieu of the supervisee, usually
    emulating syscalls. pidfd_getfd() is needed to expand its uses.

    There are currently two major users that wait on pidfd_getfd() and one
    future user:

    - Netflix, Sargun said, is working on a service mesh where users
    should be able to connect to a dns-based VIP. When a user connects
    to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
    redirected to an envoy process. This service mesh uses seccomp user
    notifications and pidfd to intercept all connect calls and instead
    of connecting them to 1.2.3.4:80 connects them to e.g.
    127.0.0.1:8080.

    - LXD uses the seccomp notifier heavily to intercept and emulate
    mknod() and mount() syscalls for unprivileged containers/processes.
    With pidfd_getfd() more uses-cases e.g. bridging socket connections
    will be possible.

    - The patchset has also seen some interest from the browser corner.
    Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
    broker process. In the future glibc will start blocking all signals
    during dlopen() rendering this type of sandbox impossible. Hence,
    in the future Firefox will switch to a seccomp-user-nofication
    based sandbox which also makes use of file descriptor retrieval.
    The thread for this can be found at
    https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html

    With pidfd_getfd() it is e.g. possible to bridge socket connections
    for the supervisee (binding to a privileged port) and taking actions
    on file descriptors on behalf of the supervisee in general.

    Sargun's first version was using an ioctl on pidfds but various people
    pushed for it to be a proper syscall which he duely implemented as
    well over various review cycles. Selftests are of course included.
    I've also added instructions how to deal with merge conflicts below.

    There's also a small fix coming from the kernel mentee project to
    correctly annotate struct sighand_struct with __rcu to fix various
    sparse warnings. We've received a few more such fixes and even though
    they are mostly trivial I've decided to postpone them until after -rc1
    since they came in rather late and I don't want to risk introducing
    build warnings.

    Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
    needed to avoid allocation recursions triggerable by storage drivers
    that have userspace parts that run in the IO path (e.g. dm-multipath,
    iscsi, etc). These allocation recursions deadlock the device.

    The new prctl() allows such privileged userspace components to avoid
    allocation recursions by setting the PF_MEMALLOC_NOIO and
    PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
    relevant maintainers and is routed here as part of prctl()
    thread-management."

    * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
    sched.h: Annotate sighand_struct with __rcu
    test: Add test for pidfd getfd
    arch: wire up pidfd_getfd syscall
    pid: Implement pidfd_getfd syscall
    vfs, fdtable: Add fget_task helper

    Linus Torvalds
     
  • Pull openat2 support from Al Viro:
    "This is the openat2() series from Aleksa Sarai.

    I'm afraid that the rest of namei stuff will have to wait - it got
    zero review the last time I'd posted #work.namei, and there had been a
    leak in the posted series I'd caught only last weekend. I was going to
    repost it on Monday, but the window opened and the odds of getting any
    review during that... Oh, well.

    Anyway, openat2 part should be ready; that _did_ get sane amount of
    review and public testing, so here it comes"

    From Aleksa's description of the series:
    "For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown
    flags are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road
    to being added to openat(2).

    Furthermore, the need for some sort of control over VFS's path
    resolution (to avoid malicious paths resulting in inadvertent
    breakouts) has been a very long-standing desire of many userspace
    applications.

    This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
    (which was a variant of David Drysdale's O_BENEATH patchset[4] which
    was a spin-off of the Capsicum project[5]) with a few additions and
    changes made based on the previous discussion within [6] as well as
    others I felt were useful.

    In line with the conclusions of the original discussion of
    AT_NO_JUMPS, the flag has been split up into separate flags. However,
    instead of being an openat(2) flag it is provided through a new
    syscall openat2(2) which provides several other improvements to the
    openat(2) interface (see the patch description for more details). The
    following new LOOKUP_* flags are added:

    LOOKUP_NO_XDEV:

    Blocks all mountpoint crossings (upwards, downwards, or through
    absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is
    also blocked (though magic-link jumps on the same vfsmount are
    permitted).

    LOOKUP_NO_MAGICLINKS:

    Blocks resolution through /proc/$pid/fd-style links. This is done
    by blocking the usage of nd_jump_link() during resolution in a
    filesystem. The term "magic-links" is used to match with the only
    reference to these links in Documentation/, but I'm happy to change
    the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

    LOOKUP_BENEATH:

    Disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed.

    Conceptually this flag is to ensure you "stay below" a certain
    point in the filesystem tree -- but this requires some additional
    to protect against various races that would allow escape using
    "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done
    as in LOOKUP_IN_ROOT, but that requires more discussion.

    In addition, two new flags are added that expand on the above ideas:

    LOOKUP_NO_SYMLINKS:

    Does what it says on the tin. No symlink resolution is allowed at
    all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
    can still be used with NOFOLLOW to open an fd for the symlink as
    long as no parent path had a symlink component.

    LOOKUP_IN_ROOT:

    This is an extension of LOOKUP_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that
    chroot(2) is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to
    cross magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container.

    There is a long list of CVEs that could have bene mitigated by
    having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
    CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
    few).

    In order to make all of the above more usable, I'm working on
    libpathrs[8] which is a C-friendly library for safe path resolution.
    It features a userspace-emulated backend if the kernel doesn't support
    openat2(2). Hopefully we can get userspace to switch to using it, and
    thus get openat2(2) support for free once it's ready.

    Future work would include implementing things like
    RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
    programs to be sure they don't hit DoSes though stale NFS handles)"

    * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Documentation: path-lookup: include new LOOKUP flags
    selftests: add openat2(2) selftests
    open: introduce openat2(2) syscall
    namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
    namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
    namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
    namei: LOOKUP_NO_XDEV: block mountpoint crossing
    namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
    namei: LOOKUP_NO_SYMLINKS: block symlink resolution
    namei: allow set_root() to produce errors
    namei: allow nd_jump_link() to produce errors
    nsfs: clean-up ns_get_path() signature to return int
    namei: only return -ECHILD from follow_dotdot_rcu()

    Linus Torvalds
     
  • Pull tty/serial driver updates from Greg KH:
    "Here are the big set of tty and serial driver updates for 5.6-rc1

    Included in here are:
    - dummy_con cleanups (touches lots of arch code)
    - sysrq logic cleanups (touches lots of serial drivers)
    - samsung driver fixes (wasn't really being built)
    - conmakeshash move to tty subdir out of scripts
    - lots of small tty/serial driver updates

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
    tty: n_hdlc: Use flexible-array member and struct_size() helper
    tty: baudrate: SPARC supports few more baud rates
    tty: baudrate: Synchronise baud_table[] and baud_bits[]
    tty: serial: meson_uart: Add support for kernel debugger
    serial: imx: fix a race condition in receive path
    serial: 8250_bcm2835aux: Document struct bcm2835aux_data
    serial: 8250_bcm2835aux: Use generic remapping code
    serial: 8250_bcm2835aux: Allocate uart_8250_port on stack
    serial: 8250_bcm2835aux: Suppress register_port error on -EPROBE_DEFER
    serial: 8250_bcm2835aux: Suppress clk_get error on -EPROBE_DEFER
    serial: 8250_bcm2835aux: Fix line mismatch on driver unbind
    serial_core: Remove unused member in uart_port
    vt: Correct comment documenting do_take_over_console()
    vt: Delete comment referencing non-existent unbind_con_driver()
    arch/xtensa/setup: Drop dummy_con initialization
    arch/x86/setup: Drop dummy_con initialization
    arch/unicore32/setup: Drop dummy_con initialization
    arch/sparc/setup: Drop dummy_con initialization
    arch/sh/setup: Drop dummy_con initialization
    arch/s390/setup: Drop dummy_con initialization
    ...

    Linus Torvalds
     

18 Jan, 2020

1 commit

  • /* Background. */
    For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown flags
    are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road to
    being added to openat(2).

    Userspace also has a hard time figuring out whether a particular flag is
    supported on a particular kernel. While it is now possible with
    contemporary kernels (thanks to [3]), older kernels will expose unknown
    flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
    openat(2) time matches modern syscall designs and is far more
    fool-proof.

    In addition, the newly-added path resolution restriction LOOKUP flags
    (which we would like to expose to user-space) don't feel related to the
    pre-existing O_* flag set -- they affect all components of path lookup.
    We'd therefore like to add a new flag argument.

    Adding a new syscall allows us to finally fix the flag-ignoring problem,
    and we can make it extensible enough so that we will hopefully never
    need an openat3(2).

    /* Syscall Prototype. */
    /*
    * open_how is an extensible structure (similar in interface to
    * clone3(2) or sched_setattr(2)). The size parameter must be set to
    * sizeof(struct open_how), to allow for future extensions. All future
    * extensions will be appended to open_how, with their zero value
    * acting as a no-op default.
    */
    struct open_how { /* ... */ };

    int openat2(int dfd, const char *pathname,
    struct open_how *how, size_t size);

    /* Description. */
    The initial version of 'struct open_how' contains the following fields:

    flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

    mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

    resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH => LOOKUP_BENEATH
    RESOLVE_IN_ROOT => LOOKUP_IN_ROOT

    open_how does not contain an embedded size field, because it is of
    little benefit (userspace can figure out the kernel open_how size at
    runtime fairly easily without it). It also only contains u64s (even
    though ->mode arguably should be a u16) to avoid having padding fields
    which are never used in the future.

    Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
    is no longer permitted for openat(2). As far as I can tell, this has
    always been a bug and appears to not be used by userspace (and I've not
    seen any problems on my machines by disallowing it). If it turns out
    this breaks something, we can special-case it and only permit it for
    openat(2) but not openat2(2).

    After input from Florian Weimer, the new open_how and flag definitions
    are inside a separate header from uapi/linux/fcntl.h, to avoid problems
    that glibc has with importing that header.

    /* Testing. */
    In a follow-up patch there are over 200 selftests which ensure that this
    syscall has the correct semantics and will correctly handle several
    attack scenarios.

    In addition, I've written a userspace library[4] which provides
    convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
    because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
    must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
    syscalls). During the development of this patch, I've run numerous
    verification tests using libpathrs (showing that the API is reasonably
    usable by userspace).

    /* Future Work. */
    Additional RESOLVE_ flags have been suggested during the review period.
    These can be easily implemented separately (such as blocking auto-mount
    during resolution).

    Furthermore, there are some other proposed changes to the openat(2)
    interface (the most obvious example is magic-link hardening[5]) which
    would be a good opportunity to add a way for userspace to restrict how
    O_PATH file descriptors can be re-opened.

    Another possible avenue of future work would be some kind of
    CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
    which openat2(2) flags and fields are supported by the current kernel
    (to avoid userspace having to go through several guesses to figure it
    out).

    [1]: https://lwn.net/Articles/588444/
    [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
    [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
    [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
    [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
    [6]: https://youtu.be/ggD-eb3yPVs

    Suggested-by: Christian Brauner
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

14 Jan, 2020

2 commits

  • con_init in tty/vt.c will now set conswitchp to dummy_con if it's unset.
    Drop it from arch setup code.

    Signed-off-by: Arvind Sankar
    Link: https://lore.kernel.org/r/20191218214506.49252-11-nivedita@alum.mit.edu
    Signed-off-by: Greg Kroah-Hartman

    Arvind Sankar
     
  • This is required for clone3(), which passes the TLS value through a
    struct rather than a register.

    As do_fork() is only available if CONFIG_HAVE_COPY_THREAD_TLS is set,
    m68k_clone() must be changed to call _do_fork() directly.

    Signed-off-by: Geert Uytterhoeven
    Acked-by: Christian Brauner
    Acked-by: Greg Ungerer
    Link: https://lore.kernel.org/r/20200113103040.23661-1-geert@linux-m68k.org

    Geert Uytterhoeven