31 Mar, 2020

1 commit

  • Pull irq updates from Thomas Gleixner:
    "Updates for the interrupt subsystem:

    Treewide:

    - Cleanup of setup_irq() which is not longer required because the
    memory allocator is available early.

    Most cleanup changes come through the various maintainer trees, so
    the final removal of setup_irq() is postponed towards the end of
    the merge window.

    Core:

    - Protection against unsafe invocation of interrupt handlers and
    unsafe interrupt injection including a fixup of the offending
    PCI/AER error injection mechanism.

    Invoking interrupt handlers from arbitrary contexts, i.e. outside
    of an actual interrupt, can cause inconsistent state on the
    fragile x86 interrupt affinity changing hardware trainwreck.

    Drivers:

    - Second wave of support for the new ARM GICv4.1

    - Multi-instance support for Xilinx and PLIC interrupt controllers

    - CPU-Hotplug support for PLIC

    - The obligatory new driver for X1000 TCU

    - Enhancements, cleanups and fixes all over the place"

    * tag 'irq-core-2020-03-30' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
    unicore32: Replace setup_irq() by request_irq()
    sh: Replace setup_irq() by request_irq()
    hexagon: Replace setup_irq() by request_irq()
    c6x: Replace setup_irq() by request_irq()
    alpha: Replace setup_irq() by request_irq()
    irqchip/gic-v4.1: Eagerly vmap vPEs
    irqchip/gic-v4.1: Add VSGI property setup
    irqchip/gic-v4.1: Add VSGI allocation/teardown
    irqchip/gic-v4.1: Move doorbell management to the GICv4 abstraction layer
    irqchip/gic-v4.1: Plumb set_vcpu_affinity SGI callbacks
    irqchip/gic-v4.1: Plumb get/set_irqchip_state SGI callbacks
    irqchip/gic-v4.1: Plumb mask/unmask SGI callbacks
    irqchip/gic-v4.1: Add initial SGI configuration
    irqchip/gic-v4.1: Plumb skeletal VSGI irqchip
    irqchip/stm32: Retrigger both in eoi and unmask callbacks
    irqchip/gic-v3: Move irq_domain_update_bus_token to after checking for NULL domain
    irqchip/xilinx: Do not call irq_set_default_host()
    irqchip/xilinx: Enable generic irq multi handler
    irqchip/xilinx: Fill error code when irq domain registration fails
    irqchip/xilinx: Add support for multiple instances
    ...

    Linus Torvalds
     

30 Mar, 2020

1 commit

  • request_irq() is preferred over setup_irq(). Invocations of setup_irq()
    occur after memory allocators are ready.

    setup_irq() was required in older kernels as the memory allocator was not
    available during early boot.

    Hence replace setup_irq() by request_irq().

    Signed-off-by: afzal mohammed
    Signed-off-by: Thomas Gleixner
    Acked-by: Matt Turner
    Link: https://lkml.kernel.org/r/51f8ae7da9f47a23596388141933efa2bdef317b.1585320721.git.afzal.mohd.ma@gmail.com

    afzal mohammed
     

28 Mar, 2020

1 commit

  • Move access_ok() in and pagefault_enable()/pagefault_disable() out.
    Mechanical conversion only - some instances don't really need
    a separate access_ok() at all (e.g. the ones only using
    get_user()/put_user(), or architectures where access_ok()
    is always true); we'll deal with that in followups.

    Signed-off-by: Al Viro

    Al Viro
     

10 Feb, 2020

1 commit

  • Pull more Kbuild updates from Masahiro Yamada:

    - fix randconfig to generate a sane .config

    - rename hostprogs-y / always to hostprogs / always-y, which are more
    natual syntax.

    - optimize scripts/kallsyms

    - fix yes2modconfig and mod2yesconfig

    - make multiple directory targets ('make foo/ bar/') work

    * tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
    kbuild: make multiple directory targets work
    kconfig: Invalidate all symbols after changing to y or m.
    kallsyms: fix type of kallsyms_token_table[]
    scripts/kallsyms: change table to store (strcut sym_entry *)
    scripts/kallsyms: rename local variables in read_symbol()
    kbuild: rename hostprogs-y/always to hostprogs/always-y
    kbuild: fix the document to use extra-y for vmlinux.lds
    kconfig: fix broken dependency in randconfig-generated .config

    Linus Torvalds
     

04 Feb, 2020

2 commits

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • In old days, the "host-progs" syntax was used for specifying host
    programs. It was renamed to the current "hostprogs-y" in 2004.

    It is typically useful in scripts/Makefile because it allows Kbuild to
    selectively compile host programs based on the kernel configuration.

    This commit renames like follows:

    always -> always-y
    hostprogs-y -> hostprogs

    So, scripts/Makefile will look like this:

    always-$(CONFIG_BUILD_BIN2C) += ...
    always-$(CONFIG_KALLSYMS) += ...
    ...
    hostprogs := $(always-y) $(always-m)

    I think this makes more sense because a host program is always a host
    program, irrespective of the kernel configuration. We want to specify
    which ones to compile by CONFIG options, so always-y will be handier.

    The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
    compatibility for a while.

    Signed-off-by: Masahiro Yamada

    Masahiro Yamada
     

30 Jan, 2020

3 commits

  • Pull thread management updates from Christian Brauner:
    "Sargun Dhillon over the last cycle has worked on the pidfd_getfd()
    syscall.

    This syscall allows for the retrieval of file descriptors of a process
    based on its pidfd. A task needs to have ptrace_may_access()
    permissions with PTRACE_MODE_ATTACH_REALCREDS (suggested by Oleg and
    Andy) on the target.

    One of the main use-cases is in combination with seccomp's user
    notification feature. As a reminder, seccomp's user notification
    feature was made available in v5.0. It allows a task to retrieve a
    file descriptor for its seccomp filter. The file descriptor is usually
    handed of to a more privileged supervising process. The supervisor can
    then listen for syscall events caught by the seccomp filter of the
    supervisee and perform actions in lieu of the supervisee, usually
    emulating syscalls. pidfd_getfd() is needed to expand its uses.

    There are currently two major users that wait on pidfd_getfd() and one
    future user:

    - Netflix, Sargun said, is working on a service mesh where users
    should be able to connect to a dns-based VIP. When a user connects
    to e.g. 1.2.3.4:80 that runs e.g. service "foo" they will be
    redirected to an envoy process. This service mesh uses seccomp user
    notifications and pidfd to intercept all connect calls and instead
    of connecting them to 1.2.3.4:80 connects them to e.g.
    127.0.0.1:8080.

    - LXD uses the seccomp notifier heavily to intercept and emulate
    mknod() and mount() syscalls for unprivileged containers/processes.
    With pidfd_getfd() more uses-cases e.g. bridging socket connections
    will be possible.

    - The patchset has also seen some interest from the browser corner.
    Right now, Firefox is using a SECCOMP_RET_TRAP sandbox managed by a
    broker process. In the future glibc will start blocking all signals
    during dlopen() rendering this type of sandbox impossible. Hence,
    in the future Firefox will switch to a seccomp-user-nofication
    based sandbox which also makes use of file descriptor retrieval.
    The thread for this can be found at
    https://sourceware.org/ml/libc-alpha/2019-12/msg00079.html

    With pidfd_getfd() it is e.g. possible to bridge socket connections
    for the supervisee (binding to a privileged port) and taking actions
    on file descriptors on behalf of the supervisee in general.

    Sargun's first version was using an ioctl on pidfds but various people
    pushed for it to be a proper syscall which he duely implemented as
    well over various review cycles. Selftests are of course included.
    I've also added instructions how to deal with merge conflicts below.

    There's also a small fix coming from the kernel mentee project to
    correctly annotate struct sighand_struct with __rcu to fix various
    sparse warnings. We've received a few more such fixes and even though
    they are mostly trivial I've decided to postpone them until after -rc1
    since they came in rather late and I don't want to risk introducing
    build warnings.

    Finally, there's a new prctl() command PR_{G,S}ET_IO_FLUSHER which is
    needed to avoid allocation recursions triggerable by storage drivers
    that have userspace parts that run in the IO path (e.g. dm-multipath,
    iscsi, etc). These allocation recursions deadlock the device.

    The new prctl() allows such privileged userspace components to avoid
    allocation recursions by setting the PF_MEMALLOC_NOIO and
    PF_LESS_THROTTLE flags. The patch carries the necessary acks from the
    relevant maintainers and is routed here as part of prctl()
    thread-management."

    * tag 'threads-v5.6' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    prctl: PR_{G,S}ET_IO_FLUSHER to support controlling memory reclaim
    sched.h: Annotate sighand_struct with __rcu
    test: Add test for pidfd getfd
    arch: wire up pidfd_getfd syscall
    pid: Implement pidfd_getfd syscall
    vfs, fdtable: Add fget_task helper

    Linus Torvalds
     
  • Pull openat2 support from Al Viro:
    "This is the openat2() series from Aleksa Sarai.

    I'm afraid that the rest of namei stuff will have to wait - it got
    zero review the last time I'd posted #work.namei, and there had been a
    leak in the posted series I'd caught only last weekend. I was going to
    repost it on Monday, but the window opened and the odds of getting any
    review during that... Oh, well.

    Anyway, openat2 part should be ready; that _did_ get sane amount of
    review and public testing, so here it comes"

    From Aleksa's description of the series:
    "For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown
    flags are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road
    to being added to openat(2).

    Furthermore, the need for some sort of control over VFS's path
    resolution (to avoid malicious paths resulting in inadvertent
    breakouts) has been a very long-standing desire of many userspace
    applications.

    This patchset is a revival of Al Viro's old AT_NO_JUMPS[3] patchset
    (which was a variant of David Drysdale's O_BENEATH patchset[4] which
    was a spin-off of the Capsicum project[5]) with a few additions and
    changes made based on the previous discussion within [6] as well as
    others I felt were useful.

    In line with the conclusions of the original discussion of
    AT_NO_JUMPS, the flag has been split up into separate flags. However,
    instead of being an openat(2) flag it is provided through a new
    syscall openat2(2) which provides several other improvements to the
    openat(2) interface (see the patch description for more details). The
    following new LOOKUP_* flags are added:

    LOOKUP_NO_XDEV:

    Blocks all mountpoint crossings (upwards, downwards, or through
    absolute links). Absolute pathnames alone in openat(2) do not
    trigger this. Magic-link traversal which implies a vfsmount jump is
    also blocked (though magic-link jumps on the same vfsmount are
    permitted).

    LOOKUP_NO_MAGICLINKS:

    Blocks resolution through /proc/$pid/fd-style links. This is done
    by blocking the usage of nd_jump_link() during resolution in a
    filesystem. The term "magic-links" is used to match with the only
    reference to these links in Documentation/, but I'm happy to change
    the name.

    It should be noted that this is different to the scope of
    ~LOOKUP_FOLLOW in that it applies to all path components. However,
    you can do openat2(NO_FOLLOW|NO_MAGICLINKS) on a magic-link and it
    will *not* fail (assuming that no parent component was a
    magic-link), and you will have an fd for the magic-link.

    In order to correctly detect magic-links, the introduction of a new
    LOOKUP_MAGICLINK_JUMPED state flag was required.

    LOOKUP_BENEATH:

    Disallows escapes to outside the starting dirfd's
    tree, using techniques such as ".." or absolute links. Absolute
    paths in openat(2) are also disallowed.

    Conceptually this flag is to ensure you "stay below" a certain
    point in the filesystem tree -- but this requires some additional
    to protect against various races that would allow escape using
    "..".

    Currently LOOKUP_BENEATH implies LOOKUP_NO_MAGICLINKS, because it
    can trivially beam you around the filesystem (breaking the
    protection). In future, there might be similar safety checks done
    as in LOOKUP_IN_ROOT, but that requires more discussion.

    In addition, two new flags are added that expand on the above ideas:

    LOOKUP_NO_SYMLINKS:

    Does what it says on the tin. No symlink resolution is allowed at
    all, including magic-links. Just as with LOOKUP_NO_MAGICLINKS this
    can still be used with NOFOLLOW to open an fd for the symlink as
    long as no parent path had a symlink component.

    LOOKUP_IN_ROOT:

    This is an extension of LOOKUP_BENEATH that, rather than blocking
    attempts to move past the root, forces all such movements to be
    scoped to the starting point. This provides chroot(2)-like
    protection but without the cost of a chroot(2) for each filesystem
    operation, as well as being safe against race attacks that
    chroot(2) is not.

    If a race is detected (as with LOOKUP_BENEATH) then an error is
    generated, and similar to LOOKUP_BENEATH it is not permitted to
    cross magic-links with LOOKUP_IN_ROOT.

    The primary need for this is from container runtimes, which
    currently need to do symlink scoping in userspace[7] when opening
    paths in a potentially malicious container.

    There is a long list of CVEs that could have bene mitigated by
    having RESOLVE_THIS_ROOT (such as CVE-2017-1002101,
    CVE-2017-1002102, CVE-2018-15664, and CVE-2019-5736, just to name a
    few).

    In order to make all of the above more usable, I'm working on
    libpathrs[8] which is a C-friendly library for safe path resolution.
    It features a userspace-emulated backend if the kernel doesn't support
    openat2(2). Hopefully we can get userspace to switch to using it, and
    thus get openat2(2) support for free once it's ready.

    Future work would include implementing things like
    RESOLVE_NO_AUTOMOUNT and possibly a RESOLVE_NO_REMOTE (to allow
    programs to be sure they don't hit DoSes though stale NFS handles)"

    * 'work.openat2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    Documentation: path-lookup: include new LOOKUP flags
    selftests: add openat2(2) selftests
    open: introduce openat2(2) syscall
    namei: LOOKUP_{IN_ROOT,BENEATH}: permit limited ".." resolution
    namei: LOOKUP_IN_ROOT: chroot-like scoped resolution
    namei: LOOKUP_BENEATH: O_BENEATH-like scoped resolution
    namei: LOOKUP_NO_XDEV: block mountpoint crossing
    namei: LOOKUP_NO_MAGICLINKS: block magic-link resolution
    namei: LOOKUP_NO_SYMLINKS: block symlink resolution
    namei: allow set_root() to produce errors
    namei: allow nd_jump_link() to produce errors
    nsfs: clean-up ns_get_path() signature to return int
    namei: only return -ECHILD from follow_dotdot_rcu()

    Linus Torvalds
     
  • Pull tty/serial driver updates from Greg KH:
    "Here are the big set of tty and serial driver updates for 5.6-rc1

    Included in here are:
    - dummy_con cleanups (touches lots of arch code)
    - sysrq logic cleanups (touches lots of serial drivers)
    - samsung driver fixes (wasn't really being built)
    - conmakeshash move to tty subdir out of scripts
    - lots of small tty/serial driver updates

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'tty-5.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (140 commits)
    tty: n_hdlc: Use flexible-array member and struct_size() helper
    tty: baudrate: SPARC supports few more baud rates
    tty: baudrate: Synchronise baud_table[] and baud_bits[]
    tty: serial: meson_uart: Add support for kernel debugger
    serial: imx: fix a race condition in receive path
    serial: 8250_bcm2835aux: Document struct bcm2835aux_data
    serial: 8250_bcm2835aux: Use generic remapping code
    serial: 8250_bcm2835aux: Allocate uart_8250_port on stack
    serial: 8250_bcm2835aux: Suppress register_port error on -EPROBE_DEFER
    serial: 8250_bcm2835aux: Suppress clk_get error on -EPROBE_DEFER
    serial: 8250_bcm2835aux: Fix line mismatch on driver unbind
    serial_core: Remove unused member in uart_port
    vt: Correct comment documenting do_take_over_console()
    vt: Delete comment referencing non-existent unbind_con_driver()
    arch/xtensa/setup: Drop dummy_con initialization
    arch/x86/setup: Drop dummy_con initialization
    arch/unicore32/setup: Drop dummy_con initialization
    arch/sparc/setup: Drop dummy_con initialization
    arch/sh/setup: Drop dummy_con initialization
    arch/s390/setup: Drop dummy_con initialization
    ...

    Linus Torvalds
     

29 Jan, 2020

1 commit

  • Pull EFI updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Cleanup of the GOP [graphics output] handling code in the EFI stub

    - Complete refactoring of the mixed mode handling in the x86 EFI stub

    - Overhaul of the x86 EFI boot/runtime code

    - Increase robustness for mixed mode code

    - Add the ability to disable DMA at the root port level in the EFI
    stub

    - Get rid of RWX mappings in the EFI memory map and page tables,
    where possible

    - Move the support code for the old EFI memory mapping style into its
    only user, the SGI UV1+ support code.

    - plus misc fixes, updates, smaller cleanups.

    ... and due to interactions with the RWX changes, another round of PAT
    cleanups make a guest appearance via the EFI tree - with no side
    effects intended"

    * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    efi/x86: Disable instrumentation in the EFI runtime handling code
    efi/libstub/x86: Fix EFI server boot failure
    efi/x86: Disallow efi=old_map in mixed mode
    x86/boot/compressed: Relax sed symbol type regex for LLVM ld.lld
    efi/x86: avoid KASAN false positives when accessing the 1: 1 mapping
    efi: Fix handling of multiple efi_fake_mem= entries
    efi: Fix efi_memmap_alloc() leaks
    efi: Add tracking for dynamically allocated memmaps
    efi: Add a flags parameter to efi_memory_map
    efi: Fix comment for efi_mem_type() wrt absent physical addresses
    efi/arm: Defer probe of PCIe backed efifb on DT systems
    efi/x86: Limit EFI old memory map to SGI UV machines
    efi/x86: Avoid RWX mappings for all of DRAM
    efi/x86: Don't map the entire kernel text RW for mixed mode
    x86/mm: Fix NX bit clearing issue in kernel_map_pages_in_pgd
    efi/libstub/x86: Fix unused-variable warning
    efi/libstub/x86: Use mandatory 16-byte stack alignment in mixed mode
    efi/libstub/x86: Use const attribute for efi_is_64bit()
    efi: Allow disabling PCI busmastering on bridges during boot
    efi/x86: Allow translating 64-bit arguments for mixed mode calls
    ...

    Linus Torvalds
     

18 Jan, 2020

1 commit

  • /* Background. */
    For a very long time, extending openat(2) with new features has been
    incredibly frustrating. This stems from the fact that openat(2) is
    possibly the most famous counter-example to the mantra "don't silently
    accept garbage from userspace" -- it doesn't check whether unknown flags
    are present[1].

    This means that (generally) the addition of new flags to openat(2) has
    been fraught with backwards-compatibility issues (O_TMPFILE has to be
    defined as __O_TMPFILE|O_DIRECTORY|[O_RDWR or O_WRONLY] to ensure old
    kernels gave errors, since it's insecure to silently ignore the
    flag[2]). All new security-related flags therefore have a tough road to
    being added to openat(2).

    Userspace also has a hard time figuring out whether a particular flag is
    supported on a particular kernel. While it is now possible with
    contemporary kernels (thanks to [3]), older kernels will expose unknown
    flag bits through fcntl(F_GETFL). Giving a clear -EINVAL during
    openat(2) time matches modern syscall designs and is far more
    fool-proof.

    In addition, the newly-added path resolution restriction LOOKUP flags
    (which we would like to expose to user-space) don't feel related to the
    pre-existing O_* flag set -- they affect all components of path lookup.
    We'd therefore like to add a new flag argument.

    Adding a new syscall allows us to finally fix the flag-ignoring problem,
    and we can make it extensible enough so that we will hopefully never
    need an openat3(2).

    /* Syscall Prototype. */
    /*
    * open_how is an extensible structure (similar in interface to
    * clone3(2) or sched_setattr(2)). The size parameter must be set to
    * sizeof(struct open_how), to allow for future extensions. All future
    * extensions will be appended to open_how, with their zero value
    * acting as a no-op default.
    */
    struct open_how { /* ... */ };

    int openat2(int dfd, const char *pathname,
    struct open_how *how, size_t size);

    /* Description. */
    The initial version of 'struct open_how' contains the following fields:

    flags
    Used to specify openat(2)-style flags. However, any unknown flag
    bits or otherwise incorrect flag combinations (like O_PATH|O_RDWR)
    will result in -EINVAL. In addition, this field is 64-bits wide to
    allow for more O_ flags than currently permitted with openat(2).

    mode
    The file mode for O_CREAT or O_TMPFILE.

    Must be set to zero if flags does not contain O_CREAT or O_TMPFILE.

    resolve
    Restrict path resolution (in contrast to O_* flags they affect all
    path components). The current set of flags are as follows (at the
    moment, all of the RESOLVE_ flags are implemented as just passing
    the corresponding LOOKUP_ flag).

    RESOLVE_NO_XDEV => LOOKUP_NO_XDEV
    RESOLVE_NO_SYMLINKS => LOOKUP_NO_SYMLINKS
    RESOLVE_NO_MAGICLINKS => LOOKUP_NO_MAGICLINKS
    RESOLVE_BENEATH => LOOKUP_BENEATH
    RESOLVE_IN_ROOT => LOOKUP_IN_ROOT

    open_how does not contain an embedded size field, because it is of
    little benefit (userspace can figure out the kernel open_how size at
    runtime fairly easily without it). It also only contains u64s (even
    though ->mode arguably should be a u16) to avoid having padding fields
    which are never used in the future.

    Note that as a result of the new how->flags handling, O_PATH|O_TMPFILE
    is no longer permitted for openat(2). As far as I can tell, this has
    always been a bug and appears to not be used by userspace (and I've not
    seen any problems on my machines by disallowing it). If it turns out
    this breaks something, we can special-case it and only permit it for
    openat(2) but not openat2(2).

    After input from Florian Weimer, the new open_how and flag definitions
    are inside a separate header from uapi/linux/fcntl.h, to avoid problems
    that glibc has with importing that header.

    /* Testing. */
    In a follow-up patch there are over 200 selftests which ensure that this
    syscall has the correct semantics and will correctly handle several
    attack scenarios.

    In addition, I've written a userspace library[4] which provides
    convenient wrappers around openat2(RESOLVE_IN_ROOT) (this is necessary
    because no other syscalls support RESOLVE_IN_ROOT, and thus lots of care
    must be taken when using RESOLVE_IN_ROOT'd file descriptors with other
    syscalls). During the development of this patch, I've run numerous
    verification tests using libpathrs (showing that the API is reasonably
    usable by userspace).

    /* Future Work. */
    Additional RESOLVE_ flags have been suggested during the review period.
    These can be easily implemented separately (such as blocking auto-mount
    during resolution).

    Furthermore, there are some other proposed changes to the openat(2)
    interface (the most obvious example is magic-link hardening[5]) which
    would be a good opportunity to add a way for userspace to restrict how
    O_PATH file descriptors can be re-opened.

    Another possible avenue of future work would be some kind of
    CHECK_FIELDS[6] flag which causes the kernel to indicate to userspace
    which openat2(2) flags and fields are supported by the current kernel
    (to avoid userspace having to go through several guesses to figure it
    out).

    [1]: https://lwn.net/Articles/588444/
    [2]: https://lore.kernel.org/lkml/CA+55aFyyxJL1LyXZeBsf2ypriraj5ut1XkNDsunRBqgVjZU_6Q@mail.gmail.com
    [3]: commit 629e014bb834 ("fs: completely ignore unknown open flags")
    [4]: https://sourceware.org/bugzilla/show_bug.cgi?id=17523
    [5]: https://lore.kernel.org/lkml/20190930183316.10190-2-cyphar@cyphar.com/
    [6]: https://youtu.be/ggD-eb3yPVs

    Suggested-by: Christian Brauner
    Signed-off-by: Aleksa Sarai
    Signed-off-by: Al Viro

    Aleksa Sarai
     

14 Jan, 2020

2 commits


06 Jan, 2020

1 commit


10 Dec, 2019

1 commit


05 Dec, 2019

1 commit

  • Patch series "mm: remove __ARCH_HAS_4LEVEL_HACK", v13.

    These patches convert several architectures to use page table folding
    and remove __ARCH_HAS_4LEVEL_HACK along with
    include/asm-generic/4level-fixup.h.

    For the nommu configurations the folding is already implemented by the
    generic code so the only change was to use the appropriate header file.

    As for the rest, the changes are mostly about mechanical replacement of
    pgd accessors with pud/pmd ones and the addition of higher levels to
    page table traversals.

    With Vineet's patches from "elide extraneous generated code for folded
    p4d/pud/pmd" series [1] there is a small shrink of the kernel size of
    about -0.01% for the defconfig builds.

    This patch (of 13):

    It is not likely alpha will have 5-level page tables.

    Replace usage of include/asm-generic/4level-fixup.h and implied
    __ARCH_HAS_4LEVEL_HACK with include/asm-generic/pgtable-nopud.h and
    adjust page table manipulation macros and functions accordingly.

    Link: http://lkml.kernel.org/r/1572938135-31886-2-git-send-email-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Cc: Anton Ivanov
    Cc: Arnd Bergmann
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Jeff Dike
    Cc: "Kirill A. Shutemov"
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Peter Rosin
    Cc: Richard Weinberger
    Cc: Rolf Eike Beer
    Cc: Russell King
    Cc: Sam Creasey
    Cc: Vincent Chen
    Cc: Vineet Gupta
    Cc: Anatoly Pugachev
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Dec, 2019

1 commit

  • Pull PCI updates from Bjorn Helgaas:
    "Enumeration:

    - Warn if a host bridge has no NUMA info (Yunsheng Lin)

    - Add PCI_STD_NUM_BARS for the number of standard BARs (Denis
    Efremov)

    Resource management:

    - Fix boot-time Embedded Controller GPE storm caused by incorrect
    resource assignment after ACPI Bus Check Notification (Mika
    Westerberg)

    - Protect pci_reassign_bridge_resources() against concurrent
    addition/removal (Benjamin Herrenschmidt)

    - Fix bridge dma_ranges resource list cleanup (Rob Herring)

    - Add "pci=hpmmiosize" and "pci=hpmmioprefsize" parameters to control
    the MMIO and prefetchable MMIO window sizes of hotplug bridges
    independently (Nicholas Johnson)

    - Fix MMIO/MMIO_PREF window assignment that assigned more space than
    desired (Nicholas Johnson)

    - Only enforce bus numbers from bridge EA if the bridge has EA
    devices downstream (Subbaraya Sundeep)

    - Consolidate DT "dma-ranges" parsing and convert all host drivers to
    use shared parsing (Rob Herring)

    Error reporting:

    - Restore AER capability after resume (Mayurkumar Patel)

    - Add PoisonTLPBlocked AER counter (Rajat Jain)

    - Use for_each_set_bit() to simplify AER code (Andy Shevchenko)

    - Fix AER kernel-doc (Andy Shevchenko)

    - Add "pcie_ports=dpc-native" parameter to allow native use of DPC
    even if platform didn't grant control over AER (Olof Johansson)

    Hotplug:

    - Avoid returning prematurely from sysfs requests to enable or
    disable a PCIe hotplug slot (Lukas Wunner)

    - Don't disable interrupts twice when suspending hotplug ports (Mika
    Westerberg)

    - Fix deadlocks when PCIe ports are hot-removed while suspended (Mika
    Westerberg)

    Power management:

    - Remove unnecessary ASPM locking (Bjorn Helgaas)

    - Add support for disabling L1 PM Substates (Heiner Kallweit)

    - Allow re-enabling Clock PM after it has been disabled (Heiner
    Kallweit)

    - Add sysfs attributes for controlling ASPM link states (Heiner
    Kallweit)

    - Remove CONFIG_PCIEASPM_DEBUG, including "link_state" and "clk_ctl"
    sysfs files (Heiner Kallweit)

    - Avoid AMD FCH XHCI USB PME# from D0 defect that prevents wakeup on
    USB 2.0 or 1.1 connect events (Kai-Heng Feng)

    - Move power state check out of pci_msi_supported() (Bjorn Helgaas)

    - Fix incorrect MSI-X masking on resume and revert related nvme quirk
    for Kingston NVME SSD running FW E8FK11.T (Jian-Hong Pan)

    - Always return devices to D0 when thawing to fix hibernation with
    drivers like mlx4 that used legacy power management (previously we
    only did it for drivers with new power management ops) (Dexuan Cui)

    - Clear PCIe PME Status even for legacy power management (Bjorn
    Helgaas)

    - Fix PCI PM documentation errors (Bjorn Helgaas)

    - Use dev_printk() for more power management messages (Bjorn Helgaas)

    - Apply D2 delay as milliseconds, not microseconds (Bjorn Helgaas)

    - Convert xen-platform from legacy to generic power management (Bjorn
    Helgaas)

    - Removed unused .resume_early() and .suspend_late() legacy power
    management hooks (Bjorn Helgaas)

    - Rearrange power management code for clarity (Rafael J. Wysocki)

    - Decode power states more clearly ("4" or "D4" really refers to
    "D3cold") (Bjorn Helgaas)

    - Notice when reading PM Control register returns an error (~0)
    instead of interpreting it as being in D3hot (Bjorn Helgaas)

    - Add missing link delays required by the PCIe spec (Mika Westerberg)

    Virtualization:

    - Move pci_prg_resp_pasid_required() to CONFIG_PCI_PRI (Bjorn
    Helgaas)

    - Allow VFs to use PRI (the PF PRI is shared by the VFs, but the code
    previously didn't recognize that) (Kuppuswamy Sathyanarayanan)

    - Allow VFs to use PASID (the PF PASID capability is shared by the
    VFs, but the code previously didn't recognize that) (Kuppuswamy
    Sathyanarayanan)

    - Disconnect PF and VF ATS enablement, since ATS in PFs and
    associated VFs can be enabled independently (Kuppuswamy
    Sathyanarayanan)

    - Cache PRI and PASID capability offsets (Kuppuswamy Sathyanarayanan)

    - Cache the PRI PRG Response PASID Required bit (Bjorn Helgaas)

    - Consolidate ATS declarations in linux/pci-ats.h (Krzysztof
    Wilczynski)

    - Remove unused PRI and PASID stubs (Bjorn Helgaas)

    - Removed unnecessary EXPORT_SYMBOL_GPL() from ATS, PRI, and PASID
    interfaces that are only used by built-in IOMMU drivers (Bjorn
    Helgaas)

    - Hide PRI and PASID state restoration functions used only inside the
    PCI core (Bjorn Helgaas)

    - Add a DMA alias quirk for the Intel VCA NTB (Slawomir Pawlowski)

    - Serialize sysfs sriov_numvfs reads vs writes (Pierre Crégut)

    - Update Cavium ACS quirk for ThunderX2 and ThunderX3 (George
    Cherian)

    - Fix the UPDCR register address in the Intel ACS quirk (Steffen
    Liebergeld)

    - Unify ACS quirk implementations (Bjorn Helgaas)

    Amlogic Meson host bridge driver:

    - Fix meson PERST# GPIO polarity problem (Remi Pommarel)

    - Add DT bindings for Amlogic Meson G12A (Neil Armstrong)

    - Fix meson clock names to match DT bindings (Neil Armstrong)

    - Add meson support for Amlogic G12A SoC with separate shared PHY
    (Neil Armstrong)

    - Add meson extended PCIe PHY functions for Amlogic G12A USB3+PCIe
    combo PHY (Neil Armstrong)

    - Add arm64 DT for Amlogic G12A PCIe controller node (Neil Armstrong)

    - Add commented-out description of VIM3 USB3/PCIe mux in arm64 DT
    (Neil Armstrong)

    Broadcom iProc host bridge driver:

    - Invalidate iProc PAXB address mapping before programming it
    (Abhishek Shah)

    - Fix iproc-msi and mvebu __iomem annotations (Ben Dooks)

    Cadence host bridge driver:

    - Refactor Cadence PCIe host controller to use as a library for both
    host and endpoint (Tom Joseph)

    Freescale Layerscape host bridge driver:

    - Add layerscape LS1028a support (Xiaowei Bao)

    Intel VMD host bridge driver:

    - Add VMD bus 224-255 restriction decode (Jon Derrick)

    - Add VMD 8086:9A0B device ID (Jon Derrick)

    - Remove Keith from VMD maintainer list (Keith Busch)

    Marvell ARMADA 3700 / Aardvark host bridge driver:

    - Use LTSSM state to build link training flag since Aardvark doesn't
    implement the Link Training bit (Remi Pommarel)

    - Delay before training Aardvark link in case PERST# was asserted
    before the driver probe (Remi Pommarel)

    - Fix Aardvark issues with Root Control reads and writes (Remi
    Pommarel)

    - Don't rely on jiffies in Aardvark config access path since
    interrupts may be disabled (Remi Pommarel)

    - Fix Aardvark big-endian support (Grzegorz Jaszczyk)

    Marvell ARMADA 370 / XP host bridge driver:

    - Make mvebu_pci_bridge_emul_ops static (Ben Dooks)

    Microsoft Hyper-V host bridge driver:

    - Add hibernation support for Hyper-V virtual PCI devices (Dexuan
    Cui)

    - Track Hyper-V pci_protocol_version per-hbus, not globally (Dexuan
    Cui)

    - Avoid kmemleak false positive on hv hbus buffer (Dexuan Cui)

    Mobiveil host bridge driver:

    - Change mobiveil csr_read()/write() function names that conflict
    with riscv arch functions (Kefeng Wang)

    NVIDIA Tegra host bridge driver:

    - Fix Tegra CLKREQ dependency programming (Vidya Sagar)

    Renesas R-Car host bridge driver:

    - Remove unnecessary header include from rcar (Andrew Murray)

    - Tighten register index checking for rcar inbound range programming
    (Marek Vasut)

    - Fix rcar inbound range alignment calculation to improve packing of
    multiple entries (Marek Vasut)

    - Update rcar MACCTLR setting to match documentation (Yoshihiro
    Shimoda)

    - Clear bit 0 of MACCTLR before PCIETCTLR.CFINIT per manual
    (Yoshihiro Shimoda)

    - Add Marek Vasut and Yoshihiro Shimoda as R-Car maintainers (Simon
    Horman)

    Rockchip host bridge driver:

    - Make rockchip 0V9 and 1V8 power regulators non-optional (Robin
    Murphy)

    Socionext UniPhier host bridge driver:

    - Set uniphier to host (RC) mode always (Kunihiko Hayashi)

    Endpoint drivers:

    - Fix endpoint driver sign extension problem when shifting page
    number to phys_addr_t (Alan Mikhak)

    Misc:

    - Add NumaChip SPDX header (Krzysztof Wilczynski)

    - Replace EXTRA_CFLAGS with ccflags-y (Krzysztof Wilczynski)

    - Remove unused includes (Krzysztof Wilczynski)

    - Removed unused sysfs attribute groups (Ben Dooks)

    - Remove PTM and ASPM dependencies on PCIEPORTBUS (Bjorn Helgaas)

    - Add PCIe Link Control 2 register field definitions to replace magic
    numbers in AMDGPU and Radeon CIK/SI (Bjorn Helgaas)

    - Fix incorrect Link Control 2 Transmit Margin usage in AMDGPU and
    Radeon CIK/SI PCIe Gen3 link training (Bjorn Helgaas)

    - Use pcie_capability_read_word() instead of pci_read_config_word()
    in AMDGPU and Radeon CIK/SI (Frederick Lawler)

    - Remove unused pci_irq_get_node() Greg Kroah-Hartman)

    - Make asm/msi.h mandatory and simplify PCI_MSI_IRQ_DOMAIN Kconfig
    (Palmer Dabbelt, Michal Simek)

    - Read all 64 bits of Switchtec part_event_bitmap (Logan Gunthorpe)

    - Fix erroneous intel-iommu dependency on CONFIG_AMD_IOMMU (Bjorn
    Helgaas)

    - Fix bridge emulation big-endian support (Grzegorz Jaszczyk)

    - Fix dwc find_next_bit() usage (Niklas Cassel)

    - Fix pcitest.c fd leak (Hewenliang)

    - Fix typos and comments (Bjorn Helgaas)

    - Fix Kconfig whitespace errors (Krzysztof Kozlowski)"

    * tag 'pci-v5.5-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (160 commits)
    PCI: Remove PCI_MSI_IRQ_DOMAIN architecture whitelist
    asm-generic: Make msi.h a mandatory include/asm header
    Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T"
    PCI/MSI: Fix incorrect MSI-X masking on resume
    PCI/MSI: Move power state check out of pci_msi_supported()
    PCI/MSI: Remove unused pci_irq_get_node()
    PCI: hv: Avoid a kmemleak false positive caused by the hbus buffer
    PCI: hv: Change pci_protocol_version to per-hbus
    PCI: hv: Add hibernation support
    PCI: hv: Reorganize the code in preparation of hibernation
    MAINTAINERS: Remove Keith from VMD maintainer
    PCI/ASPM: Remove PCIEASPM_DEBUG Kconfig option and related code
    PCI/ASPM: Add sysfs attributes for controlling ASPM link states
    PCI: Fix indentation
    drm/radeon: Prefer pcie_capability_read_word()
    drm/radeon: Replace numbers with PCI_EXP_LNKCTL2 definitions
    drm/radeon: Correct Transmit Margin masks
    drm/amdgpu: Prefer pcie_capability_read_word()
    PCI: uniphier: Set mode register to host mode
    drm/amdgpu: Replace numbers with PCI_EXP_LNKCTL2 definitions
    ...

    Linus Torvalds
     

02 Dec, 2019

1 commit

  • Pull y2038 cleanups from Arnd Bergmann:
    "y2038 syscall implementation cleanups

    This is a series of cleanups for the y2038 work, mostly intended for
    namespace cleaning: the kernel defines the traditional time_t, timeval
    and timespec types that often lead to y2038-unsafe code. Even though
    the unsafe usage is mostly gone from the kernel, having the types and
    associated functions around means that we can still grow new users,
    and that we may be missing conversions to safe types that actually
    matter.

    There are still a number of driver specific patches needed to get the
    last users of these types removed, those have been submitted to the
    respective maintainers"

    Link: https://lore.kernel.org/lkml/20191108210236.1296047-1-arnd@arndb.de/

    * tag 'y2038-cleanups-5.5' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (26 commits)
    y2038: alarm: fix half-second cut-off
    y2038: ipc: fix x32 ABI breakage
    y2038: fix typo in powerpc vdso "LOPART"
    y2038: allow disabling time32 system calls
    y2038: itimer: change implementation to timespec64
    y2038: move itimer reset into itimer.c
    y2038: use compat_{get,set}_itimer on alpha
    y2038: itimer: compat handling to itimer.c
    y2038: time: avoid timespec usage in settimeofday()
    y2038: timerfd: Use timespec64 internally
    y2038: elfcore: Use __kernel_old_timeval for process times
    y2038: make ns_to_compat_timeval use __kernel_old_timeval
    y2038: socket: use __kernel_old_timespec instead of timespec
    y2038: socket: remove timespec reference in timestamping
    y2038: syscalls: change remaining timeval to __kernel_old_timeval
    y2038: rusage: use __kernel_old_timeval
    y2038: uapi: change __kernel_time_t to __kernel_old_time_t
    y2038: stat: avoid 'time_t' in 'struct stat'
    y2038: ipc: remove __kernel_time_t reference from headers
    y2038: vdso: powerpc: avoid timespec references
    ...

    Linus Torvalds
     

29 Nov, 2019

1 commit

  • Pull generic ioremap support from Christoph Hellwig:
    "This adds the remaining bits for an entirely generic ioremap and
    iounmap to lib/ioremap.c. To facilitate that, it cleans up the giant
    mess of weird ioremap variants we had with no users outside the arch
    code.

    For now just the three newest ports use the code, but there is more
    than a handful others that can be converted without too much work.

    Summary:

    - clean up various obsolete ioremap and iounmap variants

    - add a new generic ioremap implementation and switch csky, nds32 and
    riscv over to it"

    * tag 'ioremap-5.5' of git://git.infradead.org/users/hch/ioremap: (21 commits)
    nds32: use generic ioremap
    csky: use generic ioremap
    csky: remove ioremap_cache
    riscv: use the generic ioremap code
    lib: provide a simple generic ioremap implementation
    sh: remove __iounmap
    nios2: remove __iounmap
    hexagon: remove __iounmap
    m68k: rename __iounmap and mark it static
    arch: rely on asm-generic/io.h for default ioremap_* definitions
    asm-generic: don't provide ioremap for CONFIG_MMU
    asm-generic: ioremap_uc should behave the same with and without MMU
    xtensa: clean up ioremap
    x86: Clean up ioremap()
    parisc: remove __ioremap
    nios2: remove __ioremap
    alpha: remove the unused __ioremap wrapper
    hexagon: clean up ioremap
    ia64: rename ioremap_nocache to ioremap_uc
    unicore32: remove ioremap_cached
    ...

    Linus Torvalds
     

27 Nov, 2019

1 commit

  • Pull x86 asm updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Cross-arch changes to move the linker sections for NOTES and
    EXCEPTION_TABLE into the RO_DATA area, where they belong on most
    architectures. (Kees Cook)

    - Switch the x86 linker fill byte from x90 (NOP) to 0xcc (INT3), to
    trap jumps into the middle of those padding areas instead of
    sliding execution. (Kees Cook)

    - A thorough cleanup of symbol definitions within x86 assembler code.
    The rather randomly named macros got streamlined around a
    (hopefully) straightforward naming scheme:

    SYM_START(name, linkage, align...)
    SYM_END(name, sym_type)

    SYM_FUNC_START(name)
    SYM_FUNC_END(name)

    SYM_CODE_START(name)
    SYM_CODE_END(name)

    SYM_DATA_START(name)
    SYM_DATA_END(name)

    etc - with about three times of these basic primitives with some
    label, local symbol or attribute variant, expressed via postfixes.

    No change in functionality intended. (Jiri Slaby)

    - Misc other changes, cleanups and smaller fixes"

    * 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    x86/entry/64: Remove pointless jump in paranoid_exit
    x86/entry/32: Remove unused resume_userspace label
    x86/build/vdso: Remove meaningless CFLAGS_REMOVE_*.o
    m68k: Convert missed RODATA to RO_DATA
    x86/vmlinux: Use INT3 instead of NOP for linker fill bytes
    x86/mm: Report actual image regions in /proc/iomem
    x86/mm: Report which part of kernel image is freed
    x86/mm: Remove redundant address-of operators on addresses
    xtensa: Move EXCEPTION_TABLE to RO_DATA segment
    powerpc: Move EXCEPTION_TABLE to RO_DATA segment
    parisc: Move EXCEPTION_TABLE to RO_DATA segment
    microblaze: Move EXCEPTION_TABLE to RO_DATA segment
    ia64: Move EXCEPTION_TABLE to RO_DATA segment
    h8300: Move EXCEPTION_TABLE to RO_DATA segment
    c6x: Move EXCEPTION_TABLE to RO_DATA segment
    arm64: Move EXCEPTION_TABLE to RO_DATA segment
    alpha: Move EXCEPTION_TABLE to RO_DATA segment
    x86/vmlinux: Move EXCEPTION_TABLE to RO_DATA segment
    x86/vmlinux: Actually use _etext for the end of the text segment
    vmlinux.lds.h: Allow EXCEPTION_TABLE to live in RO_DATA
    ...

    Linus Torvalds
     

26 Nov, 2019

1 commit

  • Pull printk updates from Petr Mladek:

    - Allow to print symbolic error names via new %pe modifier.

    - Use pr_warn() instead of the remaining pr_warning() calls. Fix
    formatting of the related lines.

    - Add VSPRINTF entry to MAINTAINERS.

    * tag 'printk-for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/pmladek/printk: (32 commits)
    checkpatch: don't warn about new vsprintf pointer extension '%pe'
    MAINTAINERS: Add VSPRINTF
    tools lib api: Renaming pr_warning to pr_warn
    ASoC: samsung: Use pr_warn instead of pr_warning
    lib: cpu_rmap: Use pr_warn instead of pr_warning
    trace: Use pr_warn instead of pr_warning
    dma-debug: Use pr_warn instead of pr_warning
    vgacon: Use pr_warn instead of pr_warning
    fs: afs: Use pr_warn instead of pr_warning
    sh/intc: Use pr_warn instead of pr_warning
    scsi: Use pr_warn instead of pr_warning
    platform/x86: intel_oaktrail: Use pr_warn instead of pr_warning
    platform/x86: asus-laptop: Use pr_warn instead of pr_warning
    platform/x86: eeepc-laptop: Use pr_warn instead of pr_warning
    oprofile: Use pr_warn instead of pr_warning
    of: Use pr_warn instead of pr_warning
    macintosh: Use pr_warn instead of pr_warning
    idsn: Use pr_warn instead of pr_warning
    ide: Use pr_warn instead of pr_warning
    crypto: n2: Use pr_warn instead of pr_warning
    ...

    Linus Torvalds
     

15 Nov, 2019

2 commits

  • The itimer handling for the old alpha osf_setitimer/osf_getitimer
    system calls is identical to the compat version of getitimer/setitimer,
    so just use those directly.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • There are two 'struct timeval' fields in 'struct rusage'.

    Unfortunately the definition of timeval is now ambiguous when used in
    user space with a libc that has a 64-bit time_t, and this also changes
    the 'rusage' definition in user space in a way that is incompatible with
    the system call interface.

    While there is no good solution to avoid all ambiguity here, change
    the definition in the kernel headers to be compatible with the kernel
    ABI, using __kernel_old_timeval as an unambiguous base type.

    In previous discussions, there was also a plan to add a replacement
    for rusage based on 64-bit timestamps and nanosecond resolution,
    i.e. 'struct __kernel_timespec'. I have patches for that as well,
    if anyone thinks we should do that.

    Reviewed-by: Cyrill Gorcunov
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

12 Nov, 2019

1 commit


05 Nov, 2019

1 commit

  • Since the EXCEPTION_TABLE is read-only, collapse it into RO_DATA.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Geert Uytterhoeven
    Cc: Heiko Carstens
    Cc: Ivan Kokshaysky
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Richard Henderson
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86@kernel.org
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-18-keescook@chromium.org

    Kees Cook
     

04 Nov, 2019

6 commits

  • Rename RW_DATA_SECTION to RW_DATA. (Calling this a "section" is a lie,
    since it's multiple sections and section flags cannot be applied to
    the macro.)

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Acked-by: Heiko Carstens # s390
    Acked-by: Geert Uytterhoeven # m68k
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-14-keescook@chromium.org

    Kees Cook
     
  • There's no reason to keep the RODATA macro: replace the callers with
    the expected RO_DATA macro.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-12-keescook@chromium.org

    Kees Cook
     
  • The .notes section should be non-executable read-only data. As such,
    move it to the RO_DATA macro instead of being per-architecture defined.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Acked-by: Heiko Carstens # s390
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-11-keescook@chromium.org

    Kees Cook
     
  • In preparation for moving NOTES into RO_DATA, make the Program Header
    assignment restoration be part of the NOTES macro itself.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Acked-by: Heiko Carstens # s390
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-10-keescook@chromium.org

    Kees Cook
     
  • In preparation for moving NOTES into RO_DATA, provide a mechanism for
    architectures that want to emit a PT_NOTE Program Header to do so.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Acked-by: Heiko Carstens # s390
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-9-keescook@chromium.org

    Kees Cook
     
  • In preparation for moving NOTES into RO_DATA, rename the linker script
    internal identifier for the PT_LOAD Program Header from "kernel" to
    "text" to match other architectures.

    Signed-off-by: Kees Cook
    Signed-off-by: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Arnd Bergmann
    Cc: Dave Hansen
    Cc: Heiko Carstens
    Cc: Ivan Kokshaysky
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Richard Henderson
    Cc: Rick Edgecombe
    Cc: Segher Boessenkool
    Cc: Will Deacon
    Cc: x86-ml
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20191029211351.13243-5-keescook@chromium.org

    Kees Cook
     

18 Oct, 2019

1 commit

  • As said in commit f2c2cbcc35d4 ("powerpc: Use pr_warn instead of
    pr_warning"), removing pr_warning so all logging messages use a
    consistent _warn style. Let's do it.

    Link: http://lkml.kernel.org/r/20191018031850.48498-1-wangkefeng.wang@huawei.com
    To: linux-kernel@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Kefeng Wang
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek

    Kefeng Wang
     

14 Oct, 2019

1 commit

  • Code that iterates over all standard PCI BARs typically uses
    PCI_STD_RESOURCE_END. However, that requires the unusual test
    "i < PCI_STD_NUM_BARS".

    Add a definition for PCI_STD_NUM_BARS and change loops to use the more
    idiomatic C style to help avoid fencepost errors.

    Link: https://lore.kernel.org/r/20190927234026.23342-1-efremov@linux.com
    Link: https://lore.kernel.org/r/20190927234308.23935-1-efremov@linux.com
    Link: https://lore.kernel.org/r/20190916204158.6889-3-efremov@linux.com
    Signed-off-by: Denis Efremov
    Signed-off-by: Bjorn Helgaas
    Acked-by: Sebastian Ott # arch/s390/
    Acked-by: Bartlomiej Zolnierkiewicz # video/fbdev/
    Acked-by: Gustavo Pimentel # pci/controller/dwc/
    Acked-by: Jack Wang # scsi/pm8001/
    Acked-by: Martin K. Petersen # scsi/pm8001/
    Acked-by: Ulf Hansson # memstick/

    Denis Efremov
     

26 Sep, 2019

2 commits

  • When a process expects no accesses to a certain memory range for a long
    time, it could hint kernel that the pages can be reclaimed instantly but
    data should be preserved for future use. This could reduce workingset
    eviction so it ends up increasing performance.

    This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
    MADV_PAGEOUT can be used by a process to mark a memory range as not
    expected to be used for a long time so that kernel reclaims *any LRU*
    pages instantly. The hint can help kernel in deciding which pages to
    evict proactively.

    A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
    intentionally because it's automatically bounded by PMD size. If PMD
    size(e.g., 256) makes some trouble, we could fix it later by limit it to
    SWAP_CLUSTER_MAX[1].

    - man-page material

    MADV_PAGEOUT (since Linux x.x)

    Do not expect access in the near future so pages in the specified
    regions could be reclaimed instantly regardless of memory pressure.
    Thus, access in the range after successful operation could cause
    major page fault but never lose the up-to-date contents unlike
    MADV_DONTNEED. Pages belonging to a shared mapping are only processed
    if a write access is allowed for the calling process.

    MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
    VM_PFNMAP pages.

    [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

    [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
    Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 Sep, 2019

2 commits

  • Both pgtable_cache_init() and pgd_cache_init() are used to initialize kmem
    cache for page table allocations on several architectures that do not use
    PAGE_SIZE tables for one or more levels of the page table hierarchy.

    Most architectures do not implement these functions and use __weak default
    NOP implementation of pgd_cache_init(). Since there is no such default
    for pgtable_cache_init(), its empty stub is duplicated among most
    architectures.

    Rename the definitions of pgd_cache_init() to pgtable_cache_init() and
    drop empty stubs of pgtable_cache_init().

    Link: http://lkml.kernel.org/r/1566457046-22637-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Will Deacon [arm64]
    Acked-by: Thomas Gleixner [x86]
    Cc: Catalin Marinas
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "mm: remove quicklist page table caches".

    A while ago Nicholas proposed to remove quicklist page table caches [1].

    I've rebased his patch on the curren upstream and switched ia64 and sh to
    use generic versions of PTE allocation.

    [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

    This patch (of 3):

    Remove page table allocator "quicklists". These have been around for a
    long time, but have not got much traction in the last decade and are only
    used on ia64 and sh architectures.

    The numbers in the initial commit look interesting but probably don't
    apply anymore. If anybody wants to resurrect this it's in the git
    history, but it's unhelpful to have this code and divergent allocator
    behaviour for minor archs.

    Also it might be better to instead make more general improvements to page
    allocator if this is still so slow.

    Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Mike Rapoport
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

20 Sep, 2019

1 commit

  • Pull Kbuild updates from Masahiro Yamada:

    - add modpost warn exported symbols marked as 'static' because 'static'
    and EXPORT_SYMBOL is an odd combination

    - break the build early if gold linker is used

    - optimize the Bison rule to produce .c and .h files by a single
    pattern rule

    - handle PREEMPT_RT in the module vermagic and UTS_VERSION

    - warn CONFIG options leaked to the user-space except existing ones

    - make single targets work properly

    - rebuild modules when module linker scripts are updated

    - split the module final link stage into scripts/Makefile.modfinal

    - fix the missed error code in merge_config.sh

    - improve the error message displayed on the attempt of the O= build in
    unclean source tree

    - remove 'clean-dirs' syntax

    - disable -Wimplicit-fallthrough warning for Clang

    - add CONFIG_CC_OPTIMIZE_FOR_SIZE_O3 for ARC

    - remove ARCH_{CPP,A,C}FLAGS variables

    - add $(BASH) to run bash scripts

    - change *CFLAGS_.o to take the relative path to $(obj)
    instead of the basename

    - stop suppressing Clang's -Wunused-function warnings when W=1

    - fix linux/export.h to avoid genksyms calculating CRC of trimmed
    exported symbols

    - misc cleanups

    * tag 'kbuild-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (63 commits)
    genksyms: convert to SPDX License Identifier for lex.l and parse.y
    modpost: use __section in the output to *.mod.c
    modpost: use MODULE_INFO() for __module_depends
    export.h, genksyms: do not make genksyms calculate CRC of trimmed symbols
    export.h: remove defined(__KERNEL__), which is no longer needed
    kbuild: allow Clang to find unused static inline functions for W=1 build
    kbuild: rename KBUILD_ENABLE_EXTRA_GCC_CHECKS to KBUILD_EXTRA_WARN
    kbuild: refactor scripts/Makefile.extrawarn
    merge_config.sh: ignore unwanted grep errors
    kbuild: change *FLAGS_.o to take the path relative to $(obj)
    modpost: add NOFAIL to strndup
    modpost: add guid_t type definition
    kbuild: add $(BASH) to run scripts with bash-extension
    kbuild: remove ARCH_{CPP,A,C}FLAGS
    kbuild,arc: add CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3 for ARC
    kbuild: Do not enable -Wimplicit-fallthrough for clang for now
    kbuild: clean up subdir-ymn calculation in Makefile.clean
    kbuild: remove unneeded '+' marker from cmd_clean
    kbuild: remove clean-dirs syntax
    kbuild: check clean srctree even earlier
    ...

    Linus Torvalds
     

04 Sep, 2019

1 commit

  • While the default ->mmap and ->get_sgtable implementations work for the
    majority of our dma_map_ops impementations they are inherently safe
    for others that don't use the page allocator or CMA and/or use their
    own way of remapping not covered by the common code. So remove the
    defaults if these methods are not wired up, but instead wire up the
    default implementations for all safe instances.

    Fixes: e1c7e324539a ("dma-mapping: always provide the dma_map_ops based implementation")
    Signed-off-by: Christoph Hellwig

    Christoph Hellwig
     

22 Aug, 2019

1 commit

  • Add CONFIG_ASM_MODVERSIONS. This allows to remove one if-conditional
    nesting in scripts/Makefile.build.

    scripts/Makefile.build is run every time Kbuild descends into a
    sub-directory. So, I want to avoid $(wildcard ...) evaluation
    where possible although computing $(wildcard ...) is so cheap that
    it may not make measurable performance difference.

    Signed-off-by: Masahiro Yamada
    Acked-by: Geert Uytterhoeven

    Masahiro Yamada