06 Sep, 2019

1 commit

  • If a request_key authentication token key gets revoked, there's a window in
    which request_key_auth_describe() can see it with a NULL payload - but it
    makes no check for this and something like the following oops may occur:

    BUG: Kernel NULL pointer dereference at 0x00000038
    Faulting instruction address: 0xc0000000004ddf30
    Oops: Kernel access of bad area, sig: 11 [#1]
    ...
    NIP [...] request_key_auth_describe+0x90/0xd0
    LR [...] request_key_auth_describe+0x54/0xd0
    Call Trace:
    [...] request_key_auth_describe+0x54/0xd0 (unreliable)
    [...] proc_keys_show+0x308/0x4c0
    [...] seq_read+0x3d0/0x540
    [...] proc_reg_read+0x90/0x110
    [...] __vfs_read+0x3c/0x70
    [...] vfs_read+0xb4/0x1b0
    [...] ksys_read+0x7c/0x130
    [...] system_call+0x5c/0x70

    Fix this by checking for a NULL pointer when describing such a key.

    Also make the read routine check for a NULL pointer to be on the safe side.

    [DH: Modified to not take already-held rcu lock and modified to also check
    in the read routine]

    Fixes: 04c567d9313e ("[PATCH] Keys: Fix race between two instantiators of a key")
    Reported-by: Sachin Sant
    Signed-off-by: Hillf Danton
    Signed-off-by: David Howells
    Tested-by: Sachin Sant
    Signed-off-by: Linus Torvalds

    Hillf Danton
     

31 Aug, 2019

1 commit


14 Aug, 2019

1 commit

  • Commit c78719203fc6 ("KEYS: trusted: allow trusted.ko to initialize w/o a
    TPM") allows the trusted module to be loaded even if a TPM is not found, to
    avoid module dependency problems.

    However, trusted module initialization can still fail if the TPM is
    inactive or deactivated. tpm_get_random() returns an error.

    This patch removes the call to tpm_get_random() and instead extends the PCR
    specified by the user with zeros. The security of this alternative is
    equivalent to the previous one, as either option prevents with a PCR update
    unsealing and misuse of sealed data by a user space process.

    Even if a PCR is extended with zeros, instead of random data, it is still
    computationally infeasible to find a value as input for a new PCR extend
    operation, to obtain again the PCR value that would allow unsealing.

    Cc: stable@vger.kernel.org
    Fixes: 240730437deb ("KEYS: trusted: explicitly use tpm_chip structure...")
    Signed-off-by: Roberto Sassu
    Reviewed-by: Tyler Hicks
    Suggested-by: Mimi Zohar
    Reviewed-by: Jarkko Sakkinen
    Signed-off-by: Jarkko Sakkinen

    Roberto Sassu
     

03 Aug, 2019

1 commit


01 Aug, 2019

1 commit

  • Since roles_init() adds some entries to the role hash table, we need to
    destroy also its keys/values on error, otherwise we get a memory leak in
    the error path.

    Cc:
    Reported-by: syzbot+fee3a14d4cdf92646287@syzkaller.appspotmail.com
    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Ondrej Mosnacek
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

29 Jul, 2019

1 commit


27 Jul, 2019

1 commit


26 Jul, 2019

1 commit

  • The combination of KASAN_STACK and GCC_PLUGIN_STRUCTLEAK_BYREF
    leads to much larger kernel stack usage, as seen from the warnings
    about functions that now exceed the 2048 byte limit:

    drivers/media/i2c/tvp5150.c:253:1: error: the frame size of 3936 bytes is larger than 2048 bytes
    drivers/media/tuners/r820t.c:1327:1: error: the frame size of 2816 bytes is larger than 2048 bytes
    drivers/net/wireless/broadcom/brcm80211/brcmsmac/phy/phy_n.c:16552:1: error: the frame size of 3144 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
    fs/ocfs2/aops.c:1892:1: error: the frame size of 2088 bytes is larger than 2048 bytes
    fs/ocfs2/dlm/dlmrecovery.c:737:1: error: the frame size of 2088 bytes is larger than 2048 bytes
    fs/ocfs2/namei.c:1677:1: error: the frame size of 2584 bytes is larger than 2048 bytes
    fs/ocfs2/super.c:1186:1: error: the frame size of 2640 bytes is larger than 2048 bytes
    fs/ocfs2/xattr.c:3678:1: error: the frame size of 2176 bytes is larger than 2048 bytes
    net/bluetooth/l2cap_core.c:7056:1: error: the frame size of 2144 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]
    net/bluetooth/l2cap_core.c: In function 'l2cap_recv_frame':
    net/bridge/br_netlink.c:1505:1: error: the frame size of 2448 bytes is larger than 2048 bytes
    net/ieee802154/nl802154.c:548:1: error: the frame size of 2232 bytes is larger than 2048 bytes
    net/wireless/nl80211.c:1726:1: error: the frame size of 2224 bytes is larger than 2048 bytes
    net/wireless/nl80211.c:2357:1: error: the frame size of 4584 bytes is larger than 2048 bytes
    net/wireless/nl80211.c:5108:1: error: the frame size of 2760 bytes is larger than 2048 bytes
    net/wireless/nl80211.c:6472:1: error: the frame size of 2112 bytes is larger than 2048 bytes

    The structleak plugin was previously disabled for CONFIG_COMPILE_TEST,
    but meant we missed some bugs, so this time we should address them.

    The frame size warnings are distracting, and risking a kernel stack
    overflow is generally not beneficial to performance, so it may be best
    to disallow that particular combination. This can be done by turning
    off either one. I picked the dependency in GCC_PLUGIN_STRUCTLEAK_BYREF
    and GCC_PLUGIN_STRUCTLEAK_BYREF_ALL, as this option is designed to
    make uninitialized stack usage less harmful when enabled on its own,
    but it also prevents KASAN from detecting those cases in which it was
    in fact needed.

    KASAN_STACK is currently implied by KASAN on gcc, but could be made a
    user selectable option if we want to allow combining (non-stack) KASAN
    with GCC_PLUGIN_STRUCTLEAK_BYREF.

    Note that it would be possible to specifically address the files that
    print the warning, but presumably the overall stack usage is still
    significantly higher than in other configurations, so this would not
    address the full problem.

    I could not test this with CONFIG_INIT_STACK_ALL, which may or may not
    suffer from a similar problem.

    Fixes: 81a56f6dcd20 ("gcc-plugins: structleak: Generalize to all variable types")
    Signed-off-by: Arnd Bergmann
    Link: https://lore.kernel.org/r/20190722114134.3123901-1-arnd@arndb.de
    Signed-off-by: Kees Cook

    Arnd Bergmann
     

24 Jul, 2019

1 commit

  • We need to error out when trying to add an entry above SIDTAB_MAX in
    sidtab_reverse_lookup() to avoid overflow on the odd chance that this
    happens.

    Cc: stable@vger.kernel.org
    Fixes: ee1a84fdfeed ("selinux: overhaul sidtab to fix bug and improve performance")
    Signed-off-by: Ondrej Mosnacek
    Reviewed-by: Kees Cook
    Signed-off-by: Paul Moore

    Ondrej Mosnacek
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

19 Jul, 2019

1 commit

  • In the sysctl code the proc_dointvec_minmax() function is often used to
    validate the user supplied value between an allowed range. This
    function uses the extra1 and extra2 members from struct ctl_table as
    minimum and maximum allowed value.

    On sysctl handler declaration, in every source file there are some
    readonly variables containing just an integer which address is assigned
    to the extra1 and extra2 members, so the sysctl range is enforced.

    The special values 0, 1 and INT_MAX are very often used as range
    boundary, leading duplication of variables like zero=0, one=1,
    int_max=INT_MAX in different source files:

    $ git grep -E '\.extra[12].*&(zero|one|int_max)' |wc -l
    248

    Add a const int array containing the most commonly used values, some
    macros to refer more easily to the correct array member, and use them
    instead of creating a local one for every object file.

    This is the bloat-o-meter output comparing the old and new binary
    compiled with the default Fedora config:

    # scripts/bloat-o-meter -d vmlinux.o.old vmlinux.o
    add/remove: 2/2 grow/shrink: 0/2 up/down: 24/-188 (-164)
    Data old new delta
    sysctl_vals - 12 +12
    __kstrtab_sysctl_vals - 12 +12
    max 14 10 -4
    int_max 16 - -16
    one 68 - -68
    zero 128 28 -100
    Total: Before=20583249, After=20583085, chg -0.00%

    [mcroce@redhat.com: tipc: remove two unused variables]
    Link: http://lkml.kernel.org/r/20190530091952.4108-1-mcroce@redhat.com
    [akpm@linux-foundation.org: fix net/ipv6/sysctl_net_ipv6.c]
    [arnd@arndb.de: proc/sysctl: make firmware loader table conditional]
    Link: http://lkml.kernel.org/r/20190617130014.1713870-1-arnd@arndb.de
    [akpm@linux-foundation.org: fix fs/eventpoll.c]
    Link: http://lkml.kernel.org/r/20190430180111.10688-1-mcroce@redhat.com
    Signed-off-by: Matteo Croce
    Signed-off-by: Arnd Bergmann
    Acked-by: Kees Cook
    Reviewed-by: Aaron Tomlin
    Cc: Matthew Wilcox
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

17 Jul, 2019

1 commit

  • Pull rst conversion of docs from Mauro Carvalho Chehab:
    "As agreed with Jon, I'm sending this big series directly to you, c/c
    him, as this series required a special care, in order to avoid
    conflicts with other trees"

    * tag 'docs/v5.3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (77 commits)
    docs: kbuild: fix build with pdf and fix some minor issues
    docs: block: fix pdf output
    docs: arm: fix a breakage with pdf output
    docs: don't use nested tables
    docs: gpio: add sysfs interface to the admin-guide
    docs: locking: add it to the main index
    docs: add some directories to the main documentation index
    docs: add SPDX tags to new index files
    docs: add a memory-devices subdir to driver-api
    docs: phy: place documentation under driver-api
    docs: serial: move it to the driver-api
    docs: driver-api: add remaining converted dirs to it
    docs: driver-api: add xilinx driver API documentation
    docs: driver-api: add a series of orphaned documents
    docs: admin-guide: add a series of orphaned documents
    docs: cgroup-v1: add it to the admin-guide book
    docs: aoe: add it to the driver-api book
    docs: add some documentation dirs to the driver-api book
    docs: driver-model: move it to the driver-api book
    docs: lp855x-driver.rst: add it to the driver-api book
    ...

    Linus Torvalds
     

15 Jul, 2019

12 commits

  • The capable() hook returns an error number. -EPERM is actually the same as
    -1, so this doesn't make a difference in behavior.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • Someone might write a ruleset like the following, expecting that it
    securely constrains UID 1 to UIDs 1, 2 and 3:

    1:2
    1:3

    However, because no constraints are applied to UIDs 2 and 3, an attacker
    with UID 1 can simply first switch to UID 2, then switch to any UID from
    there. The secure way to write this ruleset would be:

    1:2
    1:3
    2:2
    3:3

    , which uses "transition to self" as a way to inhibit the default-allow
    policy without allowing anything specific.

    This is somewhat unintuitive. To make sure that policy authors don't
    accidentally write insecure policies because of this, let the kernel verify
    that a new ruleset does not contain any entries that are constrained, but
    transitively unconstrained.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • For debugging a running system, it is very helpful to be able to see what
    policy the system is using. Add a read handler that can dump out a copy of
    the loaded policy.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • The current API of the SafeSetID LSM uses one write() per rule, and applies
    each written rule instantly. This has several downsides:

    - While a policy is being loaded, once a single parent-child pair has been
    loaded, the parent is restricted to that specific child, even if
    subsequent rules would allow transitions to other child UIDs. This means
    that during policy loading, set*uid() can randomly fail.
    - To replace the policy without rebooting, it is necessary to first flush
    all old rules. This creates a time window in which no constraints are
    placed on the use of CAP_SETUID.
    - If we want to perform sanity checks on the final policy, this requires
    that the policy isn't constructed in a piecemeal fashion without telling
    the kernel when it's done.

    Other kernel APIs - including things like the userns code and netfilter -
    avoid this problem by performing updates atomically. Luckily, SafeSetID
    hasn't landed in a stable (upstream) release yet, so maybe it's not too
    late to completely change the API.

    The new API for SafeSetID is: If you want to change the policy, open
    "safesetid/whitelist_policy" and write the entire policy,
    newline-delimited, in there.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • Looking at current_cred() in write handlers is bad form, stop doing that.

    Also, let's just require that the write is coming from the initial user
    namespace. Especially SAFESETID_WHITELIST_FLUSH requires privilege over all
    namespaces, and SAFESETID_WHITELIST_ADD should probably require it as well.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • In preparation for changing the policy parsing logic, refactor the line
    parsing logic to be less verbose and move it into a separate function.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • At the moment, safesetid_security_capable() has two nested conditional
    blocks, and one big comment for all the logic. Chop it up and reduce the
    amount of indentation.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • parent_kuid and child_kuid are kuids, there is no reason to make them
    uint64_t. (And anyway, in the kernel, the normal name for that would be
    u64, not uint64_t.)

    check_setuid_policy_hashtable_key() and
    check_setuid_policy_hashtable_key_value() are basically the same thing,
    merge them.

    Also fix the comment that claimed that (1<
    Signed-off-by: Micah Morton

    Jann Horn
     
  • With the old code, when a process with the (real,effective,saved) UID set
    (1,1,1) calls setresuid(2,3,4), safesetid_task_fix_setuid() only checks
    whether the transition 1->2 is permitted; the transitions 1->3 and 1->4 are
    not checked. Fix this.

    This is also a good opportunity to refactor safesetid_task_fix_setuid() to
    be less verbose - having one branch per set*uid() syscall is unnecessary.

    Note that this slightly changes semantics: The UID transition check for
    UIDs that were not in the old cred struct is now always performed against
    the policy of the RUID. I think that's more consistent anyway, since the
    RUID is also the one that decides whether any policy is enforced at all.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • Fix the pr_warn() calls in the SafeSetID LSM to have newlines at the end.
    Without this, denial messages will be buffered as incomplete lines in
    log_output(), and will then only show up once something else prints into
    dmesg.

    Signed-off-by: Jann Horn
    Signed-off-by: Micah Morton

    Jann Horn
     
  • Those files belong to the admin guide, so add them.

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Those two docs belong to the x86 architecture:

    Documentation/Intel-IOMMU.txt -> Documentation/x86/intel-iommu.rst
    Documentation/intel_txt.txt -> Documentation/x86/intel_txt.rst

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

13 Jul, 2019

2 commits

  • Merge updates from Andrew Morton:
    "Am experimenting with splitting MM up into identifiable subsystems
    perhaps with a view to gitifying it in complex ways. Also with more
    verbose "incoming" emails.

    Most of MM is here and a few other trees.

    Subsystems affected by this patch series:
    - hotfixes
    - iommu
    - scripts
    - arch/sh
    - ocfs2
    - mm:slab-generic
    - mm:slub
    - mm:kmemleak
    - mm:kasan
    - mm:cleanups
    - mm:debug
    - mm:pagecache
    - mm:swap
    - mm:memcg
    - mm:gup
    - mm:pagemap
    - mm:infrastructure
    - mm:vmalloc
    - mm:initialization
    - mm:pagealloc
    - mm:vmscan
    - mm:tools
    - mm:proc
    - mm:ras
    - mm:oom-kill

    hotfixes:
    mm: vmscan: scan anonymous pages on file refaults
    mm/nvdimm: add is_ioremap_addr and use that to check ioremap address
    mm/memcontrol: fix wrong statistics in memory.stat
    mm/z3fold.c: lock z3fold page before __SetPageMovable()
    nilfs2: do not use unexported cpu_to_le32()/le32_to_cpu() in uapi header
    MAINTAINERS: nilfs2: update email address

    iommu:
    include/linux/dmar.h: replace single-char identifiers in macros

    scripts:
    scripts/decode_stacktrace: match basepath using shell prefix operator, not regex
    scripts/decode_stacktrace: look for modules with .ko.debug extension
    scripts/spelling.txt: drop "sepc" from the misspelling list
    scripts/spelling.txt: add spelling fix for prohibited
    scripts/decode_stacktrace: Accept dash/underscore in modules
    scripts/spelling.txt: add more spellings to spelling.txt

    arch/sh:
    arch/sh/configs/sdk7786_defconfig: remove CONFIG_LOGFS
    sh: config: remove left-over BACKLIGHT_LCD_SUPPORT
    sh: prevent warnings when using iounmap

    ocfs2:
    fs: ocfs: fix spelling mistake "hearbeating" -> "heartbeat"
    ocfs2/dlm: use struct_size() helper
    ocfs2: add last unlock times in locking_state
    ocfs2: add locking filter debugfs file
    ocfs2: add first lock wait time in locking_state
    ocfs: no need to check return value of debugfs_create functions
    fs/ocfs2/dlmglue.c: unneeded variable: "status"
    ocfs2: use kmemdup rather than duplicating its implementation

    mm:slab-generic:
    Patch series "mm/slab: Improved sanity checking":
    mm/slab: validate cache membership under freelist hardening
    mm/slab: sanity-check page type when looking up cache
    lkdtm/heap: add tests for freelist hardening

    mm:slub:
    mm/slub.c: avoid double string traverse in kmem_cache_flags()
    slub: don't panic for memcg kmem cache creation failure

    mm:kmemleak:
    mm/kmemleak.c: fix check for softirq context
    mm/kmemleak.c: change error at _write when kmemleak is disabled
    docs: kmemleak: add more documentation details

    mm:kasan:
    mm/kasan: print frame description for stack bugs
    Patch series "Bitops instrumentation for KASAN", v5:
    lib/test_kasan: add bitops tests
    x86: use static_cpu_has in uaccess region to avoid instrumentation
    asm-generic, x86: add bitops instrumentation for KASAN
    Patch series "mm/kasan: Add object validation in ksize()", v3:
    mm/kasan: introduce __kasan_check_{read,write}
    mm/kasan: change kasan_check_{read,write} to return boolean
    lib/test_kasan: Add test for double-kzfree detection
    mm/slab: refactor common ksize KASAN logic into slab_common.c
    mm/kasan: add object validation in ksize()

    mm:cleanups:
    include/linux/pfn_t.h: remove pfn_t_to_virt()
    Patch series "remove ARCH_SELECT_MEMORY_MODEL where it has no effect":
    arm: remove ARCH_SELECT_MEMORY_MODEL
    s390: remove ARCH_SELECT_MEMORY_MODEL
    sparc: remove ARCH_SELECT_MEMORY_MODEL
    mm/gup.c: make follow_page_mask() static
    mm/memory.c: trivial clean up in insert_page()
    mm: make !CONFIG_HUGE_PAGE wrappers into static inlines
    include/linux/mm_types.h: ifdef struct vm_area_struct::swap_readahead_info
    mm: remove the account_page_dirtied export
    mm/page_isolation.c: change the prototype of undo_isolate_page_range()
    include/linux/vmpressure.h: use spinlock_t instead of struct spinlock
    mm: remove the exporting of totalram_pages
    include/linux/pagemap.h: document trylock_page() return value

    mm:debug:
    mm/failslab.c: by default, do not fail allocations with direct reclaim only
    Patch series "debug_pagealloc improvements":
    mm, debug_pagelloc: use static keys to enable debugging
    mm, page_alloc: more extensive free page checking with debug_pagealloc
    mm, debug_pagealloc: use a page type instead of page_ext flag

    mm:pagecache:
    Patch series "fix filler_t callback type mismatches", v2:
    mm/filemap.c: fix an overly long line in read_cache_page
    mm/filemap: don't cast ->readpage to filler_t for do_read_cache_page
    jffs2: pass the correct prototype to read_cache_page
    9p: pass the correct prototype to read_cache_page
    mm/filemap.c: correct the comment about VM_FAULT_RETRY

    mm:swap:
    mm, swap: fix race between swapoff and some swap operations
    mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device()
    mm, swap: use rbtree for swap_extent
    mm/mincore.c: fix race between swapoff and mincore

    mm:memcg:
    memcg, oom: no oom-kill for __GFP_RETRY_MAYFAIL
    memcg, fsnotify: no oom-kill for remote memcg charging
    mm, memcg: introduce memory.events.local
    mm: memcontrol: dump memory.stat during cgroup OOM
    Patch series "mm: reparent slab memory on cgroup removal", v7:
    mm: memcg/slab: postpone kmem_cache memcg pointer initialization to memcg_link_cache()
    mm: memcg/slab: rename slab delayed deactivation functions and fields
    mm: memcg/slab: generalize postponed non-root kmem_cache deactivation
    mm: memcg/slab: introduce __memcg_kmem_uncharge_memcg()
    mm: memcg/slab: unify SLAB and SLUB page accounting
    mm: memcg/slab: don't check the dying flag on kmem_cache creation
    mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock
    mm: memcg/slab: rework non-root kmem_cache lifecycle management
    mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages
    mm: memcg/slab: reparent memcg kmem_caches on cgroup removal
    mm, memcg: add a memcg_slabinfo debugfs file

    mm:gup:
    Patch series "switch the remaining architectures to use generic GUP", v4:
    mm: use untagged_addr() for get_user_pages_fast addresses
    mm: simplify gup_fast_permitted
    mm: lift the x86_32 PAE version of gup_get_pte to common code
    MIPS: use the generic get_user_pages_fast code
    sh: add the missing pud_page definition
    sh: use the generic get_user_pages_fast code
    sparc64: add the missing pgd_page definition
    sparc64: define untagged_addr()
    sparc64: use the generic get_user_pages_fast code
    mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP
    mm: reorder code blocks in gup.c
    mm: consolidate the get_user_pages* implementations
    mm: validate get_user_pages_fast flags
    mm: move the powerpc hugepd code to mm/gup.c
    mm: switch gup_hugepte to use try_get_compound_head
    mm: mark the page referenced in gup_hugepte
    mm/gup: speed up check_and_migrate_cma_pages() on huge page
    mm/gup.c: remove some BUG_ONs from get_gate_page()
    mm/gup.c: mark undo_dev_pagemap as __maybe_unused

    mm:pagemap:
    asm-generic, x86: introduce generic pte_{alloc,free}_one[_kernel]
    alpha: switch to generic version of pte allocation
    arm: switch to generic version of pte allocation
    arm64: switch to generic version of pte allocation
    csky: switch to generic version of pte allocation
    m68k: sun3: switch to generic version of pte allocation
    mips: switch to generic version of pte allocation
    nds32: switch to generic version of pte allocation
    nios2: switch to generic version of pte allocation
    parisc: switch to generic version of pte allocation
    riscv: switch to generic version of pte allocation
    um: switch to generic version of pte allocation
    unicore32: switch to generic version of pte allocation
    mm/pgtable: drop pgtable_t variable from pte_fn_t functions
    mm/memory.c: fail when offset == num in first check of __vm_map_pages()

    mm:infrastructure:
    mm/mmu_notifier: use hlist_add_head_rcu()

    mm:vmalloc:
    Patch series "Some cleanups for the KVA/vmalloc", v5:
    mm/vmalloc.c: remove "node" argument
    mm/vmalloc.c: preload a CPU with one object for split purpose
    mm/vmalloc.c: get rid of one single unlink_va() when merge
    mm/vmalloc.c: switch to WARN_ON() and move it under unlink_va()
    mm/vmalloc.c: spelling> s/informaion/information/

    mm:initialization:
    mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist
    mm/large system hash: clear hashdist when only one node with memory is booted

    mm:pagealloc:
    arm64: move jump_label_init() before parse_early_param()
    Patch series "add init_on_alloc/init_on_free boot options", v10:
    mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options
    mm: init: report memory auto-initialization features at boot time

    mm:vmscan:
    mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
    mm: vmscan: correct some vmscan counters for THP swapout

    mm:tools:
    tools/vm/slabinfo: order command line options
    tools/vm/slabinfo: add partial slab listing to -X
    tools/vm/slabinfo: add option to sort by partial slabs
    tools/vm/slabinfo: add sorting info to help menu

    mm:proc:
    proc: use down_read_killable mmap_sem for /proc/pid/maps
    proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
    proc: use down_read_killable mmap_sem for /proc/pid/pagemap
    proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
    proc: use down_read_killable mmap_sem for /proc/pid/map_files
    mm: use down_read_killable for locking mmap_sem in access_remote_vm
    mm: smaps: split PSS into components
    mm: vmalloc: show number of vmalloc pages in /proc/meminfo

    mm:ras:
    mm/memory-failure.c: clarify error message

    mm:oom-kill:
    mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
    mm, oom: refactor dump_tasks for memcg OOMs
    mm, oom: remove redundant task_in_mem_cgroup() check
    oom: decouple mems_allowed from oom_unkillable_task
    mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()"

    * akpm: (147 commits)
    mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process()
    oom: decouple mems_allowed from oom_unkillable_task
    mm, oom: remove redundant task_in_mem_cgroup() check
    mm, oom: refactor dump_tasks for memcg OOMs
    mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks()
    mm/memory-failure.c: clarify error message
    mm: vmalloc: show number of vmalloc pages in /proc/meminfo
    mm: smaps: split PSS into components
    mm: use down_read_killable for locking mmap_sem in access_remote_vm
    proc: use down_read_killable mmap_sem for /proc/pid/map_files
    proc: use down_read_killable mmap_sem for /proc/pid/clear_refs
    proc: use down_read_killable mmap_sem for /proc/pid/pagemap
    proc: use down_read_killable mmap_sem for /proc/pid/smaps_rollup
    proc: use down_read_killable mmap_sem for /proc/pid/maps
    tools/vm/slabinfo: add sorting info to help menu
    tools/vm/slabinfo: add option to sort by partial slabs
    tools/vm/slabinfo: add partial slab listing to -X
    tools/vm/slabinfo: order command line options
    mm: vmscan: correct some vmscan counters for THP swapout
    mm: vmscan: remove double slab pressure by inc'ing sc->nr_scanned
    ...

    Linus Torvalds
     
  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     

12 Jul, 2019

2 commits

  • Pull security/loadpin updates from Kees Cook:

    - Allow exclusion of specific file types (Ke Wu)

    * tag 'loadpin-v5.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    security/loadpin: Allow to exclude specific file types

    Linus Torvalds
     
  • Pull networking updates from David Miller:
    "Some highlights from this development cycle:

    1) Big refactoring of ipv6 route and neigh handling to support
    nexthop objects configurable as units from userspace. From David
    Ahern.

    2) Convert explored_states in BPF verifier into a hash table,
    significantly decreased state held for programs with bpf2bpf
    calls, from Alexei Starovoitov.

    3) Implement bpf_send_signal() helper, from Yonghong Song.

    4) Various classifier enhancements to mvpp2 driver, from Maxime
    Chevallier.

    5) Add aRFS support to hns3 driver, from Jian Shen.

    6) Fix use after free in inet frags by allocating fqdirs dynamically
    and reworking how rhashtable dismantle occurs, from Eric Dumazet.

    7) Add act_ctinfo packet classifier action, from Kevin
    Darbyshire-Bryant.

    8) Add TFO key backup infrastructure, from Jason Baron.

    9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

    10) Add devlink notifications for flash update status to mlxsw driver,
    from Jiri Pirko.

    11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

    12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

    13) Various enhancements to ipv6 flow label handling, from Eric
    Dumazet and Willem de Bruijn.

    14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
    der Merwe, and others.

    15) Various improvements to axienet driver including converting it to
    phylink, from Robert Hancock.

    16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

    17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
    Radulescu.

    18) Add devlink health reporting to mlx5, from Moshe Shemesh.

    19) Convert stmmac over to phylink, from Jose Abreu.

    20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
    Shalom Toledo.

    21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

    22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

    23) Track spill/fill of constants in BPF verifier, from Alexei
    Starovoitov.

    24) Support bounded loops in BPF, from Alexei Starovoitov.

    25) Various page_pool API fixes and improvements, from Jesper Dangaard
    Brouer.

    26) Just like ipv4, support ref-countless ipv6 route handling. From
    Wei Wang.

    27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

    28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

    29) Add flower GRE encap/decap support to nfp driver, from Pieter
    Jansen van Vuuren.

    30) Protect against stack overflow when using act_mirred, from John
    Hurley.

    31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

    32) Use page_pool API in netsec driver, Ilias Apalodimas.

    33) Add Google gve network driver, from Catherine Sullivan.

    34) More indirect call avoidance, from Paolo Abeni.

    35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

    36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

    37) Add MPLS manipulation actions to TC, from John Hurley.

    38) Add sending a packet to connection tracking from TC actions, and
    then allow flower classifier matching on conntrack state. From
    Paul Blakey.

    39) Netfilter hw offload support, from Pablo Neira Ayuso"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
    net/mlx5e: Return in default case statement in tx_post_resync_params
    mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
    net: dsa: add support for BRIDGE_MROUTER attribute
    pkt_sched: Include const.h
    net: netsec: remove static declaration for netsec_set_tx_de()
    net: netsec: remove superfluous if statement
    netfilter: nf_tables: add hardware offload support
    net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
    net: flow_offload: add flow_block_cb_is_busy() and use it
    net: sched: remove tcf block API
    drivers: net: use flow block API
    net: sched: use flow block API
    net: flow_offload: add flow_block_cb_{priv, incref, decref}()
    net: flow_offload: add list handling functions
    net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
    net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
    net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
    net: flow_offload: add flow_block_cb_setup_simple()
    net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
    net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
    ...

    Linus Torvalds
     

11 Jul, 2019

1 commit

  • …el/git/dhowells/linux-fs"

    This reverts merge 0f75ef6a9cff49ff612f7ce0578bced9d0b38325 (and thus
    effectively commits

    7a1ade847596 ("keys: Provide KEYCTL_GRANT_PERMISSION")
    2e12256b9a76 ("keys: Replace uid/gid/perm permissions checking with an ACL")

    that the merge brought in).

    It turns out that it breaks booting with an encrypted volume, and Eric
    biggers reports that it also breaks the fscrypt tests [1] and loading of
    in-kernel X.509 certificates [2].

    The root cause of all the breakage is likely the same, but David Howells
    is off email so rather than try to work it out it's getting reverted in
    order to not impact the rest of the merge window.

    [1] https://lore.kernel.org/lkml/20190710011559.GA7973@sol.localdomain/
    [2] https://lore.kernel.org/lkml/20190710013225.GB7973@sol.localdomain/

    Link: https://lore.kernel.org/lkml/CAHk-=wjxoeMJfeBahnWH=9zShKp2bsVy527vo3_y8HfOdhwAAw@mail.gmail.com/
    Reported-by: Eric Biggers <ebiggers@kernel.org>
    Cc: David Howells <dhowells@redhat.com>
    Cc: James Morris <jmorris@namei.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Linus Torvalds
     

10 Jul, 2019

2 commits

  • Pull Documentation updates from Jonathan Corbet:
    "It's been a relatively busy cycle for docs:

    - A fair pile of RST conversions, many from Mauro. These create more
    than the usual number of simple but annoying merge conflicts with
    other trees, unfortunately. He has a lot more of these waiting on
    the wings that, I think, will go to you directly later on.

    - A new document on how to use merges and rebases in kernel repos,
    and one on Spectre vulnerabilities.

    - Various improvements to the build system, including automatic
    markup of function() references because some people, for reasons I
    will never understand, were of the opinion that
    :c:func:``function()`` is unattractive and not fun to type.

    - We now recommend using sphinx 1.7, but still support back to 1.4.

    - Lots of smaller improvements, warning fixes, typo fixes, etc"

    * tag 'docs-5.3' of git://git.lwn.net/linux: (129 commits)
    docs: automarkup.py: ignore exceptions when seeking for xrefs
    docs: Move binderfs to admin-guide
    Disable Sphinx SmartyPants in HTML output
    doc: RCU callback locks need only _bh, not necessarily _irq
    docs: format kernel-parameters -- as code
    Doc : doc-guide : Fix a typo
    platform: x86: get rid of a non-existent document
    Add the RCU docs to the core-api manual
    Documentation: RCU: Add TOC tree hooks
    Documentation: RCU: Rename txt files to rst
    Documentation: RCU: Convert RCU UP systems to reST
    Documentation: RCU: Convert RCU linked list to reST
    Documentation: RCU: Convert RCU basic concepts to reST
    docs: filesystems: Remove uneeded .rst extension on toctables
    scripts/sphinx-pre-install: fix out-of-tree build
    docs: zh_CN: submitting-drivers.rst: Remove a duplicated Documentation/
    Documentation: PGP: update for newer HW devices
    Documentation: Add section about CPU vulnerabilities for Spectre
    Documentation: platform: Delete x86-laptop-drivers.txt
    docs: Note that :c:func: should no longer be used
    ...

    Linus Torvalds
     
  • Pull capabilities update from James Morris:
    "Minor fixes for capabilities:

    - Update the commoncap.c code to utilize XATTR_SECURITY_PREFIX_LEN,
    from Carmeli tamir.

    - Make the capability hooks static, from Yue Haibing"

    * 'next-lsm' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
    security/commoncap: Use xattr security prefix len
    security: Make capability_hooks static

    Linus Torvalds
     

09 Jul, 2019

9 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull cgroup updates from Tejun Heo:
    "Documentation updates and the addition of cgroup_parse_float() which
    will be used by new controllers including blk-iocost"

    * 'for-5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    docs: cgroup-v1: convert docs to ReST and rename to *.rst
    cgroup: Move cgroup_parse_float() implementation out of CONFIG_SYSFS
    cgroup: add cgroup_parse_float()

    Linus Torvalds
     
  • Pull integrity updates from Mimi Zohar:
    "Bug fixes, code clean up, and new features:

    - IMA policy rules can be defined in terms of LSM labels, making the
    IMA policy dependent on LSM policy label changes, in particular LSM
    label deletions. The new environment, in which IMA-appraisal is
    being used, frequently updates the LSM policy and permits LSM label
    deletions.

    - Prevent an mmap'ed shared file opened for write from also being
    mmap'ed execute. In the long term, making this and other similar
    changes at the VFS layer would be preferable.

    - The IMA per policy rule template format support is needed for a
    couple of new/proposed features (eg. kexec boot command line
    measurement, appended signatures, and VFS provided file hashes).

    - Other than the "boot-aggregate" record in the IMA measuremeent
    list, all other measurements are of file data. Measuring and
    storing the kexec boot command line in the IMA measurement list is
    the first buffer based measurement included in the measurement
    list"

    * 'next-integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
    integrity: Introduce struct evm_xattr
    ima: Update MAX_TEMPLATE_NAME_LEN to fit largest reasonable definition
    KEXEC: Call ima_kexec_cmdline to measure the boot command line args
    IMA: Define a new template field buf
    IMA: Define a new hook to measure the kexec boot command line arguments
    IMA: support for per policy rule template formats
    integrity: Fix __integrity_init_keyring() section mismatch
    ima: Use designated initializers for struct ima_event_data
    ima: use the lsm policy update notifier
    LSM: switch to blocking policy update notifiers
    x86/ima: fix the Kconfig dependency for IMA_ARCH_POLICY
    ima: Make arch_policy_entry static
    ima: prevent a file already mmap'ed write to be mmap'ed execute
    x86/ima: check EFI SetupMode too

    Linus Torvalds
     
  • Pull keyring ACL support from David Howells:
    "This changes the permissions model used by keys and keyrings to be
    based on an internal ACL by the following means:

    - Replace the permissions mask internally with an ACL that contains a
    list of ACEs, each with a specific subject with a permissions mask.
    Potted default ACLs are available for new keys and keyrings.

    ACE subjects can be macroised to indicate the UID and GID specified
    on the key (which remain). Future commits will be able to add
    additional subject types, such as specific UIDs or domain
    tags/namespaces.

    Also split a number of permissions to give finer control. Examples
    include splitting the revocation permit from the change-attributes
    permit, thereby allowing someone to be granted permission to revoke
    a key without allowing them to change the owner; also the ability
    to join a keyring is split from the ability to link to it, thereby
    stopping a process accessing a keyring by joining it and thus
    acquiring use of possessor permits.

    - Provide a keyctl to allow the granting or denial of one or more
    permits to a specific subject. Direct access to the ACL is not
    granted, and the ACL cannot be viewed"

    * tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Provide KEYCTL_GRANT_PERMISSION
    keys: Replace uid/gid/perm permissions checking with an ACL

    Linus Torvalds
     
  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     
  • Pull request_key improvements from David Howells:
    "These are all request_key()-related, including a fix and some improvements:

    - Fix the lack of a Link permission check on a key found by
    request_key(), thereby enabling request_key() to link keys that
    don't grant this permission to the target keyring (which must still
    grant Write permission).

    Note that the key must be in the caller's keyrings already to be
    found.

    - Invalidate used request_key authentication keys rather than
    revoking them, so that they get cleaned up immediately rather than
    hanging around till the expiry time is passed.

    - Move the RCU locks outwards from the keyring search functions so
    that a request_key_rcu() can be provided. This can be called in RCU
    mode, so it can't sleep and can't upcall - but it can be called
    from LOOKUP_RCU pathwalk mode.

    - Cache the latest positive result of request_key*() temporarily in
    task_struct so that filesystems that make a lot of request_key()
    calls during pathwalk can take advantage of it to avoid having to
    redo the searching. This requires CONFIG_KEYS_REQUEST_CACHE=y.

    It is assumed that the key just found is likely to be used multiple
    times in each step in an RCU pathwalk, and is likely to be reused
    for the next step too.

    Note that the cleanup of the cache is done on TIF_NOTIFY_RESUME,
    just before userspace resumes, and on exit"

    * tag 'keys-request-20190626' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Kill off request_key_async{,_with_auxdata}
    keys: Cache result of request_key*() temporarily in task_struct
    keys: Provide request_key_rcu()
    keys: Move the RCU locks outwards from the keyring search functions
    keys: Invalidate used request_key authentication keys
    keys: Fix request_key() lack of Link perm check on found key

    Linus Torvalds
     
  • Pull misc keyring updates from David Howells:
    "These are some miscellaneous keyrings fixes and improvements:

    - Fix a bunch of warnings from sparse, including missing RCU bits and
    kdoc-function argument mismatches

    - Implement a keyctl to allow a key to be moved from one keyring to
    another, with the option of prohibiting key replacement in the
    destination keyring.

    - Grant Link permission to possessors of request_key_auth tokens so
    that upcall servicing daemons can more easily arrange things such
    that only the necessary auth key is passed to the actual service
    program, and not all the auth keys a daemon might possesss.

    - Improvement in lookup_user_key().

    - Implement a keyctl to allow keyrings subsystem capabilities to be
    queried.

    The keyutils next branch has commits to make available, document and
    test the move-key and capabilities code:

    https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/keyutils.git/log

    They're currently on the 'next' branch"

    * tag 'keys-misc-20190619' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Add capability-checking keyctl function
    keys: Reuse keyring_index_key::desc_len in lookup_user_key()
    keys: Grant Link permission to possessers of request_key auth keys
    keys: Add a keyctl to move a key between keyrings
    keys: Hoist locking out of __key_link_begin()
    keys: Break bits out of key_unlink()
    keys: Change keyring_serialise_link_sem to a mutex
    keys: sparse: Fix kdoc mismatches
    keys: sparse: Fix incorrect RCU accesses
    keys: sparse: Fix key_fs[ug]id_changed()

    Linus Torvalds
     
  • Pull selinux updates from Paul Moore:
    "Like the audit pull request this is a little early due to some
    upcoming vacation plans and uncertain network access while I'm away.
    Also like the audit PR, the list of patches here is pretty minor, the
    highlights include:

    - Explicitly use __le variables to make sure "sparse" can verify
    proper byte endian handling.

    - Remove some BUG_ON()s that are no longer needed.

    - Allow zero-byte writes to the "keycreate" procfs attribute without
    requiring key:create to make it easier for userspace to reset the
    keycreate label.

    - Consistently log the "invalid_context" field as an untrusted string
    in the AUDIT_SELINUX_ERR audit records"

    * tag 'selinux-pr-20190702' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
    selinux: format all invalid context as untrusted
    selinux: fix empty write to keycreate file
    selinux: remove some no-op BUG_ONs
    selinux: provide __le variables explicitly

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds