09 Jul, 2019

7 commits

  • …iederm/user-namespace

    Pull force_sig() argument change from Eric Biederman:
    "A source of error over the years has been that force_sig has taken a
    task parameter when it is only safe to use force_sig with the current
    task.

    The force_sig function is built for delivering synchronous signals
    such as SIGSEGV where the userspace application caused a synchronous
    fault (such as a page fault) and the kernel responded with a signal.

    Because the name force_sig does not make this clear, and because the
    force_sig takes a task parameter the function force_sig has been
    abused for sending other kinds of signals over the years. Slowly those
    have been fixed when the oopses have been tracked down.

    This set of changes fixes the remaining abusers of force_sig and
    carefully rips out the task parameter from force_sig and friends
    making this kind of error almost impossible in the future"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (27 commits)
    signal/x86: Move tsk inside of CONFIG_MEMORY_FAILURE in do_sigbus
    signal: Remove the signal number and task parameters from force_sig_info
    signal: Factor force_sig_info_to_task out of force_sig_info
    signal: Generate the siginfo in force_sig
    signal: Move the computation of force into send_signal and correct it.
    signal: Properly set TRACE_SIGNAL_LOSE_INFO in __send_signal
    signal: Remove the task parameter from force_sig_fault
    signal: Use force_sig_fault_to_task for the two calls that don't deliver to current
    signal: Explicitly call force_sig_fault on current
    signal/unicore32: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from __do_user_fault
    signal/arm: Remove tsk parameter from ptrace_break
    signal/nds32: Remove tsk parameter from send_sigtrap
    signal/riscv: Remove tsk parameter from do_trap
    signal/sh: Remove tsk parameter from force_sig_info_fault
    signal/um: Remove task parameter from send_sigtrap
    signal/x86: Remove task parameter from send_sigtrap
    signal: Remove task parameter from force_sig_mceerr
    signal: Remove task parameter from force_sig
    signal: Remove task parameter from force_sigsegv
    ...

    Linus Torvalds
     
  • Pull crypto updates from Herbert Xu:
    "Here is the crypto update for 5.3:

    API:
    - Test shash interface directly in testmgr
    - cra_driver_name is now mandatory

    Algorithms:
    - Replace arc4 crypto_cipher with library helper
    - Implement 5 way interleave for ECB, CBC and CTR on arm64
    - Add xxhash
    - Add continuous self-test on noise source to drbg
    - Update jitter RNG

    Drivers:
    - Add support for SHA204A random number generator
    - Add support for 7211 in iproc-rng200
    - Fix fuzz test failures in inside-secure
    - Fix fuzz test failures in talitos
    - Fix fuzz test failures in qat"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (143 commits)
    crypto: stm32/hash - remove interruptible condition for dma
    crypto: stm32/hash - Fix hmac issue more than 256 bytes
    crypto: stm32/crc32 - rename driver file
    crypto: amcc - remove memset after dma_alloc_coherent
    crypto: ccp - Switch to SPDX license identifiers
    crypto: ccp - Validate the the error value used to index error messages
    crypto: doc - Fix formatting of new crypto engine content
    crypto: doc - Add parameter documentation
    crypto: arm64/aes-ce - implement 5 way interleave for ECB, CBC and CTR
    crypto: arm64/aes-ce - add 5 way interleave routines
    crypto: talitos - drop icv_ool
    crypto: talitos - fix hash on SEC1.
    crypto: talitos - move struct talitos_edesc into talitos.h
    lib/scatterlist: Fix mapping iterator when sg->offset is greater than PAGE_SIZE
    crypto/NX: Set receive window credits to max number of CRBs in RxFIFO
    crypto: asymmetric_keys - select CRYPTO_HASH where needed
    crypto: serpent - mark __serpent_setkey_sbox noinline
    crypto: testmgr - dynamically allocate crypto_shash
    crypto: testmgr - dynamically allocate testvec_config
    crypto: talitos - eliminate unneeded 'done' functions at build time
    ...

    Linus Torvalds
     
  • Pull keyring ACL support from David Howells:
    "This changes the permissions model used by keys and keyrings to be
    based on an internal ACL by the following means:

    - Replace the permissions mask internally with an ACL that contains a
    list of ACEs, each with a specific subject with a permissions mask.
    Potted default ACLs are available for new keys and keyrings.

    ACE subjects can be macroised to indicate the UID and GID specified
    on the key (which remain). Future commits will be able to add
    additional subject types, such as specific UIDs or domain
    tags/namespaces.

    Also split a number of permissions to give finer control. Examples
    include splitting the revocation permit from the change-attributes
    permit, thereby allowing someone to be granted permission to revoke
    a key without allowing them to change the owner; also the ability
    to join a keyring is split from the ability to link to it, thereby
    stopping a process accessing a keyring by joining it and thus
    acquiring use of possessor permits.

    - Provide a keyctl to allow the granting or denial of one or more
    permits to a specific subject. Direct access to the ACL is not
    granted, and the ACL cannot be viewed"

    * tag 'keys-acl-20190703' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Provide KEYCTL_GRANT_PERMISSION
    keys: Replace uid/gid/perm permissions checking with an ACL

    Linus Torvalds
     
  • …/git/dhowells/linux-fs

    Pull keyring namespacing from David Howells:
    "These patches help make keys and keyrings more namespace aware.

    Firstly some miscellaneous patches to make the process easier:

    - Simplify key index_key handling so that the word-sized chunks
    assoc_array requires don't have to be shifted about, making it
    easier to add more bits into the key.

    - Cache the hash value in the key so that we don't have to calculate
    on every key we examine during a search (it involves a bunch of
    multiplications).

    - Allow keying_search() to search non-recursively.

    Then the main patches:

    - Make it so that keyring names are per-user_namespace from the point
    of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
    accessible cross-user_namespace.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

    - Move the user and user-session keyrings to the user_namespace
    rather than the user_struct. This prevents them propagating
    directly across user_namespaces boundaries (ie. the KEY_SPEC_*
    flags will only pick from the current user_namespace).

    - Make it possible to include the target namespace in which the key
    shall operate in the index_key. This will allow the possibility of
    multiple keys with the same description, but different target
    domains to be held in the same keyring.

    keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

    - Make it so that keys are implicitly invalidated by removal of a
    domain tag, causing them to be garbage collected.

    - Institute a network namespace domain tag that allows keys to be
    differentiated by the network namespace in which they operate. New
    keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
    the network domain in force when they are created.

    - Make it so that the desired network namespace can be handed down
    into the request_key() mechanism. This allows AFS, NFS, etc. to
    request keys specific to the network namespace of the superblock.

    This also means that the keys in the DNS record cache are
    thenceforth namespaced, provided network filesystems pass the
    appropriate network namespace down into dns_query().

    For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
    cache keyrings, such as idmapper keyrings, also need to set the
    domain tag - for which they need access to the network namespace of
    the superblock"

    * tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    keys: Pass the network namespace into request_key mechanism
    keys: Network namespace domain tag
    keys: Garbage collect keys for which the domain has been removed
    keys: Include target namespace in match criteria
    keys: Move the user and user-session keyrings to the user_namespace
    keys: Namespace keyring names
    keys: Add a 'recurse' flag for keyring searches
    keys: Cache the hash value to avoid lots of recalculation
    keys: Simplify key description management

    Linus Torvalds
     
  • Pull x86 AVX512 status update from Ingo Molnar:
    "This adds a new ABI that the main scheduler probably doesn't want to
    deal with but HPC job schedulers might want to use: the
    AVX512_elapsed_ms field in the new /proc//arch_status task status
    file, which allows the user-space job scheduler to cluster such tasks,
    to avoid turbo frequency drops"

    * 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Documentation/filesystems/proc.txt: Add arch_status file
    x86/process: Add AVX-512 usage elapsed time to /proc/pid/arch_status
    proc: Add /proc//arch_status

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - Remove the unused per rq load array and all its infrastructure, by
    Dietmar Eggemann.

    - Add utilization clamping support by Patrick Bellasi. This is a
    refinement of the energy aware scheduling framework with support for
    boosting of interactive and capping of background workloads: to make
    sure critical GUI threads get maximum frequency ASAP, and to make
    sure background processing doesn't unnecessarily move to cpufreq
    governor to higher frequencies and less energy efficient CPU modes.

    - Add the bare minimum of tracepoints required for LISA EAS regression
    testing, by Qais Yousef - which allows automated testing of various
    power management features, including energy aware scheduling.

    - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
    kernel used to modify the scheduler's CPU affinity logic such as
    migrate_disable() - introduce the task->cpus_ptr value instead of
    taking the address of &task->cpus_allowed directly - by Sebastian
    Andrzej Siewior.

    - Misc optimizations, fixes, cleanups and small enhancements - see the
    Git log for details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    sched/uclamp: Add uclamp support to energy_compute()
    sched/uclamp: Add uclamp_util_with()
    sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
    sched/uclamp: Set default clamps for RT tasks
    sched/uclamp: Reset uclamp values on RESET_ON_FORK
    sched/uclamp: Extend sched_setattr() to support utilization clamping
    sched/core: Allow sched_setattr() to use the current policy
    sched/uclamp: Add system default clamps
    sched/uclamp: Enforce last task's UCLAMP_MAX
    sched/uclamp: Add bucket local max tracking
    sched/uclamp: Add CPU's clamp buckets refcounting
    sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
    sched/debug: Export the newly added tracepoints
    sched/debug: Add sched_overutilized tracepoint
    sched/debug: Add new tracepoint to track PELT at se level
    sched/debug: Add new tracepoints to track PELT at rq level
    sched/debug: Add a new sched_trace_*() helper functions
    sched/autogroup: Make autogroup_path() always available
    sched/wait: Deduplicate code with do-while
    sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds
     

07 Jul, 2019

1 commit


06 Jul, 2019

1 commit

  • Pull nfsd fixes from Bruce Fields:
    "Two more quick bugfixes for nfsd: fixing a regression causing mount
    failures on high-memory machines and fixing the DRC over RDMA"

    * tag 'nfsd-5.2-2' of git://linux-nfs.org/~bfields/linux:
    nfsd: Fix overflow causing non-working mounts on 1 TB machines
    svcrdma: Ignore source port when computing DRC hash

    Linus Torvalds
     

05 Jul, 2019

5 commits

  • CONFIG_VALIDATE_FS_PARSER is a debugging tool to check that the parser
    tables are vaguely sane. It was set to default to 'Y' for the moment to
    catch errors in upcoming fs conversion development.

    Make sure it is not enabled by default in the final release of v5.1.

    Fixes: 31d921c7fb969172 ("vfs: Add configuration parser helpers")
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Al Viro

    Geert Uytterhoeven
     
  • Merge more fixes from Andrew Morton:
    "5 fixes"

    * emailed patches from Andrew Morton :
    swap_readpage(): avoid blk_wake_io_task() if !synchronous
    devres: allow const resource arguments
    mm/vmscan.c: prevent useless kswapd loops
    fs/userfaultfd.c: disable irqs for fault_pending and event locks
    mm/page_alloc.c: fix regression with deferred struct page init

    Linus Torvalds
     
  • Pull dax fix from Dan Williams:
    "A single dax fix that has been soaking awaiting other fixes under
    discussion to join it. As it is getting late in the cycle lets proceed
    with this fix and save follow-on changes for post-v5.3-rc1.

    - Fix xarray entry association for mixed mappings"

    * tag 'dax-fix-5.2-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    dax: Fix xarray entry association for mixed mappings

    Linus Torvalds
     
  • Pull do_move_mount() fix from Al Viro:
    "Regression fix"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    vfs: move_mount: reject moving kernel internal mounts

    Linus Torvalds
     
  • When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
    and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.

    This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
    userfaultfd_ctx_read(), which in turn can be waiting for
    userfaultfd_ctx::fault_pending_wqh.lock or
    userfaultfd_ctx::event_wqh.lock.

    But elsewhere the fault_pending_wqh and event_wqh locks are taken with
    IRQs enabled. Since the IRQ handler may take kioctx::ctx_lock, lockdep
    reports that a deadlock is possible.

    Fix it by always disabling IRQs when taking the fault_pending_wqh and
    event_wqh locks.

    Commit ae62c16e105a ("userfaultfd: disable irqs when taking the
    waitqueue lock") didn't fix this because it only accounted for the
    fd_wqh lock, not the other locks nested inside it.

    Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
    Fixes: bfe4037e722e ("aio: implement IOCB_CMD_POLL")
    Signed-off-by: Eric Biggers
    Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
    Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
    Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Christoph Hellwig
    Cc: Andrea Arcangeli
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

04 Jul, 2019

1 commit

  • Since commit 10a68cdf10 (nfsd: fix performance-limiting session
    calculation) (Linux 5.1-rc1 and 4.19.31), shares from NFS servers with
    1 TB of memory cannot be mounted anymore. The mount just hangs on the
    client.

    The gist of commit 10a68cdf10 is the change below.

    -avail = clamp_t(int, avail, slotsize, avail/3);
    +avail = clamp_t(int, avail, slotsize, total_avail/3);

    Here are the macros.

    #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y),
    Signed-off-by: J. Bruce Fields

    Paul Menzel
     

03 Jul, 2019

1 commit


01 Jul, 2019

1 commit

  • sys_move_mount() crashes by dereferencing the pointer MNT_NS_INTERNAL,
    a.k.a. ERR_PTR(-EINVAL), if the old mount is specified by fd for a
    kernel object with an internal mount, such as a pipe or memfd.

    Fix it by checking for this case and returning -EINVAL.

    [AV: what we want is is_mounted(); use that instead of making the
    condition even more convoluted]

    Reproducer:

    #include

    #define __NR_move_mount 429
    #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004

    int main()
    {
    int fds[2];

    pipe(fds);
    syscall(__NR_move_mount, fds[0], "", -1, "/", MOVE_MOUNT_F_EMPTY_PATH);
    }

    Reported-by: syzbot+6004acbaa1893ad013f0@syzkaller.appspotmail.com
    Fixes: 2db154b3ea8e ("vfs: syscall: Add move_mount(2) to move mounts around")
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro

    Eric Biggers
     

29 Jun, 2019

8 commits

  • Pull XArray fixes from Matthew Wilcox:

    - Account XArray nodes for the page cache to the appropriate cgroup
    (Johannes Weiner)

    - Fix idr_get_next() when called under the RCU lock (Matthew Wilcox)

    - Add a test for xa_insert() (Matthew Wilcox)

    * tag 'xarray-5.2-rc6' of git://git.infradead.org/users/willy/linux-dax:
    XArray tests: Add check_insert
    idr: Fix idr_get_next race with idr_remove
    mm: fix page cache convergence regression

    Linus Torvalds
     
  • Merge misc fixes from Andrew Morton:
    "15 fixes"

    * emailed patches from Andrew Morton :
    linux/kernel.h: fix overflow for DIV_ROUND_UP_ULL
    mm, swap: fix THP swap out
    fork,memcg: alloc_thread_stack_node needs to set tsk->stack
    MAINTAINERS: add CLANG/LLVM BUILD SUPPORT info
    mm/vmalloc.c: avoid bogus -Wmaybe-uninitialized warning
    mm/page_idle.c: fix oops because end_pfn is larger than max_pfn
    initramfs: fix populate_initrd_image() section mismatch
    mm/oom_kill.c: fix uninitialized oc->constraint
    mm: hugetlb: soft-offline: dissolve_free_huge_page() return zero on !PageHuge
    mm: soft-offline: return -EBUSY if set_hwpoison_free_buddy_page() fails
    signal: remove the wrong signal_pending() check in restore_user_sigmask()
    fs/binfmt_flat.c: make load_flat_shared_library() work
    mm/mempolicy.c: fix an incorrect rebind node in mpol_rebind_nodemask
    fs/proc/array.c: allow reporting eip/esp for all coredumping threads
    mm/dev_pfn: exclude MEMORY_DEVICE_PRIVATE while computing virtual address

    Linus Torvalds
     
  • Pull two more NFS client fixes from Anna Schumaker:
    "These are both stable fixes.

    One to calculate the correct client message length in the case of
    partial transmissions. And the other to set the proper TCP timeout for
    flexfiles"

    * tag 'nfs-for-5.2-4' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    NFS/flexfiles: Use the correct TCP timeout for flexfiles I/O
    SUNRPC: Fix up calculation of client message length

    Linus Torvalds
     
  • Pull ceph fix from Ilya Dryomov:
    "A small fix for a potential -rc1 regression from Jeff"

    * tag 'ceph-for-5.2-rc7' of git://github.com/ceph/ceph-client:
    ceph: fix ceph_mdsc_build_path to not stop on first component

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Just two small fixes.

    One from Paolo, fixing a silly mistake in BFQ. The other one is from
    me, ensuring that we have ->file cleared in the io_uring request a bit
    earlier. That avoids a use-before-free, if we encounter an error
    before ->file is assigned"

    * tag 'for-linus-20190628' of git://git.kernel.dk/linux-block:
    block, bfq: fix operator in BFQQ_TOTALLY_SEEKY
    io_uring: ensure req->file is cleared on allocation

    Linus Torvalds
     
  • This is the minimal fix for stable, I'll send cleanups later.

    Commit 854a6ed56839 ("signal: Add restore_user_sigmask()") introduced
    the visible change which breaks user-space: a signal temporary unblocked
    by set_user_sigmask() can be delivered even if the caller returns
    success or timeout.

    Change restore_user_sigmask() to accept the additional "interrupted"
    argument which should be used instead of signal_pending() check, and
    update the callers.

    Eric said:

    : For clarity. I don't think this is required by posix, or fundamentally to
    : remove the races in select. It is what linux has always done and we have
    : applications who care so I agree this fix is needed.
    :
    : Further in any case where the semantic change that this patch rolls back
    : (aka where allowing a signal to be delivered and the select like call to
    : complete) would be advantage we can do as well if not better by using
    : signalfd.
    :
    : Michael is there any chance we can get this guarantee of the linux
    : implementation of pselect and friends clearly documented. The guarantee
    : that if the system call completes successfully we are guaranteed that no
    : signal that is unblocked by using sigmask will be delivered?

    Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com
    Fixes: 854a6ed56839a40f6b5d02a2962f48841482eec4 ("signal: Add restore_user_sigmask()")
    Signed-off-by: Oleg Nesterov
    Reported-by: Eric Wong
    Tested-by: Eric Wong
    Acked-by: "Eric W. Biederman"
    Acked-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Cc: Michael Kerrisk
    Cc: Jens Axboe
    Cc: Davidlohr Bueso
    Cc: Jason Baron
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: David Laight
    Cc: [5.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • load_flat_shared_library() is broken: It only calls load_flat_file() if
    prepare_binprm() returns zero, but prepare_binprm() returns the number of
    bytes read - so this only happens if the file is empty.

    Instead, call into load_flat_file() if the number of bytes read is
    non-negative. (Even if the number of bytes is zero - in that case,
    load_flat_file() will see nullbytes and return a nice -ENOEXEC.)

    In addition, remove the code related to bprm creds and stop using
    prepare_binprm() - this code is loading a library, not a main executable,
    and it only actually uses the members "buf", "file" and "filename" of the
    linux_binprm struct. Instead, call kernel_read() directly.

    Link: http://lkml.kernel.org/r/20190524201817.16509-1-jannh@google.com
    Fixes: 287980e49ffc ("remove lots of IS_ERR_VALUE abuses")
    Signed-off-by: Jann Horn
    Cc: Alexander Viro
    Cc: Kees Cook
    Cc: Nicolas Pitre
    Cc: Arnd Bergmann
    Cc: Geert Uytterhoeven
    Cc: Russell King
    Cc: Greg Ungerer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • 0a1eb2d474ed ("fs/proc: Stop reporting eip and esp in /proc/PID/stat")
    stopped reporting eip/esp and fd7d56270b52 ("fs/proc: Report eip/esp in
    /prod/PID/stat for coredumping") reintroduced the feature to fix a
    regression with userspace core dump handlers (such as minicoredumper).

    Because PF_DUMPCORE is only set for the primary thread, this didn't fix
    the original problem for secondary threads. Allow reporting the eip/esp
    for all threads by checking for PF_EXITING as well. This is set for all
    the other threads when they are killed. coredump_wait() waits for all the
    tasks to become inactive before proceeding to invoke a core dumper.

    Link: http://lkml.kernel.org/r/87y32p7i7a.fsf@linutronix.de
    Link: http://lkml.kernel.org/r/20190522161614.628-1-jlu@pengutronix.de
    Fixes: fd7d56270b526ca3 ("fs/proc: Report eip/esp in /prod/PID/stat for coredumping")
    Signed-off-by: John Ogness
    Reported-by: Jan Luebbe
    Tested-by: Jan Luebbe
    Cc: Alexey Dobriyan
    Cc: Andy Lutomirski
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Ogness
     

28 Jun, 2019

7 commits

  • Fix a typo where we're confusing the default TCP retrans value
    (NFS_DEF_TCP_RETRANS) for the default TCP timeout value.

    Fixes: 15d03055cf39f ("pNFS/flexfiles: Set reasonable default ...")
    Cc: stable@vger.kernel.org # 4.8+
    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker

    Trond Myklebust
     
  • We never parsed/returned any data from .get_link() when the object is a windows reparse-point
    containing a symlink. This results in the VFS layer oopsing accessing an uninitialized buffer:

    ...
    [ 171.407172] Call Trace:
    [ 171.408039] readlink_copy+0x29/0x70
    [ 171.408872] vfs_readlink+0xc1/0x1f0
    [ 171.409709] ? readlink_copy+0x70/0x70
    [ 171.410565] ? simple_attr_release+0x30/0x30
    [ 171.411446] ? getname_flags+0x105/0x2a0
    [ 171.412231] do_readlinkat+0x1b7/0x1e0
    [ 171.412938] ? __ia32_compat_sys_newfstat+0x30/0x30
    ...

    Fix this by adding code to handle these buffers and make sure we do return a valid buffer
    to .get_link()

    CC: Stable
    Signed-off-by: Ronnie Sahlberg
    Signed-off-by: Steve French

    Ronnie Sahlberg
     
  • Pull pidfd fixes from Christian Brauner:
    "Userspace tools and libraries such as strace or glibc need a cheap and
    reliable way to tell whether CLONE_PIDFD is supported. The easiest way
    is to pass an invalid fd value in the return argument, perform the
    syscall and verify the value in the return argument has been changed
    to a valid fd.

    However, if CLONE_PIDFD is specified we currently check if pidfd == 0
    and return EINVAL if not.

    The check for pidfd == 0 was originally added to enable us to abuse
    the return argument for passing additional flags along with
    CLONE_PIDFD in the future.

    However, extending legacy clone this way would be a terrible idea and
    with clone3 on the horizon and the ability to reuse CLONE_DETACHED
    with CLONE_PIDFD there's no real need for this clutch. So remove the
    pidfd == 0 check and help userspace out.

    Also, accordig to Al, anon_inode_getfd() should only be used past the
    point of no failure and ksys_close() should not be used at all since
    it is far too easy to get wrong. Al's motto being "basically, once
    it's in descriptor table, it's out of your control". So Al's patch
    switches back to what we already had in v1 of the original patchset
    and uses a anon_inode_getfile() + put_user() + fd_install() sequence
    in the success path and a fput() + put_unused_fd() in the failure
    path.

    The other two changes should be trivial"

    * tag 'for-linus-20190627' of gitolite.kernel.org:pub/scm/linux/kernel/git/brauner/linux:
    proc: remove useless d_is_dir() check
    copy_process(): don't use ksys_close() on cleanups
    samples: make pidfd-metadata fail gracefully on older kernels
    fork: don't check parent_tidptr with CLONE_PIDFD

    Linus Torvalds
     
  • Pull AFS fixes from David Howells:
    "The in-kernel AFS client has been undergoing testing on opendev.org on
    one of their mirror machines. They are using AFS to hold data that is
    then served via apache, and Ian Wienand had reported seeing oopses,
    spontaneous machine reboots and updates to volumes going missing. This
    patch series appears to have fixed the problem, very probably due to
    patch (2), but it's not 100% certain.

    (1) Fix the printing of the "vnode modified" warning to exclude checks
    on files for which we don't have a callback promise from the
    server (and so don't expect the server to tell us when it
    changes).

    Without this, for every file or directory for which we still have
    an in-core inode that gets changed on the server, we may get a
    message logged when we next look at it. This can happen in bulk
    if, for instance, someone does "vos release" to update a R/O
    volume from a R/W volume and a whole set of files are all changed
    together.

    We only really want to log a message if the file changed and the
    server didn't tell us about it or we failed to track the state
    internally.

    (2) Fix accidental corruption of either afs_vlserver struct objects or
    the the following memory locations (which could hold anything).
    The issue is caused by a union that points to two different
    structs in struct afs_call (to save space in the struct). The call
    cleanup code assumes that it can simply call the cleanup for one
    of those structs if not NULL - when it might be actually pointing
    to the other struct.

    This means that every Volume Location RPC op is going to corrupt
    something.

    (3) Fix an uninitialised spinlock. This isn't too bad, it just causes
    a one-off warning if lockdep is enabled when "vos release" is
    called, but the spinlock still behaves correctly.

    (4) Fix the setting of i_block in the inode. This causes du, for
    example, to produce incorrect results, but otherwise should not be
    dangerous to the kernel"

    * tag 'afs-fixes-20190620' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Fix setting of i_blocks
    afs: Fix uninitialised spinlock afs_volume::cb_break_lock
    afs: Fix vlserver record corruption
    afs: Fix over zealous "vnode modified" warnings

    Linus Torvalds
     
  • Replace the uid/gid/perm permissions checking on a key with an ACL to allow
    the SETATTR and SEARCH permissions to be split. This will also allow a
    greater range of subjects to represented.

    ============
    WHY DO THIS?
    ============

    The problem is that SETATTR and SEARCH cover a slew of actions, not all of
    which should be grouped together.

    For SETATTR, this includes actions that are about controlling access to a
    key:

    (1) Changing a key's ownership.

    (2) Changing a key's security information.

    (3) Setting a keyring's restriction.

    And actions that are about managing a key's lifetime:

    (4) Setting an expiry time.

    (5) Revoking a key.

    and (proposed) managing a key as part of a cache:

    (6) Invalidating a key.

    Managing a key's lifetime doesn't really have anything to do with
    controlling access to that key.

    Expiry time is awkward since it's more about the lifetime of the content
    and so, in some ways goes better with WRITE permission. It can, however,
    be set unconditionally by a process with an appropriate authorisation token
    for instantiating a key, and can also be set by the key type driver when a
    key is instantiated, so lumping it with the access-controlling actions is
    probably okay.

    As for SEARCH permission, that currently covers:

    (1) Finding keys in a keyring tree during a search.

    (2) Permitting keyrings to be joined.

    (3) Invalidation.

    But these don't really belong together either, since these actions really
    need to be controlled separately.

    Finally, there are number of special cases to do with granting the
    administrator special rights to invalidate or clear keys that I would like
    to handle with the ACL rather than key flags and special checks.

    ===============
    WHAT IS CHANGED
    ===============

    The SETATTR permission is split to create two new permissions:

    (1) SET_SECURITY - which allows the key's owner, group and ACL to be
    changed and a restriction to be placed on a keyring.

    (2) REVOKE - which allows a key to be revoked.

    The SEARCH permission is split to create:

    (1) SEARCH - which allows a keyring to be search and a key to be found.

    (2) JOIN - which allows a keyring to be joined as a session keyring.

    (3) INVAL - which allows a key to be invalidated.

    The WRITE permission is also split to create:

    (1) WRITE - which allows a key's content to be altered and links to be
    added, removed and replaced in a keyring.

    (2) CLEAR - which allows a keyring to be cleared completely. This is
    split out to make it possible to give just this to an administrator.

    (3) REVOKE - see above.

    Keys acquire ACLs which consist of a series of ACEs, and all that apply are
    unioned together. An ACE specifies a subject, such as:

    (*) Possessor - permitted to anyone who 'possesses' a key
    (*) Owner - permitted to the key owner
    (*) Group - permitted to the key group
    (*) Everyone - permitted to everyone

    Note that 'Other' has been replaced with 'Everyone' on the assumption that
    you wouldn't grant a permit to 'Other' that you wouldn't also grant to
    everyone else.

    Further subjects may be made available by later patches.

    The ACE also specifies a permissions mask. The set of permissions is now:

    VIEW Can view the key metadata
    READ Can read the key content
    WRITE Can update/modify the key content
    SEARCH Can find the key by searching/requesting
    LINK Can make a link to the key
    SET_SECURITY Can change owner, ACL, expiry
    INVAL Can invalidate
    REVOKE Can revoke
    JOIN Can join this keyring
    CLEAR Can clear this keyring

    The KEYCTL_SETPERM function is then deprecated.

    The KEYCTL_SET_TIMEOUT function then is permitted if SET_SECURITY is set,
    or if the caller has a valid instantiation auth token.

    The KEYCTL_INVALIDATE function then requires INVAL.

    The KEYCTL_REVOKE function then requires REVOKE.

    The KEYCTL_JOIN_SESSION_KEYRING function then requires JOIN to join an
    existing keyring.

    The JOIN permission is enabled by default for session keyrings and manually
    created keyrings only.

    ======================
    BACKWARD COMPATIBILITY
    ======================

    To maintain backward compatibility, KEYCTL_SETPERM will translate the
    permissions mask it is given into a new ACL for a key - unless
    KEYCTL_SET_ACL has been called on that key, in which case an error will be
    returned.

    It will convert possessor, owner, group and other permissions into separate
    ACEs, if each portion of the mask is non-zero.

    SETATTR permission turns on all of INVAL, REVOKE and SET_SECURITY. WRITE
    permission turns on WRITE, REVOKE and, if a keyring, CLEAR. JOIN is turned
    on if a keyring is being altered.

    The KEYCTL_DESCRIBE function translates the ACL back into a permissions
    mask to return depending on possessor, owner, group and everyone ACEs.

    It will make the following mappings:

    (1) INVAL, JOIN -> SEARCH

    (2) SET_SECURITY -> SETATTR

    (3) REVOKE -> WRITE if SETATTR isn't already set

    (4) CLEAR -> WRITE

    Note that the value subsequently returned by KEYCTL_DESCRIBE may not match
    the value set with KEYCTL_SETATTR.

    =======
    TESTING
    =======

    This passes the keyutils testsuite for all but a couple of tests:

    (1) tests/keyctl/dh_compute/badargs: The first wrong-key-type test now
    returns EOPNOTSUPP rather than ENOKEY as READ permission isn't removed
    if the type doesn't have ->read(). You still can't actually read the
    key.

    (2) tests/keyctl/permitting/valid: The view-other-permissions test doesn't
    work as Other has been replaced with Everyone in the ACL.

    Signed-off-by: David Howells

    David Howells
     
  • Create a request_key_net() function and use it to pass the network
    namespace domain tag into DNS revolver keys and rxrpc/AFS keys so that keys
    for different domains can coexist in the same keyring.

    Signed-off-by: David Howells
    cc: netdev@vger.kernel.org
    cc: linux-nfs@vger.kernel.org
    cc: linux-cifs@vger.kernel.org
    cc: linux-afs@lists.infradead.org

    David Howells
     
  • When ceph_mdsc_build_path is handed a positive dentry, it will return a
    zero-length path string with the base set to that dentry. This is not
    what we want. Always include at least one path component in the string.

    ceph_mdsc_build_path has behaved this way for a long time but it didn't
    matter until recent d_name handling rework.

    Fixes: 964fff7491e4 ("ceph: use ceph_mdsc_build_path instead of clone_dentry_name")
    Signed-off-by: Jeff Layton
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov

    Jeff Layton
     

27 Jun, 2019

1 commit


25 Jun, 2019

1 commit


22 Jun, 2019

5 commits

  • Pull more NFS client fixes from Anna Schumaker:
    "These are mostly refcounting issues that people have found recently.
    The revert fixes a suspend recovery performance issue.

    - SUNRPC: Fix a credential refcount leak

    - Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"

    - SUNRPC: Fix xps refcount imbalance on the error path

    - NFS4: Only set creation opendata if O_CREAT"

    * tag 'nfs-for-5.2-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    SUNRPC: Fix a credential refcount leak
    Revert "SUNRPC: Declare RPC timers as TIMER_DEFERRABLE"
    net :sunrpc :clnt :Fix xps refcount imbalance on the error path
    NFS4: Only set creation opendata if O_CREAT

    Linus Torvalds
     
  • Stephen reports:

    I hit the following General Protection Fault when testing io_uring via
    the io_uring engine in fio. This was on a VM running 5.2-rc5 and the
    latest version of fio. The issue occurs for both null_blk and fake NVMe
    drives. I have not tested bare metal or real NVMe SSDs. The fio script
    used is given below.

    [io_uring]
    time_based=1
    runtime=60
    filename=/dev/nvme2n1 (note /dev/nullb0 also fails)
    ioengine=io_uring
    bs=4k
    rw=readwrite
    direct=1
    fixedbufs=1
    sqthread_poll=1
    sqthread_poll_cpu=0

    general protection fault: 0000 [#1] SMP PTI
    CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    RIP: 0010:fput_many+0x7/0x90
    Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \

    RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246
    RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805
    RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5
    RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d
    R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000
    R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004
    FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ? fput+0x13/0x20
    io_free_req+0x20/0x40
    io_put_req+0x1b/0x20
    io_submit_sqe+0x40a/0x680
    ? __switch_to_asm+0x34/0x70
    ? __switch_to_asm+0x40/0x70
    io_submit_sqes+0xb9/0x160
    ? io_submit_sqes+0xb9/0x160
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? __schedule+0x3f2/0x6a0
    ? __switch_to_asm+0x34/0x70
    io_sq_thread+0x1af/0x470
    ? __switch_to_asm+0x34/0x70
    ? wait_woken+0x80/0x80
    ? __switch_to+0x85/0x410
    ? __switch_to_asm+0x40/0x70
    ? __switch_to_asm+0x34/0x70
    ? __schedule+0x3f2/0x6a0
    kthread+0x105/0x140
    ? io_submit_sqes+0x160/0x160
    ? kthread+0x105/0x140
    ? io_submit_sqes+0x160/0x160
    ? kthread_destroy_worker+0x50/0x50
    ret_from_fork+0x35/0x40

    which occurs because using a kernel side submission thread isn't valid
    without using fixed files (registered through io_uring_register()). This
    causes io_uring to put the request after logging an error, but before
    the file field is set in the request. If it happens to be non-zero, we
    attempt to fput() garbage.

    Fix this by ensuring that req->file is initialized when the request is
    allocated.

    Cc: stable@vger.kernel.org # 5.1+
    Reported-by: Stephen Bates
    Tested-by: Stephen Bates
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can end up in nfs4_opendata_alloc during task exit, in which case
    current->fs has already been cleaned up. This leads to a crash in
    current_umask().

    Fix this by only setting creation opendata if we are actually doing an open
    with O_CREAT. We can drop the check for NULL nfs4_open_createattrs, since
    O_CREAT will never be set for the recovery path.

    Suggested-by: Trond Myklebust
    Signed-off-by: Benjamin Coddington
    Signed-off-by: Anna Schumaker

    Benjamin Coddington
     
  • Pull still more SPDX updates from Greg KH:
    "Another round of SPDX updates for 5.2-rc6

    Here is what I am guessing is going to be the last "big" SPDX update
    for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
    that were "easy" to determine by pattern matching. The ones after this
    are going to be a bit more difficult and the people on the spdx list
    will be discussing them on a case-by-case basis now.

    Another 5000+ files are fixed up, so our overall totals are:
    Files checked: 64545
    Files with SPDX: 45529

    Compared to the 5.1 kernel which was:
    Files checked: 63848
    Files with SPDX: 22576

    This is a huge improvement.

    Also, we deleted another 20000 lines of boilerplate license crud,
    always nice to see in a diffstat"

    * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
    treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
    ...

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "Four small SMB3 fixes, all for stable"

    * tag '5.2-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: fix GlobalMid_Lock bug in cifs_reconnect
    SMB3: retry on STATUS_INSUFFICIENT_RESOURCES instead of failing write
    cifs: add spinlock for the openFileList to cifsInodeInfo
    cifs: fix panic in smb2_reconnect

    Linus Torvalds
     

21 Jun, 2019

1 commit