15 Jan, 2020

1 commit

  • commit dd499f7a7e34270208350a849ef103c0b3ae477f upstream.

    copy_thread implementations handle CLONE_SETTLS by reading the TLS
    value from the registers containing the syscall arguments for
    clone. This doesn't work with clone3 since the TLS value is passed
    in clone_args instead.

    Signed-off-by: Amanieu d'Antras
    Cc: # 5.3.x
    Link: https://lore.kernel.org/r/20200102172413.654385-8-amanieu@gmail.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Amanieu d'Antras
     

29 Nov, 2019

3 commits

  • commit 150d71584b12809144b8145b817e83b81158ae5f upstream.

    To allow separate handling of the futex exit state in the futex exit code
    for exit and exec, split futex_mm_release() into two functions and invoke
    them from the corresponding exit/exec_mm_release() callsites.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 4610ba7ad877fafc0a25a30c6c82015304120426 upstream.

    mm_release() contains the futex exit handling. mm_release() is called from
    do_exit()->exit_mm() and from exec()->exec_mm().

    In the exit_mm() case PF_EXITING and the futex state is updated. In the
    exec_mm() case these states are not touched.

    As the futex exit code needs further protections against exit races, this
    needs to be split into two functions.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.240518241@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream.

    The futex exit handling is #ifdeffed into mm_release() which is not pretty
    to begin with. But upcoming changes to address futex exit races need to add
    more functionality to this exit code.

    Split it out into a function, move it into futex code and make the various
    futex exit functions static.

    Preparatory only and no functional change.

    Folded build fix from Borislav.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

20 Nov, 2019

1 commit

  • pidfd_poll() is defined as returning 'unsigned int' but the
    .poll method is declared as returning '__poll_t', a bitwise type.

    Fix this by using the proper return type and using the EPOLL
    constants instead of the POLL ones, as required for __poll_t.

    Fixes: b53b0b9d9a61 ("pidfd: add polling support")
    Cc: Joel Fernandes (Google)
    Cc: stable@vger.kernel.org # 5.3
    Signed-off-by: Luc Van Oostenryck
    Reviewed-by: Christian Brauner
    Link: https://lore.kernel.org/r/20191120003320.31138-1-luc.vanoostenryck@gmail.com
    Signed-off-by: Christian Brauner

    Luc Van Oostenryck
     

05 Nov, 2019

1 commit

  • Validate the stack arguments and setup the stack depening on whether or not
    it is growing down or up.

    Legacy clone() required userspace to know in which direction the stack is
    growing and pass down the stack pointer appropriately. To make things more
    confusing microblaze uses a variant of the clone() syscall selected by
    CONFIG_CLONE_BACKWARDS3 that takes an additional stack_size argument.
    IA64 has a separate clone2() syscall which also takes an additional
    stack_size argument. Finally, parisc has a stack that is growing upwards.
    Userspace therefore has a lot nasty code like the following:

    #define __STACK_SIZE (8 * 1024 * 1024)
    pid_t sys_clone(int (*fn)(void *), void *arg, int flags, int *pidfd)
    {
    pid_t ret;
    void *stack;

    stack = malloc(__STACK_SIZE);
    if (!stack)
    return -ENOMEM;

    #ifdef __ia64__
    ret = __clone2(fn, stack, __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
    #elif defined(__parisc__) /* stack grows up */
    ret = clone(fn, stack, flags | SIGCHLD, arg, pidfd);
    #else
    ret = clone(fn, stack + __STACK_SIZE, flags | SIGCHLD, arg, pidfd);
    #endif
    return ret;
    }

    or even crazier variants such as [3].

    With clone3() we have the ability to validate the stack. We can check that
    when stack_size is passed, the stack pointer is valid and the other way
    around. We can also check that the memory area userspace gave us is fine to
    use via access_ok(). Furthermore, we probably should not require
    userspace to know in which direction the stack is growing. It is easy
    for us to do this in the kernel and I couldn't find the original
    reasoning behind exposing this detail to userspace.

    /* Intentional user visible API change */
    clone3() was released with 5.3. Currently, it is not documented and very
    unclear to userspace how the stack and stack_size argument have to be
    passed. After talking to glibc folks we concluded that trying to change
    clone3() to setup the stack instead of requiring userspace to do this is
    the right course of action.
    Note, that this is an explicit change in user visible behavior we introduce
    with this patch. If it breaks someone's use-case we will revert! (And then
    e.g. place the new behavior under an appropriate flag.)
    Breaking someone's use-case is very unlikely though. First, neither glibc
    nor musl currently expose a wrapper for clone3(). Second, there is no real
    motivation for anyone to use clone3() directly since it does not provide
    features that legacy clone doesn't. New features for clone3() will first
    happen in v5.5 which is why v5.4 is still a good time to try and make that
    change now and backport it to v5.3. Searches on [4] did not reveal any
    packages calling clone3().

    [1]: https://lore.kernel.org/r/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@mail.gmail.com
    [2]: https://lore.kernel.org/r/20191028172143.4vnnjpdljfnexaq5@wittgenstein
    [3]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/raw-clone.h#L31
    [4]: https://codesearch.debian.net
    Fixes: 7f192e3cd316 ("fork: add clone3")
    Cc: Kees Cook
    Cc: Jann Horn
    Cc: David Howells
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Florian Weimer
    Cc: Peter Zijlstra
    Cc: linux-api@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: # 5.3
    Cc: GNU C Library
    Signed-off-by: Christian Brauner
    Acked-by: Arnd Bergmann
    Acked-by: Aleksa Sarai
    Link: https://lore.kernel.org/r/20191031113608.20713-1-christian.brauner@ubuntu.com

    Christian Brauner
     

08 Oct, 2019

1 commit

  • Partially revert 16db3d3f1170 ("kernel/sysctl.c: threads-max observe
    limits") because the patch is causing a regression to any workload which
    needs to override the auto-tuning of the limit provided by kernel.

    set_max_threads is implementing a boot time guesstimate to provide a
    sensible limit of the concurrently running threads so that runaways will
    not deplete all the memory. This is a good thing in general but there
    are workloads which might need to increase this limit for an application
    to run (reportedly WebSpher MQ is affected) and that is simply not
    possible after the mentioned change. It is also very dubious to
    override an admin decision by an estimation that doesn't have any direct
    relation to correctness of the kernel operation.

    Fix this by dropping set_max_threads from sysctl_max_threads so any
    value is accepted as long as it fits into MAX_THREADS which is important
    to check because allowing more threads could break internal robust futex
    restriction. While at it, do not use MIN_THREADS as the lower boundary
    because it is also only a heuristic for automatic estimation and admin
    might have a good reason to stop new threads to be created even when
    below this limit.

    This became more severe when we switched x86 from 4k to 8k kernel
    stacks. Starting since 6538b8ea886e ("x86_64: expand kernel stack to
    16K") (3.16) we use THREAD_SIZE_ORDER = 2 and that halved the auto-tuned
    value.

    In the particular case

    3.12
    kernel.threads-max = 515561

    4.4
    kernel.threads-max = 200000

    Neither of the two values is really insane on 32GB machine.

    I am not sure we want/need to tune the max_thread value further. If
    anything the tuning should be removed altogether if proven not useful in
    general. But we definitely need a way to override this auto-tuning.

    Link: http://lkml.kernel.org/r/20190922065801.GB18814@dhcp22.suse.cz
    Fixes: 16db3d3f1170 ("kernel/sysctl.c: threads-max observe limits")
    Signed-off-by: Michal Hocko
    Reviewed-by: "Eric W. Biederman"
    Cc: Heinrich Schuchardt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

05 Oct, 2019

1 commit

  • …/kernel/git/brauner/linux

    Pull copy_struct_from_user() helper from Christian Brauner:
    "This contains the copy_struct_from_user() helper which got split out
    from the openat2() patchset. It is a generic interface designed to
    copy a struct from userspace.

    The helper will be especially useful for structs versioned by size of
    which we have quite a few. This allows for backwards compatibility,
    i.e. an extended struct can be passed to an older kernel, or a legacy
    struct can be passed to a newer kernel. For the first case (extended
    struct, older kernel) the new fields in an extended struct can be set
    to zero and the struct safely passed to an older kernel.

    The most obvious benefit is that this helper lets us get rid of
    duplicate code present in at least sched_setattr(), perf_event_open(),
    and clone3(). More importantly it will also help to ensure that users
    implementing versioning-by-size end up with the same core semantics.

    This point is especially crucial since we have at least one case where
    versioning-by-size is used but with slighly different semantics:
    sched_setattr(), perf_event_open(), and clone3() all do do similar
    checks to copy_struct_from_user() while rt_sigprocmask(2) always
    rejects differently-sized struct arguments.

    With this pull request we also switch over sched_setattr(),
    perf_event_open(), and clone3() to use the new helper"

    * tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    usercopy: Add parentheses around assignment in test_copy_struct_from_user
    perf_event_open: switch to copy_struct_from_user()
    sched_setattr: switch to copy_struct_from_user()
    clone3: switch to copy_struct_from_user()
    lib: introduce copy_struct_from_user() helper

    Linus Torvalds
     

04 Oct, 2019

1 commit


01 Oct, 2019

1 commit

  • Switch clone3() syscall from it's own copying struct clone_args from
    userspace to the new dedicated copy_struct_from_user() helper.

    The change is very straightforward, and helps unify the syscall
    interface for struct-from-userspace syscalls. Additionally, explicitly
    define CLONE_ARGS_SIZE_VER0 to match the other users of the
    struct-extension pattern.

    Signed-off-by: Aleksa Sarai
    Reviewed-by: Kees Cook
    Reviewed-by: Christian Brauner
    [christian.brauner@ubuntu.com: improve commit message]
    Link: https://lore.kernel.org/r/20191001011055.19283-3-cyphar@cyphar.com
    Signed-off-by: Christian Brauner

    Aleksa Sarai
     

29 Sep, 2019

1 commit

  • Pull scheduler fixes from Ingo Molnar:

    - Apply a number of membarrier related fixes and cleanups, which fixes
    a use-after-free race in the membarrier code

    - Introduce proper RCU protection for tasks on the runqueue - to get
    rid of the subtle task_rcu_dereference() interface that was easy to
    get wrong

    - Misc fixes, but also an EAS speedup

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Avoid redundant EAS calculation
    sched/core: Remove double update_max_interval() call on CPU startup
    sched/core: Fix preempt_schedule() interrupt return comment
    sched/fair: Fix -Wunused-but-set-variable warnings
    sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
    sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
    sched/membarrier: Skip IPIs when mm->mm_users == 1
    selftests, sched/membarrier: Add multi-threaded test
    sched/membarrier: Fix p->mm->membarrier_state racy load
    sched/membarrier: Call sync_core only before usermode for same mm
    sched/membarrier: Remove redundant check
    sched/membarrier: Fix private expedited registration check
    tasks, sched/core: RCUify the assignment of rq->curr
    tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
    tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
    tasks: Add a count of task RCU users
    sched/core: Convert vcpu_is_preempted() from macro to an inline function
    sched/fair: Remove unused cfs_rq_clock_task() function

    Linus Torvalds
     

26 Sep, 2019

1 commit

  • When a user process exits, the kernel cleans up the mm_struct of the user
    process and during cleanup, check_mm() checks the page tables of the user
    process for corruption (E.g: unexpected page flags set/cleared). For
    corrupted page tables, the error message printed by check_mm() isn't very
    clear as it prints the loop index instead of page table type (E.g:
    Resident file mapping pages vs Resident shared memory pages). The loop
    index in check_mm() is used to index rss_stat[] which represents
    individual memory type stats. Hence, instead of printing index, print
    memory type, thereby improving error message.

    Without patch:
    --------------
    [ 204.836425] mm/pgtable-generic.c:29: bad p4d 0000000089eb4e92(800000025f941467)
    [ 204.836544] BUG: Bad rss-counter state mm:00000000f75895ea idx:0 val:2
    [ 204.836615] BUG: Bad rss-counter state mm:00000000f75895ea idx:1 val:5
    [ 204.836685] BUG: non-zero pgtables_bytes on freeing mm: 20480

    With patch:
    -----------
    [ 69.815453] mm/pgtable-generic.c:29: bad p4d 0000000084653642(800000025ca37467)
    [ 69.815872] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_FILEPAGES val:2
    [ 69.815962] BUG: Bad rss-counter state mm:00000000014a6c03 type:MM_ANONPAGES val:5
    [ 69.816050] BUG: non-zero pgtables_bytes on freeing mm: 20480

    Also, change print function (from printk(KERN_ALERT, ..) to pr_alert()) so
    that it matches the other print statement.

    Link: http://lkml.kernel.org/r/da75b5153f617f4c5739c08ee6ebeb3d19db0fbc.1565123758.git.sai.praneeth.prakhya@intel.com
    Signed-off-by: Sai Praneeth Prakhya
    Reviewed-by: Anshuman Khandual
    Suggested-by: Dave Hansen
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: Dave Hansen
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sai Praneeth Prakhya
     

25 Sep, 2019

2 commits

  • In the ordinary case today the RCU grace period for a task_struct is
    triggered when another process wait's for it's zombine and causes the
    kernel to call release_task(). As the waiting task has to receive a
    signal and then act upon it before this happens, typically this will
    occur after the original task as been removed from the runqueue.

    Unfortunaty in some cases such as self reaping tasks it can be shown
    that release_task() will be called starting the grace period for
    task_struct long before the task leaves the runqueue.

    Therefore use put_task_struct_rcu_user() in finish_task_switch() to
    guarantee that the there is a RCU lifetime after the task
    leaves the runqueue.

    Besides the change in the start of the RCU grace period for the
    task_struct this change may cause perf_event_delayed_put and
    trace_sched_process_free. The function perf_event_delayed_put boils
    down to just a WARN_ON for cases that I assume never show happen. So
    I don't see any problem with delaying it.

    The function trace_sched_process_free is a trace point and thus
    visible to user space. Occassionally userspace has the strangest
    dependencies so this has a miniscule chance of causing a regression.
    This change only changes the timing of when the tracepoint is called.
    The change in timing arguably gives userspace a more accurate picture
    of what is going on. So I don't expect there to be a regression.

    In the case where a task self reaps we are pretty much guaranteed that
    the RCU grace period is delayed. So we should get quite a bit of
    coverage in of this worst case for the change in a normal threaded
    workload. So I expect any issues to turn up quickly or not at all.

    I have lightly tested this change and everything appears to work
    fine.

    Inspired-by: Linus Torvalds
    Inspired-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Davidlohr Bueso
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/87r24jdpl5.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Ingo Molnar

    Eric W. Biederman
     
  • Add a count of the number of RCU users (currently 1) of the task
    struct so that we can later add the scheduler case and get rid of the
    very subtle task_rcu_dereference(), and just use rcu_dereference().

    As suggested by Oleg have the count overlap rcu_head so that no
    additional space in task_struct is required.

    Inspired-by: Linus Torvalds
    Inspired-by: Oleg Nesterov
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Davidlohr Bueso
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
    Signed-off-by: Ingo Molnar

    Eric W. Biederman
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

18 Sep, 2019

1 commit

  • Pull core timer updates from Thomas Gleixner:
    "Timers and timekeeping updates:

    - A large overhaul of the posix CPU timer code which is a preparation
    for moving the CPU timer expiry out into task work so it can be
    properly accounted on the task/process.

    An update to the bogus permission checks will come later during the
    merge window as feedback was not complete before heading of for
    travel.

    - Switch the timerqueue code to use cached rbtrees and get rid of the
    homebrewn caching of the leftmost node.

    - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
    single function

    - Implement the separation of hrtimers to be forced to expire in hard
    interrupt context even when PREEMPT_RT is enabled and mark the
    affected timers accordingly.

    - Implement a mechanism for hrtimers and the timer wheel to protect
    RT against priority inversion and live lock issues when a (hr)timer
    which should be canceled is currently executing the callback.
    Instead of infinitely spinning, the task which tries to cancel the
    timer blocks on a per cpu base expiry lock which is held and
    released by the (hr)timer expiry code.

    - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
    resulting in faster access to timekeeping functions.

    - Updates to various clocksource/clockevent drivers and their device
    tree bindings.

    - The usual small improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
    posix-cpu-timers: Fix permission check regression
    posix-cpu-timers: Always clear head pointer on dequeue
    hrtimer: Add a missing bracket and hide `migration_base' on !SMP
    posix-cpu-timers: Make expiry_active check actually work correctly
    posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
    tick: Mark sched_timer to expire in hard interrupt context
    hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
    x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
    posix-cpu-timers: Utilize timerqueue for storage
    posix-cpu-timers: Move state tracking to struct posix_cputimers
    posix-cpu-timers: Deduplicate rlimit handling
    posix-cpu-timers: Remove pointless comparisons
    posix-cpu-timers: Get rid of 64bit divisions
    posix-cpu-timers: Consolidate timer expiry further
    posix-cpu-timers: Get rid of zero checks
    rlimit: Rewrite non-sensical RLIMIT_CPU comment
    posix-cpu-timers: Respect INFINITY for hard RTTIME limit
    posix-cpu-timers: Switch thread group sampling to array
    posix-cpu-timers: Restructure expiry array
    posix-cpu-timers: Remove cputime_expires
    ...

    Linus Torvalds
     

17 Sep, 2019

2 commits

  • Pull ia64 updates from Tony Luck:
    "The big change here is removal of support for SGI Altix"

    * tag 'please-pull-ia64_for_5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux: (33 commits)
    genirq: remove the is_affinity_mask_valid hook
    ia64: remove CONFIG_SWIOTLB ifdefs
    ia64: remove support for machvecs
    ia64: move the screen_info setup to common code
    ia64: move the ROOT_DEV setup to common code
    ia64: rework iommu probing
    ia64: remove the unused sn_coherency_id symbol
    ia64: remove the SGI UV simulator support
    ia64: remove the zx1 swiotlb machvec
    ia64: remove CONFIG_ACPI ifdefs
    ia64: remove CONFIG_PCI ifdefs
    ia64: remove the hpsim platform
    ia64: remove now unused machvec indirections
    ia64: remove support for the SGI SN2 platform
    drivers: remove the SGI SN2 IOC4 base support
    drivers: remove the SGI SN2 IOC3 base support
    qla2xxx: remove SGI SN2 support
    qla1280: remove SGI SN2 support
    misc/sgi-xp: remove SGI SN2 support
    char/mspec: remove SGI SN2 support
    ...

    Linus Torvalds
     
  • Pull pidfd/waitid updates from Christian Brauner:
    "This contains two features and various tests.

    First, it adds support for waiting on process through pidfds by adding
    the P_PIDFD type to the waitid() syscall. This completes the basic
    functionality of the pidfd api (cf. [1]). In the meantime we also have
    a new adition to the userspace projects that make use of the pidfd
    api. The qt project was nice enough to send a mail pointing out that
    they have a pr up to switch to the pidfd api (cf. [2]).

    Second, this tag contains an extension to the waitid() syscall to make
    it possible to wait on the current process group in a race free manner
    (even though the actual problem is very unlikely) by specifing 0
    together with the P_PGID type. This extension traces back to a
    discussion on the glibc development mailing list.

    There are also a range of tests for the features above. Additionally,
    the test-suite which detected the pidfd-polling race we fixed in [3]
    is included in this tag"

    [1] https://lwn.net/Articles/794707/
    [2] https://codereview.qt-project.org/c/qt/qtbase/+/108456
    [3] commit b191d6491be6 ("pidfd: fix a poll race when setting exit_state")

    * tag 'core-process-v5.4' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    waitid: Add support for waiting for the current process group
    tests: add pidfd poll tests
    tests: move common definitions and functions into pidfd.h
    pidfd: add pidfd_wait tests
    pidfd: add P_PIDFD to waitid()

    Linus Torvalds
     

12 Sep, 2019

1 commit

  • Previously, higher 32 bits of exit_signal fields were lost when copied
    to the kernel args structure (that uses int as a type for the respective
    field). Moreover, as Oleg has noted, exit_signal is used unchecked, so
    it has to be checked for sanity before use; for the legacy syscalls,
    applying CSIGNAL mask guarantees that it is at least non-negative;
    however, there's no such thing is done in clone3() code path, and that
    can break at least thread_group_leader.

    This commit adds a check to copy_clone_args_from_user() to verify that
    the exit signal is limited by CSIGNAL as with legacy clone() and that
    the signal is valid. With this we don't get the legacy clone behavior
    were an invalid signal could be handed down and would only be detected
    and ignored in do_notify_parent(). Users of clone3() will now get a
    proper error when they pass an invalid exit signal. Note, that this is
    not user-visible behavior since no kernel with clone3() has been
    released yet.

    The following program will cause a splat on a non-fixed clone3() version
    and will fail correctly on a fixed version:

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    pid_t pid = -1;
    struct clone_args args = {0};
    args.exit_signal = -1;

    pid = syscall(__NR_clone3, &args, sizeof(struct clone_args));
    if (pid < 0)
    exit(EXIT_FAILURE);

    if (pid == 0)
    exit(EXIT_SUCCESS);

    wait(NULL);

    exit(EXIT_SUCCESS);
    }

    Fixes: 7f192e3cd316 ("fork: add clone3")
    Reported-by: Oleg Nesterov
    Suggested-by: Oleg Nesterov
    Suggested-by: Dmitry V. Levin
    Signed-off-by: Eugene Syromiatnikov
    Link: https://lore.kernel.org/r/4b38fa4ce420b119a4c6345f42fe3cec2de9b0b5.1568223594.git.esyr@redhat.com
    [christian.brauner@ubuntu.com: simplify check and rework commit message]
    Signed-off-by: Christian Brauner

    Eugene Syromiatnikov
     

28 Aug, 2019

3 commits

  • Put it where it belongs and clean up the ifdeffery in fork completely.

    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de

    Thomas Gleixner
     
  • The expiry cache belongs into the posix_cputimers container where the other
    cpu timers information is.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de

    Thomas Gleixner
     
  • Per task/process data of posix CPU timers is all over the place which
    makes the code hard to follow and requires ifdeffery.

    Create a container to hold all this information in one place, so data is
    consolidated and the ifdeffery can be confined to the posix timer header
    file and removed from places like fork.

    As a first step, move the cpu_timers list head array into the new struct
    and clean up the initializers and simplify fork. The remaining #ifdef in
    fork will be removed later.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Link: https://lkml.kernel.org/r/20190821192920.819418976@linutronix.de

    Thomas Gleixner
     

22 Aug, 2019

1 commit

  • From rdma.git

    Jason Gunthorpe says:

    ====================
    This is a collection of general cleanups for ODP to clarify some of the
    flows around umem creation and use of the interval tree.
    ====================

    The branch is based on v5.3-rc5 due to dependencies, and is being taken
    into hmm.git due to dependencies in the next patches.

    * odp_fixes:
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    RDMA/core: Make invalidate_range a device operation
    RDMA/odp: Use kvcalloc for the dma_list and page_list
    RDMA/odp: Check for overflow when computing the umem_odp end
    RDMA/odp: Provide ib_umem_odp_release() to undo the allocs
    RDMA/odp: Split creating a umem_odp from ib_umem_get
    RDMA/odp: Make the three ways to create a umem_odp clear
    RMDA/odp: Consolidate umem_odp initialization
    RDMA/odp: Make it clearer when a umem is an implicit ODP umem
    RDMA/odp: Iterate over the whole rbtree directly
    RDMA/odp: Use the common interval tree library instead of generic
    RDMA/mlx5: Fix MR npages calculation for IB_ACCESS_HUGETLB

    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

20 Aug, 2019

1 commit

  • This is a significant simplification, it eliminates all the remaining
    'hmm' stuff in mm_struct, eliminates krefing along the critical notifier
    paths, and takes away all the ugly locking and abuse of page_table_lock.

    mmu_notifier_get() provides the single struct hmm per struct mm which
    eliminates mm->hmm.

    It also directly guarantees that no mmu_notifier op callback is callable
    while concurrent free is possible, this eliminates all the krefs inside
    the mmu_notifier callbacks.

    The remaining krefs in the range code were overly cautious, drivers are
    already not permitted to free the mirror while a range exists.

    Link: https://lore.kernel.org/r/20190806231548.25242-6-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

13 Aug, 2019

1 commit


02 Aug, 2019

1 commit

  • This adds the P_PIDFD type to waitid().
    One of the last remaining bits for the pidfd api is to make it possible
    to wait on pidfds. With P_PIDFD added to waitid() the parts of userspace
    that want to use the pidfd api to exclusively manage processes can do so
    now.

    One of the things this will unblock in the future is the ability to make
    it possible to retrieve the exit status via waitid(P_PIDFD) for
    non-parent processes if handed a _suitable_ pidfd that has this feature
    set. This is similar to what you can do on FreeBSD with kqueue(). It
    might even end up being possible to wait on a process as a non-parent if
    an appropriate property is enabled on the pidfd.

    With P_PIDFD no scoping of the process identified by the pidfd is
    possible, i.e. it explicitly blocks things such as wait4(-1), wait4(0),
    waitid(P_ALL), waitid(P_PGID) etc. It only allows for semantics
    equivalent to wait4(pid), waitid(P_PID). Users that need scoping should
    rely on pid-based wait*() syscalls for now.

    Signed-off-by: Christian Brauner
    Reviewed-by: Kees Cook
    Reviewed-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Joel Fernandes (Google)
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: Jann Horn
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Aleksa Sarai
    Cc: Linus Torvalds
    Cc: Al Viro
    Link: https://lore.kernel.org/r/20190727222229.6516-2-christian@brauner.io

    Christian Brauner
     

25 Jul, 2019

1 commit

  • When going through execve(), zero out the NUMA fault statistics instead of
    freeing them.

    During execve, the task is reachable through procfs and the scheduler. A
    concurrent /proc/*/sched reader can read data from a freed ->numa_faults
    allocation (confirmed by KASAN) and write it back to userspace.
    I believe that it would also be possible for a use-after-free read to occur
    through a race between a NUMA fault and execve(): task_numa_fault() can
    lead to task_numa_compare(), which invokes task_weight() on the currently
    running task of a different CPU.

    Another way to fix this would be to make ->numa_faults RCU-managed or add
    extra locking, but it seems easier to wipe the NUMA fault statistics on
    execve.

    Signed-off-by: Jann Horn
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Petr Mladek
    Cc: Sergey Senozhatsky
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 82727018b0d3 ("sched/numa: Call task_numa_free() from do_execve()")
    Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.com
    Signed-off-by: Ingo Molnar

    Jann Horn
     

17 Jul, 2019

1 commit

  • Pull pidfd and clone3 fixes from Christian Brauner:
    "This contains a bugfix for CLONE_PIDFD when used with the legacy clone
    syscall, two fixes to ensure that syscall numbering and clone3
    entrypoint implementations will stay consistent, and an update for the
    maintainers file:

    - The addition of clone3 broke CLONE_PIDFD for legacy clone on all
    architectures that use do_fork() directly instead of calling the
    clone syscall itself. (Fwiw, cleaning do_fork() up is on my todo.)

    The reason this happened was that during conversion of _do_fork()
    to use struct kernel_clone_args we missed that do_fork() is called
    directly by various architectures. This is fixed by making sure
    that the pidfd argument in struct kernel_clone_args is correctly
    initialized with the parent_tidptr argument passed down from
    do_fork(). Additionally, do_fork() missed a check to make
    CLONE_PIDFD and CLONE_PARENT_SETTID mutually exclusive just a
    clone() does. This is now fixed too.

    - When clone3() was introduced we skipped architectures that require
    special handling for fork-like syscalls. Their syscall tables did
    not contain any mention of clone3().

    To make sure that Arnd's work to make syscall numbers on all
    architectures identical (minus alpha) was not for naught we are
    placing a comment in all syscall tables that do not yet implement
    clone3(). The comment makes it clear that 435 is reserved for
    clone3 and should not be used.

    - Also, this contains a patch to make the clone3() syscall definition
    in asm-generic/unist.h conditional on __ARCH_WANT_SYS_CLONE3. This
    lets us catch new architectures that implicitly make use of clone3
    without setting __ARCH_WANT_SYS_CLONE3 which is a good indicator
    that they did not check whether it needs special treatment or not.

    - Finally, this contains a patch to add me as maintainer for pidfd
    stuff so people can start blaming me (more)"

    * tag 'for-linus-20190715' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    MAINTAINERS: add new entry for pidfd api
    unistd: protect clone3 via __ARCH_WANT_SYS_CLONE3
    arch: mark syscall number 435 reserved for clone3
    clone: fix CLONE_PIDFD support

    Linus Torvalds
     

15 Jul, 2019

2 commits

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     
  • The introduction of clone3 syscall accidentally broke CLONE_PIDFD
    support in traditional clone syscall on compat x86 and those
    architectures that use do_fork to implement clone syscall.

    This bug was found by strace test suite.

    Link: https://strace.io/logs/strace/2019-07-12
    Fixes: 7f192e3cd316 ("fork: add clone3")
    Bisected-and-tested-by: Anatoly Pugachev
    Signed-off-by: Dmitry V. Levin
    Link: https://lore.kernel.org/r/20190714162047.GB10389@altlinux.org
    Signed-off-by: Christian Brauner

    Dmitry V. Levin
     

12 Jul, 2019

1 commit

  • Pull clone3 system call from Christian Brauner:
    "This adds the clone3 syscall which is an extensible successor to clone
    after we snagged the last flag with CLONE_PIDFD during the 5.2 merge
    window for clone(). It cleanly supports all of the flags from clone()
    and thus all legacy workloads.

    There are few user visible differences between clone3 and clone.
    First, CLONE_DETACHED will cause EINVAL with clone3 so we can reuse
    this flag. Second, the CSIGNAL flag is deprecated and will cause
    EINVAL to be reported. It is superseeded by a dedicated "exit_signal"
    argument in struct clone_args thus freeing up even more flags. And
    third, clone3 gives CLONE_PIDFD a dedicated return argument in struct
    clone_args instead of abusing CLONE_PARENT_SETTID's parent_tidptr
    argument.

    The clone3 uapi is designed to be easy to handle on 32- and 64 bit:

    /* uapi */
    struct clone_args {
    __aligned_u64 flags;
    __aligned_u64 pidfd;
    __aligned_u64 child_tid;
    __aligned_u64 parent_tid;
    __aligned_u64 exit_signal;
    __aligned_u64 stack;
    __aligned_u64 stack_size;
    __aligned_u64 tls;
    };

    and a separate kernel struct is used that uses proper kernel typing:

    /* kernel internal */
    struct kernel_clone_args {
    u64 flags;
    int __user *pidfd;
    int __user *child_tid;
    int __user *parent_tid;
    int exit_signal;
    unsigned long stack;
    unsigned long stack_size;
    unsigned long tls;
    };

    The system call comes with a size argument which enables the kernel to
    detect what version of clone_args userspace is passing in. clone3
    validates that any additional bytes a given kernel does not know about
    are set to zero and that the size never exceeds a page.

    A nice feature is that this patchset allowed us to cleanup and
    simplify various core kernel codepaths in kernel/fork.c by making the
    internal _do_fork() function take struct kernel_clone_args even for
    legacy clone().

    This patch also unblocks the time namespace patchset which wants to
    introduce a new CLONE_TIMENS flag.

    Note, that clone3 has only been wired up for x86{_32,64}, arm{64}, and
    xtensa. These were the architectures that did not require special
    massaging.

    Other architectures treat fork-like system calls individually and
    after some back and forth neither Arnd nor I felt confident that we
    dared to add clone3 unconditionally to all architectures. We agreed to
    leave this up to individual architecture maintainers. This is why
    there's an additional patch that introduces __ARCH_WANT_SYS_CLONE3
    which any architecture can set once it has implemented support for
    clone3. The patch also adds a cond_syscall(clone3) for architectures
    such as nios2 or h8300 that generate their syscall table by simply
    including asm-generic/unistd.h. The hope is to get rid of
    __ARCH_WANT_SYS_CLONE3 and cond_syscall() rather soon"

    * tag 'clone3-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    arch: handle arches who do not yet define clone3
    arch: wire-up clone3() syscall
    fork: add clone3

    Linus Torvalds
     

11 Jul, 2019

1 commit

  • Pull pidfd updates from Christian Brauner:
    "This adds two main features.

    - First, it adds polling support for pidfds. This allows process
    managers to know when a (non-parent) process dies in a race-free
    way.

    The notification mechanism used follows the same logic that is
    currently used when the parent of a task is notified of a child's
    death. With this patchset it is possible to put pidfds in an
    {e}poll loop and get reliable notifications for process (i.e.
    thread-group) exit.

    - The second feature compliments the first one by making it possible
    to retrieve pollable pidfds for processes that were not created
    using CLONE_PIDFD.

    A lot of processes get created with traditional PID-based calls
    such as fork() or clone() (without CLONE_PIDFD). For these
    processes a caller can currently not create a pollable pidfd. This
    is a problem for Android's low memory killer (LMK) and service
    managers such as systemd.

    Both patchsets are accompanied by selftests.

    It's perhaps worth noting that the work done so far and the work done
    in this branch for pidfd_open() and polling support do already see
    some adoption:

    - Android is in the process of backporting this work to all their LTS
    kernels [1]

    - Service managers make use of pidfd_send_signal but will need to
    wait until we enable waiting on pidfds for full adoption.

    - And projects I maintain make use of both pidfd_send_signal and
    CLONE_PIDFD [2] and will use polling support and pidfd_open() too"

    [1] https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.9+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.14+backport%22
    https://android-review.googlesource.com/q/topic:%22pidfd+polling+support+4.19+backport%22

    [2] https://github.com/lxc/lxc/blob/aab6e3eb73c343231cdde775db938994fc6f2803/src/lxc/start.c#L1753

    * tag 'pidfd-updates-v5.3' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    tests: add pidfd_open() tests
    arch: wire-up pidfd_open()
    pid: add pidfd_open()
    pidfd: add polling selftests
    pidfd: add polling support

    Linus Torvalds
     

09 Jul, 2019

3 commits

  • Pull scheduler updates from Ingo Molnar:

    - Remove the unused per rq load array and all its infrastructure, by
    Dietmar Eggemann.

    - Add utilization clamping support by Patrick Bellasi. This is a
    refinement of the energy aware scheduling framework with support for
    boosting of interactive and capping of background workloads: to make
    sure critical GUI threads get maximum frequency ASAP, and to make
    sure background processing doesn't unnecessarily move to cpufreq
    governor to higher frequencies and less energy efficient CPU modes.

    - Add the bare minimum of tracepoints required for LISA EAS regression
    testing, by Qais Yousef - which allows automated testing of various
    power management features, including energy aware scheduling.

    - Restructure the former tsk_nr_cpus_allowed() facility that the -rt
    kernel used to modify the scheduler's CPU affinity logic such as
    migrate_disable() - introduce the task->cpus_ptr value instead of
    taking the address of &task->cpus_allowed directly - by Sebastian
    Andrzej Siewior.

    - Misc optimizations, fixes, cleanups and small enhancements - see the
    Git log for details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    sched/uclamp: Add uclamp support to energy_compute()
    sched/uclamp: Add uclamp_util_with()
    sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks
    sched/uclamp: Set default clamps for RT tasks
    sched/uclamp: Reset uclamp values on RESET_ON_FORK
    sched/uclamp: Extend sched_setattr() to support utilization clamping
    sched/core: Allow sched_setattr() to use the current policy
    sched/uclamp: Add system default clamps
    sched/uclamp: Enforce last task's UCLAMP_MAX
    sched/uclamp: Add bucket local max tracking
    sched/uclamp: Add CPU's clamp buckets refcounting
    sched/fair: Rename weighted_cpuload() to cpu_runnable_load()
    sched/debug: Export the newly added tracepoints
    sched/debug: Add sched_overutilized tracepoint
    sched/debug: Add new tracepoint to track PELT at se level
    sched/debug: Add new tracepoints to track PELT at rq level
    sched/debug: Add a new sched_trace_*() helper functions
    sched/autogroup: Make autogroup_path() always available
    sched/wait: Deduplicate code with do-while
    sched/topology: Remove unused 'sd' parameter from arch_scale_cpu_capacity()
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - rwsem scalability improvements, phase #2, by Waiman Long, which are
    rather impressive:

    "On a 2-socket 40-core 80-thread Skylake system with 40 reader
    and writer locking threads, the min/mean/max locking operations
    done in a 5-second testing window before the patchset were:

    40 readers, Iterations Min/Mean/Max = 1,807/1,808/1,810
    40 writers, Iterations Min/Mean/Max = 1,807/50,344/151,255

    After the patchset, they became:

    40 readers, Iterations Min/Mean/Max = 30,057/31,359/32,741
    40 writers, Iterations Min/Mean/Max = 94,466/95,845/97,098"

    There's a lot of changes to the locking implementation that makes
    it similar to qrwlock, including owner handoff for more fair
    locking.

    Another microbenchmark shows how across the spectrum the
    improvements are:

    "With a locking microbenchmark running on 5.1 based kernel, the
    total locking rates (in kops/s) on a 2-socket Skylake system
    with equal numbers of readers and writers (mixed) before and
    after this patchset were:

    # of Threads Before Patch After Patch
    ------------ ------------ -----------
    2 2,618 4,193
    4 1,202 3,726
    8 802 3,622
    16 729 3,359
    32 319 2,826
    64 102 2,744"

    The changes are extensive and the patch-set has been through
    several iterations addressing various locking workloads. There
    might be more regressions, but unless they are pathological I
    believe we want to use this new implementation as the baseline
    going forward.

    - jump-label optimizations by Daniel Bristot de Oliveira: the primary
    motivation was to remove IPI disturbance of isolated RT-workload
    CPUs, which resulted in the implementation of batched jump-label
    updates. Beyond the improvement of the real-time characteristics
    kernel, in one test this patchset improved static key update
    overhead from 57 msecs to just 1.4 msecs - which is a nice speedup
    as well.

    - atomic64_t cross-arch type cleanups by Mark Rutland: over the last
    ~10 years of atomic64_t existence the various types used by the
    APIs only had to be self-consistent within each architecture -
    which means they became wildly inconsistent across architectures.
    Mark puts and end to this by reworking all the atomic64
    implementations to use 's64' as the base type for atomic64_t, and
    to ensure that this type is consistently used for parameters and
    return values in the API, avoiding further problems in this area.

    - A large set of small improvements to lockdep by Yuyang Du: type
    cleanups, output cleanups, function return type and othr cleanups
    all around the place.

    - A set of percpu ops cleanups and fixes by Peter Zijlstra.

    - Misc other changes - please see the Git log for more details"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
    locking/lockdep: increase size of counters for lockdep statistics
    locking/atomics: Use sed(1) instead of non-standard head(1) option
    locking/lockdep: Move mark_lock() inside CONFIG_TRACE_IRQFLAGS && CONFIG_PROVE_LOCKING
    x86/jump_label: Make tp_vec_nr static
    x86/percpu: Optimize raw_cpu_xchg()
    x86/percpu, sched/fair: Avoid local_clock()
    x86/percpu, x86/irq: Relax {set,get}_irq_regs()
    x86/percpu: Relax smp_processor_id()
    x86/percpu: Differentiate this_cpu_{}() and __this_cpu_{}()
    locking/rwsem: Guard against making count negative
    locking/rwsem: Adaptive disabling of reader optimistic spinning
    locking/rwsem: Enable time-based spinning on reader-owned rwsem
    locking/rwsem: Make rwsem->owner an atomic_long_t
    locking/rwsem: Enable readers spinning on writer
    locking/rwsem: Clarify usage of owner's nonspinaable bit
    locking/rwsem: Wake up almost all readers in wait queue
    locking/rwsem: More optimal RT task handling of null owner
    locking/rwsem: Always release wait_lock before waking up tasks
    locking/rwsem: Implement lock handoff to prevent lock starvation
    locking/rwsem: Make rwsem_spin_on_owner() return owner state
    ...

    Linus Torvalds
     
  • Pull timer updates from Thomas Gleixner:
    "The timer and timekeeping departement delivers:

    Core:

    - The consolidation of the VDSO code into a generic library including
    the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
    route through the relevant maintainer trees and should end up in
    5.4.

    This gets rid of the unnecessary different copies of the same code
    and brings all architectures on the same level of VDSO
    functionality.

    - Make the NTP user space interface more robust by restricting the
    TAI offset to prevent undefined behaviour. Includes a selftest.

    - Validate user input in the compat settimeofday() syscall to catch
    invalid values which would be turned into valid values by a
    multiplication overflow

    - Consolidate the time accessors

    - Small fixes, improvements and cleanups all over the place

    Drivers:

    - Support for the NXP system counter, TI davinci timer

    - Move the Microsoft HyperV clocksource/events code into the
    drivers/clocksource directory so it can be shared between x86 and
    ARM64.

    - Overhaul of the Tegra driver

    - Delay timer support for IXP4xx

    - Small fixes, improvements and cleanups as usual"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
    time: Validate user input in compat_settimeofday()
    timer: Document TIMER_PINNED
    clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
    clocksource/drivers: Make Hyper-V clocksource ISA agnostic
    MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
    hrtimer: Use a bullet for the returns bullet list
    arm64: vdso: Fix compilation with clang older than 8
    arm64: compat: Fix __arch_get_hw_counter() implementation
    arm64: Fix __arch_get_hw_counter() implementation
    lib/vdso: Make delta calculation work correctly
    MAINTAINERS: Add entry for the generic VDSO library
    arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
    arm64: vdso: Remove unnecessary asm-offsets.c definitions
    vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
    clocksource/drivers/davinci: Add support for clocksource
    clocksource/drivers/davinci: Add support for clockevents
    clocksource/drivers/tegra: Set up maximum-ticks limit properly
    clocksource/drivers/tegra: Cycles can't be 0
    clocksource/drivers/tegra: Restore base address before cleanup
    clocksource/drivers/tegra: Add verbose definition for 1MHz constant
    ...

    Linus Torvalds
     

03 Jul, 2019

1 commit


01 Jul, 2019

1 commit

  • Make sure to return a proper negative error code from copy_process()
    when anon_inode_getfile() fails with CLONE_PIDFD.
    Otherwise _do_fork() will not detect an error and get_task_pid() will
    operator on a nonsensical pointer:

    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
    R13: 00007ffc15fbb0ff R14: 00007ff07e47e9c0 R15: 0000000000000000
    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 7990 Comm: syz-executor290 Not tainted 5.2.0-rc6+ #9
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:get_task_pid+0xe1/0x210 kernel/pid.c:372
    Code: 89 ff e8 62 27 5f 00 49 8b 07 44 89 f1 4c 8d bc c8 90 01 00 00 eb 0c
    e8 0d fe 25 00 49 81 c7 38 05 00 00 4c 89 f8 48 c1 e8 03 3c 18 00 74
    08 4c 89 ff e8 31 27 5f 00 4d 8b 37 e8 f9 47 12 00
    RSP: 0018:ffff88808a4a7d78 EFLAGS: 00010203
    RAX: 00000000000000a7 RBX: dffffc0000000000 RCX: ffff888088180600
    RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
    RBP: ffff88808a4a7d90 R08: ffffffff814fb3a8 R09: ffffed1015d66bf8
    R10: ffffed1015d66bf8 R11: 1ffff11015d66bf7 R12: 0000000000041ffc
    R13: 1ffff11011494fbc R14: 0000000000000000 R15: 000000000000053d
    FS: 00007ff07e47e700(0000) GS:ffff8880aeb00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000004b5100 CR3: 0000000094df2000 CR4: 00000000001406e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    _do_fork+0x1b9/0x5f0 kernel/fork.c:2360
    __do_sys_clone kernel/fork.c:2454 [inline]
    __se_sys_clone kernel/fork.c:2448 [inline]
    __x64_sys_clone+0xc1/0xd0 kernel/fork.c:2448
    do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:301
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Link: https://lore.kernel.org/lkml/000000000000e0dc0d058c9e7142@google.com
    Reported-and-tested-by: syzbot+002e636502bc4b64eb5c@syzkaller.appspotmail.com
    Fixes: 6fd2fe494b17 ("copy_process(): don't use ksys_close() on cleanups")
    Cc: Jann Horn
    Cc: Al Viro
    Signed-off-by: Christian Brauner

    Christian Brauner
     

29 Jun, 2019

1 commit

  • Commit 5eed6f1dff87 ("fork,memcg: fix crash in free_thread_stack on
    memcg charge fail") corrected two instances, but there was a third
    instance of this bug.

    Without setting tsk->stack, if memcg_charge_kernel_stack fails, it'll
    execute free_thread_stack() on a dangling pointer.

    Enterprise kernels are compiled with VMAP_STACK=y so this isn't
    critical, but custom VMAP_STACK=n builds should have some performance
    advantage, with the drawback of risking to fail fork because compaction
    didn't succeed. So as long as VMAP_STACK=n is a supported option it's
    worth fixing it upstream.

    Link: http://lkml.kernel.org/r/20190619011450.28048-1-aarcange@redhat.com
    Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
    Signed-off-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

28 Jun, 2019

1 commit

  • This patch adds polling support to pidfd.

    Android low memory killer (LMK) needs to know when a process dies once
    it is sent the kill signal. It does so by checking for the existence of
    /proc/pid which is both racy and slow. For example, if a PID is reused
    between when LMK sends a kill signal and checks for existence of the
    PID, since the wrong PID is now possibly checked for existence.
    Using the polling support, LMK will be able to get notified when a process
    exists in race-free and fast way, and allows the LMK to do other things
    (such as by polling on other fds) while awaiting the process being killed
    to die.

    For notification to polling processes, we follow the same existing
    mechanism in the kernel used when the parent of the task group is to be
    notified of a child's death (do_notify_parent). This is precisely when the
    tasks waiting on a poll of pidfd are also awakened in this patch.

    We have decided to include the waitqueue in struct pid for the following
    reasons:
    1. The wait queue has to survive for the lifetime of the poll. Including
    it in task_struct would not be option in this case because the task can
    be reaped and destroyed before the poll returns.

    2. By including the struct pid for the waitqueue means that during
    de_thread(), the new thread group leader automatically gets the new
    waitqueue/pid even though its task_struct is different.

    Appropriate test cases are added in the second patch to provide coverage of
    all the cases the patch is handling.

    Cc: Andy Lutomirski
    Cc: Steven Rostedt
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Jonathan Kowalski
    Cc: Linus Torvalds
    Cc: Al Viro
    Cc: Kees Cook
    Cc: David Howells
    Cc: Oleg Nesterov
    Cc: kernel-team@android.com
    Reviewed-by: Oleg Nesterov
    Co-developed-by: Daniel Colascione
    Signed-off-by: Daniel Colascione
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Christian Brauner

    Joel Fernandes (Google)
     

27 Jun, 2019

1 commit

  • anon_inode_getfd() should be used *ONLY* in situations when we are
    guaranteed to be past the last failure point (including copying the
    descriptor number to userland, at that). And ksys_close() should
    not be used for cleanups at all.

    anon_inode_getfile() is there for all nontrivial cases like that.
    Just use that...

    Fixes: b3e583825266 ("clone: add CLONE_PIDFD")
    Signed-off-by: Al Viro
    Reviewed-by: Jann Horn
    Signed-off-by: Christian Brauner

    Al Viro