29 Nov, 2019

11 commits

  • commit 3ef240eaff36b8119ac9e2ea17cbf41179c930ba upstream.

    Oleg provided the following test case:

    int main(void)
    {
    struct sched_param sp = {};

    sp.sched_priority = 2;
    assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);

    int lock = vfork();
    if (!lock) {
    sp.sched_priority = 1;
    assert(sched_setscheduler(0, SCHED_FIFO, &sp) == 0);
    _exit(0);
    }

    syscall(__NR_futex, &lock, FUTEX_LOCK_PI, 0,0,0);
    return 0;
    }

    This creates an unkillable RT process spinning in futex_lock_pi() on a UP
    machine or if the process is affine to a single CPU. The reason is:

    parent child

    set FIFO prio 2

    vfork() -> set FIFO prio 1
    implies wait_for_child() sched_setscheduler(...)
    exit()
    do_exit()
    ....
    mm_release()
    tsk->futex_state = FUTEX_STATE_EXITING;
    exit_futex(); (NOOP in this case)
    complete() --> wakes parent
    sys_futex()
    loop infinite because
    tsk->futex_state == FUTEX_STATE_EXITING

    The same problem can happen just by regular preemption as well:

    task holds futex
    ...
    do_exit()
    tsk->futex_state = FUTEX_STATE_EXITING;

    --> preemption (unrelated wakeup of some other higher prio task, e.g. timer)

    switch_to(other_task)

    return to user
    sys_futex()
    loop infinite as above

    Just for the fun of it the futex exit cleanup could trigger the wakeup
    itself before the task sets its futex state to DEAD.

    To cure this, the handling of the exiting owner is changed so:

    - A refcount is held on the task

    - The task pointer is stored in a caller visible location

    - The caller drops all locks (hash bucket, mmap_sem) and blocks
    on task::futex_exit_mutex. When the mutex is acquired then
    the exiting task has completed the cleanup and the state
    is consistent and can be reevaluated.

    This is not a pretty solution, but there is no choice other than returning
    an error code to user space, which would break the state consistency
    guarantee and open another can of problems including regressions.

    For stable backports the preparatory commits ac31c7ff8624 .. ba31c1a48538
    are required as well, but for anything older than 5.3.y the backports are
    going to be provided when this hits mainline as the other dependencies for
    those kernels are definitely not stable material.

    Fixes: 778e9a9c3e71 ("pi-futex: fix exit races and locking problems")
    Reported-by: Oleg Nesterov
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Cc: Stable Team
    Link: https://lkml.kernel.org/r/20191106224557.041676471@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit ac31c7ff8624409ba3c4901df9237a616c187a5d upstream.

    attach_to_pi_owner() returns -EAGAIN for various cases:

    - Owner task is exiting
    - Futex value has changed

    The caller drops the held locks (hash bucket, mmap_sem) and retries the
    operation. In case of the owner task exiting this can result in a live
    lock.

    As a preparatory step for seperating those cases, provide a distinct return
    value (EBUSY) for the owner exiting case.

    No functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.935606117@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 3f186d974826847a07bc7964d79ec4eded475ad9 upstream.

    The mutex will be used in subsequent changes to replace the busy looping of
    a waiter when the futex owner is currently executing the exit cleanup to
    prevent a potential live lock.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.845798895@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit af8cbda2cfcaa5515d61ec500498d46e9a8247e2 upstream.

    exec() attempts to handle potentially held futexes gracefully by running
    the futex exit handling code like exit() does.

    The current implementation has no protection against concurrent incoming
    waiters. The reason is that the futex state cannot be set to
    FUTEX_STATE_DEAD after the cleanup because the task struct is still active
    and just about to execute the new binary.

    While its arguably buggy when a task holds a futex over exec(), for
    consistency sake the state handling can at least cover the actual futex
    exit cleanup section. This provides state consistency protection accross
    the cleanup. As the futex state of the task becomes FUTEX_STATE_OK after the
    cleanup has been finished, this cannot prevent subsequent attempts to
    attach to the task in case that the cleanup was not successfull in mopping
    up all leftovers.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.753355618@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 4a8e991b91aca9e20705d434677ac013974e0e30 upstream.

    Instead of having a smp_mb() and an empty lock/unlock of task::pi_lock move
    the state setting into to the lock section.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.645603214@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 18f694385c4fd77a09851fd301236746ca83f3cb upstream.

    Instead of relying on PF_EXITING use an explicit state for the futex exit
    and set it in the futex exit function. This moves the smp barrier and the
    lock/unlock serialization into the futex code.

    As with the DEAD state this is restricted to the exit path as exec
    continues to use the same task struct.

    This allows to simplify that logic in a next step.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.539409004@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit f24f22435dcc11389acc87e5586239c1819d217c upstream.

    Setting task::futex_state in do_exit() is rather arbitrarily placed for no
    reason. Move it into the futex code.

    Note, this is only done for the exit cleanup as the exec cleanup cannot set
    the state to FUTEX_STATE_DEAD because the task struct is still in active
    use.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.439511191@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 150d71584b12809144b8145b817e83b81158ae5f upstream.

    To allow separate handling of the futex exit state in the futex exit code
    for exit and exec, split futex_mm_release() into two functions and invoke
    them from the corresponding exit/exec_mm_release() callsites.

    Preparatory only, no functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.332094221@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 3d4775df0a89240f671861c6ab6e8d59af8e9e41 upstream.

    The futex exit handling relies on PF_ flags. That's suboptimal as it
    requires a smp_mb() and an ugly lock/unlock of the exiting tasks pi_lock in
    the middle of do_exit() to enforce the observability of PF_EXITING in the
    futex code.

    Add a futex_state member to task_struct and convert the PF_EXITPIDONE logic
    over to the new state. The PF_EXITING dependency will be cleaned up in a
    later step.

    This prepares for handling various futex exit issues later.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.149449274@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit ba31c1a48538992316cc71ce94fa9cd3e7b427c0 upstream.

    The futex exit handling is #ifdeffed into mm_release() which is not pretty
    to begin with. But upcoming changes to address futex exit races need to add
    more functionality to this exit code.

    Split it out into a function, move it into futex code and make the various
    futex exit functions static.

    Preparatory only and no functional change.

    Folded build fix from Borislav.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20191106224556.049705556@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit ca16d5bee59807bf04deaab0a8eccecd5061528c upstream.

    Robust futexes utilize the robust_list mechanism to allow the kernel to
    release futexes which are held when a task exits. The exit can be voluntary
    or caused by a signal or fault. This prevents that waiters block forever.

    The futex operations in user space store a pointer to the futex they are
    either locking or unlocking in the op_pending member of the per task robust
    list.

    After a lock operation has succeeded the futex is queued in the robust list
    linked list and the op_pending pointer is cleared.

    After an unlock operation has succeeded the futex is removed from the
    robust list linked list and the op_pending pointer is cleared.

    The robust list exit code checks for the pending operation and any futex
    which is queued in the linked list. It carefully checks whether the futex
    value is the TID of the exiting task. If so, it sets the OWNER_DIED bit and
    tries to wake up a potential waiter.

    This is race free for the lock operation but unlock has two race scenarios
    where waiters might not be woken up. These issues can be observed with
    regular robust pthread mutexes. PI aware pthread mutexes are not affected.

    (1) Unlocking task is killed after unlocking the futex value in user space
    before being able to wake a waiter.

    pthread_mutex_unlock()
    |
    V
    atomic_exchange_rel (&mutex->__data.__lock, 0)
    wakeup()
    |
    |(return to userspace)
    |(__lock = 0)
    |
    V
    oldval = mutex->__data.__lock
    __data.__lock, |
    id | assume_other_futex_waiters, 0) |
    |
    |
    (enter kernel)|
    |
    V
    do_exit()
    |
    |
    V
    handle_futex_death()
    |
    |(__lock = 0)
    |(uval = 0)
    |
    V
    if ((uval & FUTEX_TID_MASK) != task_pid_vnr(curr))
    return 0;

    The sanity check which ensures that the user space futex is owned
    by the exiting task prevents the wakeup of waiters, which seems to
    be correct as the exiting task does not own the futex value, but
    the consequence is that other waiters wont be woken up and block
    infinitely.

    In both scenarios the following conditions are true:

    - task->robust_list->list_op_pending != NULL
    - user space futex value == 0
    - Regular futex (not PI)

    If these conditions are met then it is reasonably safe to wake up a
    potential waiter in order to prevent the above problems.

    As this might be a false positive it can cause spurious wakeups, but the
    waiter side has to handle other types of unrelated wakeups, e.g. signals
    gracefully anyway. So such a spurious wakeup will not affect the
    correctness of these operations.

    This workaround must not touch the user space futex value and cannot set
    the OWNER_DIED bit because the lock value is 0, i.e. uncontended. Setting
    OWNER_DIED in this case would result in inconsistent state and subsequently
    in malfunction of the owner died handling in user space.

    The rest of the user space state is still consistent as no other task can
    observe the list_op_pending entry in the exiting tasks robust list.

    The eventually woken up waiter will observe the uncontended lock value and
    take it over.

    [ tglx: Massaged changelog and comment. Made the return explicit and not
    depend on the subsequent check and added constants to hand into
    handle_futex_death() instead of plain numbers. Fixed a few coding
    style issues. ]

    Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
    Signed-off-by: Yang Tao
    Signed-off-by: Yi Wang
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/1573010582-35297-1-git-send-email-wang.yi59@zte.com.cn
    Link: https://lkml.kernel.org/r/20191106224555.943191378@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Yang Tao
     

01 Aug, 2019

2 commits

  • hrtimer_sleepers will gain a scheduling class dependent treatment on
    PREEMPT_RT. Use the new hrtimer_sleeper_start_expires() function to make
    that possible.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • hrtimer_init_sleeper() calls require prior initialisation of the hrtimer
    object which is embedded into the hrtimer_sleeper.

    Combine the initialization and spare a function call. Fixup all call sites.

    This is also a preparatory change for PREEMPT_RT to do hrtimer sleeper
    specific initializations of the embedded hrtimer without modifying any of
    the call sites.

    No functional change.

    [ anna-maria: Minor cleanups ]
    [ tglx: Adopted to the removal of the task argument of
    hrtimer_init_sleeper() and trivial polishing.
    Folded a fix from Stephen Rothwell for the vsoc code ]

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20190726185752.887468908@linutronix.de

    Sebastian Andrzej Siewior
     

31 Jul, 2019

1 commit


03 Jun, 2019

1 commit


31 May, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details you
    should have received a copy of the gnu general public license along
    with this program if not write to the free software foundation inc
    59 temple place suite 330 boston ma 02111 1307 usa

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1334 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.113240726@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

29 May, 2019

1 commit

  • Add a new futex_setup_timer() helper function to consolidate all the
    hrtimer_sleeper setup code.

    Signed-off-by: Waiman Long
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Link: https://lkml.kernel.org/r/20190528160345.24017-1-longman@redhat.com

    Waiman Long
     

15 May, 2019

1 commit

  • To facilitate additional options to get_user_pages_fast() change the
    singular write parameter to be gup_flags.

    This patch does not change any functionality. New functionality will
    follow in subsequent patches.

    Some of the get_user_pages_fast() call sites were unchanged because they
    already passed FOLL_WRITE or 0 for the write parameter.

    NOTE: It was suggested to change the ordering of the get_user_pages_fast()
    arguments to ensure that callers were converted. This breaks the current
    GUP call site convention of having the returned pages be the final
    parameter. So the suggestion was rejected.

    Link: http://lkml.kernel.org/r/20190328084422.29911-4-ira.weiny@intel.com
    Link: http://lkml.kernel.org/r/20190317183438.2057-4-ira.weiny@intel.com
    Signed-off-by: Ira Weiny
    Reviewed-by: Mike Marshall
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Dan Williams
    Cc: "David S. Miller"
    Cc: Heiko Carstens
    Cc: Ingo Molnar
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Michal Hocko
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ira Weiny
     

26 Apr, 2019

1 commit

  • Some futex() operations, including FUTEX_WAKE_OP, require the kernel to
    perform an atomic read-modify-write of the futex word via the userspace
    mapping. These operations are implemented by each architecture in
    arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
    are called in atomic context with the relevant hash bucket locks held.

    Although these routines may return -EFAULT in response to a page fault
    generated when accessing userspace, they are expected to succeed (i.e.
    return 0) in all other cases. This poses a problem for architectures
    that do not provide bounded forward progress guarantees or fairness of
    contended atomic operations and can lead to starvation in some cases.

    In these problematic scenarios, we must return back to the core futex
    code so that we can drop the hash bucket locks and reschedule if
    necessary, much like we do in the case of a page fault.

    Allow architectures to return -EAGAIN from their implementations of
    arch_futex_atomic_op_inuser() and futex_atomic_cmpxchg_inatomic(), which
    will cause the core futex code to reschedule if necessary and return
    back to the architecture code later on.

    Cc:
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Will Deacon

    Will Deacon
     

22 Mar, 2019

1 commit

  • The futex code requires that the user space addresses of futexes are 32bit
    aligned. sys_futex() checks this in futex_get_keys() but the robust list
    code has no alignment check in place.

    As a consequence the kernel crashes on architectures with strict alignment
    requirements in handle_futex_death() when trying to cmpxchg() on an
    unaligned futex address which was retrieved from the robust list.

    [ tglx: Rewrote changelog, proper sizeof() based alignement check and add
    comment ]

    Fixes: 0771dfefc9e5 ("[PATCH] lightweight robust futexes: core")
    Signed-off-by: Chen Jie
    Signed-off-by: Thomas Gleixner
    Cc:
    Cc:
    Cc:
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/1552621478-119787-1-git-send-email-chenjie6@huawei.com

    Chen Jie
     

06 Mar, 2019

2 commits

  • Pull locking updates from Ingo Molnar:
    "The biggest part of this tree is the new auto-generated atomics API
    wrappers by Mark Rutland.

    The primary motivation was to allow instrumentation without uglifying
    the primary source code.

    The linecount increase comes from adding the auto-generated files to
    the Git space as well:

    include/asm-generic/atomic-instrumented.h | 1689 ++++++++++++++++--
    include/asm-generic/atomic-long.h | 1174 ++++++++++---
    include/linux/atomic-fallback.h | 2295 +++++++++++++++++++++++++
    include/linux/atomic.h | 1241 +------------

    I preferred this approach, so that the full call stack of the (already
    complex) locking APIs is still fully visible in 'git grep'.

    But if this is excessive we could certainly hide them.

    There's a separate build-time mechanism to determine whether the
    headers are out of date (they should never be stale if we do our job
    right).

    Anyway, nothing from this should be visible to regular kernel
    developers.

    Other changes:

    - Add support for dynamic keys, which removes a source of false
    positives in the workqueue code, among other things (Bart Van
    Assche)

    - Updates to tools/memory-model (Andrea Parri, Paul E. McKenney)

    - qspinlock, wake_q and lockdep micro-optimizations (Waiman Long)

    - misc other updates and enhancements"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    locking/lockdep: Shrink struct lock_class_key
    locking/lockdep: Add module_param to enable consistency checks
    lockdep/lib/tests: Test dynamic key registration
    lockdep/lib/tests: Fix run_tests.sh
    kernel/workqueue: Use dynamic lockdep keys for workqueues
    locking/lockdep: Add support for dynamic keys
    locking/lockdep: Verify whether lock objects are small enough to be used as class keys
    locking/lockdep: Check data structure consistency
    locking/lockdep: Reuse lock chains that have been freed
    locking/lockdep: Fix a comment in add_chain_cache()
    locking/lockdep: Introduce lockdep_next_lockchain() and lock_chain_count()
    locking/lockdep: Reuse list entries that are no longer in use
    locking/lockdep: Free lock classes that are no longer in use
    locking/lockdep: Update two outdated comments
    locking/lockdep: Make it easy to detect whether or not inside a selftest
    locking/lockdep: Split lockdep_free_key_range() and lockdep_reset_lock()
    locking/lockdep: Initialize the locks_before and locks_after lists earlier
    locking/lockdep: Make zap_class() remove all matching lock order entries
    locking/lockdep: Reorder struct lock_class members
    locking/lockdep: Avoid that add_chain_cache() adds an invalid chain to the cache
    ...

    Linus Torvalds
     
  • Pull year 2038 updates from Thomas Gleixner:
    "Another round of changes to make the kernel ready for 2038. After lots
    of preparatory work this is the first set of syscalls which are 2038
    safe:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    The syscall numbers are identical all over the architectures"

    * 'timers-2038-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
    riscv: Use latest system call ABI
    checksyscalls: fix up mq_timedreceive and stat exceptions
    unicore32: Fix __ARCH_WANT_STAT64 definition
    asm-generic: Make time32 syscall numbers optional
    asm-generic: Drop getrlimit and setrlimit syscalls from default list
    32-bit userspace ABI: introduce ARCH_32BIT_OFF_T config option
    compat ABI: use non-compat openat and open_by_handle_at variants
    y2038: add 64-bit time_t syscalls to all 32-bit architectures
    y2038: rename old time and utime syscalls
    y2038: remove struct definition redirects
    y2038: use time32 syscall names on 32-bit
    syscalls: remove obsolete __IGNORE_ macros
    y2038: syscalls: rename y2038 compat syscalls
    x86/x32: use time64 versions of sigtimedwait and recvmmsg
    timex: change syscalls to use struct __kernel_timex
    timex: use __kernel_timex internally
    sparc64: add custom adjtimex/clock_adjtime functions
    time: fix sys_timer_settime prototype
    time: Add struct __kernel_timex
    time: make adjtime compat handling available for 32 bit
    ...

    Linus Torvalds
     

28 Feb, 2019

1 commit


11 Feb, 2019

2 commits

  • atomic_t variables are currently used to implement reference
    counters with the following properties:

    - counter is initialized to 1 using atomic_set()
    - a resource is freed upon counter reaching zero
    - once counter reaches zero, its further
    increments aren't allowed
    - counter schema uses basic atomic operations
    (set, inc, inc_not_zero, dec_and_test, etc.)

    Such atomic variables should be converted to a newly provided
    refcount_t type and API that prevents accidental counter overflows
    and underflows. This is important since overflows and underflows
    can lead to use-after-free situation and be exploitable.

    The variable futex_pi_state.refcount is used as pure
    reference counter. Convert it to refcount_t and fix up
    the operations.

    **Important note for maintainers:

    Some functions from refcount_t API defined in lib/refcount.c
    have different memory ordering guarantees than their atomic
    counterparts. Please check Documentation/core-api/refcount-vs-atomic.rst
    for more information.

    Normally the differences should not matter since refcount_t provides
    enough guarantees to satisfy the refcounting use cases, but in
    some rare cases it might matter.
    Please double check that you don't have some undocumented
    memory guarantees for this variable usage.

    For the futex_pi_state.refcount it might make a difference
    in following places:

    - get_pi_state() and exit_pi_state_list(): increment in
    refcount_inc_not_zero() only guarantees control dependency
    on success vs. fully ordered atomic counterpart
    - put_pi_state(): decrement in refcount_dec_and_test() provides
    RELEASE ordering and ACQUIRE ordering on success
    vs. fully ordered atomic counterpart

    Suggested-by: Kees Cook
    Signed-off-by: Elena Reshetova
    Reviewed-by: David Windsor
    Reviewed-by: Hans Liljestrand
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: dvhart@infradead.org
    Link: http://lkml.kernel.org/r/1549369467-3505-1-git-send-email-elena.reshetova@intel.com
    Signed-off-by: Ingo Molnar

    Elena Reshetova
     
  • …/arnd/playground into timers/2038

    Pull y2038 - time64 system calls from Arnd Bergmann:

    This series finally gets us to the point of having system calls with 64-bit
    time_t on all architectures, after a long time of incremental preparation
    patches.

    There was actually one conversion that I missed during the summer,
    i.e. Deepa's timex series, which I now updated based the 5.0-rc1 changes
    and review comments.

    The following system calls are now added on all 32-bit architectures using
    the same system call numbers:

    403 clock_gettime64
    404 clock_settime64
    405 clock_adjtime64
    406 clock_getres_time64
    407 clock_nanosleep_time64
    408 timer_gettime64
    409 timer_settime64
    410 timerfd_gettime64
    411 timerfd_settime64
    412 utimensat_time64
    413 pselect6_time64
    414 ppoll_time64
    416 io_pgetevents_time64
    417 recvmmsg_time64
    418 mq_timedsend_time64
    419 mq_timedreceiv_time64
    420 semtimedop_time64
    421 rt_sigtimedwait_time64
    422 futex_time64
    423 sched_rr_get_interval_time64

    Each one of these corresponds directly to an existing system call that
    includes a 'struct timespec' argument, or a structure containing a timespec
    or (in case of clock_adjtime) timeval. Not included here are new versions
    of getitimer/setitimer and getrusage/waitid, which are planned for the
    future but only needed to make a consistent API rather than for correct
    operation beyond y2038. These four system calls are based on 'timeval', and
    it has not been finally decided what the replacement kernel interface will
    use instead.

    So far, I have done a lot of build testing across most architectures, which
    has found a number of bugs. Runtime testing so far included testing LTP on
    32-bit ARM with the existing system calls, to ensure we do not regress for
    existing binaries, and a test with a 32-bit x86 build of LTP against a
    modified version of the musl C library that has been adapted to the new
    system call interface [3]. This library can be used for testing on all
    architectures supported by musl-1.1.21, but it is not how the support is
    getting integrated into the official musl release. Official musl support is
    planned but will require more invasive changes to the library.

    Link: https://lore.kernel.org/lkml/20190110162435.309262-1-arnd@arndb.de/T/
    Link: https://lore.kernel.org/lkml/20190118161835.2259170-1-arnd@arndb.de/
    Link: https://git.linaro.org/people/arnd/musl-y2038.git/ [2]

    Thomas Gleixner
     

08 Feb, 2019

2 commits

  • commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
    rtmutex") changed the locking rules in the futex code so that the hash
    bucket lock is not longer held while the waiter is enqueued into the
    rtmutex wait list. This made the lock and the unlock path symmetric, but
    unfortunately the possible early exit from __rt_mutex_proxy_start() due to
    a detected deadlock was not updated accordingly. That allows a concurrent
    unlocker to observe inconsitent state which triggers the warning in the
    unlock path.

    futex_lock_pi() futex_unlock_pi()
    lock(hb->lock)
    queue(hb_waiter) lock(hb->lock)
    lock(rtmutex->wait_lock)
    unlock(hb->lock)
    // acquired hb->lock
    hb_waiter = futex_top_waiter()
    lock(rtmutex->wait_lock)
    __rt_mutex_proxy_start()
    ---> fail
    remove(rtmutex_waiter);
    ---> returns -EDEADLOCK
    unlock(rtmutex->wait_lock)
    // acquired wait_lock
    wake_futex_pi()
    rt_mutex_next_owner()
    --> returns NULL
    --> WARN

    lock(hb->lock)
    unqueue(hb_waiter)

    The problem is caused by the remove(rtmutex_waiter) in the failure case of
    __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
    hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

    The original commit handles this correctly for the other early return cases
    (timeout, signal) by delaying the removal of the rtmutex waiter until the
    returning task reacquired the hash bucket lock.

    Treat the failure case of __rt_mutex_proxy_start() in the same way and let
    the existing cleanup code handle the eventual handover of the rtmutex
    gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
    removal for the failure case, so that the other callsites are still
    operating correctly.

    Add proper comments to the code so all these details are fully documented.

    Thanks to Peter for helping with the analysis and writing the really
    valuable code comments.

    Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
    Reported-by: Heiko Carstens
    Co-developed-by: Peter Zijlstra
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    Tested-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: linux-s390@vger.kernel.org
    Cc: Stefan Liebler
    Cc: Sebastian Sewior
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

    Thomas Gleixner
     
  • The current comment for the barrier that guarantees that waiter increment
    is always before taking the hb spinlock (barrier (A)) needs to be fixed as
    it is misplaced.

    This is obviously referring to hb_waiters_inc, which is a full barrier.

    Reported-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190206185602.949-1-dave@stgolabs.net

    Davidlohr Bueso
     

07 Feb, 2019

1 commit

  • A lot of system calls that pass a time_t somewhere have an implementation
    using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
    been reworked so that this implementation can now be used on 32-bit
    architectures as well.

    The missing step is to redefine them using the regular SYSCALL_DEFINEx()
    to get them out of the compat namespace and make it possible to build them
    on 32-bit architectures.

    Any system call that ends in 'time' gets a '32' suffix on its name for
    that version, while the others get a '_time32' suffix, to distinguish
    them from the normal version, which takes a 64-bit time argument in the
    future.

    In this step, only 64-bit architectures are changed, doing this rename
    first lets us avoid touching the 32-bit architectures twice.

    Acked-by: Catalin Marinas
    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

04 Feb, 2019

1 commit

  • Some users, specifically futexes and rwsems, required fixes
    that allowed the callers to be safe when wakeups occur before
    they are expected by wake_up_q(). Such scenarios also play
    games and rely on reference counting, and until now were
    pivoting on wake_q doing it. With the wake_q_add() call being
    moved down, this can no longer be the case. As such we end up
    with a a double task refcounting overhead; and these callers
    care enough about this (being rather core-ish).

    This patch introduces a wake_q_add_safe() call that serves
    for callers that have already done refcounting and therefore the
    task is 'safe' from wake_q point of view (int that it requires
    reference throughout the entire queue/>wakeup cycle). In the one
    case it has internal reference counting, in the other case it
    consumes the reference counting.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: Xie Yongji
    Cc: Yongji Xie
    Cc: andrea.parri@amarulasolutions.com
    Cc: lilin24@baidu.com
    Cc: liuqi16@baidu.com
    Cc: nixun@baidu.com
    Cc: yuanlinsi01@baidu.com
    Cc: zhangyu31@baidu.com
    Link: https://lkml.kernel.org/r/20181218195352.7orq3upiwfdbrdne@linux-r8p5
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

30 Jan, 2019

1 commit

  • When calling debugfs functions, there is no need to ever check the return
    value. The function can work or not, but the code logic should never do
    something different based on this.

    Signed-off-by: Greg Kroah-Hartman
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart (VMware)
    Cc: Peter Zijlstra
    Link: https://lkml.kernel.org/r/20190122152151.16139-40-gregkh@linuxfoundation.org

    Greg Kroah-Hartman
     

21 Jan, 2019

1 commit

  • We must not rely on wake_q_add() to delay the wakeup; in particular
    commit:

    1d0dcb3ad9d3 ("futex: Implement lockless wakeups")

    moved wake_q_add() before smp_store_release(&q->lock_ptr, NULL), which
    could result in futex_wait() waking before observing ->lock_ptr ==
    NULL and going back to sleep again.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 1d0dcb3ad9d3 ("futex: Implement lockless wakeups")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Jan, 2019

1 commit

  • Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
    of the user address range verification function since we got rid of the
    old racy i386-only code to walk page tables by hand.

    It existed because the original 80386 would not honor the write protect
    bit when in kernel mode, so you had to do COW by hand before doing any
    user access. But we haven't supported that in a long time, and these
    days the 'type' argument is a purely historical artifact.

    A discussion about extending 'user_access_begin()' to do the range
    checking resulted this patch, because there is no way we're going to
    move the old VERIFY_xyz interface to that model. And it's best done at
    the end of the merge window when I've done most of my merges, so let's
    just get this done once and for all.

    This patch was mostly done with a sed-script, with manual fix-ups for
    the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

    There were a couple of notable cases:

    - csky still had the old "verify_area()" name as an alias.

    - the iter_iov code had magical hardcoded knowledge of the actual
    values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
    really used it)

    - microblaze used the type argument for a debug printout

    but other than those oddities this should be a total no-op patch.

    I tried to fix up all architectures, did fairly extensive grepping for
    access_ok() uses, and the changes are trivial, but I may have missed
    something. Any missed conversion should be trivially fixable, though.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Dec, 2018

1 commit

  • Pull y2038 updates from Arnd Bergmann:
    "More syscalls and cleanups

    This concludes the main part of the system call rework for 64-bit
    time_t, which has spread over most of year 2018, the last six system
    calls being

    - ppoll
    - pselect6
    - io_pgetevents
    - recvmmsg
    - futex
    - rt_sigtimedwait

    As before, nothing changes for 64-bit architectures, while 32-bit
    architectures gain another entry point that differs only in the layout
    of the timespec structure. Hopefully in the next release we can wire
    up all 22 of those system calls on all 32-bit architectures, which
    gives us a baseline version for glibc to start using them.

    This does not include the clock_adjtime, getrusage/waitid, and
    getitimer/setitimer system calls. I still plan to have new versions of
    those as well, but they are not required for correct operation of the
    C library since they can be emulated using the old 32-bit time_t based
    system calls.

    Aside from the system calls, there are also a few cleanups here,
    removing old kernel internal interfaces that have become unused after
    all references got removed. The arch/sh cleanups are part of this,
    there were posted several times over the past year without a reaction
    from the maintainers, while the corresponding changes made it into all
    other architectures"

    * tag 'y2038-for-4.21' of ssh://gitolite.kernel.org:/pub/scm/linux/kernel/git/arnd/playground:
    timekeeping: remove obsolete time accessors
    vfs: replace current_kernel_time64 with ktime equivalent
    timekeeping: remove timespec_add/timespec_del
    timekeeping: remove unused {read,update}_persistent_clock
    sh: remove board_time_init() callback
    sh: remove unused rtc_sh_get/set_time infrastructure
    sh: sh03: rtc: push down rtc class ops into driver
    sh: dreamcast: rtc: push down rtc class ops into driver
    y2038: signal: Add compat_sys_rt_sigtimedwait_time64
    y2038: signal: Add sys_rt_sigtimedwait_time32
    y2038: socket: Add compat_sys_recvmmsg_time64
    y2038: futex: Add support for __kernel_timespec
    y2038: futex: Move compat implementation into futex.c
    io_pgetevents: use __kernel_timespec
    pselect6: use __kernel_timespec
    ppoll: use __kernel_timespec
    signal: Add restore_user_sigmask()
    signal: Add set_user_sigmask()

    Linus Torvalds
     

19 Dec, 2018

1 commit

  • Stefan reported, that the glibc tst-robustpi4 test case fails
    occasionally. That case creates the following race between
    sys_exit() and sys_futex_lock_pi():

    CPU0 CPU1

    sys_exit() sys_futex()
    do_exit() futex_lock_pi()
    exit_signals(tsk) No waiters:
    tsk->flags |= PF_EXITING; *uaddr == 0x00000PID
    mm_release(tsk) Set waiter bit
    exit_robust_list(tsk) { *uaddr = 0x80000PID;
    Set owner died attach_to_pi_owner() {
    *uaddr = 0xC0000000; tsk = get_task(PID);
    } if (!tsk->flags & PF_EXITING) {
    ... attach();
    tsk->flags |= PF_EXITPIDONE; } else {
    if (!(tsk->flags & PF_EXITPIDONE))
    return -EAGAIN;
    return -ESRCH;
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Heiko Carstens
    Cc: Darren Hart
    Cc: Ingo Molnar
    Cc: Sasha Levin
    Cc: stable@vger.kernel.org
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=200467
    Link: https://lkml.kernel.org/r/20181210152311.986181245@linutronix.de

    Thomas Gleixner
     

08 Dec, 2018

2 commits

  • This prepares sys_futex for y2038 safe calling: the native
    syscall is changed to receive a __kernel_timespec argument, which
    will be switched to 64-bit time_t in the future. All the internal
    time handling gets changed to timespec64, and the compat_sys_futex
    entry point is moved under the CONFIG_COMPAT_32BIT_TIME check
    to provide compatibility for existing 32-bit architectures.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     
  • We are going to share the compat_sys_futex() handler between 64-bit
    architectures and 32-bit architectures that need to deal with both 32-bit
    and 64-bit time_t, and this is easier if both entry points are in the
    same file.

    In fact, most other system call handlers do the same thing these days, so
    let's follow the trend here and merge all of futex_compat.c into futex.c.

    In the process, a few minor changes have to be done to make sure everything
    still makes sense: handle_futex_death() and futex_cmpxchg_enabled() become
    local symbol, and the compat version of the fetch_robust_entry() function
    gets renamed to compat_fetch_robust_entry() to avoid a symbol clash.

    This is intended as a purely cosmetic patch, no behavior should
    change.

    Signed-off-by: Arnd Bergmann

    Arnd Bergmann
     

31 Oct, 2018

1 commit

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

09 Oct, 2018

1 commit

  • lockdep_assert_held() is better suited for checking locking requirements,
    since it won't get confused when the lock is held by some other task. This
    is also a step towards possibly removing spin_is_locked().

    Signed-off-by: Lance Roy
    Signed-off-by: Thomas Gleixner
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Link: https://lkml.kernel.org/r/20181003053902.6910-12-ldr709@gmail.com

    Lance Roy
     

21 Aug, 2018

1 commit


07 Feb, 2018

1 commit

  • There are several functions that do find_task_by_vpid() followed by
    get_task_struct(). We can use a helper function instead.

    Link: http://lkml.kernel.org/r/1509602027-11337-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport