26 Oct, 2014

2 commits

  • free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
    exit_pi_state_list takes the hb lock first, and most callers of
    free_pi_state do too. requeue_pi doesn't, which means free_pi_state
    can free the pi_state out from under exit_pi_state_list. For example:

    task A | task B
    exit_pi_state_list |
    pi_state = |
    curr->pi_state_list->next |
    | futex_requeue(requeue_pi=1)
    | // pi_state is the same as
    | // the one in task A
    | free_pi_state(pi_state)
    | list_del_init(&pi_state->list)
    | kfree(pi_state)
    list_del_init(&pi_state->list) |

    Move the free_pi_state calls in requeue_pi to before it drops the hb
    locks which it's already holding.

    [ tglx: Removed a pointless free_pi_state() call and the hb->lock held
    debugging. The latter comes via a seperate patch ]

    Signed-off-by: Brian Silverman
    Cc: austin.linux@gmail.com
    Cc: darren@dvhart.com
    Cc: peterz@infradead.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
    Signed-off-by: Thomas Gleixner

    Brian Silverman
     
  • Update our documentation as of fix 76835b0ebf8 (futex: Ensure
    get_futex_key_refs() always implies a barrier). Explicitly
    state that we don't do key referencing for private futexes.

    Signed-off-by: Davidlohr Bueso
    Cc: Matteo Franchin
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

19 Oct, 2014

1 commit

  • Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
    nothing to wake up) changes the futex code to avoid taking a lock when
    there are no waiters. This code has been subsequently fixed in commit
    11d4616bd07f (futex: revert back to the explicit waiter counting code).
    Both the original commit and the fix-up rely on get_futex_key_refs() to
    always imply a barrier.

    However, for private futexes, none of the cases in the switch statement
    of get_futex_key_refs() would be hit and the function completes without
    a memory barrier as required before checking the "waiters" in
    futex_wake() -> hb_waiters_pending(). The consequence is a race with a
    thread waiting on a futex on another CPU, allowing the waker thread to
    read "waiters == 0" while the waiter thread to have read "futex_val ==
    locked" (in kernel).

    Without this fix, the problem (user space deadlocks) can be seen with
    Android bionic's mutex implementation on an arm64 multi-cluster system.

    Signed-off-by: Catalin Marinas
    Reported-by: Matteo Franchin
    Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
    Acked-by: Davidlohr Bueso
    Tested-by: Mike Galbraith
    Cc:
    Cc: Darren Hart
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

13 Sep, 2014

1 commit

  • futex_wait_requeue_pi() calls futex_wait_setup(). If
    futex_wait_setup() succeeds it returns with hb->lock held and
    preemption disabled. Now the sanity check after this does:

    if (match_futex(&q.key, &key2)) {
    ret = -EINVAL;
    goto out_put_keys;
    }

    which releases the keys but does not release hb->lock.

    So we happily return to user space with hb->lock held and therefor
    preemption disabled.

    Unlock hb->lock before taking the exit route.

    Reported-by: Dave "Trinity" Jones
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Reviewed-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

22 Jun, 2014

6 commits

  • futex_lock_pi_atomic() is a maze of retry hoops and loops.

    Reduce it to simple and understandable states:

    First step is to lookup existing waiters (state) in the kernel.

    If there is an existing waiter, validate it and attach to it.

    If there is no existing waiter, check the user space value

    If the TID encoded in the user space value is 0, take over the futex
    preserving the owner died bit.

    If the TID encoded in the user space value is != 0, lookup the owner
    task, validate it and attach to it.

    Reduces text size by 128 bytes on x8664.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: wad@chromium.org
    Cc: Darren Hart
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1406131137020.5170@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • We want to be a bit more clever in futex_lock_pi_atomic() and separate
    the possible states. Split out the code which attaches the first
    waiter to the owner into a separate function. No functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: wad@chromium.org
    Link: http://lkml.kernel.org/r/20140611204237.271300614@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • We want to be a bit more clever in futex_lock_pi_atomic() and separate
    the possible states. Split out the waiter verification into a separate
    function. No functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: wad@chromium.org
    Link: http://lkml.kernel.org/r/20140611204237.180458410@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • No point in open coding the same function again.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: wad@chromium.org
    Link: http://lkml.kernel.org/r/20140611204237.092947239@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The kernel tries to atomically unlock the futex without checking
    whether there is kernel state associated to the futex.

    So if user space manipulated the user space value, this will leave
    kernel internal state around associated to the owner task.

    For robustness sake, lookup first whether there are waiters on the
    futex. If there are waiters, wake the top priority waiter with all the
    proper sanity checks applied.

    If there are no waiters, do the atomic release. We do not have to
    preserve the waiters bit in this case, because a potentially incoming
    waiter is blocked on the hb->lock and will acquire the futex
    atomically. We neither have to preserve the owner died bit. The caller
    is the owner and it was supposed to cleanup the mess.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Kees Cook
    Cc: wad@chromium.org
    Link: http://lkml.kernel.org/r/20140611204237.016987332@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The deadlock logic is only required for futexes.

    Remove the extra arguments for the public functions and also for the
    futex specific ones which get always called with deadlock detection
    enabled.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt

    Thomas Gleixner
     

09 Jun, 2014

1 commit

  • Now that 3.15 is released, this merges the 'next' branch into 'master',
    bringing us to the normal situation where my 'master' branch is the
    merge window.

    * accumulated work in next: (6809 commits)
    ufs: sb mutex merge + mutex_destroy
    powerpc: update comments for generic idle conversion
    cris: update comments for generic idle conversion
    idle: remove cpu_idle() forward declarations
    nbd: zero from and len fields in NBD_CMD_DISCONNECT.
    mm: convert some level-less printks to pr_*
    MAINTAINERS: adi-buildroot-devel is moderated
    MAINTAINERS: add linux-api for review of API/ABI changes
    mm/kmemleak-test.c: use pr_fmt for logging
    fs/dlm/debug_fs.c: replace seq_printf by seq_puts
    fs/dlm/lockspace.c: convert simple_str to kstr
    fs/dlm/config.c: convert simple_str to kstr
    mm: mark remap_file_pages() syscall as deprecated
    mm: memcontrol: remove unnecessary memcg argument from soft limit functions
    mm: memcontrol: clean up memcg zoneinfo lookup
    mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
    mm/mempool.c: update the kmemleak stack trace for mempool allocations
    lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
    mm: introduce kmemleak_update_trace()
    mm/kmemleak.c: use %u to print ->checksum
    ...

    Linus Torvalds
     

06 Jun, 2014

4 commits

  • The current implementation of lookup_pi_state has ambigous handling of
    the TID value 0 in the user space futex. We can get into the kernel
    even if the TID value is 0, because either there is a stale waiters bit
    or the owner died bit is set or we are called from the requeue_pi path
    or from user space just for fun.

    The current code avoids an explicit sanity check for pid = 0 in case
    that kernel internal state (waiters) are found for the user space
    address. This can lead to state leakage and worse under some
    circumstances.

    Handle the cases explicit:

    Waiter | pi_state | pi->owner | uTID | uODIED | ?

    [1] NULL | --- | --- | 0 | 0/1 | Valid
    [2] NULL | --- | --- | >0 | 0/1 | Valid

    [3] Found | NULL | -- | Any | 0/1 | Invalid

    [4] Found | Found | NULL | 0 | 1 | Valid
    [5] Found | Found | NULL | >0 | 1 | Invalid

    [6] Found | Found | task | 0 | 1 | Valid

    [7] Found | Found | NULL | Any | 0 | Invalid

    [8] Found | Found | task | ==taskTID | 0/1 | Valid
    [9] Found | Found | task | 0 | 0 | Invalid
    [10] Found | Found | task | !=taskTID | 0/1 | Invalid

    [1] Indicates that the kernel can acquire the futex atomically. We
    came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

    [2] Valid, if TID does not belong to a kernel thread. If no matching
    thread is found then it indicates that the owner TID has died.

    [3] Invalid. The waiter is queued on a non PI futex

    [4] Valid state after exit_robust_list(), which sets the user space
    value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

    [5] The user space value got manipulated between exit_robust_list()
    and exit_pi_state_list()

    [6] Valid state after exit_pi_state_list() which sets the new owner in
    the pi_state but cannot access the user space value.

    [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.

    [8] Owner and user space value match

    [9] There is no transient state which sets the user space TID to 0
    except exit_robust_list(), but this is indicated by the
    FUTEX_OWNER_DIED bit. See [4]

    [10] There is no transient state which leaves owner and user space
    TID out of sync.

    Signed-off-by: Thomas Gleixner
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • If the owner died bit is set at futex_unlock_pi, we currently do not
    cleanup the user space futex. So the owner TID of the current owner
    (the unlocker) persists. That's observable inconsistant state,
    especially when the ownership of the pi state got transferred.

    Clean it up unconditionally.

    Signed-off-by: Thomas Gleixner
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • We need to protect the atomic acquisition in the kernel against rogue
    user space which sets the user space futex to 0, so the kernel side
    acquisition succeeds while there is existing state in the kernel
    associated to the real owner.

    Verify whether the futex has waiters associated with kernel state. If
    it has, return -EINVAL. The state is corrupted already, so no point in
    cleaning it up. Subsequent calls will fail as well. Not our problem.

    [ tglx: Use futex_top_waiter() and explain why we do not need to try
    restoring the already corrupted user space state. ]

    Signed-off-by: Darren Hart
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • …tex_requeue(..., requeue_pi=1)

    If uaddr == uaddr2, then we have broken the rule of only requeueing from
    a non-pi futex to a pi futex with this call. If we attempt this, then
    dangling pointers may be left for rt_waiter resulting in an exploitable
    condition.

    This change brings futex_requeue() in line with futex_wait_requeue_pi()
    which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
    uaddr == uaddr2 in futex_wait_requeue_pi()")

    [ tglx: Compare the resulting keys as well, as uaddrs might be
    different depending on the mapping ]

    Fixes CVE-2014-3153.

    Reported-by: Pinkie Pie
    Signed-off-by: Will Drewry <wad@chromium.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Darren Hart <dvhart@linux.intel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Thomas Gleixner
     

04 Jun, 2014

1 commit

  • …el/git/tip/tip into next

    Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - reduced/streamlined smp_mb__*() interface that allows more usecases
    and makes the existing ones less buggy, especially in rarer
    architectures

    - add rwsem implementation comments

    - bump up lockdep limits"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    rwsem: Add comments to explain the meaning of the rwsem's count field
    lockdep: Increase static allocations
    arch: Mass conversion of smp_mb__*()
    arch,doc: Convert smp_mb__*()
    arch,xtensa: Convert smp_mb__*()
    arch,x86: Convert smp_mb__*()
    arch,tile: Convert smp_mb__*()
    arch,sparc: Convert smp_mb__*()
    arch,sh: Convert smp_mb__*()
    arch,score: Convert smp_mb__*()
    arch,s390: Convert smp_mb__*()
    arch,powerpc: Convert smp_mb__*()
    arch,parisc: Convert smp_mb__*()
    arch,openrisc: Convert smp_mb__*()
    arch,mn10300: Convert smp_mb__*()
    arch,mips: Convert smp_mb__*()
    arch,metag: Convert smp_mb__*()
    arch,m68k: Convert smp_mb__*()
    arch,m32r: Convert smp_mb__*()
    arch,ia64: Convert smp_mb__*()
    ...

    Linus Torvalds
     

19 May, 2014

2 commits

  • We happily allow userspace to declare a random kernel thread to be the
    owner of a user space PI futex.

    Found while analysing the fallout of Dave Jones syscall fuzzer.

    We also should validate the thread group for private futexes and find
    some fast way to validate whether the "alleged" owner has RW access on
    the file which backs the SHM, but that's a separate issue.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     
  • Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
    detection code of rtmutex:
    http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

    That underlying issue has been fixed with a patch to the rtmutex code,
    but the futex code must not call into rtmutex in that case because
    - it can detect that issue early
    - it avoids a different and more complex fixup for backing out

    If the user space variable got manipulated to 0x80000000 which means
    no lock holder, but the waiters bit set and an active pi_state in the
    kernel is found we can figure out the recursive locking issue by
    looking at the pi_state owner. If that is the current task, then we
    can safely return -EDEADLK.

    The check should have been added in commit 59fa62451 (futex: Handle
    futex_pi OWNER_DIED take over correctly) already, but I did not see
    the above issue caused by user space manipulation back then.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

18 Apr, 2014

1 commit

  • Mostly scripted conversion of the smp_mb__* barriers.

    Signed-off-by: Peter Zijlstra
    Acked-by: Paul E. McKenney
    Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
    Cc: Linus Torvalds
    Cc: linux-arch@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Apr, 2014

1 commit

  • Commits 11d4616bd07f ("futex: revert back to the explicit waiter
    counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
    wake") changed some of the finer details of how we think about futexes.
    One was a late fix and the other a consequence of overlooking the whole
    requeuing logic.

    The first change caused our documentation to be incorrect, and the
    second made us aware that we need to explicitly add more details to it.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

09 Apr, 2014

1 commit

  • Jan Stancek reported:
    "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
    occasionally fails, because some threads fail to wake up.

    Testcase creates 5 threads, which are all waiting on same condition.
    Main thread then calls pthread_cond_broadcast() without holding mutex,
    which calls:

    futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

    This immediately wakes up single thread A, which unlocks mutex and
    tries to wake up another thread:

    futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

    If thread A manages to call futex_wake() before any waiters are
    requeued for uaddr2, no other thread is woken up"

    The ordering constraints for the hash bucket waiter counting are that
    the waiter counts have to be incremented _before_ getting the spinlock
    (because the spinlock acts as part of the memory barrier), but the
    "requeue" operation didn't honor those rules, and nobody had even
    thought about that case.

    This fairly simple patch just increments the waiter count for the target
    hash bucket (hb2) when requeing a futex before taking the locks. It
    then decrements them again after releasing the lock - the code that
    actually moves the futex(es) between hash buckets will do the additional
    required waiter count housekeeping.

    Reported-and-tested-by: Jan Stancek
    Acked-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 3.14
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull core locking updates from Ingo Molnar:
    "The biggest change is the MCS spinlock generalization changes from Tim
    Chen, Peter Zijlstra, Jason Low et al. There's also lockdep
    fixes/enhancements from Oleg Nesterov, in particular a false negative
    fix related to lockdep_set_novalidate_class() usage"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
    locking/mutex: Fix debug checks
    locking/mutexes: Add extra reschedule point
    locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
    locking/mutexes: Unlock the mutex without the wait_lock
    locking/mutexes: Modify the way optimistic spinners are queued
    locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
    locking: Move mcs_spinlock.h into kernel/locking/
    m68k: Skip futex_atomic_cmpxchg_inatomic() test
    futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
    Revert "sched/wait: Suppress Sparse 'variable shadowing' warning"
    lockdep: Change lockdep_set_novalidate_class() to use _and_name
    lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate
    lockdep: Don't create the wrong dependency on hlock->check == 0
    lockdep: Make held_lock->check and "int check" argument bool
    locking/mcs: Allow architecture specific asm files to be used for contended case
    locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order
    sched/wait: Suppress Sparse 'variable shadowing' warning
    hung_task/Documentation: Fix hung_task_warnings description
    locking/mcs: Allow architectures to hook in to contended paths
    locking/mcs: Micro-optimize the MCS code, add extra comments
    ...

    Linus Torvalds
     

21 Mar, 2014

1 commit

  • Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
    taking the hb->lock if there's nothing to wake up") causes java threads
    getting stuck on futexes when runing specjbb on a power7 numa box.

    The cause appears to be that the powerpc spinlocks aren't using the same
    ticket lock model that we use on x86 (and other) architectures, which in
    turn result in the "spin_is_locked()" test in hb_waiters_pending()
    occasionally reporting an unlocked spinlock even when there are pending
    waiters.

    So this reinstates Davidlohr Bueso's original explicit waiter counting
    code, which I had convinced Davidlohr to drop in favor of figuring out
    the pending waiters by just using the existing state of the spinlock and
    the wait queue.

    Reported-and-tested-by: Srikar Dronamraju
    Original-code-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Mar, 2014

1 commit

  • If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
    is no runtime check necessary, allow to skip the test within futex_init().

    This allows to get rid of some code which would always give the same result,
    and also allows the compiler to optimize a couple of if statements away.

    Signed-off-by: Heiko Carstens
    Cc: Finn Thain
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
    Signed-off-by: Thomas Gleixner

    Heiko Carstens
     

21 Jan, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:

    - Add the initial implementation of SCHED_DEADLINE support: a real-time
    scheduling policy where tasks that meet their deadlines and
    periodically execute their instances in less than their runtime quota
    see real-time scheduling and won't miss any of their deadlines.
    Tasks that go over their quota get delayed (Available to privileged
    users for now)

    - Clean up and fix preempt_enable_no_resched() abuse all around the
    tree

    - Do sched_clock() performance optimizations on x86 and elsewhere

    - Fix and improve auto-NUMA balancing

    - Fix and clean up the idle loop

    - Apply various cleanups and fixes

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    sched: Fix __sched_setscheduler() nice test
    sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
    sched: Fix up attr::sched_priority warning
    sched: Fix up scheduler syscall LTP fails
    sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
    sched/core: Fix htmldocs warnings
    sched/deadline: No need to check p if dl_se is valid
    sched/deadline: Remove unused variables
    sched/deadline: Fix sparse static warnings
    m68k: Fix build warning in mac_via.h
    sched, thermal: Clean up preempt_enable_no_resched() abuse
    sched, net: Fixup busy_loop_us_clock()
    sched, net: Clean up preempt_enable_no_resched() abuse
    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
    sched/preempt, locking: Rework local_bh_{dis,en}able()
    sched/clock, x86: Avoid a runtime condition in native_sched_clock()
    sched/clock: Fix up clear_sched_clock_stable()
    sched/clock, x86: Use a static_key for sched_clock_stable
    sched/clock: Remove local_irq_disable() from the clocks
    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
    ...

    Linus Torvalds
     

16 Jan, 2014

1 commit

  • "futexes: Increase hash table size for better performance"
    introduces a new alloc_large_system_hash() call.

    alloc_large_system_hash() however may allocate less memory than
    requested, e.g. limited by MAX_ORDER.

    Hence pass a pointer to alloc_large_system_hash() which will
    contain the hash shift when the function returns. Afterwards
    correctly set futex_hashsize.

    Fixes a crash on s390 where the requested allocation size was
    4MB but only 1MB was allocated.

    Signed-off-by: Heiko Carstens
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

13 Jan, 2014

5 commits

  • Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
    and provide a proper comparison function for -deadline and
    -priority tasks.

    This is done mainly because:
    - classical prio field of the plist is just an int, which might
    not be enough for representing a deadline;
    - manipulating such a list would become O(nr_deadline_tasks),
    which might be to much, as the number of -deadline task increases.

    Therefore, an rb-tree is used, and tasks are queued in it according
    to the following logic:
    - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
    one with the higher (lower, actually!) prio wins;
    - among a -priority and a -deadline task, the latter always wins;
    - among two -deadline tasks, the one with the earliest deadline
    wins.

    Queueing and dequeueing functions are changed accordingly, for both
    the list of a task's pi-waiters and the list of tasks blocked on
    a pi-lock.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-again-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In futex_wake() there is clearly no point in taking the hb->lock
    if we know beforehand that there are no tasks to be woken. While
    the hash bucket's plist head is a cheap way of knowing this, we
    cannot rely 100% on it as there is a racy window between the
    futex_wait call and when the task is actually added to the
    plist. To this end, we couple it with the spinlock check as
    tasks trying to enter the critical region are most likely
    potential waiters that will be added to the plist, thus
    preventing tasks sleeping forever if wakers don't acknowledge
    all possible waiters.

    Furthermore, the futex ordering guarantees are preserved,
    ensuring that waiters either observe the changed user space
    value before blocking or is woken by a concurrent waker. For
    wakers, this is done by relying on the barriers in
    get_futex_key_refs() -- for archs that do not have implicit mb
    in atomic_inc(), we explicitly add them through a new
    futex_get_mm function. For waiters we rely on the fact that
    spin_lock calls already update the head counter, so spinners
    are visible even if the lock hasn't been acquired yet.

    For more details please refer to the updated comments in the
    code and related discussion:

    https://lkml.org/lkml/2013/11/26/556

    Special thanks to tglx for careful review and feedback.

    Suggested-by: Linus Torvalds
    Reviewed-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Paul E. McKenney
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • That's essential, if you want to hack on futexes.

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Randy Dunlap
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Currently, the futex global hash table suffers from its fixed,
    smallish (for today's standards) size of 256 entries, as well as
    its lack of NUMA awareness. Large systems, using many futexes,
    can be prone to high amounts of collisions; where these futexes
    hash to the same bucket and lead to extra contention on the same
    hb->lock. Furthermore, cacheline bouncing is a reality when we
    have multiple hb->locks residing on the same cacheline and
    different futexes hash to adjacent buckets.

    This patch keeps the current static size of 16 entries for small
    systems, or otherwise, 256 * ncpus (or larger as we need to
    round the number to a power of 2). Note that this number of CPUs
    accounts for all CPUs that can ever be available in the system,
    taking into consideration things like hotpluging. While we do
    impose extra overhead at bootup by making the hash table larger,
    this is a one time thing, and does not shadow the benefits of
    this patch.

    Furthermore, as suggested by tglx, by cache aligning the hash
    buckets we can avoid access across cacheline boundaries and also
    avoid massive cache line bouncing if multiple cpus are hammering
    away at different hash buckets which happen to reside in the
    same cache line.

    Also, similar to other core kernel components (pid, dcache,
    tcp), by using alloc_large_system_hash() we benefit from its
    NUMA awareness and thus the table is distributed among the nodes
    instead of in a single one.

    For a custom microbenchmark that pounds on the uaddr hashing --
    making the wait path fail at futex_wait_setup() returning
    -EWOULDBLOCK for large amounts of futexes, we can see the
    following benefits on a 80-core, 8-socket 1Tb server:

    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
    |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
    |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
    |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
    |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
    |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Reviewed-and-tested-by: Jason Low
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • - Remove unnecessary head variables.
    - Delete unused parameter in queue_unlock().

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jason Low
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

13 Dec, 2013

2 commits

  • When debugging the read-only hugepage case, I was confused by the fact
    that get_futex_key() did an access_ok() only for the non-shared futex
    case, since the user address checking really isn't in any way specific
    to the private key handling.

    Now, it turns out that the shared key handling does effectively do the
    equivalent checks inside get_user_pages_fast() (it doesn't actually
    check the address range on x86, but does check the page protections for
    being a user page). So it wasn't actually a bug, but the fact that we
    treat the address differently for private and shared futexes threw me
    for a loop.

    Just move the check up, so that it gets done for both cases. Also, use
    the 'rw' parameter for the type, even if it doesn't actually matter any
    more (it's a historical artifact of the old racy i386 "page faults from
    kernel space don't check write protections").

    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The hugepage code had the exact same bug that regular pages had in
    commit 7485d0d3758e ("futexes: Remove rw parameter from
    get_futex_key()").

    The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
    regression with read only mappings"), but the transparent hugepage case
    (added in a5b338f2b0b1: "thp: update futex compound knowledge") case
    remained broken.

    Found by Dave Jones and his trinity tool.

    Reported-and-tested-by: Dave Jones
    Cc: stable@kernel.org # v2.6.38+
    Acked-by: Thomas Gleixner
    Cc: Mel Gorman
    Cc: Darren Hart
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Nov, 2013

1 commit


26 Jun, 2013

2 commits

  • Avoid waking up every thread sleeping in a futex_wait call during
    suspend and resume by calling a freezable blocking call. Previous
    patches modified the freezer to avoid sending wakeups to threads
    that are blocked in freezable blocking calls.

    This call was selected to be converted to a freezable call because
    it doesn't hold any locks or release any resources when interrupted
    that might be needed by another freezing task or a kernel driver
    during suspend, and is a common site where idle userspace tasks are
    blocked.

    Signed-off-by: Colin Cross
    Cc: Rafael J. Wysocki
    Cc: arve@android.com
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Darren Hart
    Cc: Randy Dunlap
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
    Signed-off-by: Thomas Gleixner

    Colin Cross
     
  • The futex_keys of process shared futexes are generated from the page
    offset, the mapping host and the mapping index of the futex user space
    address. This should result in an unique identifier for each futex.

    Though this is not true when futexes are located in different subpages
    of an hugepage. The reason is, that the mapping index for all those
    futexes evaluates to the index of the base page of the hugetlbfs
    mapping. So a futex at offset 0 of the hugepage mapping and another
    one at offset PAGE_SIZE of the same hugepage mapping have identical
    futex_keys. This happens because the futex code blindly uses
    page->index.

    Steps to reproduce the bug:

    1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
    and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
    mapping.

    The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
    PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
    their keys solely depend on the user space address.

    2. Lock mutex1 and mutex2

    3. Create thread1 and in the thread function lock mutex1, which
    results in thread1 blocking on the locked mutex1.

    4. Create thread2 and in the thread function lock mutex2, which
    results in thread2 blocking on the locked mutex2.

    5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
    still blocks on mutex2 because the futex_key points to mutex1.

    To solve this issue we need to take the normal page index of the page
    which contains the futex into account, if the futex is in an hugetlbfs
    mapping. In other words, we calculate the normal page mapping index of
    the subpage in the hugetlbfs mapping.

    Mappings which are not based on hugetlbfs are not affected and still
    use page->index.

    Thanks to Mel Gorman who provided a patch for adding proper evaluation
    functions to the hugetlbfs code to avoid exposing hugetlbfs specific
    details to the futex code.

    [ tglx: Massaged changelog ]

    Signed-off-by: Zhang Yi
    Reviewed-by: Jiang Biao
    Tested-by: Ma Chenggong
    Reviewed-by: 'Mel Gorman'
    Acked-by: 'Darren Hart'
    Cc: 'Peter Zijlstra'
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
    Signed-off-by: Thomas Gleixner

    Zhang Yi
     

13 Mar, 2013

1 commit

  • Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
    kernel-doc notation format.

    Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'

    Fix one spello.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

28 Feb, 2013

1 commit


23 Feb, 2013

1 commit

  • Pull core locking changes from Ingo Molnar:
    "The biggest change is the rwsem lock-steal improvements, both to the
    assembly optimized and the spinlock based variants.

    The other notable change is the clean up of the seqlock implementation
    to be based on the seqcount infrastructure.

    The rest is assorted smaller debuggability, cleanup and continued -rt
    locking changes."

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rwsem-spinlock: Implement writer lock-stealing for better scalability
    futex: Revert "futex: Mark get_robust_list as deprecated"
    generic: Use raw local irq variant for generic cmpxchg
    lockdep: Selftest: convert spinlock to raw spinlock
    seqlock: Use seqcount infrastructure
    seqlock: Remove unused functions
    ntp: Make ntp_lock raw
    intel_idle: Convert i7300_idle_lock to raw_spinlock
    locking: Various static lock initializer fixes
    lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
    rwsem: Implement writer lock-stealing for better scalability
    lockdep: Silence warning if CONFIG_LOCKDEP isn't set
    watchdog: Use local_clock for get_timestamp()
    lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
    locking/stat: Fix a typo

    Linus Torvalds
     

19 Feb, 2013

1 commit

  • This reverts commit ec0c4274e33c0373e476b73e01995c53128f1257.

    get_robust_list() is in use and a removal would break existing user
    space. With the permission checks in place it's not longer a security
    hole. Remove the deprecation warnings.

    Signed-off-by: Thomas Gleixner
    Cc: Cyrill Gorcunov
    Cc: Richard Weinberger
    Cc: akpm@linux-foundation.org
    Cc: paul.gortmaker@windriver.com
    Cc: davej@redhat.com
    Cc: keescook@chromium.org
    Cc: stable@vger.kernel.org
    Cc: ebiederm@xmission.com

    Thomas Gleixner