06 Jun, 2014

4 commits

  • The current implementation of lookup_pi_state has ambigous handling of
    the TID value 0 in the user space futex. We can get into the kernel
    even if the TID value is 0, because either there is a stale waiters bit
    or the owner died bit is set or we are called from the requeue_pi path
    or from user space just for fun.

    The current code avoids an explicit sanity check for pid = 0 in case
    that kernel internal state (waiters) are found for the user space
    address. This can lead to state leakage and worse under some
    circumstances.

    Handle the cases explicit:

    Waiter | pi_state | pi->owner | uTID | uODIED | ?

    [1] NULL | --- | --- | 0 | 0/1 | Valid
    [2] NULL | --- | --- | >0 | 0/1 | Valid

    [3] Found | NULL | -- | Any | 0/1 | Invalid

    [4] Found | Found | NULL | 0 | 1 | Valid
    [5] Found | Found | NULL | >0 | 1 | Invalid

    [6] Found | Found | task | 0 | 1 | Valid

    [7] Found | Found | NULL | Any | 0 | Invalid

    [8] Found | Found | task | ==taskTID | 0/1 | Valid
    [9] Found | Found | task | 0 | 0 | Invalid
    [10] Found | Found | task | !=taskTID | 0/1 | Invalid

    [1] Indicates that the kernel can acquire the futex atomically. We
    came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

    [2] Valid, if TID does not belong to a kernel thread. If no matching
    thread is found then it indicates that the owner TID has died.

    [3] Invalid. The waiter is queued on a non PI futex

    [4] Valid state after exit_robust_list(), which sets the user space
    value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

    [5] The user space value got manipulated between exit_robust_list()
    and exit_pi_state_list()

    [6] Valid state after exit_pi_state_list() which sets the new owner in
    the pi_state but cannot access the user space value.

    [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.

    [8] Owner and user space value match

    [9] There is no transient state which sets the user space TID to 0
    except exit_robust_list(), but this is indicated by the
    FUTEX_OWNER_DIED bit. See [4]

    [10] There is no transient state which leaves owner and user space
    TID out of sync.

    Signed-off-by: Thomas Gleixner
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • If the owner died bit is set at futex_unlock_pi, we currently do not
    cleanup the user space futex. So the owner TID of the current owner
    (the unlocker) persists. That's observable inconsistant state,
    especially when the ownership of the pi state got transferred.

    Clean it up unconditionally.

    Signed-off-by: Thomas Gleixner
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • We need to protect the atomic acquisition in the kernel against rogue
    user space which sets the user space futex to 0, so the kernel side
    acquisition succeeds while there is existing state in the kernel
    associated to the real owner.

    Verify whether the futex has waiters associated with kernel state. If
    it has, return -EINVAL. The state is corrupted already, so no point in
    cleaning it up. Subsequent calls will fail as well. Not our problem.

    [ tglx: Use futex_top_waiter() and explain why we do not need to try
    restoring the already corrupted user space state. ]

    Signed-off-by: Darren Hart
    Cc: Kees Cook
    Cc: Will Drewry
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • …tex_requeue(..., requeue_pi=1)

    If uaddr == uaddr2, then we have broken the rule of only requeueing from
    a non-pi futex to a pi futex with this call. If we attempt this, then
    dangling pointers may be left for rt_waiter resulting in an exploitable
    condition.

    This change brings futex_requeue() in line with futex_wait_requeue_pi()
    which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
    uaddr == uaddr2 in futex_wait_requeue_pi()")

    [ tglx: Compare the resulting keys as well, as uaddrs might be
    different depending on the mapping ]

    Fixes CVE-2014-3153.

    Reported-by: Pinkie Pie
    Signed-off-by: Will Drewry <wad@chromium.org>
    Signed-off-by: Kees Cook <keescook@chromium.org>
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Reviewed-by: Darren Hart <dvhart@linux.intel.com>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Thomas Gleixner
     

19 May, 2014

2 commits

  • We happily allow userspace to declare a random kernel thread to be the
    owner of a user space PI futex.

    Found while analysing the fallout of Dave Jones syscall fuzzer.

    We also should validate the thread group for private futexes and find
    some fast way to validate whether the "alleged" owner has RW access on
    the file which backs the SHM, but that's a separate issue.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     
  • Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
    detection code of rtmutex:
    http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

    That underlying issue has been fixed with a patch to the rtmutex code,
    but the futex code must not call into rtmutex in that case because
    - it can detect that issue early
    - it avoids a different and more complex fixup for backing out

    If the user space variable got manipulated to 0x80000000 which means
    no lock holder, but the waiters bit set and an active pi_state in the
    kernel is found we can figure out the recursive locking issue by
    looking at the pi_state owner. If that is the current task, then we
    can safely return -EDEADLK.

    The check should have been added in commit 59fa62451 (futex: Handle
    futex_pi OWNER_DIED take over correctly) already, but I did not see
    the above issue caused by user space manipulation back then.

    Signed-off-by: Thomas Gleixner
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Clark Williams
    Cc: Paul McKenney
    Cc: Lai Jiangshan
    Cc: Roland McGrath
    Cc: Carlos ODonell
    Cc: Jakub Jelinek
    Cc: Michael Kerrisk
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

13 Apr, 2014

1 commit

  • Commits 11d4616bd07f ("futex: revert back to the explicit waiter
    counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
    wake") changed some of the finer details of how we think about futexes.
    One was a late fix and the other a consequence of overlooking the whole
    requeuing logic.

    The first change caused our documentation to be incorrect, and the
    second made us aware that we need to explicitly add more details to it.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

09 Apr, 2014

1 commit

  • Jan Stancek reported:
    "pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
    occasionally fails, because some threads fail to wake up.

    Testcase creates 5 threads, which are all waiting on same condition.
    Main thread then calls pthread_cond_broadcast() without holding mutex,
    which calls:

    futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

    This immediately wakes up single thread A, which unlocks mutex and
    tries to wake up another thread:

    futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

    If thread A manages to call futex_wake() before any waiters are
    requeued for uaddr2, no other thread is woken up"

    The ordering constraints for the hash bucket waiter counting are that
    the waiter counts have to be incremented _before_ getting the spinlock
    (because the spinlock acts as part of the memory barrier), but the
    "requeue" operation didn't honor those rules, and nobody had even
    thought about that case.

    This fairly simple patch just increments the waiter count for the target
    hash bucket (hb2) when requeing a futex before taking the locks. It
    then decrements them again after releasing the lock - the code that
    actually moves the futex(es) between hash buckets will do the additional
    required waiter count housekeeping.

    Reported-and-tested-by: Jan Stancek
    Acked-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org # 3.14
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

01 Apr, 2014

1 commit

  • Pull core locking updates from Ingo Molnar:
    "The biggest change is the MCS spinlock generalization changes from Tim
    Chen, Peter Zijlstra, Jason Low et al. There's also lockdep
    fixes/enhancements from Oleg Nesterov, in particular a false negative
    fix related to lockdep_set_novalidate_class() usage"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
    locking/mutex: Fix debug checks
    locking/mutexes: Add extra reschedule point
    locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
    locking/mutexes: Unlock the mutex without the wait_lock
    locking/mutexes: Modify the way optimistic spinners are queued
    locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
    locking: Move mcs_spinlock.h into kernel/locking/
    m68k: Skip futex_atomic_cmpxchg_inatomic() test
    futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
    Revert "sched/wait: Suppress Sparse 'variable shadowing' warning"
    lockdep: Change lockdep_set_novalidate_class() to use _and_name
    lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate
    lockdep: Don't create the wrong dependency on hlock->check == 0
    lockdep: Make held_lock->check and "int check" argument bool
    locking/mcs: Allow architecture specific asm files to be used for contended case
    locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order
    sched/wait: Suppress Sparse 'variable shadowing' warning
    hung_task/Documentation: Fix hung_task_warnings description
    locking/mcs: Allow architectures to hook in to contended paths
    locking/mcs: Micro-optimize the MCS code, add extra comments
    ...

    Linus Torvalds
     

21 Mar, 2014

1 commit

  • Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
    taking the hb->lock if there's nothing to wake up") causes java threads
    getting stuck on futexes when runing specjbb on a power7 numa box.

    The cause appears to be that the powerpc spinlocks aren't using the same
    ticket lock model that we use on x86 (and other) architectures, which in
    turn result in the "spin_is_locked()" test in hb_waiters_pending()
    occasionally reporting an unlocked spinlock even when there are pending
    waiters.

    So this reinstates Davidlohr Bueso's original explicit waiter counting
    code, which I had convinced Davidlohr to drop in favor of figuring out
    the pending waiters by just using the existing state of the spinlock and
    the wait queue.

    Reported-and-tested-by: Srikar Dronamraju
    Original-code-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

03 Mar, 2014

1 commit

  • If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
    is no runtime check necessary, allow to skip the test within futex_init().

    This allows to get rid of some code which would always give the same result,
    and also allows the compiler to optimize a couple of if statements away.

    Signed-off-by: Heiko Carstens
    Cc: Finn Thain
    Cc: Geert Uytterhoeven
    Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
    Signed-off-by: Thomas Gleixner

    Heiko Carstens
     

21 Jan, 2014

1 commit

  • Pull scheduler changes from Ingo Molnar:

    - Add the initial implementation of SCHED_DEADLINE support: a real-time
    scheduling policy where tasks that meet their deadlines and
    periodically execute their instances in less than their runtime quota
    see real-time scheduling and won't miss any of their deadlines.
    Tasks that go over their quota get delayed (Available to privileged
    users for now)

    - Clean up and fix preempt_enable_no_resched() abuse all around the
    tree

    - Do sched_clock() performance optimizations on x86 and elsewhere

    - Fix and improve auto-NUMA balancing

    - Fix and clean up the idle loop

    - Apply various cleanups and fixes

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    sched: Fix __sched_setscheduler() nice test
    sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
    sched: Fix up attr::sched_priority warning
    sched: Fix up scheduler syscall LTP fails
    sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
    sched/core: Fix htmldocs warnings
    sched/deadline: No need to check p if dl_se is valid
    sched/deadline: Remove unused variables
    sched/deadline: Fix sparse static warnings
    m68k: Fix build warning in mac_via.h
    sched, thermal: Clean up preempt_enable_no_resched() abuse
    sched, net: Fixup busy_loop_us_clock()
    sched, net: Clean up preempt_enable_no_resched() abuse
    sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
    sched/preempt, locking: Rework local_bh_{dis,en}able()
    sched/clock, x86: Avoid a runtime condition in native_sched_clock()
    sched/clock: Fix up clear_sched_clock_stable()
    sched/clock, x86: Use a static_key for sched_clock_stable
    sched/clock: Remove local_irq_disable() from the clocks
    sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
    ...

    Linus Torvalds
     

16 Jan, 2014

1 commit

  • "futexes: Increase hash table size for better performance"
    introduces a new alloc_large_system_hash() call.

    alloc_large_system_hash() however may allocate less memory than
    requested, e.g. limited by MAX_ORDER.

    Hence pass a pointer to alloc_large_system_hash() which will
    contain the hash shift when the function returns. Afterwards
    correctly set futex_hashsize.

    Fixes a crash on s390 where the requested allocation size was
    4MB but only 1MB was allocated.

    Signed-off-by: Heiko Carstens
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

13 Jan, 2014

5 commits

  • Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
    and provide a proper comparison function for -deadline and
    -priority tasks.

    This is done mainly because:
    - classical prio field of the plist is just an int, which might
    not be enough for representing a deadline;
    - manipulating such a list would become O(nr_deadline_tasks),
    which might be to much, as the number of -deadline task increases.

    Therefore, an rb-tree is used, and tasks are queued in it according
    to the following logic:
    - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
    one with the higher (lower, actually!) prio wins;
    - among a -priority and a -deadline task, the latter always wins;
    - among two -deadline tasks, the one with the earliest deadline
    wins.

    Queueing and dequeueing functions are changed accordingly, for both
    the list of a task's pi-waiters and the list of tasks blocked on
    a pi-lock.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-again-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In futex_wake() there is clearly no point in taking the hb->lock
    if we know beforehand that there are no tasks to be woken. While
    the hash bucket's plist head is a cheap way of knowing this, we
    cannot rely 100% on it as there is a racy window between the
    futex_wait call and when the task is actually added to the
    plist. To this end, we couple it with the spinlock check as
    tasks trying to enter the critical region are most likely
    potential waiters that will be added to the plist, thus
    preventing tasks sleeping forever if wakers don't acknowledge
    all possible waiters.

    Furthermore, the futex ordering guarantees are preserved,
    ensuring that waiters either observe the changed user space
    value before blocking or is woken by a concurrent waker. For
    wakers, this is done by relying on the barriers in
    get_futex_key_refs() -- for archs that do not have implicit mb
    in atomic_inc(), we explicitly add them through a new
    futex_get_mm function. For waiters we rely on the fact that
    spin_lock calls already update the head counter, so spinners
    are visible even if the lock hasn't been acquired yet.

    For more details please refer to the updated comments in the
    code and related discussion:

    https://lkml.org/lkml/2013/11/26/556

    Special thanks to tglx for careful review and feedback.

    Suggested-by: Linus Torvalds
    Reviewed-by: Darren Hart
    Reviewed-by: Thomas Gleixner
    Reviewed-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Paul E. McKenney
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • That's essential, if you want to hack on futexes.

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Randy Dunlap
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Jason Low
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Currently, the futex global hash table suffers from its fixed,
    smallish (for today's standards) size of 256 entries, as well as
    its lack of NUMA awareness. Large systems, using many futexes,
    can be prone to high amounts of collisions; where these futexes
    hash to the same bucket and lead to extra contention on the same
    hb->lock. Furthermore, cacheline bouncing is a reality when we
    have multiple hb->locks residing on the same cacheline and
    different futexes hash to adjacent buckets.

    This patch keeps the current static size of 16 entries for small
    systems, or otherwise, 256 * ncpus (or larger as we need to
    round the number to a power of 2). Note that this number of CPUs
    accounts for all CPUs that can ever be available in the system,
    taking into consideration things like hotpluging. While we do
    impose extra overhead at bootup by making the hash table larger,
    this is a one time thing, and does not shadow the benefits of
    this patch.

    Furthermore, as suggested by tglx, by cache aligning the hash
    buckets we can avoid access across cacheline boundaries and also
    avoid massive cache line bouncing if multiple cpus are hammering
    away at different hash buckets which happen to reside in the
    same cache line.

    Also, similar to other core kernel components (pid, dcache,
    tcp), by using alloc_large_system_hash() we benefit from its
    NUMA awareness and thus the table is distributed among the nodes
    instead of in a single one.

    For a custom microbenchmark that pounds on the uaddr hashing --
    making the wait path fail at futex_wait_setup() returning
    -EWOULDBLOCK for large amounts of futexes, we can see the
    following benefits on a 80-core, 8-socket 1Tb server:

    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    | threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+
    |     512 |              32426 | 50531  (+55.8%)        | 255274  (+687.2%)     | 292553  (+802.2%)             |
    |     256 |              65360 | 99588  (+52.3%)        | 443563  (+578.6%)     | 508088  (+677.3%)             |
    |     128 |             125635 | 200075 (+59.2%)        | 742613  (+491.1%)     | 835452  (+564.9%)             |
    |      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
    |      64 |             247667 | 443740 (+79.1%)        | 997300  (+302.6%)     | 1145494 (+362.5%)             |
    |      32 |             628412 | 721401 (+14.7%)        | 965996  (+53.7%)      | 1122115 (+78.5%)              |
    +---------+--------------------+------------------------+-----------------------+-------------------------------+

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Waiman Long
    Reviewed-and-tested-by: Jason Low
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • - Remove unnecessary head variables.
    - Delete unused parameter in queue_unlock().

    Reviewed-by: Darren Hart
    Reviewed-by: Peter Zijlstra
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jason Low
    Signed-off-by: Davidlohr Bueso
    Cc: Mike Galbraith
    Cc: Jeff Mahoney
    Cc: Linus Torvalds
    Cc: Scott Norton
    Cc: Tom Vaden
    Cc: Aswin Chandramouleeswaran
    Cc: Waiman Long
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

13 Dec, 2013

2 commits

  • When debugging the read-only hugepage case, I was confused by the fact
    that get_futex_key() did an access_ok() only for the non-shared futex
    case, since the user address checking really isn't in any way specific
    to the private key handling.

    Now, it turns out that the shared key handling does effectively do the
    equivalent checks inside get_user_pages_fast() (it doesn't actually
    check the address range on x86, but does check the page protections for
    being a user page). So it wasn't actually a bug, but the fact that we
    treat the address differently for private and shared futexes threw me
    for a loop.

    Just move the check up, so that it gets done for both cases. Also, use
    the 'rw' parameter for the type, even if it doesn't actually matter any
    more (it's a historical artifact of the old racy i386 "page faults from
    kernel space don't check write protections").

    Cc: Thomas Gleixner
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • The hugepage code had the exact same bug that regular pages had in
    commit 7485d0d3758e ("futexes: Remove rw parameter from
    get_futex_key()").

    The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
    regression with read only mappings"), but the transparent hugepage case
    (added in a5b338f2b0b1: "thp: update futex compound knowledge") case
    remained broken.

    Found by Dave Jones and his trinity tool.

    Reported-and-tested-by: Dave Jones
    Cc: stable@kernel.org # v2.6.38+
    Acked-by: Thomas Gleixner
    Cc: Mel Gorman
    Cc: Darren Hart
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

06 Nov, 2013

1 commit


26 Jun, 2013

2 commits

  • Avoid waking up every thread sleeping in a futex_wait call during
    suspend and resume by calling a freezable blocking call. Previous
    patches modified the freezer to avoid sending wakeups to threads
    that are blocked in freezable blocking calls.

    This call was selected to be converted to a freezable call because
    it doesn't hold any locks or release any resources when interrupted
    that might be needed by another freezing task or a kernel driver
    during suspend, and is a common site where idle userspace tasks are
    blocked.

    Signed-off-by: Colin Cross
    Cc: Rafael J. Wysocki
    Cc: arve@android.com
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Darren Hart
    Cc: Randy Dunlap
    Cc: Al Viro
    Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
    Signed-off-by: Thomas Gleixner

    Colin Cross
     
  • The futex_keys of process shared futexes are generated from the page
    offset, the mapping host and the mapping index of the futex user space
    address. This should result in an unique identifier for each futex.

    Though this is not true when futexes are located in different subpages
    of an hugepage. The reason is, that the mapping index for all those
    futexes evaluates to the index of the base page of the hugetlbfs
    mapping. So a futex at offset 0 of the hugepage mapping and another
    one at offset PAGE_SIZE of the same hugepage mapping have identical
    futex_keys. This happens because the futex code blindly uses
    page->index.

    Steps to reproduce the bug:

    1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
    and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
    mapping.

    The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
    PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
    their keys solely depend on the user space address.

    2. Lock mutex1 and mutex2

    3. Create thread1 and in the thread function lock mutex1, which
    results in thread1 blocking on the locked mutex1.

    4. Create thread2 and in the thread function lock mutex2, which
    results in thread2 blocking on the locked mutex2.

    5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
    still blocks on mutex2 because the futex_key points to mutex1.

    To solve this issue we need to take the normal page index of the page
    which contains the futex into account, if the futex is in an hugetlbfs
    mapping. In other words, we calculate the normal page mapping index of
    the subpage in the hugetlbfs mapping.

    Mappings which are not based on hugetlbfs are not affected and still
    use page->index.

    Thanks to Mel Gorman who provided a patch for adding proper evaluation
    functions to the hugetlbfs code to avoid exposing hugetlbfs specific
    details to the futex code.

    [ tglx: Massaged changelog ]

    Signed-off-by: Zhang Yi
    Reviewed-by: Jiang Biao
    Tested-by: Ma Chenggong
    Reviewed-by: 'Mel Gorman'
    Acked-by: 'Darren Hart'
    Cc: 'Peter Zijlstra'
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
    Signed-off-by: Thomas Gleixner

    Zhang Yi
     

13 Mar, 2013

1 commit

  • Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
    kernel-doc notation format.

    Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'

    Fix one spello.

    Signed-off-by: Randy Dunlap
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

28 Feb, 2013

1 commit


23 Feb, 2013

1 commit

  • Pull core locking changes from Ingo Molnar:
    "The biggest change is the rwsem lock-steal improvements, both to the
    assembly optimized and the spinlock based variants.

    The other notable change is the clean up of the seqlock implementation
    to be based on the seqcount infrastructure.

    The rest is assorted smaller debuggability, cleanup and continued -rt
    locking changes."

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rwsem-spinlock: Implement writer lock-stealing for better scalability
    futex: Revert "futex: Mark get_robust_list as deprecated"
    generic: Use raw local irq variant for generic cmpxchg
    lockdep: Selftest: convert spinlock to raw spinlock
    seqlock: Use seqcount infrastructure
    seqlock: Remove unused functions
    ntp: Make ntp_lock raw
    intel_idle: Convert i7300_idle_lock to raw_spinlock
    locking: Various static lock initializer fixes
    lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
    rwsem: Implement writer lock-stealing for better scalability
    lockdep: Silence warning if CONFIG_LOCKDEP isn't set
    watchdog: Use local_clock for get_timestamp()
    lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
    locking/stat: Fix a typo

    Linus Torvalds
     

19 Feb, 2013

1 commit

  • This reverts commit ec0c4274e33c0373e476b73e01995c53128f1257.

    get_robust_list() is in use and a removal would break existing user
    space. With the permission checks in place it's not longer a security
    hole. Remove the deprecation warnings.

    Signed-off-by: Thomas Gleixner
    Cc: Cyrill Gorcunov
    Cc: Richard Weinberger
    Cc: akpm@linux-foundation.org
    Cc: paul.gortmaker@windriver.com
    Cc: davej@redhat.com
    Cc: keescook@chromium.org
    Cc: stable@vger.kernel.org
    Cc: ebiederm@xmission.com

    Thomas Gleixner
     

08 Feb, 2013

1 commit


27 Nov, 2012

1 commit

  • Dave Jones reported a bug with futex_lock_pi() that his trinity test
    exposed. Sometime between queue_me() and taking the q.lock_ptr, the
    lock_ptr became NULL, resulting in a crash.

    While futex_wake() is careful to not call wake_futex() on futex_q's with
    a pi_state or an rt_waiter (which are either waiting for a
    futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
    futex_requeue() do not perform the same test.

    Update futex_wake_op() and futex_requeue() to test for q.pi_state and
    q.rt_waiter and abort with -EINVAL if detected. To ensure any future
    breakage is caught, add a WARN() to wake_futex() if the same condition
    is true.

    This fix has seen 3 hours of testing with "trinity -c futex" on an
    x86_64 VM with 4 CPUS.

    [akpm@linux-foundation.org: tidy up the WARN()]
    Signed-off-by: Darren Hart
    Reported-by: Dave Jones
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: John Kacur
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darren Hart
     

01 Nov, 2012

1 commit

  • Siddhesh analyzed a failure in the take over of pi futexes in case the
    owner died and provided a workaround.
    See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

    The detailed problem analysis shows:

    Futex F is initialized with PTHREAD_PRIO_INHERIT and
    PTHREAD_MUTEX_ROBUST_NP attributes.

    T1 lock_futex_pi(F);

    T2 lock_futex_pi(F);
    --> T2 blocks on the futex and creates pi_state which is associated
    to T1.

    T1 exits
    --> exit_robust_list() runs
    --> Futex F userspace value TID field is set to 0 and
    FUTEX_OWNER_DIED bit is set.

    T3 lock_futex_pi(F);
    --> Succeeds due to the check for F's userspace TID field == 0
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    T1 --> exit_pi_state_list()
    --> Transfers pi_state to waiter T2 and wakes T2 via
    rt_mutex_unlock(&pi_state->mutex)

    T2 --> acquires pi_state->mutex and gains real ownership of the
    pi_state
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    T3 --> observes inconsistent state

    This problem is independent of UP/SMP, preemptible/non preemptible
    kernels, or process shared vs. private. The only difference is that
    certain configurations are more likely to expose it.

    So as Siddhesh correctly analyzed the following check in
    futex_lock_pi_atomic() is the culprit:

    if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

    We check the userspace value for a TID value of 0 and take over the
    futex unconditionally if that's true.

    AFAICT this check is there as it is correct for a different corner
    case of futexes: the WAITERS bit became stale.

    Now the proposed change

    - if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
    + if (unlikely(ownerdied ||
    + !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

    solves the problem, but it's not obvious why and it wreckages the
    "stale WAITERS bit" case.

    What happens is, that due to the WAITERS bit being set (T2 is blocked
    on that futex) it enforces T3 to go through lookup_pi_state(), which
    in the above case returns an existing pi_state and therefor forces T3
    to legitimately fight with T2 over the ownership of the pi_state (via
    pi_state->mutex). Probelm solved!

    Though that does not work for the "WAITERS bit is stale" problem
    because if lookup_pi_state() does not find existing pi_state it
    returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
    return -ESRCH to user space because the OWNER_DIED bit is not set.

    Now there is a different solution to that problem. Do not look at the
    user space value at all and enforce a lookup of possibly available
    pi_state. If pi_state can be found, then the new incoming locker T3
    blocks on that pi_state and legitimately races with T2 to acquire the
    rt_mutex and the pi_state and therefor the proper ownership of the
    user space futex.

    lookup_pi_state() has the correct order of checks. It first tries to
    find a pi_state associated with the user space futex and only if that
    fails it checks for futex TID value = 0. If no pi_state is available
    nothing can create new state at that point because this happens with
    the hash bucket lock held.

    So the above scenario changes to:

    T1 lock_futex_pi(F);

    T2 lock_futex_pi(F);
    --> T2 blocks on the futex and creates pi_state which is associated
    to T1.

    T1 exits
    --> exit_robust_list() runs
    --> Futex F userspace value TID field is set to 0 and
    FUTEX_OWNER_DIED bit is set.

    T3 lock_futex_pi(F);
    --> Finds pi_state and blocks on pi_state->rt_mutex

    T1 --> exit_pi_state_list()
    --> Transfers pi_state to waiter T2 and wakes it via
    rt_mutex_unlock(&pi_state->mutex)

    T2 --> acquires pi_state->mutex and gains ownership of the pi_state
    --> Claims ownership of the futex and sets its own TID into the
    userspace TID field of futex F
    --> returns to user space

    This covers all gazillion points on which T3 might come in between
    T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
    also solves the "WAITERS bit stale" problem by forcing the take over.

    Another benefit of changing the code this way is that it makes it less
    dependent on untrusted user space values and therefor minimizes the
    possible wreckage which might be inflicted.

    As usual after staring for too long at the futex code my brain hurts
    so much that I really want to ditch that whole optimization of
    avoiding the syscall for the non contended case for PI futexes and rip
    out the maze of corner case handling code. Unfortunately we can't as
    user space relies on that existing behaviour, but at least thinking
    about it helps me to preserve my mental sanity. Maybe we should
    nevertheless :)

    Reported-and-tested-by: Siddhesh Poyarekar
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
    Acked-by: Darren Hart
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

24 Jul, 2012

3 commits

  • If uaddr == uaddr2, then we have broken the rule of only requeueing
    from a non-pi futex to a pi futex with this call. If we attempt this,
    as the trinity test suite manages to do, we miss early wakeups as
    q.key is equal to key2 (because they are the same uaddr). We will then
    attempt to dereference the pi_mutex (which would exist had the futex_q
    been properly requeued to a pi futex) and trigger a NULL pointer
    dereference.

    Signed-off-by: Darren Hart
    Cc: Dave Jones
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing
    the address (&q.pi_state) of the pointer instead of the value
    (q.pi_state) of the pointer. Correct it accordingly.

    Signed-off-by: Darren Hart
    Cc: Dave Jones
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1c85d97f6e5f79ec389a4ead3e367363c74bd09a.1342809673.git.dvhart@linux.intel.com
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • If fixup_pi_state_owner() faults, pi_mutex may be NULL. Test
    for pi_mutex != NULL before testing the owner against current
    and possibly unlocking it.

    Signed-off-by: Darren Hart
    Cc: Dave Jones
    Cc: Dan Carpenter
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/dc59890338fc413606f04e5c5b131530734dae3d.1342809673.git.dvhart@linux.intel.com
    Signed-off-by: Thomas Gleixner

    Darren Hart
     

29 Mar, 2012

2 commits

  • Notify get_robust_list users that the syscall is going away.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Kees Cook
    Cc: Randy Dunlap
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Eric W. Biederman
    Cc: David Howells
    Cc: Serge E. Hallyn
    Cc: kernel-hardening@lists.openwall.com
    Cc: spender@grsecurity.net
    Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
    Signed-off-by: Thomas Gleixner

    Kees Cook
     
  • It was possible to extract the robust list head address from a setuid
    process if it had used set_robust_list(), allowing an ASLR info leak. This
    changes the permission checks to be the same as those used for similar
    info that comes out of /proc.

    Running a setuid program that uses robust futexes would have had:
    cred->euid != pcred->euid
    cred->euid == pcred->uid
    so the old permissions check would allow it. I'm not aware of any setuid
    programs that use robust futexes, so this is just a preventative measure.

    (This patch is based on changes from grsecurity.)

    Signed-off-by: Kees Cook
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Eric W. Biederman
    Cc: David Howells
    Cc: Serge E. Hallyn
    Cc: kernel-hardening@lists.openwall.com
    Cc: spender@grsecurity.net
    Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
    Signed-off-by: Thomas Gleixner

    Kees Cook
     

20 Mar, 2012

1 commit


15 Feb, 2012

2 commits


01 Jan, 2012

1 commit

  • It was found (by Sasha) that if you use a futex located in the gate
    area we get stuck in an uninterruptible infinite loop, much like the
    ZERO_PAGE issue.

    While looking at this problem, PeterZ realized you'll get into similar
    trouble when hitting any install_special_pages() mapping. And are there
    still drivers setting up their own special mmaps without page->mapping,
    and without special VM or pte flags to make get_user_pages fail?

    In most cases, if page->mapping is NULL, we do not need to retry at all:
    Linus points out that even /proc/sys/vm/drop_caches poses no problem,
    because it ends up using remove_mapping(), which takes care not to
    interfere when the page reference count is raised.

    But there is still one case which does need a retry: if memory pressure
    called shmem_writepage in between get_user_pages_fast dropping page
    table lock and our acquiring page lock, then the page gets switched from
    filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
    Fault it back in to get the page->mapping needed for key->shared.inode.

    Reported-by: Sasha Levin
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker