Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

06 Jun, 2014

4 commits

54a217887 futex: Make lookup_pi_state more robust ... Browse Code »
5

The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters bit
or the owner died bit is set or we are called from the requeue_pi path
or from user space just for fun.

The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.

Handle the cases explicit:

Waiter | pi_state | pi->owner | uTID | uODIED | ?

[1] NULL | --- | --- | 0 | 0/1 | Valid
[2] NULL | --- | --- | >0 | 0/1 | Valid

[3] Found | NULL | -- | Any | 0/1 | Invalid

[4] Found | Found | NULL | 0 | 1 | Valid
[5] Found | Found | NULL | >0 | 1 | Invalid

[6] Found | Found | task | 0 | 1 | Valid

[7] Found | Found | NULL | Any | 0 | Invalid

[8] Found | Found | task | ==taskTID | 0/1 | Valid
[9] Found | Found | task | 0 | 0 | Invalid
[10] Found | Found | task | !=taskTID | 0/1 | Invalid

[1] Indicates that the kernel can acquire the futex atomically. We
came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

[2] Valid, if TID does not belong to a kernel thread. If no matching
thread is found then it indicates that the owner TID has died.

[3] Invalid. The waiter is queued on a non PI futex

[4] Valid state after exit_robust_list(), which sets the user space
value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

[5] The user space value got manipulated between exit_robust_list()
and exit_pi_state_list()

[6] Valid state after exit_pi_state_list() which sets the new owner in
the pi_state but cannot access the user space value.

[7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.

[8] Owner and user space value match

[9] There is no transient state which sets the user space TID to 0
except exit_robust_list(), but this is indicated by the
FUTEX_OWNER_DIED bit. See [4]

[10] There is no transient state which leaves owner and user space
TID out of sync.

Signed-off-by: Thomas Gleixner
Cc: Kees Cook
Cc: Will Drewry
Cc: Darren Hart
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
13fbca4c6 futex: Always cleanup owner tid in unlock_pi ... Browse Code »
5

If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.

Clean it up unconditionally.

Signed-off-by: Thomas Gleixner
Cc: Kees Cook
Cc: Will Drewry
Cc: Darren Hart
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
b3eaa9fc5 futex: Validate atomic acquisition in futex_lock_pi_atomic() ... Browse Code »
5

We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.

Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.

[ tglx: Use futex_top_waiter() and explain why we do not need to try
restoring the already corrupted user space state. ]

Signed-off-by: Darren Hart
Cc: Kees Cook
Cc: Will Drewry
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
e9c243a5a futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in fu… ... Browse Code »
5

…tex_requeue(..., requeue_pi=1)

If uaddr == uaddr2, then we have broken the rule of only requeueing from
a non-pi futex to a pi futex with this call. If we attempt this, then
dangling pointers may be left for rt_waiter resulting in an exploitable
condition.

This change brings futex_requeue() in line with futex_wait_requeue_pi()
which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
uaddr == uaddr2 in futex_wait_requeue_pi()")

[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]

Fixes CVE-2014-3153.

Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thomas Gleixner
2014-06-06 03:31:07 +0800

19 May, 2014

2 commits

f0d71b3dc futex: Prevent attaching to kernel threads ... Browse Code »
5

We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.

Found while analysing the fallout of Dave Jones syscall fuzzer.

We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800
866293ee5 futex: Add another early deadlock detection check ... Browse Code »
5

Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
- it can detect that issue early
- it avoids a different and more complex fixup for backing out

If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.

The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800

13 Apr, 2014

1 commit

d7e8af1af futex: update documentation for ordering guarantees ... Browse Code »

Commits 11d4616bd07f ("futex: revert back to the explicit waiter
counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
wake") changed some of the finer details of how we think about futexes.
One was a late fix and the other a consequence of overlooking the whole
requeuing logic.

The first change caused our documentation to be incorrect, and the
second made us aware that we need to explicitly add more details to it.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-13 08:57:51 +0800

09 Apr, 2014

1 commit

69cd9eba3 futex: avoid race between requeue and wake ... Browse Code »
5

Jan Stancek reported:
"pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
occasionally fails, because some threads fail to wake up.

Testcase creates 5 threads, which are all waiting on same condition.
Main thread then calls pthread_cond_broadcast() without holding mutex,
which calls:

futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

This immediately wakes up single thread A, which unlocks mutex and
tries to wake up another thread:

futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

If thread A manages to call futex_wake() before any waiters are
requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks. It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek
Acked-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-04-09 23:02:12 +0800

01 Apr, 2014

1 commit

462bf234a Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking updates from Ingo Molnar:
"The biggest change is the MCS spinlock generalization changes from Tim
Chen, Peter Zijlstra, Jason Low et al. There's also lockdep
fixes/enhancements from Oleg Nesterov, in particular a false negative
fix related to lockdep_set_novalidate_class() usage"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
locking/mutex: Fix debug checks
locking/mutexes: Add extra reschedule point
locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
locking/mutexes: Unlock the mutex without the wait_lock
locking/mutexes: Modify the way optimistic spinners are queued
locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
locking: Move mcs_spinlock.h into kernel/locking/
m68k: Skip futex_atomic_cmpxchg_inatomic() test
futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
Revert "sched/wait: Suppress Sparse 'variable shadowing' warning"
lockdep: Change lockdep_set_novalidate_class() to use _and_name
lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate
lockdep: Don't create the wrong dependency on hlock->check == 0
lockdep: Make held_lock->check and "int check" argument bool
locking/mcs: Allow architecture specific asm files to be used for contended case
locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order
sched/wait: Suppress Sparse 'variable shadowing' warning
hung_task/Documentation: Fix hung_task_warnings description
locking/mcs: Allow architectures to hook in to contended paths
locking/mcs: Micro-optimize the MCS code, add extra comments
...

Linus Torvalds
2014-04-01 01:59:39 +0800

21 Mar, 2014

1 commit

11d4616bd futex: revert back to the explicit waiter counting code ... Browse Code »
18

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju
Original-code-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-03-21 13:11:17 +0800

03 Mar, 2014

1 commit

03b8c7b62 futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test ... Browse Code »
23

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens
Cc: Finn Thain
Cc: Geert Uytterhoeven
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner

Heiko Carstens
2014-03-03 18:32:08 +0800

21 Jan, 2014

1 commit

a0fa1dd3c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:

- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)

- Clean up and fix preempt_enable_no_resched() abuse all around the
tree

- Do sched_clock() performance optimizations on x86 and elsewhere

- Fix and improve auto-NUMA balancing

- Fix and clean up the idle loop

- Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...

Linus Torvalds
2014-01-21 02:42:08 +0800

16 Jan, 2014

1 commit

63b1a8169 futexes: Fix futex_hashsize initialization ... Browse Code »

"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Waiman Long
Cc: Jason Low
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar

Heiko Carstens
2014-01-16 22:14:32 +0800

13 Jan, 2014

5 commits

fb00aca47 rtmutex: Turn the plist into an rb-tree ... Browse Code »
5

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra
Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
Signed-off-again-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-01-13 20:41:50 +0800
b0c29f79e futexes: Avoid taking the hb->lock if there's nothing to wake up ... Browse Code »
36

In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds
Reviewed-by: Darren Hart
Reviewed-by: Thomas Gleixner
Reviewed-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Cc: Paul E. McKenney
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-01-13 18:45:21 +0800
99b60ce69 futexes: Document multiprocessor ordering guarantees ... Browse Code »

That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Signed-off-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Randy Dunlap
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-01-13 18:45:19 +0800
a52b89ebb futexes: Increase hash table size for better performance ... Browse Code »

Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531 (+55.8%)        | 255274 (+687.2%)     | 292553 (+802.2%)             |
|     256 |              65360 | 99588 (+52.3%)        | 443563 (+578.6%)     | 508088 (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613 (+491.1%)     | 835452 (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300 (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996 (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Waiman Long
Reviewed-and-tested-by: Jason Low
Reviewed-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-01-13 18:45:18 +0800
0d00c7b20 futexes: Clean up various details ... Browse Code »

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Thomas Gleixner
Signed-off-by: Jason Low
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-01-13 18:45:17 +0800

13 Dec, 2013

2 commits

5cdec2d83 futex: move user address verification up to common code ... Browse Code »

When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.

Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page). So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.

Just move the check up, so that it gets done for both cases. Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").

Cc: Thomas Gleixner
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-12-13 01:53:51 +0800
f12d5bfce futex: fix handling of read-only-mapped hugepages ... Browse Code »
2

The hugepage code had the exact same bug that regular pages had in
commit 7485d0d3758e ("futexes: Remove rw parameter from
get_futex_key()").

The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.

Found by Dave Jones and his trinity tool.

Reported-and-tested-by: Dave Jones
Cc: stable@kernel.org # v2.6.38+
Acked-by: Thomas Gleixner
Cc: Mel Gorman
Cc: Darren Hart
Cc: Andrea Arcangeli
Cc: Oleg Nesterov
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-12-13 01:38:42 +0800

06 Nov, 2013

1 commit

1696a8bee locking: Move the rtmutex code to kernel/locking/ ... Browse Code »

Suggested-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-p9ijt8div0hwldexwfm4nlhj@git.kernel.org
[ Fixed build failure in kernel/rcu/tree_plugin.h. ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-11-06 16:23:59 +0800

26 Jun, 2013

2 commits

88c8004fd futex: Use freezable blocking call ... Browse Code »

Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Signed-off-by: Colin Cross
Cc: Rafael J. Wysocki
Cc: arve@android.com
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Darren Hart
Cc: Randy Dunlap
Cc: Al Viro
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner

Colin Cross
2013-06-26 05:11:19 +0800
13d60f4b6 futex: Take hugepages into account when generating futex_key ... Browse Code »

The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.

Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.

Steps to reproduce the bug:

1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
mapping.

The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
their keys solely depend on the user space address.

2. Lock mutex1 and mutex2

3. Create thread1 and in the thread function lock mutex1, which
results in thread1 blocking on the locked mutex1.

4. Create thread2 and in the thread function lock mutex2, which
results in thread2 blocking on the locked mutex2.

5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
still blocks on mutex2 because the futex_key points to mutex1.

To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.

Mappings which are not based on hugetlbfs are not affected and still
use page->index.

Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.

[ tglx: Massaged changelog ]

Signed-off-by: Zhang Yi
Reviewed-by: Jiang Biao
Tested-by: Ma Chenggong
Reviewed-by: 'Mel Gorman'
Acked-by: 'Darren Hart'
Cc: 'Peter Zijlstra'
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner

Zhang Yi
2013-06-26 05:11:19 +0800

13 Mar, 2013

1 commit

6c23cbbd5 futex: fix kernel-doc notation and spello ... Browse Code »

Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
kernel-doc notation format.

Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'

Fix one spello.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2013-03-13 11:42:10 +0800

28 Feb, 2013

1 commit

6131ffaa1 more file_inode() open-coded instances ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-28 05:59:05 +0800

23 Feb, 2013

1 commit

3b5d8510b Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.

The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.

The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo

Linus Torvalds
2013-02-23 11:25:09 +0800

19 Feb, 2013

1 commit

fe2b05f7c futex: Revert "futex: Mark get_robust_list as deprecated" ... Browse Code »

This reverts commit ec0c4274e33c0373e476b73e01995c53128f1257.

get_robust_list() is in use and a removal would break existing user
space. With the permission checks in place it's not longer a security
hole. Remove the deprecation warnings.

Signed-off-by: Thomas Gleixner
Cc: Cyrill Gorcunov
Cc: Richard Weinberger
Cc: akpm@linux-foundation.org
Cc: paul.gortmaker@windriver.com
Cc: davej@redhat.com
Cc: keescook@chromium.org
Cc: stable@vger.kernel.org
Cc: ebiederm@xmission.com

Thomas Gleixner
2013-02-19 15:43:38 +0800

08 Feb, 2013

1 commit

8bd75c77b sched/rt: Move rt specific bits into new header file ... Browse Code »

Move rt scheduler definitions out of include/linux/sched.h into
new file include/linux/sched/rt.h

Signed-off-by: Clark Williams
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/20130207094707.7b9f825f@riff.lan
Signed-off-by: Ingo Molnar

Clark Williams
2013-02-08 03:51:08 +0800

27 Nov, 2012

1 commit

aa10990e0 futex: avoid wake_futex() for a PI futex_q ... Browse Code »

Dave Jones reported a bug with futex_lock_pi() that his trinity test
exposed. Sometime between queue_me() and taking the q.lock_ptr, the
lock_ptr became NULL, resulting in a crash.

While futex_wake() is careful to not call wake_futex() on futex_q's with
a pi_state or an rt_waiter (which are either waiting for a
futex_unlock_pi() or a PI futex_requeue()), futex_wake_op() and
futex_requeue() do not perform the same test.

Update futex_wake_op() and futex_requeue() to test for q.pi_state and
q.rt_waiter and abort with -EINVAL if detected. To ensure any future
breakage is caught, add a WARN() to wake_futex() if the same condition
is true.

This fix has seen 3 hours of testing with "trinity -c futex" on an
x86_64 VM with 4 CPUS.

[akpm@linux-foundation.org: tidy up the WARN()]
Signed-off-by: Darren Hart
Reported-by: Dave Jones
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: John Kacur
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Darren Hart
2012-11-27 09:41:24 +0800

01 Nov, 2012

1 commit

59fa62451 futex: Handle futex_pi OWNER_DIED take over correctly ... Browse Code »
19

Siddhesh analyzed a failure in the take over of pi futexes in case the
owner died and provided a workaround.
See: http://sourceware.org/bugzilla/show_bug.cgi?id=14076

The detailed problem analysis shows:

Futex F is initialized with PTHREAD_PRIO_INHERIT and
PTHREAD_MUTEX_ROBUST_NP attributes.

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.

T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
--> Succeeds due to the check for F's userspace TID field == 0
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space

T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes T2 via
rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains real ownership of the
pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space

T3 --> observes inconsistent state

This problem is independent of UP/SMP, preemptible/non preemptible
kernels, or process shared vs. private. The only difference is that
certain configurations are more likely to expose it.

So as Siddhesh correctly analyzed the following check in
futex_lock_pi_atomic() is the culprit:

if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {

We check the userspace value for a TID value of 0 and take over the
futex unconditionally if that's true.

AFAICT this check is there as it is correct for a different corner
case of futexes: the WAITERS bit became stale.

Now the proposed change

- if (unlikely(ownerdied || !(curval & FUTEX_TID_MASK))) {
+ if (unlikely(ownerdied ||
+ !(curval & (FUTEX_TID_MASK | FUTEX_WAITERS)))) {

solves the problem, but it's not obvious why and it wreckages the
"stale WAITERS bit" case.

What happens is, that due to the WAITERS bit being set (T2 is blocked
on that futex) it enforces T3 to go through lookup_pi_state(), which
in the above case returns an existing pi_state and therefor forces T3
to legitimately fight with T2 over the ownership of the pi_state (via
pi_state->mutex). Probelm solved!

Though that does not work for the "WAITERS bit is stale" problem
because if lookup_pi_state() does not find existing pi_state it
returns -ERSCH (due to TID == 0) which causes futex_lock_pi() to
return -ESRCH to user space because the OWNER_DIED bit is not set.

Now there is a different solution to that problem. Do not look at the
user space value at all and enforce a lookup of possibly available
pi_state. If pi_state can be found, then the new incoming locker T3
blocks on that pi_state and legitimately races with T2 to acquire the
rt_mutex and the pi_state and therefor the proper ownership of the
user space futex.

lookup_pi_state() has the correct order of checks. It first tries to
find a pi_state associated with the user space futex and only if that
fails it checks for futex TID value = 0. If no pi_state is available
nothing can create new state at that point because this happens with
the hash bucket lock held.

So the above scenario changes to:

T1 lock_futex_pi(F);

T2 lock_futex_pi(F);
--> T2 blocks on the futex and creates pi_state which is associated
to T1.

T1 exits
--> exit_robust_list() runs
--> Futex F userspace value TID field is set to 0 and
FUTEX_OWNER_DIED bit is set.

T3 lock_futex_pi(F);
--> Finds pi_state and blocks on pi_state->rt_mutex

T1 --> exit_pi_state_list()
--> Transfers pi_state to waiter T2 and wakes it via
rt_mutex_unlock(&pi_state->mutex)

T2 --> acquires pi_state->mutex and gains ownership of the pi_state
--> Claims ownership of the futex and sets its own TID into the
userspace TID field of futex F
--> returns to user space

This covers all gazillion points on which T3 might come in between
T1's exit_robust_list() clearing the TID field and T2 fixing it up. It
also solves the "WAITERS bit stale" problem by forcing the take over.

Another benefit of changing the code this way is that it makes it less
dependent on untrusted user space values and therefor minimizes the
possible wreckage which might be inflicted.

As usual after staring for too long at the futex code my brain hurts
so much that I really want to ditch that whole optimization of
avoiding the syscall for the non contended case for PI futexes and rip
out the maze of corner case handling code. Unfortunately we can't as
user space relies on that existing behaviour, but at least thinking
about it helps me to preserve my mental sanity. Maybe we should
nevertheless :)

Reported-and-tested-by: Siddhesh Poyarekar
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1210232138540.2756@ionos
Acked-by: Darren Hart
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2012-11-01 19:06:54 +0800

24 Jul, 2012

3 commits

6f7b0a2a5 futex: Forbid uaddr == uaddr2 in futex_wait_requeue_pi() ... Browse Code »
19

If uaddr == uaddr2, then we have broken the rule of only requeueing
from a non-pi futex to a pi futex with this call. If we attempt this,
as the trinity test suite manages to do, we miss early wakeups as
q.key is equal to key2 (because they are the same uaddr). We will then
attempt to dereference the pi_mutex (which would exist had the futex_q
been properly requeued to a pi futex) and trigger a NULL pointer
dereference.

Signed-off-by: Darren Hart
Cc: Dave Jones
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/ad82bfe7f7d130247fbe2b5b4275654807774227.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner

Darren Hart
2012-07-24 22:02:57 +0800
f27071cb7 futex: Fix bug in WARN_ON for NULL q.pi_state ... Browse Code »

The WARN_ON in futex_wait_requeue_pi() for a NULL q.pi_state was testing
the address (&q.pi_state) of the pointer instead of the value
(q.pi_state) of the pointer. Correct it accordingly.

Signed-off-by: Darren Hart
Cc: Dave Jones
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1c85d97f6e5f79ec389a4ead3e367363c74bd09a.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner

Darren Hart
2012-07-24 22:02:57 +0800
b6070a8d9 futex: Test for pi_mutex on fault in futex_wait_requeue_pi() ... Browse Code »

If fixup_pi_state_owner() faults, pi_mutex may be NULL. Test
for pi_mutex != NULL before testing the owner against current
and possibly unlocking it.

Signed-off-by: Darren Hart
Cc: Dave Jones
Cc: Dan Carpenter
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/dc59890338fc413606f04e5c5b131530734dae3d.1342809673.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner

Darren Hart
2012-07-24 22:02:56 +0800

29 Mar, 2012

2 commits

ec0c4274e futex: Mark get_robust_list as deprecated ... Browse Code »

Notify get_robust_list users that the syscall is going away.

Suggested-by: Thomas Gleixner
Signed-off-by: Kees Cook
Cc: Randy Dunlap
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Jiri Kosina
Cc: Eric W. Biederman
Cc: David Howells
Cc: Serge E. Hallyn
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.net
Signed-off-by: Thomas Gleixner

Kees Cook
2012-03-29 17:37:17 +0800
bdbb776f8 futex: Do not leak robust list to unprivileged process ... Browse Code »

It was possible to extract the robust list head address from a setuid
process if it had used set_robust_list(), allowing an ASLR info leak. This
changes the permission checks to be the same as those used for similar
info that comes out of /proc.

Running a setuid program that uses robust futexes would have had:
cred->euid != pcred->euid
cred->euid == pcred->uid
so the old permissions check would allow it. I'm not aware of any setuid
programs that use robust futexes, so this is just a preventative measure.

(This patch is based on changes from grsecurity.)

Signed-off-by: Kees Cook
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Jiri Kosina
Cc: Eric W. Biederman
Cc: David Howells
Cc: Serge E. Hallyn
Cc: kernel-hardening@lists.openwall.com
Cc: spender@grsecurity.net
Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.net
Signed-off-by: Thomas Gleixner

Kees Cook
2012-03-29 17:37:17 +0800

20 Mar, 2012

1 commit

5ed59af85 Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core/locking changes for v3.4 from Ingo Molnar

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Simplify return logic
futex: Cover all PI opcodes with cmpxchg enabled check

Linus Torvalds
2012-03-20 08:11:15 +0800

15 Feb, 2012

2 commits

81b40539e futex: Simplify return logic ... Browse Code »

No need to assign ret in each case and break. Simply return the result
of the handler function directly.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart

Thomas Gleixner
2012-02-15 19:19:13 +0800
59263b513 futex: Cover all PI opcodes with cmpxchg enabled check ... Browse Code »

Some of the newer futex PI opcodes do not check the cmpxchg enabled
variable and call unconditionally into the handling functions. Cover
all PI opcodes in a separate check.

Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org
Cc: Peter Zijlstra
Cc: Darren Hart

Thomas Gleixner
2012-02-15 19:19:13 +0800

01 Jan, 2012

1 commit

e6780f724 futex: Fix uninterruptible loop due to gate_area ... Browse Code »

It was found (by Sasha) that if you use a futex located in the gate
area we get stuck in an uninterruptible infinite loop, much like the
ZERO_PAGE issue.

While looking at this problem, PeterZ realized you'll get into similar
trouble when hitting any install_special_pages() mapping. And are there
still drivers setting up their own special mmaps without page->mapping,
and without special VM or pte flags to make get_user_pages fail?

In most cases, if page->mapping is NULL, we do not need to retry at all:
Linus points out that even /proc/sys/vm/drop_caches poses no problem,
because it ends up using remove_mapping(), which takes care not to
interfere when the page reference count is raised.

But there is still one case which does need a retry: if memory pressure
called shmem_writepage in between get_user_pages_fast dropping page
table lock and our acquiring page lock, then the page gets switched from
filecache to swapcache (and ->mapping set to NULL) whatever the refcount.
Fault it back in to get the page->mapping needed for key->shared.inode.

Reported-by: Sasha Levin
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-01 03:48:28 +0800

31 Oct, 2011

1 commit

9984de1a5 kernel: Map most files to use export.h instead of module.h ... Browse Code »

The changed files were only including linux/module.h for the
EXPORT_SYMBOL infrastructure, and nothing else. Revector them
onto the isolated export header for faster compile times.

Nothing to see here but a whole lot of instances of:

-#include
+#include

This commit is only changing the kernel dir; next targets
will probably be mm, fs, the arch dirs, etc.

Signed-off-by: Paul Gortmaker

Paul Gortmaker
2011-10-31 21:20:12 +0800