Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

26 Oct, 2014

2 commits

30a6b8031 futex: Fix a race condition between REQUEUE_PI and task death ... Browse Code »

free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
free_pi_state do too. requeue_pi doesn't, which means free_pi_state
can free the pi_state out from under exit_pi_state_list. For example:

task A | task B
exit_pi_state_list |
pi_state = |
curr->pi_state_list->next |
| futex_requeue(requeue_pi=1)
| // pi_state is the same as
| // the one in task A
| free_pi_state(pi_state)
| list_del_init(&pi_state->list)
| kfree(pi_state)
list_del_init(&pi_state->list) |

Move the free_pi_state calls in requeue_pi to before it drops the hb
locks which it's already holding.

[ tglx: Removed a pointless free_pi_state() call and the hb->lock held
debugging. The latter comes via a seperate patch ]

Signed-off-by: Brian Silverman
Cc: austin.linux@gmail.com
Cc: darren@dvhart.com
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
Signed-off-by: Thomas Gleixner

Brian Silverman
2014-10-26 23:16:18 +0800
993b2ff22 futex: Mention key referencing differences between shared and private futexes ... Browse Code »

Update our documentation as of fix 76835b0ebf8 (futex: Ensure
get_futex_key_refs() always implies a barrier). Explicitly
state that we don't do key referencing for private futexes.

Signed-off-by: Davidlohr Bueso
Cc: Matteo Franchin
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Acked-by: Catalin Marinas
Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2014-10-26 23:16:18 +0800

19 Oct, 2014

1 commit

76835b0eb futex: Ensure get_futex_key_refs() always implies a barrier ... Browse Code »
18

Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.

However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).

Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.

Signed-off-by: Catalin Marinas
Reported-by: Matteo Franchin
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Acked-by: Davidlohr Bueso
Tested-by: Mike Galbraith
Cc:
Cc: Darren Hart
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Paul E. McKenney
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-10-19 00:28:51 +0800

13 Sep, 2014

1 commit

13c42c2f4 futex: Unlock hb->lock in futex_wait_requeue_pi() error path ... Browse Code »
6

futex_wait_requeue_pi() calls futex_wait_setup(). If
futex_wait_setup() succeeds it returns with hb->lock held and
preemption disabled. Now the sanity check after this does:

if (match_futex(&q.key, &key2)) {
ret = -EINVAL;
goto out_put_keys;
}

which releases the keys but does not release hb->lock.

So we happily return to user space with hb->lock held and therefor
preemption disabled.

Unlock hb->lock before taking the exit route.

Reported-by: Dave "Trinity" Jones
Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Reviewed-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-09-13 04:04:36 +0800

22 Jun, 2014

6 commits

af54d6a1c futex: Simplify futex_lock_pi_atomic() and make it more robust ... Browse Code »

futex_lock_pi_atomic() is a maze of retry hoops and loops.

Reduce it to simple and understandable states:

First step is to lookup existing waiters (state) in the kernel.

If there is an existing waiter, validate it and attach to it.

If there is no existing waiter, check the user space value

If the TID encoded in the user space value is 0, take over the futex
preserving the owner died bit.

If the TID encoded in the user space value is != 0, lookup the owner
task, validate it and attach to it.

Reduces text size by 128 bytes on x8664.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Kees Cook
Cc: wad@chromium.org
Cc: Darren Hart
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1406131137020.5170@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:26:24 +0800
04e1b2e52 futex: Split out the first waiter attachment from lookup_pi_state() ... Browse Code »

We want to be a bit more clever in futex_lock_pi_atomic() and separate
the possible states. Split out the code which attaches the first
waiter to the owner into a separate function. No functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Kees Cook
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.271300614@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:26:23 +0800
e60cbc5ce futex: Split out the waiter check from lookup_pi_state() ... Browse Code »

We want to be a bit more clever in futex_lock_pi_atomic() and separate
the possible states. Split out the waiter verification into a separate
function. No functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Kees Cook
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.180458410@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:26:23 +0800
bd1dbcc67 futex: Use futex_top_waiter() in lookup_pi_state() ... Browse Code »

No point in open coding the same function again.

Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: Kees Cook
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.092947239@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:26:23 +0800
ccf9e6a80 futex: Make unlock_pi more robust ... Browse Code »
5

The kernel tries to atomically unlock the futex without checking
whether there is kernel state associated to the futex.

So if user space manipulated the user space value, this will leave
kernel internal state around associated to the owner task.

For robustness sake, lookup first whether there are waiters on the
futex. If there are waiters, wake the top priority waiter with all the
proper sanity checks applied.

If there are no waiters, do the atomic release. We do not have to
preserve the waiters bit in this case, because a potentially incoming
waiter is blocked on the hb->lock and will acquire the futex
atomically. We neither have to preserve the owner died bit. The caller
is the owner and it was supposed to cleanup the mess.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Kees Cook
Cc: wad@chromium.org
Link: http://lkml.kernel.org/r/20140611204237.016987332@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:26:23 +0800
c051b21f7 rtmutex: Confine deadlock logic to futex ... Browse Code »

The deadlock logic is only required for futexes.

Remove the extra arguments for the public functions and also for the
futex specific ones which get always called with deadlock detection
enabled.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt

Thomas Gleixner
2014-06-22 04:05:30 +0800

09 Jun, 2014

1 commit

3f17ea6de Merge branch 'next' (accumulated 3.16 merge window patches) into master ... Browse Code »

Now that 3.15 is released, this merges the 'next' branch into 'master',
bringing us to the normal situation where my 'master' branch is the
merge window.

* accumulated work in next: (6809 commits)
ufs: sb mutex merge + mutex_destroy
powerpc: update comments for generic idle conversion
cris: update comments for generic idle conversion
idle: remove cpu_idle() forward declarations
nbd: zero from and len fields in NBD_CMD_DISCONNECT.
mm: convert some level-less printks to pr_*
MAINTAINERS: adi-buildroot-devel is moderated
MAINTAINERS: add linux-api for review of API/ABI changes
mm/kmemleak-test.c: use pr_fmt for logging
fs/dlm/debug_fs.c: replace seq_printf by seq_puts
fs/dlm/lockspace.c: convert simple_str to kstr
fs/dlm/config.c: convert simple_str to kstr
mm: mark remap_file_pages() syscall as deprecated
mm: memcontrol: remove unnecessary memcg argument from soft limit functions
mm: memcontrol: clean up memcg zoneinfo lookup
mm/memblock.c: call kmemleak directly from memblock_(alloc|free)
mm/mempool.c: update the kmemleak stack trace for mempool allocations
lib/radix-tree.c: update the kmemleak stack trace for radix tree allocations
mm: introduce kmemleak_update_trace()
mm/kmemleak.c: use %u to print ->checksum
...

Linus Torvalds
2014-06-09 02:31:16 +0800

06 Jun, 2014

4 commits

54a217887 futex: Make lookup_pi_state more robust ... Browse Code »
5

The current implementation of lookup_pi_state has ambigous handling of
the TID value 0 in the user space futex. We can get into the kernel
even if the TID value is 0, because either there is a stale waiters bit
or the owner died bit is set or we are called from the requeue_pi path
or from user space just for fun.

The current code avoids an explicit sanity check for pid = 0 in case
that kernel internal state (waiters) are found for the user space
address. This can lead to state leakage and worse under some
circumstances.

Handle the cases explicit:

Waiter | pi_state | pi->owner | uTID | uODIED | ?

[1] NULL | --- | --- | 0 | 0/1 | Valid
[2] NULL | --- | --- | >0 | 0/1 | Valid

[3] Found | NULL | -- | Any | 0/1 | Invalid

[4] Found | Found | NULL | 0 | 1 | Valid
[5] Found | Found | NULL | >0 | 1 | Invalid

[6] Found | Found | task | 0 | 1 | Valid

[7] Found | Found | NULL | Any | 0 | Invalid

[8] Found | Found | task | ==taskTID | 0/1 | Valid
[9] Found | Found | task | 0 | 0 | Invalid
[10] Found | Found | task | !=taskTID | 0/1 | Invalid

[1] Indicates that the kernel can acquire the futex atomically. We
came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.

[2] Valid, if TID does not belong to a kernel thread. If no matching
thread is found then it indicates that the owner TID has died.

[3] Invalid. The waiter is queued on a non PI futex

[4] Valid state after exit_robust_list(), which sets the user space
value to FUTEX_WAITERS | FUTEX_OWNER_DIED.

[5] The user space value got manipulated between exit_robust_list()
and exit_pi_state_list()

[6] Valid state after exit_pi_state_list() which sets the new owner in
the pi_state but cannot access the user space value.

[7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.

[8] Owner and user space value match

[9] There is no transient state which sets the user space TID to 0
except exit_robust_list(), but this is indicated by the
FUTEX_OWNER_DIED bit. See [4]

[10] There is no transient state which leaves owner and user space
TID out of sync.

Signed-off-by: Thomas Gleixner
Cc: Kees Cook
Cc: Will Drewry
Cc: Darren Hart
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
13fbca4c6 futex: Always cleanup owner tid in unlock_pi ... Browse Code »
5

If the owner died bit is set at futex_unlock_pi, we currently do not
cleanup the user space futex. So the owner TID of the current owner
(the unlocker) persists. That's observable inconsistant state,
especially when the ownership of the pi state got transferred.

Clean it up unconditionally.

Signed-off-by: Thomas Gleixner
Cc: Kees Cook
Cc: Will Drewry
Cc: Darren Hart
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
b3eaa9fc5 futex: Validate atomic acquisition in futex_lock_pi_atomic() ... Browse Code »
5

We need to protect the atomic acquisition in the kernel against rogue
user space which sets the user space futex to 0, so the kernel side
acquisition succeeds while there is existing state in the kernel
associated to the real owner.

Verify whether the futex has waiters associated with kernel state. If
it has, return -EINVAL. The state is corrupted already, so no point in
cleaning it up. Subsequent calls will fail as well. Not our problem.

[ tglx: Use futex_top_waiter() and explain why we do not need to try
restoring the already corrupted user space state. ]

Signed-off-by: Darren Hart
Cc: Kees Cook
Cc: Will Drewry
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner
Signed-off-by: Linus Torvalds

Thomas Gleixner
2014-06-06 03:31:07 +0800
e9c243a5a futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in fu… ... Browse Code »
5

…tex_requeue(..., requeue_pi=1)

If uaddr == uaddr2, then we have broken the rule of only requeueing from
a non-pi futex to a pi futex with this call. If we attempt this, then
dangling pointers may be left for rt_waiter resulting in an exploitable
condition.

This change brings futex_requeue() in line with futex_wait_requeue_pi()
which performs the same check as per commit 6f7b0a2a5c0f ("futex: Forbid
uaddr == uaddr2 in futex_wait_requeue_pi()")

[ tglx: Compare the resulting keys as well, as uaddrs might be
different depending on the mapping ]

Fixes CVE-2014-3153.

Reported-by: Pinkie Pie
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thomas Gleixner
2014-06-06 03:31:07 +0800

04 Jun, 2014

1 commit

776edb593 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/tip into next

Pull core locking updates from Ingo Molnar:
"The main changes in this cycle were:

- reduced/streamlined smp_mb__*() interface that allows more usecases
and makes the existing ones less buggy, especially in rarer
architectures

- add rwsem implementation comments

- bump up lockdep limits"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
rwsem: Add comments to explain the meaning of the rwsem's count field
lockdep: Increase static allocations
arch: Mass conversion of smp_mb__*()
arch,doc: Convert smp_mb__*()
arch,xtensa: Convert smp_mb__*()
arch,x86: Convert smp_mb__*()
arch,tile: Convert smp_mb__*()
arch,sparc: Convert smp_mb__*()
arch,sh: Convert smp_mb__*()
arch,score: Convert smp_mb__*()
arch,s390: Convert smp_mb__*()
arch,powerpc: Convert smp_mb__*()
arch,parisc: Convert smp_mb__*()
arch,openrisc: Convert smp_mb__*()
arch,mn10300: Convert smp_mb__*()
arch,mips: Convert smp_mb__*()
arch,metag: Convert smp_mb__*()
arch,m68k: Convert smp_mb__*()
arch,m32r: Convert smp_mb__*()
arch,ia64: Convert smp_mb__*()
...

Linus Torvalds
2014-06-04 03:57:53 +0800

19 May, 2014

2 commits

f0d71b3dc futex: Prevent attaching to kernel threads ... Browse Code »
5

We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.

Found while analysing the fallout of Dave Jones syscall fuzzer.

We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800
866293ee5 futex: Add another early deadlock detection check ... Browse Code »
5

Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
- it can detect that issue early
- it avoids a different and more complex fixup for backing out

If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.

The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800

18 Apr, 2014

1 commit

4e857c58e arch: Mass conversion of smp_mb__*() ... Browse Code »

Mostly scripted conversion of the smp_mb__* barriers.

Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/n/tip-55dhyhocezdw1dg7u19hmh1u@git.kernel.org
Cc: Linus Torvalds
Cc: linux-arch@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-04-18 20:20:48 +0800

13 Apr, 2014

1 commit

d7e8af1af futex: update documentation for ordering guarantees ... Browse Code »

Commits 11d4616bd07f ("futex: revert back to the explicit waiter
counting code") and 69cd9eba3886 ("futex: avoid race between requeue and
wake") changed some of the finer details of how we think about futexes.
One was a late fix and the other a consequence of overlooking the whole
requeuing logic.

The first change caused our documentation to be incorrect, and the
second made us aware that we need to explicitly add more details to it.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2014-04-13 08:57:51 +0800

09 Apr, 2014

1 commit

69cd9eba3 futex: avoid race between requeue and wake ... Browse Code »
5

Jan Stancek reported:
"pthread_cond_broadcast/4-1.c testcase from openposix testsuite (LTP)
occasionally fails, because some threads fail to wake up.

Testcase creates 5 threads, which are all waiting on same condition.
Main thread then calls pthread_cond_broadcast() without holding mutex,
which calls:

futex(uaddr1, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, uaddr2, ..)

This immediately wakes up single thread A, which unlocks mutex and
tries to wake up another thread:

futex(uaddr2, FUTEX_WAKE_PRIVATE, 1)

If thread A manages to call futex_wake() before any waiters are
requeued for uaddr2, no other thread is woken up"

The ordering constraints for the hash bucket waiter counting are that
the waiter counts have to be incremented _before_ getting the spinlock
(because the spinlock acts as part of the memory barrier), but the
"requeue" operation didn't honor those rules, and nobody had even
thought about that case.

This fairly simple patch just increments the waiter count for the target
hash bucket (hb2) when requeing a futex before taking the locks. It
then decrements them again after releasing the lock - the code that
actually moves the futex(es) between hash buckets will do the additional
required waiter count housekeeping.

Reported-and-tested-by: Jan Stancek
Acked-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org # 3.14
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-04-09 23:02:12 +0800

01 Apr, 2014

1 commit

462bf234a Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking updates from Ingo Molnar:
"The biggest change is the MCS spinlock generalization changes from Tim
Chen, Peter Zijlstra, Jason Low et al. There's also lockdep
fixes/enhancements from Oleg Nesterov, in particular a false negative
fix related to lockdep_set_novalidate_class() usage"

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (22 commits)
locking/mutex: Fix debug checks
locking/mutexes: Add extra reschedule point
locking/mutexes: Introduce cancelable MCS lock for adaptive spinning
locking/mutexes: Unlock the mutex without the wait_lock
locking/mutexes: Modify the way optimistic spinners are queued
locking/mutexes: Return false if task need_resched() in mutex_can_spin_on_owner()
locking: Move mcs_spinlock.h into kernel/locking/
m68k: Skip futex_atomic_cmpxchg_inatomic() test
futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test
Revert "sched/wait: Suppress Sparse 'variable shadowing' warning"
lockdep: Change lockdep_set_novalidate_class() to use _and_name
lockdep: Change mark_held_locks() to check hlock->check instead of lockdep_no_validate
lockdep: Don't create the wrong dependency on hlock->check == 0
lockdep: Make held_lock->check and "int check" argument bool
locking/mcs: Allow architecture specific asm files to be used for contended case
locking/mcs: Order the header files in Kbuild of each architecture in alphabetical order
sched/wait: Suppress Sparse 'variable shadowing' warning
hung_task/Documentation: Fix hung_task_warnings description
locking/mcs: Allow architectures to hook in to contended paths
locking/mcs: Micro-optimize the MCS code, add extra comments
...

Linus Torvalds
2014-04-01 01:59:39 +0800

21 Mar, 2014

1 commit

11d4616bd futex: revert back to the explicit waiter counting code ... Browse Code »
18

Srikar Dronamraju reports that commit b0c29f79ecea ("futexes: Avoid
taking the hb->lock if there's nothing to wake up") causes java threads
getting stuck on futexes when runing specjbb on a power7 numa box.

The cause appears to be that the powerpc spinlocks aren't using the same
ticket lock model that we use on x86 (and other) architectures, which in
turn result in the "spin_is_locked()" test in hb_waiters_pending()
occasionally reporting an unlocked spinlock even when there are pending
waiters.

So this reinstates Davidlohr Bueso's original explicit waiter counting
code, which I had convinced Davidlohr to drop in favor of figuring out
the pending waiters by just using the existing state of the spinlock and
the wait queue.

Reported-and-tested-by: Srikar Dronamraju
Original-code-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Linus Torvalds
2014-03-21 13:11:17 +0800

03 Mar, 2014

1 commit

03b8c7b62 futex: Allow architectures to skip futex_atomic_cmpxchg_inatomic() test ... Browse Code »
23

If an architecture has futex_atomic_cmpxchg_inatomic() implemented and there
is no runtime check necessary, allow to skip the test within futex_init().

This allows to get rid of some code which would always give the same result,
and also allows the compiler to optimize a couple of if statements away.

Signed-off-by: Heiko Carstens
Cc: Finn Thain
Cc: Geert Uytterhoeven
Link: http://lkml.kernel.org/r/20140302120947.GA3641@osiris
Signed-off-by: Thomas Gleixner

Heiko Carstens
2014-03-03 18:32:08 +0800

21 Jan, 2014

1 commit

a0fa1dd3c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:

- Add the initial implementation of SCHED_DEADLINE support: a real-time
scheduling policy where tasks that meet their deadlines and
periodically execute their instances in less than their runtime quota
see real-time scheduling and won't miss any of their deadlines.
Tasks that go over their quota get delayed (Available to privileged
users for now)

- Clean up and fix preempt_enable_no_resched() abuse all around the
tree

- Do sched_clock() performance optimizations on x86 and elsewhere

- Fix and improve auto-NUMA balancing

- Fix and clean up the idle loop

- Apply various cleanups and fixes

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
sched: Fix __sched_setscheduler() nice test
sched: Move SCHED_RESET_ON_FORK into attr::sched_flags
sched: Fix up attr::sched_priority warning
sched: Fix up scheduler syscall LTP fails
sched: Preserve the nice level over sched_setscheduler() and sched_setparam() calls
sched/core: Fix htmldocs warnings
sched/deadline: No need to check p if dl_se is valid
sched/deadline: Remove unused variables
sched/deadline: Fix sparse static warnings
m68k: Fix build warning in mac_via.h
sched, thermal: Clean up preempt_enable_no_resched() abuse
sched, net: Fixup busy_loop_us_clock()
sched, net: Clean up preempt_enable_no_resched() abuse
sched/preempt: Fix up missed PREEMPT_NEED_RESCHED folding
sched/preempt, locking: Rework local_bh_{dis,en}able()
sched/clock, x86: Avoid a runtime condition in native_sched_clock()
sched/clock: Fix up clear_sched_clock_stable()
sched/clock, x86: Use a static_key for sched_clock_stable
sched/clock: Remove local_irq_disable() from the clocks
sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
...

Linus Torvalds
2014-01-21 02:42:08 +0800

16 Jan, 2014

1 commit

63b1a8169 futexes: Fix futex_hashsize initialization ... Browse Code »

"futexes: Increase hash table size for better performance"
introduces a new alloc_large_system_hash() call.

alloc_large_system_hash() however may allocate less memory than
requested, e.g. limited by MAX_ORDER.

Hence pass a pointer to alloc_large_system_hash() which will
contain the hash shift when the function returns. Afterwards
correctly set futex_hashsize.

Fixes a crash on s390 where the requested allocation size was
4MB but only 1MB was allocated.

Signed-off-by: Heiko Carstens
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Waiman Long
Cc: Jason Low
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20140116135450.GA4345@osiris
Signed-off-by: Ingo Molnar

Heiko Carstens
2014-01-16 22:14:32 +0800

13 Jan, 2014

5 commits

fb00aca47 rtmutex: Turn the plist into an rb-tree ... Browse Code »
5

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra
Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
Signed-off-again-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-01-13 20:41:50 +0800
b0c29f79e futexes: Avoid taking the hb->lock if there's nothing to wake up ... Browse Code »
36

In futex_wake() there is clearly no point in taking the hb->lock
if we know beforehand that there are no tasks to be woken. While
the hash bucket's plist head is a cheap way of knowing this, we
cannot rely 100% on it as there is a racy window between the
futex_wait call and when the task is actually added to the
plist. To this end, we couple it with the spinlock check as
tasks trying to enter the critical region are most likely
potential waiters that will be added to the plist, thus
preventing tasks sleeping forever if wakers don't acknowledge
all possible waiters.

Furthermore, the futex ordering guarantees are preserved,
ensuring that waiters either observe the changed user space
value before blocking or is woken by a concurrent waker. For
wakers, this is done by relying on the barriers in
get_futex_key_refs() -- for archs that do not have implicit mb
in atomic_inc(), we explicitly add them through a new
futex_get_mm function. For waiters we rely on the fact that
spin_lock calls already update the head counter, so spinners
are visible even if the lock hasn't been acquired yet.

For more details please refer to the updated comments in the
code and related discussion:

https://lkml.org/lkml/2013/11/26/556

Special thanks to tglx for careful review and feedback.

Suggested-by: Linus Torvalds
Reviewed-by: Darren Hart
Reviewed-by: Thomas Gleixner
Reviewed-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Cc: Paul E. McKenney
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-5-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-01-13 18:45:21 +0800
99b60ce69 futexes: Document multiprocessor ordering guarantees ... Browse Code »

That's essential, if you want to hack on futexes.

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Signed-off-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Randy Dunlap
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Jason Low
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-4-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-01-13 18:45:19 +0800
a52b89ebb futexes: Increase hash table size for better performance ... Browse Code »

Currently, the futex global hash table suffers from its fixed,
smallish (for today's standards) size of 256 entries, as well as
its lack of NUMA awareness. Large systems, using many futexes,
can be prone to high amounts of collisions; where these futexes
hash to the same bucket and lead to extra contention on the same
hb->lock. Furthermore, cacheline bouncing is a reality when we
have multiple hb->locks residing on the same cacheline and
different futexes hash to adjacent buckets.

This patch keeps the current static size of 16 entries for small
systems, or otherwise, 256 * ncpus (or larger as we need to
round the number to a power of 2). Note that this number of CPUs
accounts for all CPUs that can ever be available in the system,
taking into consideration things like hotpluging. While we do
impose extra overhead at bootup by making the hash table larger,
this is a one time thing, and does not shadow the benefits of
this patch.

Furthermore, as suggested by tglx, by cache aligning the hash
buckets we can avoid access across cacheline boundaries and also
avoid massive cache line bouncing if multiple cpus are hammering
away at different hash buckets which happen to reside in the
same cache line.

Also, similar to other core kernel components (pid, dcache,
tcp), by using alloc_large_system_hash() we benefit from its
NUMA awareness and thus the table is distributed among the nodes
instead of in a single one.

For a custom microbenchmark that pounds on the uaddr hashing --
making the wait path fail at futex_wait_setup() returning
-EWOULDBLOCK for large amounts of futexes, we can see the
following benefits on a 80-core, 8-socket 1Tb server:

+---------+--------------------+------------------------+-----------------------+-------------------------------+
| threads | baseline (ops/sec) | aligned-only (ops/sec) | large table (ops/sec) | large table+aligned (ops/sec) |
+---------+--------------------+------------------------+-----------------------+-------------------------------+
|     512 |              32426 | 50531 (+55.8%)        | 255274 (+687.2%)     | 292553 (+802.2%)             |
|     256 |              65360 | 99588 (+52.3%)        | 443563 (+578.6%)     | 508088 (+677.3%)             |
|     128 |             125635 | 200075 (+59.2%)        | 742613 (+491.1%)     | 835452 (+564.9%)             |
|      80 |             193559 | 323425 (+67.1%)        | 1028147 (+431.1%)     | 1130304 (+483.9%)             |
|      64 |             247667 | 443740 (+79.1%)        | 997300 (+302.6%)     | 1145494 (+362.5%)             |
|      32 |             628412 | 721401 (+14.7%)        | 965996 (+53.7%)      | 1122115 (+78.5%)              |
+---------+--------------------+------------------------+-----------------------+-------------------------------+

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Waiman Long
Reviewed-and-tested-by: Jason Low
Reviewed-by: Thomas Gleixner
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Link: http://lkml.kernel.org/r/1389569486-25487-3-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-01-13 18:45:18 +0800
0d00c7b20 futexes: Clean up various details ... Browse Code »

- Remove unnecessary head variables.
- Delete unused parameter in queue_unlock().

Reviewed-by: Darren Hart
Reviewed-by: Peter Zijlstra
Reviewed-by: Paul E. McKenney
Reviewed-by: Thomas Gleixner
Signed-off-by: Jason Low
Signed-off-by: Davidlohr Bueso
Cc: Mike Galbraith
Cc: Jeff Mahoney
Cc: Linus Torvalds
Cc: Scott Norton
Cc: Tom Vaden
Cc: Aswin Chandramouleeswaran
Cc: Waiman Long
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/1389569486-25487-2-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-01-13 18:45:17 +0800

13 Dec, 2013

2 commits

5cdec2d83 futex: move user address verification up to common code ... Browse Code »

When debugging the read-only hugepage case, I was confused by the fact
that get_futex_key() did an access_ok() only for the non-shared futex
case, since the user address checking really isn't in any way specific
to the private key handling.

Now, it turns out that the shared key handling does effectively do the
equivalent checks inside get_user_pages_fast() (it doesn't actually
check the address range on x86, but does check the page protections for
being a user page). So it wasn't actually a bug, but the fact that we
treat the address differently for private and shared futexes threw me
for a loop.

Just move the check up, so that it gets done for both cases. Also, use
the 'rw' parameter for the type, even if it doesn't actually matter any
more (it's a historical artifact of the old racy i386 "page faults from
kernel space don't check write protections").

Cc: Thomas Gleixner
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-12-13 01:53:51 +0800
f12d5bfce futex: fix handling of read-only-mapped hugepages ... Browse Code »
2

The hugepage code had the exact same bug that regular pages had in
commit 7485d0d3758e ("futexes: Remove rw parameter from
get_futex_key()").

The regular page case was fixed by commit 9ea71503a8ed ("futex: Fix
regression with read only mappings"), but the transparent hugepage case
(added in a5b338f2b0b1: "thp: update futex compound knowledge") case
remained broken.

Found by Dave Jones and his trinity tool.

Reported-and-tested-by: Dave Jones
Cc: stable@kernel.org # v2.6.38+
Acked-by: Thomas Gleixner
Cc: Mel Gorman
Cc: Darren Hart
Cc: Andrea Arcangeli
Cc: Oleg Nesterov
Signed-off-by: Linus Torvalds

Linus Torvalds
2013-12-13 01:38:42 +0800

06 Nov, 2013

1 commit

1696a8bee locking: Move the rtmutex code to kernel/locking/ ... Browse Code »

Suggested-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-p9ijt8div0hwldexwfm4nlhj@git.kernel.org
[ Fixed build failure in kernel/rcu/tree_plugin.h. ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2013-11-06 16:23:59 +0800

26 Jun, 2013

2 commits

88c8004fd futex: Use freezable blocking call ... Browse Code »

Avoid waking up every thread sleeping in a futex_wait call during
suspend and resume by calling a freezable blocking call. Previous
patches modified the freezer to avoid sending wakeups to threads
that are blocked in freezable blocking calls.

This call was selected to be converted to a freezable call because
it doesn't hold any locks or release any resources when interrupted
that might be needed by another freezing task or a kernel driver
during suspend, and is a common site where idle userspace tasks are
blocked.

Signed-off-by: Colin Cross
Cc: Rafael J. Wysocki
Cc: arve@android.com
Cc: Tejun Heo
Cc: Oleg Nesterov
Cc: Darren Hart
Cc: Randy Dunlap
Cc: Al Viro
Link: http://lkml.kernel.org/r/1367458508-9133-8-git-send-email-ccross@android.com
Signed-off-by: Thomas Gleixner

Colin Cross
2013-06-26 05:11:19 +0800
13d60f4b6 futex: Take hugepages into account when generating futex_key ... Browse Code »

The futex_keys of process shared futexes are generated from the page
offset, the mapping host and the mapping index of the futex user space
address. This should result in an unique identifier for each futex.

Though this is not true when futexes are located in different subpages
of an hugepage. The reason is, that the mapping index for all those
futexes evaluates to the index of the base page of the hugetlbfs
mapping. So a futex at offset 0 of the hugepage mapping and another
one at offset PAGE_SIZE of the same hugepage mapping have identical
futex_keys. This happens because the futex code blindly uses
page->index.

Steps to reproduce the bug:

1. Map a file from hugetlbfs. Initialize pthread_mutex1 at offset 0
and pthread_mutex2 at offset PAGE_SIZE of the hugetlbfs
mapping.

The mutexes must be initialized as PTHREAD_PROCESS_SHARED because
PTHREAD_PROCESS_PRIVATE mutexes are not affected by this issue as
their keys solely depend on the user space address.

2. Lock mutex1 and mutex2

3. Create thread1 and in the thread function lock mutex1, which
results in thread1 blocking on the locked mutex1.

4. Create thread2 and in the thread function lock mutex2, which
results in thread2 blocking on the locked mutex2.

5. Unlock mutex2. Despite the fact that mutex2 got unlocked, thread2
still blocks on mutex2 because the futex_key points to mutex1.

To solve this issue we need to take the normal page index of the page
which contains the futex into account, if the futex is in an hugetlbfs
mapping. In other words, we calculate the normal page mapping index of
the subpage in the hugetlbfs mapping.

Mappings which are not based on hugetlbfs are not affected and still
use page->index.

Thanks to Mel Gorman who provided a patch for adding proper evaluation
functions to the hugetlbfs code to avoid exposing hugetlbfs specific
details to the futex code.

[ tglx: Massaged changelog ]

Signed-off-by: Zhang Yi
Reviewed-by: Jiang Biao
Tested-by: Ma Chenggong
Reviewed-by: 'Mel Gorman'
Acked-by: 'Darren Hart'
Cc: 'Peter Zijlstra'
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/000101ce71a6%24a83c5880%24f8b50980%24@com
Signed-off-by: Thomas Gleixner

Zhang Yi
2013-06-26 05:11:19 +0800

13 Mar, 2013

1 commit

6c23cbbd5 futex: fix kernel-doc notation and spello ... Browse Code »

Fix kernel-doc warning in futex.c and convert 'Returns' to the new Return:
kernel-doc notation format.

Warning(kernel/futex.c:2286): Excess function parameter 'clockrt' description in 'futex_wait_requeue_pi'

Fix one spello.

Signed-off-by: Randy Dunlap
Signed-off-by: Linus Torvalds

Randy Dunlap
2013-03-13 11:42:10 +0800

28 Feb, 2013

1 commit

6131ffaa1 more file_inode() open-coded instances ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2013-02-28 05:59:05 +0800

23 Feb, 2013

1 commit

3b5d8510b Merge branch 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking changes from Ingo Molnar:
"The biggest change is the rwsem lock-steal improvements, both to the
assembly optimized and the spinlock based variants.

The other notable change is the clean up of the seqlock implementation
to be based on the seqcount infrastructure.

The rest is assorted smaller debuggability, cleanup and continued -rt
locking changes."

* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rwsem-spinlock: Implement writer lock-stealing for better scalability
futex: Revert "futex: Mark get_robust_list as deprecated"
generic: Use raw local irq variant for generic cmpxchg
lockdep: Selftest: convert spinlock to raw spinlock
seqlock: Use seqcount infrastructure
seqlock: Remove unused functions
ntp: Make ntp_lock raw
intel_idle: Convert i7300_idle_lock to raw_spinlock
locking: Various static lock initializer fixes
lockdep: Print more info when MAX_LOCK_DEPTH is exceeded
rwsem: Implement writer lock-stealing for better scalability
lockdep: Silence warning if CONFIG_LOCKDEP isn't set
watchdog: Use local_clock for get_timestamp()
lockdep: Rename print_unlock_inbalance_bug() to print_unlock_imbalance_bug()
locking/stat: Fix a typo

Linus Torvalds
2013-02-23 11:25:09 +0800

19 Feb, 2013

1 commit

fe2b05f7c futex: Revert "futex: Mark get_robust_list as deprecated" ... Browse Code »

This reverts commit ec0c4274e33c0373e476b73e01995c53128f1257.

get_robust_list() is in use and a removal would break existing user
space. With the permission checks in place it's not longer a security
hole. Remove the deprecation warnings.

Signed-off-by: Thomas Gleixner
Cc: Cyrill Gorcunov
Cc: Richard Weinberger
Cc: akpm@linux-foundation.org
Cc: paul.gortmaker@windriver.com
Cc: davej@redhat.com
Cc: keescook@chromium.org
Cc: stable@vger.kernel.org
Cc: ebiederm@xmission.com

Thomas Gleixner
2013-02-19 15:43:38 +0800