Eric Lee / smarc-fsl-linux-kernel

05 Sep, 2016

1 commit

e8b61b3f2 futex: Add some more function commentry ... Browse Code »

Add some more comments and reformat existing ones to kernel doc style.

Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Reviewed-by: Darren Hart
Link: http://lkml.kernel.org/r/1464770609-30168-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-09-05 23:20:18 +0800

30 Jul, 2016

1 commit

784bdf3bb futex: Assume all mappings are private on !MMU systems ... Browse Code »

To quote Rick why there is no need for shared mapping on !MMU systems:

|With MMU, shared futex keys need to identify the physical backing for
|a memory address because it may be mapped at different addresses in
|different processes (or even multiple times in the same process).
|Without MMU this cannot happen. You only have physical addresses. So
|the "private futex" behavior of using the virtual address as the key
|is always correct (for both shared and private cases) on nommu
|systems.

This patch disables the FLAGS_SHARED in a way that allows the compiler to
remove that code.

[bigeasy: Added changelog ]
Reported-by: Rich Felker
Signed-off-by: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20160729143230.GA21715@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2016-07-30 00:44:14 +0800

09 Jun, 2016

1 commit

077fa7aed futex: Calculate the futex key based on a tail page for file-based futexes ... Browse Code »

Mike Galbraith reported that the LTP test case futex_wake04 was broken
by commit 65d8fc777f6d ("futex: Remove requirement for lock_page()
in get_futex_key()").

This test case uses futexes backed by hugetlbfs pages and so there is an
associated inode with a futex stored on such pages. The problem is that
the key is being calculated based on the head page index of the hugetlbfs
page and not the tail page.

Prior to the optimisation, the page lock was used to stabilise mappings and
pin the inode is file-backed which is overkill. If the page was a compound
page, the head page was automatically looked up as part of the page lock
operation but the tail page index was used to calculate the futex key.

After the optimisation, the compound head is looked up early and the page
lock is only relied upon to identify truncated pages, special pages or a
shmem page moving to swapcache. The head page is looked up because without
the page lock, special care has to be taken to pin the inode correctly.
However, the tail page is still required to calculate the futex key so
this patch records the tail page.

On vanilla 4.6, the output of the test case is;

futex_wake04 0 TINFO : Hugepagesize 2097152
futex_wake04 1 TFAIL : futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.

With the patch applied

futex_wake04 0 TINFO : Hugepagesize 2097152
futex_wake04 1 TPASS : Hi hydra, thread2 awake!

Fixes: 65d8fc777f6d "futex: Remove requirement for lock_page() in get_futex_key()"
Reported-and-tested-by: Mike Galbraith
Signed-off-by: Mel Gorman
Acked-by: Peter Zijlstra (Intel)
Reviewed-by: Davidlohr Bueso
Cc: Sebastian Andrzej Siewior
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.de
Signed-off-by: Thomas Gleixner

Mel Gorman
2016-06-09 01:23:54 +0800

23 May, 2016

1 commit

bd28b1459 x86: remove more uaccess_32.h complexity ... Browse Code »

I'm looking at trying to possibly merge the 32-bit and 64-bit versions
of the x86 uaccess.h implementation, but first this needs to be cleaned
up.

For example, the 32-bit version of "__copy_from_user_inatomic()" is
mostly the special cases for the constant size, and it's actually almost
never relevant. Most users aren't actually using a constant size
anyway, and the few cases that do small constant copies are better off
just using __get_user() instead.

So get rid of the unnecessary complexity.

Signed-off-by: Linus Torvalds

Linus Torvalds
2016-05-23 08:21:27 +0800

21 Apr, 2016

1 commit

fe1bce9e2 futex: Acknowledge a new waiter in counter before plist ... Browse Code »

Otherwise an incoming waker on the dest hash bucket can miss
the waiter adding itself to the plist during the lockless
check optimization (small window but still the correct way
of doing this); similarly to the decrement counterpart.

Suggested-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Cc: Davidlohr Bueso
Cc: bigeasy@linutronix.de
Cc: dvhart@infradead.org
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2016-04-21 17:06:09 +0800

20 Apr, 2016

1 commit

89e9e66ba futex: Handle unlock_pi race gracefully ... Browse Code »

If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
transition in user space first then the user space value might not have the
waiters bit set. This opens the following race:

CPU0 CPU1
uval = get_user(futex)
lock(hb)
lock(hb)
futex |= FUTEX_WAITERS
....
unlock(hb)

cmpxchg(futex, uval, newval)

So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
the futex value is valid.

To handle this (yes, yet another) corner case gracefully, check for a flag
change and retry.

[ tglx: Massaged changelog and slightly reworked implementation ]

Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
Signed-off-by: Sebastian Andrzej Siewior
Cc: stable@vger.kernel.org
Cc: Davidlohr Bueso
Cc: Darren Hart
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2016-04-20 18:33:13 +0800

09 Mar, 2016

1 commit

29b75eb2d futex: Replace barrier() in unqueue_me() with READ_ONCE() ... Browse Code »

Commit e91467ecd1ef ("bug in futex unqueue_me") introduced a barrier() in
unqueue_me() to prevent the compiler from rereading the lock pointer which
might change after a check for NULL.

Replace the barrier() with a READ_ONCE() for the following reasons:

1) READ_ONCE() is a weaker form of barrier() that affects only the specific
load operation, while barrier() is a general compiler level memory barrier.
READ_ONCE() was not available at the time when the barrier was added.

2) Aside of that READ_ONCE() is descriptive and self explainatory while a
barrier without comment is not clear to the casual reader.

No functional change.

[ tglx: Massaged changelog ]

Signed-off-by: Jianyu Zhan
Acked-by: Christian Borntraeger
Acked-by: Darren Hart
Cc: dave@stgolabs.net
Cc: peterz@infradead.org
Cc: linux@rasmusvillemoes.dk
Cc: akpm@linux-foundation.org
Cc: fengguang.wu@intel.com
Cc: bigeasy@linutronix.de
Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com
Signed-off-by: Thomas Gleixner

Jianyu Zhan
2016-03-09 00:04:02 +0800

17 Feb, 2016

2 commits

65d8fc777 futex: Remove requirement for lock_page() in get_futex_key() ... Browse Code »

When dealing with key handling for shared futexes, we can drastically reduce
the usage/need of the page lock. 1) For anonymous pages, the associated futex
object is the mm_struct which does not require the page lock. 2) For inode
based, keys, we can check under RCU read lock if the page mapping is still
valid and take reference to the inode. This just leaves one rare race that
requires the page lock in the slow path when examining the swapcache.

Additionally realtime users currently have a problem with the page lock being
contended for unbounded periods of time during futex operations.

Task A
get_futex_key()
lock_page()
---> preempted

Now any other task trying to lock that page will have to wait until
task A gets scheduled back in, which is an unbound time.

With this patch, we pretty much have a lockless futex_get_key().

Experiments show that this patch can boost/speedup the hashing of shared
futexes with the perf futex benchmarks (which is good for measuring such
change) by up to 45% when there are high (> 100) thread counts on a 60 core
Westmere. Lower counts are pretty much in the noise range or less than 10%,
but mid range can be seen at over 30% overall throughput (hash ops/sec).
This makes anon-mem shared futexes much closer to its private counterpart.

Signed-off-by: Mel Gorman
[ Ported on top of thp refcount rework, changelog, comments, fixes. ]
Signed-off-by: Davidlohr Bueso
Reviewed-by: Thomas Gleixner
Cc: Chris Mason
Cc: Darren Hart
Cc: Hugh Dickins
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar

Mel Gorman
2016-02-17 17:42:17 +0800
8ad7b378d futex: Rename barrier references in ordering guarantees ... Browse Code »

Ingo suggested we rename how we reference barriers A and B
regarding futex ordering guarantees. This patch replaces,
for both barriers, MB (A) with smp_mb(); (A), such that:

- We explicitly state that the barriers are SMP, and

- We standardize how we reference these across futex.c
helping readers follow what barrier does what and where.

Suggested-by: Ingo Molnar
Signed-off-by: Davidlohr Bueso
Reviewed-by: Thomas Gleixner
Cc: Chris Mason
Cc: Darren Hart
Cc: Hugh Dickins
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: dave@stgolabs.net
Link: http://lkml.kernel.org/r/1455045314-8305-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2016-02-17 17:42:17 +0800

26 Jan, 2016

1 commit

b4abf9104 rtmutex: Make wait_lock irq safe ... Browse Code »

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0 CPU1

rtmutex_lock(&rcu->rt_mutex)
spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
spin_lock(&timer->it_lock)

This is caused by the following code sequence on CPU1

rcu_read_lock()
x = lookup();
if (x)
spin_lock_irqsave(&x->it_lock);
rcu_read_unlock();
return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney

Thomas Gleixner
2016-01-26 18:08:35 +0800

21 Jan, 2016

1 commit

caaee6234 ptrace: use fsuid, fsgid, effective creds for fs access checks ... Browse Code »

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g. by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

/proc/$pid/stat - uses the check for determining whether pointers
should be visible, useful for bypassing ASLR
/proc/$pid/maps - also useful for bypassing ASLR
/proc/$pid/cwd - useful for gaining access to restricted
directories that contain files with lax permissions, e.g. in
this scenario:
lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
drwx------ root root /root
drwxr-xr-x root root /root/foobar
-rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn
Acked-by: Kees Cook
Cc: Casey Schaufler
Cc: Oleg Nesterov
Cc: Ingo Molnar
Cc: James Morris
Cc: "Serge E. Hallyn"
Cc: Andy Shevchenko
Cc: Andy Lutomirski
Cc: Al Viro
Cc: "Eric W. Biederman"
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jann Horn
2016-01-21 09:09:18 +0800

16 Jan, 2016

2 commits

4a9e1cda2 mm: bring in additional flag for fixup_user_fault to signal unlock ... Browse Code »

During Jason's work with postcopy migration support for s390 a problem
regarding gmap faults was discovered.

The gmap code will call fixup_user_fault which will end up always in
handle_mm_fault. Till now we never cared about retries, but as the
userfaultfd code kind of relies on it. this needs some fix.

This patchset does not take care of the futex code. I will now look
closer at this.

This patch (of 2):

With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
faulting we ever unlocked mmap_sem.

This patch brings in the logic to handle retries as well as it cleans up
the current documentation. fixup_user_fault was not having the same
semantics as filemap_fault. It never indicated if a retry happened and
so a caller wasn't able to handle that case. So we now changed the
behaviour to always retry a locked mmap_sem.

Signed-off-by: Dominik Dingel
Reviewed-by: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Martin Schwidefsky
Cc: Christian Borntraeger
Cc: "Jason J. Herne"
Cc: David Rientjes
Cc: Eric B Munson
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Heiko Carstens
Cc: Dominik Dingel
Cc: Paolo Bonzini
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dominik Dingel
2016-01-16 09:56:32 +0800
14d27abd1 futex, thp: remove special case for THP in get_futex_key ... Browse Code »

With new THP refcounting, we don't need tricks to stabilize huge page.
If we've got reference to tail page, it can't split under us.

This patch effectively reverts a5b338f2b0b1 ("thp: update futex compound
knowledge").

Signed-off-by: Kirill A. Shutemov
Tested-by: Sasha Levin
Tested-by: Aneesh Kumar K.V
Acked-by: Jerome Marchand
Cc: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Dave Hansen
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Naoya Horiguchi
Cc: Steve Capper
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Christoph Lameter
Cc: David Rientjes
Tested-by: Artem Savkov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-01-16 09:56:32 +0800

20 Dec, 2015

6 commits

337f13046 futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op ... Browse Code »

While reviewing Michael Kerrisk's recent futex manpage update, I noticed
that we allow the FUTEX_CLOCK_REALTIME flag for FUTEX_WAIT_BITSET but
not for FUTEX_WAIT.

FUTEX_WAIT is treated as a simple version for FUTEX_WAIT_BITSET
internally (with a bitmask of FUTEX_BITSET_MATCH_ANY). As such, I cannot
come up with a reason for this exclusion for FUTEX_WAIT.

This change does modify the behavior of the futex syscall, changing a
call with FUTEX_WAIT | FUTEX_CLOCK_REALTIME from returning -ENOSYS, to be
equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of
FUTEX_BITSET_MATCH_ANY.

Reported-by: Michael Kerrisk
Signed-off-by: Darren Hart
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/9f3bdc116d79d23f5ee72ceb9a2a857f5ff8fa29.1450474525.git.dvhart@linux.intel.com
Signed-off-by: Thomas Gleixner

Darren Hart
2015-12-20 19:43:25 +0800
885c2cb77 futex: Cleanup the goto confusion in requeue_pi() ... Browse Code »

out_unlock: does not only drop the locks, it also drops the refcount
on the pi_state. Really intuitive.

Move the label after the put_pi_state() call and use 'break' in the
error handling path of the requeue loop.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe
Link: http://lkml.kernel.org/r/20151219200607.526665141@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-12-20 19:43:25 +0800
4959f2de1 futex: Remove pointless put_pi_state calls in requeue() ... Browse Code »

In the error handling cases we neither have pi_state nor a reference
to it. Remove the pointless code.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe
Link: http://lkml.kernel.org/r/20151219200607.432780944@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-12-20 19:43:25 +0800
ecb38b78f futex: Document pi_state refcounting in requeue code ... Browse Code »

Documentation of the pi_state refcounting in the requeue code is non
existent. Add it.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe
Link: http://lkml.kernel.org/r/20151219200607.335938312@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-12-20 19:43:24 +0800
29e9ee5d4 futex: Rename free_pi_state() to put_pi_state() ... Browse Code »

free_pi_state() is confusing as it is in fact only freeing/caching the
pi state when the last reference is gone. Rename it to put_pi_state()
which reflects better what it is doing.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe
Link: http://lkml.kernel.org/r/20151219200607.259636467@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-12-20 19:43:24 +0800
fb75a4282 futex: Drop refcount if requeue_pi() acquired the rtmutex ... Browse Code »

If the proxy lock in the requeue loop acquires the rtmutex for a
waiter then it acquired also refcount on the pi_state related to the
futex, but the waiter side does not drop the reference count.

Add the missing free_pi_state() call.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Bhuvanesh_Surachari@mentor.com
Cc: Andy Lowe
Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2015-12-20 19:43:24 +0800

05 Nov, 2015

1 commit

e880e8748 Merge tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core updates from Greg KH:
"Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch
of debugfs updates, with a smattering of minor driver core fixes and
updates as well.

All have been in linux-next for a long time"

* tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
debugfs: Add debugfs_create_ulong()
of: to support binding numa node to specified device in devicetree
debugfs: Add read-only/write-only bool file ops
debugfs: Add read-only/write-only size_t file ops
debugfs: Add read-only/write-only x64 file ops
debugfs: Consolidate file mode checks in debugfs_create_*()
Revert "mm: Check if section present during memory block (un)registering"
driver-core: platform: Provide helpers for multi-driver modules
mm: Check if section present during memory block (un)registering
devres: fix a for loop bounds check
CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit
base/platform: assert that dev_pm_domain callbacks are called unconditionally
sysfs: correctly handle short reads on PREALLOC attrs.
base: soc: siplify ida usage
kobject: move EXPORT_SYMBOL() macros next to corresponding definitions
kobject: explain what kobject's sd field is
debugfs: document that debugfs_remove*() accepts NULL and error values
debugfs: Pass bool pointer to debugfs_create_bool()
ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

Linus Torvalds
2015-11-05 13:50:37 +0800

04 Oct, 2015

1 commit

621a5f7ad debugfs: Pass bool pointer to debugfs_create_bool() ... Browse Code »

Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument,
when all it needs is a boolean pointer.

It would be better to update this API to make it accept 'bool *'
instead, as that will make it more consistent and often more convenient.
Over that bool takes just a byte.

That required updates to all user sites as well, in the same commit
updating the API. regmap core was also using
debugfs_{read|write}_file_bool(), directly and variable types were
updated for that to be bool as well.

Signed-off-by: Viresh Kumar
Acked-by: Mark Brown
Acked-by: Charles Keepax
Signed-off-by: Greg Kroah-Hartman

Viresh Kumar
2015-10-04 18:36:07 +0800

22 Sep, 2015

1 commit

ac742d371 futex: Force hot variables into a single cache line ... Browse Code »

futex_hash() references two global variables: the base pointer
futex_queues and the size of the array futex_hashsize. The latter is
marked __read_mostly, while the former is not, so they are likely to
end up very far from each other. This means that futex_hash() is
likely to encounter two cache misses.

We could mark futex_queues as __read_mostly as well, but that doesn't
guarantee they'll end up next to each other (and even if they do, they
may still end up in different cache lines). So put the two variables
in a small singleton struct with sufficient alignment and mark that as
__read_mostly.

Signed-off-by: Rasmus Villemoes
Cc: Peter Zijlstra
Cc: Davidlohr Bueso
Cc: kbuild test robot
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/1441834601-13633-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Thomas Gleixner

Rasmus Villemoes
2015-09-22 22:23:15 +0800

21 Jul, 2015

1 commit

5d285a7f3 futex: Make should_fail_futex() static ... Browse Code »

Signed-off-by: Fengguang Wu
Cc: kbuild-all@01.org
Cc: tipbuild@zytor.com
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Oleg Nesterov
Cc: Brian Silverman
Cc: Andy Lutomirski
Cc: Sebastian Andrzej Siewior
Signed-off-by: Thomas Gleixner

kbuild test robot
2015-07-21 03:43:54 +0800

20 Jul, 2015

2 commits

ab51fbab3 futex: Fault/error injection capabilities ... Browse Code »

Although futexes are well known for being a royal pita,
we really have very little debugging capabilities - except
for relying on tglx's eye half the time.

By simply making use of the existing fault-injection machinery,
we can improve this situation, allowing generating artificial
uaddress faults and deadlock scenarios. Of course, when this is
disabled in production systems, the overhead for failure checks
is practically zero -- so this is very cheap at the same time.
Future work would be nice to now enhance trinity to make use of
this.

There is a special tunable 'ignore-private', which can filter
out private futexes. Given the tsk->make_it_fail filter and
this option, pi futexes can be narrowed down pretty closely.

Signed-off-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/1435645562-975-3-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2015-07-20 17:45:45 +0800
767f509ca futex: Enhance comments in futex_lock_pi() for blocking paths ... Browse Code »

... serves a bit better to clarify between blocking
and non-blocking code paths.

Signed-off-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/1435645562-975-2-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2015-07-20 17:45:45 +0800

25 Jun, 2015

1 commit

a26294833 Merge branch 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Thomas Gleixner:
"These locking updates depend on the alreay merged sched/core branch:

- Lockless top waiter wakeup for rtmutex (Davidlohr)

- Reduce hash bucket lock contention for PI futexes (Sebastian)

- Documentation update (Davidlohr)"

* 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Update stale plist comments
futex: Lower the lock contention on the HB lock during wake up
locking/rtmutex: Implement lockless top-waiter wakeup

Linus Torvalds
2015-06-25 05:46:01 +0800

23 Jun, 2015

2 commits

43224b96a Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:

- Cache footprint optimizations for both hrtimers and timer wheel

- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.

- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter

- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf

- Some more leap second tweaks

- Another round of changes addressing the 2038 problem

- First step to change the internals of clock event devices by
introducing the necessary infrastructure

- Allow constant folding for usecs/msecs_to_jiffies()

- The usual pile of clockevent/clocksource driver updates

The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...

Linus Torvalds
2015-06-23 09:57:44 +0800
23b777629 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main changes are:

- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)

- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)

- NUMA balancing improvements (Rik van Riel)

- SCHED_DEADLINE improvements (Wanpeng Li)

- clean up and reorganize preemption helpers (Frederic Weisbecker)

- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)

- SCHED_DEADLINE documentation updates (Luca Abeni)

- topology CPU masks cleanups (Bartosz Golaszewski)

- /proc/sched_debug improvements (Srikar Dronamraju)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc//sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...

Linus Torvalds
2015-06-23 06:52:04 +0800

20 Jun, 2015

1 commit

802ab58da futex: Lower the lock contention on the HB lock during wake up ... Browse Code »

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[ tglx: Fixed up the rtmutex unlock path ]

Originally-from: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2015-06-20 03:26:38 +0800

19 May, 2015

1 commit

b92b8b35a locking/arch: Rename set_mb() to smp_store_mb() ... Browse Code »

Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
like mb() rename it to match the recent smp_load_acquire() and
smp_store_release().

Suggested-by: Linus Torvalds
Signed-off-by: Peter Zijlstra (Intel)
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-05-19 14:32:00 +0800

08 May, 2015

1 commit

1d0dcb3ad futex: Implement lockless wakeups ... Browse Code »

Given the overall futex architecture, any chance of reducing
hb->lock contention is welcome. In this particular case, using
wake-queues to enable lockless wakeups addresses very much real
world performance concerns, even cases of soft-lockups in cases
of large amounts of blocked tasks (which is not hard to find in
large boxes, using but just a handful of futex).

At the lowest level, this patch can reduce latency of a single thread
attempting to acquire hb->lock in highly contended scenarios by a
up to 2x. At lower counts of nr_wake there are no regressions,
confirming, of course, that the wake_q handling overhead is practically
non existent. For instance, while a fair amount of variation,
the extended pef-bench wakeup benchmark shows for a 20 core machine
the following avg per-thread time to wakeup its share of tasks:

nr_thr ms-before ms-after
16 0.0590 0.0215
32 0.0396 0.0220
48 0.0417 0.0182
64 0.0536 0.0236
80 0.0414 0.0097
96 0.0672 0.0152

Naturally, this can cause spurious wakeups. However there is no core code
that cannot handle them afaict, and furthermore tglx does have the point
that other events can already trigger them anyway.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Chris Mason
Cc: Davidlohr Bueso
Cc: George Spelvin
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Manfred Spraul
Cc: Peter Zijlstra
Cc: Sebastian Andrzej Siewior
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1430494072-30283-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2015-05-08 18:21:40 +0800

22 Apr, 2015

1 commit

2e4b0d3fe futex: Remove bogus hrtimer_active() check ... Browse Code »

The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20150414203502.985825453@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:51 +0800

24 Feb, 2015

1 commit

2ae790268 Merge tag 'v4.0-rc1' into locking/core, to refresh the tree before merging new changes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2015-02-24 15:41:07 +0800

18 Feb, 2015

1 commit

a21294644 locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads ... Browse Code »

attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
this looks doubly wrong:

1. It should actually check PF_KTHREAD, kthread can do use_mm().

2. If this task is not kthread and it is actually the lock owner we can
wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.

And note that this wrong EPERM is the likely case unless the exiting
task is (auto)reaped quickly, we check ->mm before PF_EXITING.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Catalin Marinas
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Jerome Marchand
Cc: Larry Woodman
Cc: Linus Torvalds
Cc: Mateusz Guzik
Link: http://lkml.kernel.org/r/20150202140536.GA26406@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2015-02-18 23:57:09 +0800

13 Feb, 2015

1 commit

f56141e3e all arches, signal: move restart_block to struct task_struct ... Browse Code »

If an attacker can cause a controlled kernel stack overflow, overwriting
the restart block is a very juicy exploit target. This is because the
restart_block is held in the same memory allocation as the kernel stack.

Moving the restart block to struct task_struct prevents this exploit by
making the restart_block harder to locate.

Note that there are other fields in thread_info that are also easy
targets, at least on some architectures.

It's also a decent simplification, since the restart code is more or less
identical on all architectures.

[james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
Signed-off-by: Andy Lutomirski
Cc: Thomas Gleixner
Cc: Al Viro
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Kees Cook
Cc: David Miller
Acked-by: Richard Weinberger
Cc: Richard Henderson
Cc: Ivan Kokshaysky
Cc: Matt Turner
Cc: Vineet Gupta
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Steven Miao
Cc: Mark Salter
Cc: Aurelien Jacquiot
Cc: Mikael Starvik
Cc: Jesper Nilsson
Cc: David Howells
Cc: Richard Kuo
Cc: "Luck, Tony"
Cc: Geert Uytterhoeven
Cc: Michal Simek
Cc: Ralf Baechle
Cc: Jonas Bonn
Cc: "James E.J. Bottomley"
Cc: Helge Deller
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Acked-by: Michael Ellerman (powerpc)
Tested-by: Michael Ellerman (powerpc)
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Chen Liqin
Cc: Lennox Wu
Cc: Chris Metcalf
Cc: Guan Xuetao
Cc: Chris Zankel
Cc: Max Filippov
Cc: Oleg Nesterov
Cc: Guenter Roeck
Signed-off-by: James Hogan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andy Lutomirski
2015-02-13 10:54:12 +0800

19 Jan, 2015

1 commit

996636dda futex: Fix argument handling in futex_lock_pi() calls ... Browse Code »

This patch fixes two separate buglets in calls to futex_lock_pi():

* Eliminate unused 'detect' argument
* Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL

The 'detect' argument of futex_lock_pi() seems never to have been
used (when it was included with the initial PI mutex implementation
in Linux 2.6.18, all checks against its value were disabled by
ANDing against 0 (i.e., if (detect... && 0)), and with
commit 778e9a9c3e7193ea9f434f382947155ffb59c755, any mention of
this argument in futex_lock_pi() went way altogether. Its presence
now serves only to confuse readers of the code, by giving the
impression that the futex() FUTEX_LOCK_PI operation actually does
use the 'val' argument. This patch removes the argument.

The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes
'timeout' as one of its arguments. This misleads the reader into thinking
that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible
purpose; but it does not. Indeed, it cannot, because the checks at the
start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations
that do copy_from_user() on the timeout argument. So, in the
FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change
'timeout' to 'NULL'. This patch does that.

Signed-off-by: Michael Kerrisk
Reviewed-by: Darren Hart
Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.com
Signed-off-by: Thomas Gleixner

Michael Kerrisk
2015-01-19 19:05:32 +0800

26 Oct, 2014

2 commits

30a6b8031 futex: Fix a race condition between REQUEUE_PI and task death ... Browse Code »

free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
exit_pi_state_list takes the hb lock first, and most callers of
free_pi_state do too. requeue_pi doesn't, which means free_pi_state
can free the pi_state out from under exit_pi_state_list. For example:

task A | task B
exit_pi_state_list |
pi_state = |
curr->pi_state_list->next |
| futex_requeue(requeue_pi=1)
| // pi_state is the same as
| // the one in task A
| free_pi_state(pi_state)
| list_del_init(&pi_state->list)
| kfree(pi_state)
list_del_init(&pi_state->list) |

Move the free_pi_state calls in requeue_pi to before it drops the hb
locks which it's already holding.

[ tglx: Removed a pointless free_pi_state() call and the hb->lock held
debugging. The latter comes via a seperate patch ]

Signed-off-by: Brian Silverman
Cc: austin.linux@gmail.com
Cc: darren@dvhart.com
Cc: peterz@infradead.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
Signed-off-by: Thomas Gleixner

Brian Silverman
2014-10-26 23:16:18 +0800
993b2ff22 futex: Mention key referencing differences between shared and private futexes ... Browse Code »

Update our documentation as of fix 76835b0ebf8 (futex: Ensure
get_futex_key_refs() always implies a barrier). Explicitly
state that we don't do key referencing for private futexes.

Signed-off-by: Davidlohr Bueso
Cc: Matteo Franchin
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Darren Hart
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Acked-by: Catalin Marinas
Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2014-10-26 23:16:18 +0800

19 Oct, 2014

1 commit

76835b0eb futex: Ensure get_futex_key_refs() always implies a barrier ... Browse Code »

Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
nothing to wake up) changes the futex code to avoid taking a lock when
there are no waiters. This code has been subsequently fixed in commit
11d4616bd07f (futex: revert back to the explicit waiter counting code).
Both the original commit and the fix-up rely on get_futex_key_refs() to
always imply a barrier.

However, for private futexes, none of the cases in the switch statement
of get_futex_key_refs() would be hit and the function completes without
a memory barrier as required before checking the "waiters" in
futex_wake() -> hb_waiters_pending(). The consequence is a race with a
thread waiting on a futex on another CPU, allowing the waker thread to
read "waiters == 0" while the waiter thread to have read "futex_val ==
locked" (in kernel).

Without this fix, the problem (user space deadlocks) can be seen with
Android bionic's mutex implementation on an arm64 multi-cluster system.

Signed-off-by: Catalin Marinas
Reported-by: Matteo Franchin
Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
Acked-by: Davidlohr Bueso
Tested-by: Mike Galbraith
Cc:
Cc: Darren Hart
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Paul E. McKenney
Signed-off-by: Linus Torvalds

Catalin Marinas
2014-10-19 00:28:51 +0800

13 Sep, 2014

1 commit

13c42c2f4 futex: Unlock hb->lock in futex_wait_requeue_pi() error path ... Browse Code »

futex_wait_requeue_pi() calls futex_wait_setup(). If
futex_wait_setup() succeeds it returns with hb->lock held and
preemption disabled. Now the sanity check after this does:

if (match_futex(&q.key, &key2)) {
ret = -EINVAL;
goto out_put_keys;
}

which releases the keys but does not release hb->lock.

So we happily return to user space with hb->lock held and therefor
preemption disabled.

Unlock hb->lock before taking the exit route.

Reported-by: Dave "Trinity" Jones
Signed-off-by: Thomas Gleixner
Reviewed-by: Darren Hart
Reviewed-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-09-13 04:04:36 +0800