26 Aug, 2017

1 commit

  • There is code duplicated over all architecture's headers for
    futex_atomic_op_inuser. Namely op decoding, access_ok check for uaddr,
    and comparison of the result.

    Remove this duplication and leave up to the arches only the needed
    assembly which is now in arch_futex_atomic_op_inuser.

    This effectively distributes the Will Deacon's arm64 fix for undefined
    behaviour reported by UBSAN to all architectures. The fix was done in
    commit 5f16a046f8e1 (arm64: futex: Fix undefined behaviour with
    FUTEX_OP_OPARG_SHIFT usage). Look there for an example dump.

    And as suggested by Thomas, check for negative oparg too, because it was
    also reported to cause undefined behaviour report.

    Note that s390 removed access_ok check in d12a29703 ("s390/uaccess:
    remove pointless access_ok() checks") as access_ok there returns true.
    We introduce it back to the helper for the sake of simplicity (it gets
    optimized away anyway).

    Signed-off-by: Jiri Slaby
    Signed-off-by: Thomas Gleixner
    Acked-by: Russell King
    Acked-by: Michael Ellerman (powerpc)
    Acked-by: Heiko Carstens [s390]
    Acked-by: Chris Metcalf [for tile]
    Reviewed-by: Darren Hart (VMware)
    Reviewed-by: Will Deacon [core/arm64]
    Cc: linux-mips@linux-mips.org
    Cc: Rich Felker
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: peterz@infradead.org
    Cc: Benjamin Herrenschmidt
    Cc: Max Filippov
    Cc: Paul Mackerras
    Cc: sparclinux@vger.kernel.org
    Cc: Jonas Bonn
    Cc: linux-s390@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: Yoshinori Sato
    Cc: linux-hexagon@vger.kernel.org
    Cc: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Catalin Marinas
    Cc: Matt Turner
    Cc: linux-snps-arc@lists.infradead.org
    Cc: Fenghua Yu
    Cc: Arnd Bergmann
    Cc: linux-xtensa@linux-xtensa.org
    Cc: Stefan Kristiansson
    Cc: openrisc@lists.librecores.org
    Cc: Ivan Kokshaysky
    Cc: Stafford Horne
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: Richard Henderson
    Cc: Chris Zankel
    Cc: Michal Simek
    Cc: Tony Luck
    Cc: linux-parisc@vger.kernel.org
    Cc: Vineet Gupta
    Cc: Ralf Baechle
    Cc: Richard Kuo
    Cc: linux-alpha@vger.kernel.org
    Cc: Martin Schwidefsky
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: "David S. Miller"
    Link: http://lkml.kernel.org/r/20170824073105.3901-1-jslaby@suse.cz

    Jiri Slaby
     

10 Aug, 2017

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit 65d8fc777f6d ("futex: Remove requirement for lock_page() in
    get_futex_key()") removed an unnecessary lock_page() with the
    side-effect that page->mapping needed to be treated very carefully.

    Two defensive warnings were added in case any assumption was missed and
    the first warning assumed a correct application would not alter a
    mapping backing a futex key. Since merging, it has not triggered for
    any unexpected case but Mark Rutland reported the following bug
    triggering due to the first warning.

    kernel BUG at kernel/futex.c:679!
    Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
    Hardware name: linux,dummy-virt (DT)
    task: ffff80001e271780 task.stack: ffff000010908000
    PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
    LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
    pc : [] lr : [] pstate: 80000145

    The fact that it's a bug instead of a warning was due to an unrelated
    arm64 problem, but the warning itself triggered because the underlying
    mapping changed.

    This is an application issue but from a kernel perspective it's a
    recoverable situation and the warning is unnecessary so this patch
    removes the warning. The warning may potentially be triggered with the
    following test program from Mark although it may be necessary to adjust
    NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
    system.

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define NR_FUTEX_THREADS 16
    pthread_t threads[NR_FUTEX_THREADS];

    void *mem;

    #define MEM_PROT (PROT_READ | PROT_WRITE)
    #define MEM_SIZE 65536

    static int futex_wrapper(int *uaddr, int op, int val,
    const struct timespec *timeout,
    int *uaddr2, int val3)
    {
    syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
    }

    void *poll_futex(void *unused)
    {
    for (;;) {
    futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
    }
    }

    int main(int argc, char *argv[])
    {
    int i;

    mem = mmap(NULL, MEM_SIZE, MEM_PROT,
    MAP_SHARED | MAP_ANONYMOUS, -1, 0);

    printf("Mapping @ %p\n", mem);

    printf("Creating futex threads...\n");

    for (i = 0; i < NR_FUTEX_THREADS; i++)
    pthread_create(&threads[i], NULL, poll_futex, NULL);

    printf("Flipping mapping...\n");
    for (;;) {
    mmap(mem, MEM_SIZE, MEM_PROT,
    MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    }

    return 0;
    }

    Reported-and-tested-by: Mark Rutland
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org # 4.7+
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

01 Aug, 2017

1 commit

  • This makes it possible to preserve basic futex support and compile out the
    PI support when RT mutexes are not available.

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1708010024190.5981@knanqh.ubzr

    Nicolas Pitre
     

19 Jul, 2017

1 commit

  • Pull structure randomization updates from Kees Cook:
    "Now that IPC and other changes have landed, enable manual markings for
    randstruct plugin, including the task_struct.

    This is the rest of what was staged in -next for the gcc-plugins, and
    comes in three patches, largest first:

    - mark "easy" structs with __randomize_layout

    - mark task_struct with an optional anonymous struct to isolate the
    __randomize_layout section

    - mark structs to opt _out_ of automated marking (which will come
    later)

    And, FWIW, this continues to pass allmodconfig (normal and patched to
    enable gcc-plugins) builds of x86_64, i386, arm64, arm, powerpc, and
    s390 for me"

    * tag 'gcc-plugins-v4.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    randstruct: opt-out externally exposed function pointer structs
    task_struct: Allow randomized layout
    randstruct: Mark various structs for randomization

    Linus Torvalds
     

04 Jul, 2017

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There has been a fair amount of activity in the docs tree this time
    around. Highlights include:

    - Conversion of a bunch of security documentation into RST

    - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine. We can now drop the entire DocBook build chain.

    - The usual collection of fixes and minor updates"

    * tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
    scripts/kernel-doc: handle DECLARE_HASHTABLE
    Documentation: atomic_ops.txt is core-api/atomic_ops.rst
    Docs: clean up some DocBook loose ends
    Make the main documentation title less Geocities
    Docs: Use kernel-figure in vidioc-g-selection.rst
    Docs: fix table problems in ras.rst
    Docs: Fix breakage with Sphinx 1.5 and upper
    Docs: Include the Latex "ifthen" package
    doc/kokr/howto: Only send regression fixes after -rc1
    docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
    doc: Document suitability of IBM Verse for kernel development
    Doc: fix a markup error in coding-style.rst
    docs: driver-api: i2c: remove some outdated information
    Documentation: DMA API: fix a typo in a function name
    Docs: Insert missing space to separate link from text
    doc/ko_KR/memory-barriers: Update control-dependencies example
    Documentation, kbuild: fix typo "minimun" -> "minimum"
    docs: Fix some formatting issues in request-key.rst
    doc: ReSTify keys-trusted-encrypted.txt
    doc: ReSTify keys-request-key.txt
    ...

    Linus Torvalds
     

01 Jul, 2017

1 commit

  • This marks many critical kernel structures for randomization. These are
    structures that have been targeted in the past in security exploits, or
    contain functions pointers, pointers to function pointer tables, lists,
    workqueues, ref-counters, credentials, permissions, or are otherwise
    sensitive. This initial list was extracted from Brad Spengler/PaX Team's
    code in the last public patch of grsecurity/PaX based on my understanding
    of the code. Changes or omissions from the original code are mine and
    don't reflect the original grsecurity/PaX code.

    Left out of this list is task_struct, which requires special handling
    and will be covered in a subsequent patch.

    Signed-off-by: Kees Cook

    Kees Cook
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 May, 2017

1 commit

  • There are a few issues on some kernel-doc markups that was
    causing troubles with kernel-doc output on ReST format:

    ./kernel/futex.c:492: WARNING: Inline emphasis start-string without end-string.
    ./kernel/futex.c:1264: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:1721: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2338: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2426: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2899: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2972: WARNING: Block quote ends without a blank line; unexpected unindent.

    Fix them.

    No functional changes.

    Acked-by: Darren Hart (VMware)
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

15 Apr, 2017

1 commit

  • Clarify the scenario described in mark_wake_futex requiring the
    smp_store_release(). Update the comment to explicitly refer to the
    plist_del now under __unqueue_futex() (previously plist_del was in the
    same function as the comment).

    Signed-off-by: Darren Hart (VMware)
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170414223138.GA4222@fury
    Signed-off-by: Thomas Gleixner

    Darren Hart (VMware)
     

14 Apr, 2017

2 commits

  • During (post-commit) review Darren spotted a few minor things. One
    (harmless AFAICT) type inconsistency and a comment that wasn't as
    clear as hoped.

    Reported-by: Darren Hart (VMWare)
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart (VMware)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Alexander reported a hrtimer debug_object splat:

    ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)

    debug_object_free (lib/debugobjects.c:603)
    destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
    futex_lock_pi (kernel/futex.c:2740)
    do_futex (kernel/futex.c:3399)
    SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
    do_syscall_64 (arch/x86/entry/common.c:284)
    entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)

    Which was caused by commit:

    cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

    ... losing the hrtimer_cancel() in the shuffle. Where previously the
    hrtimer_cancel() was done by rt_mutex_slowlock() we now need to do it
    manually.

    Reported-by: Alexander Levin
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Fixes: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704101802370.2906@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

04 Apr, 2017

2 commits

  • Previous patches changed the meaning of the return value of
    rt_mutex_slowunlock(); update comments and code to reflect this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • We should deboost before waking the high-priority task, such that we
    don't run two tasks with the same "state" (priority, deadline,
    sched_class, etc).

    In order to make sure the boosting task doesn't start running between
    unlock and deboost (due to 'spurious' wakeup), we move the deboost
    under the wait_lock, that way its serialized against the wait loop in
    __rt_mutex_slowlock().

    Doing the deboost early can however lead to priority-inversion if
    current would get preempted after the deboost but before waking our
    high-prio task, hence we disable preemption before doing deboost, and
    enabling it after the wake up is over.

    This gets us the right semantic order, but most importantly however;
    this change ensures pointer stability for the next patch, where we
    have rt_mutex_setprio() cache a pointer to the top-most waiter task.
    If we, as before this change, do the wakeup first and then deboost,
    this pointer might point into thin air.

    [peterz: Changelog + patch munging]
    Suggested-by: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

24 Mar, 2017

12 commits

  • When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
    chain code will (falsely) report a deadlock and BUG.

    The problem is that it hold hb->lock (now an rt_mutex) while doing
    task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
    interleaved just right with futex_unlock_pi() leads it to believe to see an
    AB-BA deadlock.

    Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
    does FUTEX_UNLOCK_PI)

    lock hb->lock
    lock rt_mutex (as per start_proxy)
    lock hb->lock

    Which is a trivial AB-BA.

    It is not an actual deadlock, because it won't be holding hb->lock by the
    time it actually blocks on the rt_mutex, but the chainwalk code doesn't
    know that and it would be a nightmare to handle this gracefully.

    To avoid this problem, do the same as in futex_unlock_pi() and drop
    hb->lock after acquiring wait_lock. This still fully serializes against
    futex_unlock_pi(), since adding to the wait_list does the very same lock
    dance, and removing it holds both locks.

    Aside of solving the RT problem this makes the lock and unlock mechanism
    symetric and reduces the hb->lock held time.

    Reported-and-tested-by: Sebastian Andrzej Siewior
    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • The problem with returning -EAGAIN when the waiter state mismatches is that
    it becomes very hard to proof a bounded execution time on the
    operation. And seeing that this is a RT operation, this is somewhat
    important.

    While in practise; given the previous patch; it will be very unlikely to
    ever really take more than one or two rounds, proving so becomes rather
    hard.

    However, now that modifying wait_list is done while holding both hb->lock
    and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
    while still holding hb-lock. Doing a hand-over, without leaving a hole.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
    modifications are done under both hb->lock and wait_lock.

    This closes the obvious interleave pattern between futex_lock_pi() and
    futex_unlock_pi(), but not entirely so. See below:

    Before:

    futex_lock_pi() futex_unlock_pi()
    unlock hb->lock

    lock hb->lock
    unlock hb->lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock

    schedule()

    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock


    -EAGAIN

    lock hb->lock

    After:

    futex_lock_pi() futex_unlock_pi()

    lock hb->lock
    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock
    unlock hb->lock

    schedule()
    lock hb->lock
    unlock hb->lock
    lock hb->lock
    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    unlock hb->lock

    It does however solve the earlier starvation/live-lock scenario which got
    introduced with the -EAGAIN since unlike the before scenario; where the
    -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
    after scenario it happens while futex_unlock_pi() actually holds a lock,
    and then it is serialized on that lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
    consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
    parts, such that only the actual blocking can be done without hb->lock
    held.

    Split split_mutex_finish_proxy_lock() into two parts, one that does the
    blocking and one that does remove_waiter() when the lock acquire failed.

    When the rtmutex was acquired successfully the waiter can be removed in the
    acquisiton path safely, since there is no concurrency on the lock owner.

    This means that, except for futex_lock_pi(), all wait_list modifications
    are done with both hb->lock and wait_lock held.

    [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since there's already two copies of this code, introduce a helper now
    before adding a third one.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • There's a number of 'interesting' problems, all caused by holding
    hb->lock while doing the rt_mutex_unlock() equivalient.

    Notably:

    - a PI inversion on hb->lock; and,

    - a SCHED_DEADLINE crash because of pointer instability.

    The previous changes:

    - changed the locking rules to cover {uval,pi_state} with wait_lock.

    - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
    turn allows to rely on wait_lock atomicity completely.

    - simplified the waiter conundrum.

    It's now sufficient to hold rtmutex::wait_lock and a reference on the
    pi_state to protect the state consistency, so hb->lock can be dropped
    before calling rt_mutex_futex_unlock().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • There is a weird state in the futex_unlock_pi() path when it interleaves
    with a concurrent futex_lock_pi() at the point where it drops hb->lock.

    In this case, it can happen that the rt_mutex wait_list and the futex_q
    disagree on pending waiters, in particular rt_mutex will find no pending
    waiters where futex_q thinks there are. In this case the rt_mutex unlock
    code cannot assign an owner.

    The futex side fixup code has to cleanup the inconsistencies with quite a
    bunch of interesting corner cases.

    Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
    situation occurs. This then gives the futex_lock_pi() code the opportunity
    to continue and the retried futex_unlock_pi() will now observe a coherent
    state.

    The only problem is that this breaks RT timeliness guarantees. That
    is, consider the following scenario:

    T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
    lock_pi()
    queue_me()
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
    becomes consistent.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
    creates another set of problems, especially priority inversions on RT where
    hb->lock becomes a rt_mutex itself.

    The rt_mutex::wait_lock is the most obvious protection for keeping the
    futex user space value and the kernel internal pi_state in sync.

    Rework and document the locking so rt_mutex::wait_lock is held accross all
    operations which modify the user space value and the pi state.

    This allows to invoke rt_mutex_unlock() (including deboost) without holding
    hb->lock as a next step.

    Nothing yet relies on the new locking rules.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Part of what makes futex_unlock_pi() intricate is that
    rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
    rt_mutex::wait_lock.

    This means it cannot rely on the atomicy of wait_lock, which would be
    preferred in order to not rely on hb->lock so much.

    The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
    race with the rt_mutex fastpath, however futexes have their own fast path.

    Since futexes already have a bunch of separate rt_mutex accessors, complete
    that set and implement a rt_mutex variant without fastpath for them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since the futex_q can dissapear the instruction after assigning NULL,
    this really should be a RELEASE barrier. That stops loads from hitting
    dead memory too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
    this to a variable 'match' totally obscures the code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

15 Mar, 2017

2 commits

  • Thomas spotted that fixup_pi_state_owner() can return errors and we
    fail to unlock the rt_mutex in that case.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • While working on the futex code, I stumbled over this potential
    use-after-free scenario. Dmitry triggered it later with syzkaller.

    pi_mutex is a pointer into pi_state, which we drop the reference on in
    unqueue_me_pi(). So any access to that pointer after that is bad.

    Since other sites already do rt_mutex_unlock() with hb->lock held, see
    for example futex_lock_pi(), simply move the unlock before
    unqueue_me_pi().

    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

13 Feb, 2017

1 commit

  • The UEVENT user mode helper is enabled before the initcalls are executed
    and is available when the root filesystem has been mounted.

    The user mode helper is triggered by device init calls and the executable
    might use the futex syscall.

    futex_init() is marked __initcall which maps to device_initcall, but there
    is no guarantee that futex_init() is invoked _before_ the first device init
    call which triggers the UEVENT user mode helper.

    If the user mode helper uses the futex syscall before futex_init() then the
    syscall crashes with a NULL pointer dereference because the futex subsystem
    has not been initialized yet.

    Move futex_init() to core_initcall so futexes are initialized before the
    root filesystem is mounted and the usermode helper becomes available.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Yang Yang
    Cc: jiang.biao2@zte.com.cn
    Cc: jiang.zhengxiong@zte.com.cn
    Cc: zhong.weidong@zte.com.cn
    Cc: deng.huali@zte.com.cn
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
    Signed-off-by: Thomas Gleixner

    Yang Yang
     

26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

05 Sep, 2016

1 commit

  • Add some more comments and reformat existing ones to kernel doc style.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Darren Hart
    Link: http://lkml.kernel.org/r/1464770609-30168-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Jul, 2016

1 commit

  • To quote Rick why there is no need for shared mapping on !MMU systems:

    |With MMU, shared futex keys need to identify the physical backing for
    |a memory address because it may be mapped at different addresses in
    |different processes (or even multiple times in the same process).
    |Without MMU this cannot happen. You only have physical addresses. So
    |the "private futex" behavior of using the virtual address as the key
    |is always correct (for both shared and private cases) on nommu
    |systems.

    This patch disables the FLAGS_SHARED in a way that allows the compiler to
    remove that code.

    [bigeasy: Added changelog ]
    Reported-by: Rich Felker
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20160729143230.GA21715@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

09 Jun, 2016

1 commit

  • Mike Galbraith reported that the LTP test case futex_wake04 was broken
    by commit 65d8fc777f6d ("futex: Remove requirement for lock_page()
    in get_futex_key()").

    This test case uses futexes backed by hugetlbfs pages and so there is an
    associated inode with a futex stored on such pages. The problem is that
    the key is being calculated based on the head page index of the hugetlbfs
    page and not the tail page.

    Prior to the optimisation, the page lock was used to stabilise mappings and
    pin the inode is file-backed which is overkill. If the page was a compound
    page, the head page was automatically looked up as part of the page lock
    operation but the tail page index was used to calculate the futex key.

    After the optimisation, the compound head is looked up early and the page
    lock is only relied upon to identify truncated pages, special pages or a
    shmem page moving to swapcache. The head page is looked up because without
    the page lock, special care has to be taken to pin the inode correctly.
    However, the tail page is still required to calculate the futex key so
    this patch records the tail page.

    On vanilla 4.6, the output of the test case is;

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TFAIL : futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.

    With the patch applied

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TPASS : Hi hydra, thread2 awake!

    Fixes: 65d8fc777f6d "futex: Remove requirement for lock_page() in get_futex_key()"
    Reported-and-tested-by: Mike Galbraith
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Cc: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.de
    Signed-off-by: Thomas Gleixner

    Mel Gorman
     

23 May, 2016

1 commit

  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Apr, 2016

1 commit

  • Otherwise an incoming waker on the dest hash bucket can miss
    the waiter adding itself to the plist during the lockless
    check optimization (small window but still the correct way
    of doing this); similarly to the decrement counterpart.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: bigeasy@linutronix.de
    Cc: dvhart@infradead.org
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

20 Apr, 2016

1 commit

  • If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
    transition in user space first then the user space value might not have the
    waiters bit set. This opens the following race:

    CPU0 CPU1
    uval = get_user(futex)
    lock(hb)
    lock(hb)
    futex |= FUTEX_WAITERS
    ....
    unlock(hb)

    cmpxchg(futex, uval, newval)

    So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
    the futex value is valid.

    To handle this (yes, yet another) corner case gracefully, check for a flag
    change and retry.

    [ tglx: Massaged changelog and slightly reworked implementation ]

    Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Cc: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior