04 Jul, 2017

1 commit

  • Pull documentation updates from Jonathan Corbet:
    "There has been a fair amount of activity in the docs tree this time
    around. Highlights include:

    - Conversion of a bunch of security documentation into RST

    - The conversion of the remaining DocBook templates by The Amazing
    Mauro Machine. We can now drop the entire DocBook build chain.

    - The usual collection of fixes and minor updates"

    * tag 'docs-4.13' of git://git.lwn.net/linux: (90 commits)
    scripts/kernel-doc: handle DECLARE_HASHTABLE
    Documentation: atomic_ops.txt is core-api/atomic_ops.rst
    Docs: clean up some DocBook loose ends
    Make the main documentation title less Geocities
    Docs: Use kernel-figure in vidioc-g-selection.rst
    Docs: fix table problems in ras.rst
    Docs: Fix breakage with Sphinx 1.5 and upper
    Docs: Include the Latex "ifthen" package
    doc/kokr/howto: Only send regression fixes after -rc1
    docs-rst: fix broken links to dynamic-debug-howto in kernel-parameters
    doc: Document suitability of IBM Verse for kernel development
    Doc: fix a markup error in coding-style.rst
    docs: driver-api: i2c: remove some outdated information
    Documentation: DMA API: fix a typo in a function name
    Docs: Insert missing space to separate link from text
    doc/ko_KR/memory-barriers: Update control-dependencies example
    Documentation, kbuild: fix typo "minimun" -> "minimum"
    docs: Fix some formatting issues in request-key.rst
    doc: ReSTify keys-trusted-encrypted.txt
    doc: ReSTify keys-request-key.txt
    ...

    Linus Torvalds
     

20 Jun, 2017

1 commit

  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

16 May, 2017

1 commit

  • There are a few issues on some kernel-doc markups that was
    causing troubles with kernel-doc output on ReST format:

    ./kernel/futex.c:492: WARNING: Inline emphasis start-string without end-string.
    ./kernel/futex.c:1264: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:1721: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2338: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2426: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2899: WARNING: Block quote ends without a blank line; unexpected unindent.
    ./kernel/futex.c:2972: WARNING: Block quote ends without a blank line; unexpected unindent.

    Fix them.

    No functional changes.

    Acked-by: Darren Hart (VMware)
    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     

15 Apr, 2017

1 commit

  • Clarify the scenario described in mark_wake_futex requiring the
    smp_store_release(). Update the comment to explicitly refer to the
    plist_del now under __unqueue_futex() (previously plist_del was in the
    same function as the comment).

    Signed-off-by: Darren Hart (VMware)
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170414223138.GA4222@fury
    Signed-off-by: Thomas Gleixner

    Darren Hart (VMware)
     

14 Apr, 2017

2 commits

  • During (post-commit) review Darren spotted a few minor things. One
    (harmless AFAICT) type inconsistency and a comment that wasn't as
    clear as hoped.

    Reported-by: Darren Hart (VMWare)
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart (VMware)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Alexander reported a hrtimer debug_object splat:

    ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup (kernel/time/hrtimer.c:1423)

    debug_object_free (lib/debugobjects.c:603)
    destroy_hrtimer_on_stack (kernel/time/hrtimer.c:427)
    futex_lock_pi (kernel/futex.c:2740)
    do_futex (kernel/futex.c:3399)
    SyS_futex (kernel/futex.c:3447 kernel/futex.c:3415)
    do_syscall_64 (arch/x86/entry/common.c:284)
    entry_SYSCALL64_slow_path (arch/x86/entry/entry_64.S:249)

    Which was caused by commit:

    cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

    ... losing the hrtimer_cancel() in the shuffle. Where previously the
    hrtimer_cancel() was done by rt_mutex_slowlock() we now need to do it
    manually.

    Reported-by: Alexander Levin
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Fixes: cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1704101802370.2906@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

04 Apr, 2017

2 commits

  • Previous patches changed the meaning of the return value of
    rt_mutex_slowunlock(); update comments and code to reflect this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • We should deboost before waking the high-priority task, such that we
    don't run two tasks with the same "state" (priority, deadline,
    sched_class, etc).

    In order to make sure the boosting task doesn't start running between
    unlock and deboost (due to 'spurious' wakeup), we move the deboost
    under the wait_lock, that way its serialized against the wait loop in
    __rt_mutex_slowlock().

    Doing the deboost early can however lead to priority-inversion if
    current would get preempted after the deboost but before waking our
    high-prio task, hence we disable preemption before doing deboost, and
    enabling it after the wake up is over.

    This gets us the right semantic order, but most importantly however;
    this change ensures pointer stability for the next patch, where we
    have rt_mutex_setprio() cache a pointer to the top-most waiter task.
    If we, as before this change, do the wakeup first and then deboost,
    this pointer might point into thin air.

    [peterz: Changelog + patch munging]
    Suggested-by: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

24 Mar, 2017

12 commits

  • When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
    chain code will (falsely) report a deadlock and BUG.

    The problem is that it hold hb->lock (now an rt_mutex) while doing
    task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
    interleaved just right with futex_unlock_pi() leads it to believe to see an
    AB-BA deadlock.

    Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
    does FUTEX_UNLOCK_PI)

    lock hb->lock
    lock rt_mutex (as per start_proxy)
    lock hb->lock

    Which is a trivial AB-BA.

    It is not an actual deadlock, because it won't be holding hb->lock by the
    time it actually blocks on the rt_mutex, but the chainwalk code doesn't
    know that and it would be a nightmare to handle this gracefully.

    To avoid this problem, do the same as in futex_unlock_pi() and drop
    hb->lock after acquiring wait_lock. This still fully serializes against
    futex_unlock_pi(), since adding to the wait_list does the very same lock
    dance, and removing it holds both locks.

    Aside of solving the RT problem this makes the lock and unlock mechanism
    symetric and reduces the hb->lock held time.

    Reported-and-tested-by: Sebastian Andrzej Siewior
    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • The problem with returning -EAGAIN when the waiter state mismatches is that
    it becomes very hard to proof a bounded execution time on the
    operation. And seeing that this is a RT operation, this is somewhat
    important.

    While in practise; given the previous patch; it will be very unlikely to
    ever really take more than one or two rounds, proving so becomes rather
    hard.

    However, now that modifying wait_list is done while holding both hb->lock
    and wait_lock, the scenario can be avoided entirely by acquiring wait_lock
    while still holding hb-lock. Doing a hand-over, without leaving a hole.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.112378812@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
    modifications are done under both hb->lock and wait_lock.

    This closes the obvious interleave pattern between futex_lock_pi() and
    futex_unlock_pi(), but not entirely so. See below:

    Before:

    futex_lock_pi() futex_unlock_pi()
    unlock hb->lock

    lock hb->lock
    unlock hb->lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock

    schedule()

    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock


    -EAGAIN

    lock hb->lock

    After:

    futex_lock_pi() futex_unlock_pi()

    lock hb->lock
    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock
    unlock hb->lock

    schedule()
    lock hb->lock
    unlock hb->lock
    lock hb->lock
    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    unlock hb->lock

    It does however solve the earlier starvation/live-lock scenario which got
    introduced with the -EAGAIN since unlike the before scenario; where the
    -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
    after scenario it happens while futex_unlock_pi() actually holds a lock,
    and then it is serialized on that lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
    consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
    parts, such that only the actual blocking can be done without hb->lock
    held.

    Split split_mutex_finish_proxy_lock() into two parts, one that does the
    blocking and one that does remove_waiter() when the lock acquire failed.

    When the rtmutex was acquired successfully the waiter can be removed in the
    acquisiton path safely, since there is no concurrency on the lock owner.

    This means that, except for futex_lock_pi(), all wait_list modifications
    are done with both hb->lock and wait_lock held.

    [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since there's already two copies of this code, introduce a helper now
    before adding a third one.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • There's a number of 'interesting' problems, all caused by holding
    hb->lock while doing the rt_mutex_unlock() equivalient.

    Notably:

    - a PI inversion on hb->lock; and,

    - a SCHED_DEADLINE crash because of pointer instability.

    The previous changes:

    - changed the locking rules to cover {uval,pi_state} with wait_lock.

    - allow to do rt_mutex_futex_unlock() without dropping wait_lock; which in
    turn allows to rely on wait_lock atomicity completely.

    - simplified the waiter conundrum.

    It's now sufficient to hold rtmutex::wait_lock and a reference on the
    pi_state to protect the state consistency, so hb->lock can be dropped
    before calling rt_mutex_futex_unlock().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.900002056@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • There is a weird state in the futex_unlock_pi() path when it interleaves
    with a concurrent futex_lock_pi() at the point where it drops hb->lock.

    In this case, it can happen that the rt_mutex wait_list and the futex_q
    disagree on pending waiters, in particular rt_mutex will find no pending
    waiters where futex_q thinks there are. In this case the rt_mutex unlock
    code cannot assign an owner.

    The futex side fixup code has to cleanup the inconsistencies with quite a
    bunch of interesting corner cases.

    Simplify all this by changing wake_futex_pi() to return -EAGAIN when this
    situation occurs. This then gives the futex_lock_pi() code the opportunity
    to continue and the retried futex_unlock_pi() will now observe a coherent
    state.

    The only problem is that this breaks RT timeliness guarantees. That
    is, consider the following scenario:

    T1 and T2 are both pinned to CPU0. prio(T2) > prio(T1)

    CPU0

    T1
    lock_pi()
    queue_me()
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.850383690@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Add a put_pit_state() as counterpart for get_pi_state() so the refcounting
    becomes consistent.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.801778516@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Currently futex-pi relies on hb->lock to serialize everything. But hb->lock
    creates another set of problems, especially priority inversions on RT where
    hb->lock becomes a rt_mutex itself.

    The rt_mutex::wait_lock is the most obvious protection for keeping the
    futex user space value and the kernel internal pi_state in sync.

    Rework and document the locking so rt_mutex::wait_lock is held accross all
    operations which modify the user space value and the pi state.

    This allows to invoke rt_mutex_unlock() (including deboost) without holding
    hb->lock as a next step.

    Nothing yet relies on the new locking rules.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.751993333@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Part of what makes futex_unlock_pi() intricate is that
    rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
    rt_mutex::wait_lock.

    This means it cannot rely on the atomicy of wait_lock, which would be
    preferred in order to not rely on hb->lock so much.

    The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
    race with the rt_mutex fastpath, however futexes have their own fast path.

    Since futexes already have a bunch of separate rt_mutex accessors, complete
    that set and implement a rt_mutex variant without fastpath for them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since the futex_q can dissapear the instruction after assigning NULL,
    this really should be a RELEASE barrier. That stops loads from hitting
    dead memory too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.604296452@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • futex_top_waiter() returns the top-waiter on the pi_mutex. Assinging
    this to a variable 'match' totally obscures the code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.554710645@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

15 Mar, 2017

2 commits

  • Thomas spotted that fixup_pi_state_owner() can return errors and we
    fail to unlock the rt_mutex in that case.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • While working on the futex code, I stumbled over this potential
    use-after-free scenario. Dmitry triggered it later with syzkaller.

    pi_mutex is a pointer into pi_state, which we drop the reference on in
    unqueue_me_pi(). So any access to that pointer after that is bad.

    Since other sites already do rt_mutex_unlock() with hb->lock held, see
    for example futex_lock_pi(), simply move the unlock before
    unqueue_me_pi().

    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

13 Feb, 2017

1 commit

  • The UEVENT user mode helper is enabled before the initcalls are executed
    and is available when the root filesystem has been mounted.

    The user mode helper is triggered by device init calls and the executable
    might use the futex syscall.

    futex_init() is marked __initcall which maps to device_initcall, but there
    is no guarantee that futex_init() is invoked _before_ the first device init
    call which triggers the UEVENT user mode helper.

    If the user mode helper uses the futex syscall before futex_init() then the
    syscall crashes with a NULL pointer dereference because the futex subsystem
    has not been initialized yet.

    Move futex_init() to core_initcall so futexes are initialized before the
    root filesystem is mounted and the usermode helper becomes available.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Yang Yang
    Cc: jiang.biao2@zte.com.cn
    Cc: jiang.zhengxiong@zte.com.cn
    Cc: zhong.weidong@zte.com.cn
    Cc: deng.huali@zte.com.cn
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
    Signed-off-by: Thomas Gleixner

    Yang Yang
     

26 Dec, 2016

1 commit

  • ktime is a union because the initial implementation stored the time in
    scalar nanoseconds on 64 bit machine and in a endianess optimized timespec
    variant for 32bit machines. The Y2038 cleanup removed the timespec variant
    and switched everything to scalar nanoseconds. The union remained, but
    become completely pointless.

    Get rid of the union and just keep ktime_t as simple typedef of type s64.

    The conversion was done with coccinelle and some manual mopping up.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra

    Thomas Gleixner
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

05 Sep, 2016

1 commit

  • Add some more comments and reformat existing ones to kernel doc style.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Darren Hart
    Link: http://lkml.kernel.org/r/1464770609-30168-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Jul, 2016

1 commit

  • To quote Rick why there is no need for shared mapping on !MMU systems:

    |With MMU, shared futex keys need to identify the physical backing for
    |a memory address because it may be mapped at different addresses in
    |different processes (or even multiple times in the same process).
    |Without MMU this cannot happen. You only have physical addresses. So
    |the "private futex" behavior of using the virtual address as the key
    |is always correct (for both shared and private cases) on nommu
    |systems.

    This patch disables the FLAGS_SHARED in a way that allows the compiler to
    remove that code.

    [bigeasy: Added changelog ]
    Reported-by: Rich Felker
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20160729143230.GA21715@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

09 Jun, 2016

1 commit

  • Mike Galbraith reported that the LTP test case futex_wake04 was broken
    by commit 65d8fc777f6d ("futex: Remove requirement for lock_page()
    in get_futex_key()").

    This test case uses futexes backed by hugetlbfs pages and so there is an
    associated inode with a futex stored on such pages. The problem is that
    the key is being calculated based on the head page index of the hugetlbfs
    page and not the tail page.

    Prior to the optimisation, the page lock was used to stabilise mappings and
    pin the inode is file-backed which is overkill. If the page was a compound
    page, the head page was automatically looked up as part of the page lock
    operation but the tail page index was used to calculate the futex key.

    After the optimisation, the compound head is looked up early and the page
    lock is only relied upon to identify truncated pages, special pages or a
    shmem page moving to swapcache. The head page is looked up because without
    the page lock, special care has to be taken to pin the inode correctly.
    However, the tail page is still required to calculate the futex key so
    this patch records the tail page.

    On vanilla 4.6, the output of the test case is;

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TFAIL : futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.

    With the patch applied

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TPASS : Hi hydra, thread2 awake!

    Fixes: 65d8fc777f6d "futex: Remove requirement for lock_page() in get_futex_key()"
    Reported-and-tested-by: Mike Galbraith
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Cc: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.de
    Signed-off-by: Thomas Gleixner

    Mel Gorman
     

23 May, 2016

1 commit

  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Apr, 2016

1 commit

  • Otherwise an incoming waker on the dest hash bucket can miss
    the waiter adding itself to the plist during the lockless
    check optimization (small window but still the correct way
    of doing this); similarly to the decrement counterpart.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: bigeasy@linutronix.de
    Cc: dvhart@infradead.org
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

20 Apr, 2016

1 commit

  • If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
    transition in user space first then the user space value might not have the
    waiters bit set. This opens the following race:

    CPU0 CPU1
    uval = get_user(futex)
    lock(hb)
    lock(hb)
    futex |= FUTEX_WAITERS
    ....
    unlock(hb)

    cmpxchg(futex, uval, newval)

    So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
    the futex value is valid.

    To handle this (yes, yet another) corner case gracefully, check for a flag
    change and retry.

    [ tglx: Massaged changelog and slightly reworked implementation ]

    Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Cc: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

09 Mar, 2016

1 commit

  • Commit e91467ecd1ef ("bug in futex unqueue_me") introduced a barrier() in
    unqueue_me() to prevent the compiler from rereading the lock pointer which
    might change after a check for NULL.

    Replace the barrier() with a READ_ONCE() for the following reasons:

    1) READ_ONCE() is a weaker form of barrier() that affects only the specific
    load operation, while barrier() is a general compiler level memory barrier.
    READ_ONCE() was not available at the time when the barrier was added.

    2) Aside of that READ_ONCE() is descriptive and self explainatory while a
    barrier without comment is not clear to the casual reader.

    No functional change.

    [ tglx: Massaged changelog ]

    Signed-off-by: Jianyu Zhan
    Acked-by: Christian Borntraeger
    Acked-by: Darren Hart
    Cc: dave@stgolabs.net
    Cc: peterz@infradead.org
    Cc: linux@rasmusvillemoes.dk
    Cc: akpm@linux-foundation.org
    Cc: fengguang.wu@intel.com
    Cc: bigeasy@linutronix.de
    Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com
    Signed-off-by: Thomas Gleixner

    Jianyu Zhan
     

17 Feb, 2016

2 commits

  • When dealing with key handling for shared futexes, we can drastically reduce
    the usage/need of the page lock. 1) For anonymous pages, the associated futex
    object is the mm_struct which does not require the page lock. 2) For inode
    based, keys, we can check under RCU read lock if the page mapping is still
    valid and take reference to the inode. This just leaves one rare race that
    requires the page lock in the slow path when examining the swapcache.

    Additionally realtime users currently have a problem with the page lock being
    contended for unbounded periods of time during futex operations.

    Task A
    get_futex_key()
    lock_page()
    ---> preempted

    Now any other task trying to lock that page will have to wait until
    task A gets scheduled back in, which is an unbound time.

    With this patch, we pretty much have a lockless futex_get_key().

    Experiments show that this patch can boost/speedup the hashing of shared
    futexes with the perf futex benchmarks (which is good for measuring such
    change) by up to 45% when there are high (> 100) thread counts on a 60 core
    Westmere. Lower counts are pretty much in the noise range or less than 10%,
    but mid range can be seen at over 30% overall throughput (hash ops/sec).
    This makes anon-mem shared futexes much closer to its private counterpart.

    Signed-off-by: Mel Gorman
    [ Ported on top of thp refcount rework, changelog, comments, fixes. ]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Thomas Gleixner
    Cc: Chris Mason
    Cc: Darren Hart
    Cc: Hugh Dickins
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Ingo suggested we rename how we reference barriers A and B
    regarding futex ordering guarantees. This patch replaces,
    for both barriers, MB (A) with smp_mb(); (A), such that:

    - We explicitly state that the barriers are SMP, and

    - We standardize how we reference these across futex.c
    helping readers follow what barrier does what and where.

    Suggested-by: Ingo Molnar
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Thomas Gleixner
    Cc: Chris Mason
    Cc: Darren Hart
    Cc: Hugh Dickins
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1455045314-8305-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

21 Jan, 2016

1 commit

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

16 Jan, 2016

1 commit

  • During Jason's work with postcopy migration support for s390 a problem
    regarding gmap faults was discovered.

    The gmap code will call fixup_user_fault which will end up always in
    handle_mm_fault. Till now we never cared about retries, but as the
    userfaultfd code kind of relies on it. this needs some fix.

    This patchset does not take care of the futex code. I will now look
    closer at this.

    This patch (of 2):

    With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
    to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
    faulting we ever unlocked mmap_sem.

    This patch brings in the logic to handle retries as well as it cleans up
    the current documentation. fixup_user_fault was not having the same
    semantics as filemap_fault. It never indicated if a retry happened and
    so a caller wasn't able to handle that case. So we now changed the
    behaviour to always retry a locked mmap_sem.

    Signed-off-by: Dominik Dingel
    Reviewed-by: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Christian Borntraeger
    Cc: "Jason J. Herne"
    Cc: David Rientjes
    Cc: Eric B Munson
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: Dominik Dingel
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominik Dingel