05 Sep, 2016

1 commit

  • Add some more comments and reformat existing ones to kernel doc style.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Reviewed-by: Darren Hart
    Link: http://lkml.kernel.org/r/1464770609-30168-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Jul, 2016

1 commit

  • To quote Rick why there is no need for shared mapping on !MMU systems:

    |With MMU, shared futex keys need to identify the physical backing for
    |a memory address because it may be mapped at different addresses in
    |different processes (or even multiple times in the same process).
    |Without MMU this cannot happen. You only have physical addresses. So
    |the "private futex" behavior of using the virtual address as the key
    |is always correct (for both shared and private cases) on nommu
    |systems.

    This patch disables the FLAGS_SHARED in a way that allows the compiler to
    remove that code.

    [bigeasy: Added changelog ]
    Reported-by: Rich Felker
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20160729143230.GA21715@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

09 Jun, 2016

1 commit

  • Mike Galbraith reported that the LTP test case futex_wake04 was broken
    by commit 65d8fc777f6d ("futex: Remove requirement for lock_page()
    in get_futex_key()").

    This test case uses futexes backed by hugetlbfs pages and so there is an
    associated inode with a futex stored on such pages. The problem is that
    the key is being calculated based on the head page index of the hugetlbfs
    page and not the tail page.

    Prior to the optimisation, the page lock was used to stabilise mappings and
    pin the inode is file-backed which is overkill. If the page was a compound
    page, the head page was automatically looked up as part of the page lock
    operation but the tail page index was used to calculate the futex key.

    After the optimisation, the compound head is looked up early and the page
    lock is only relied upon to identify truncated pages, special pages or a
    shmem page moving to swapcache. The head page is looked up because without
    the page lock, special care has to be taken to pin the inode correctly.
    However, the tail page is still required to calculate the futex key so
    this patch records the tail page.

    On vanilla 4.6, the output of the test case is;

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TFAIL : futex_wake04.c:126: Bug: wait_thread2 did not wake after 30 secs.

    With the patch applied

    futex_wake04 0 TINFO : Hugepagesize 2097152
    futex_wake04 1 TPASS : Hi hydra, thread2 awake!

    Fixes: 65d8fc777f6d "futex: Remove requirement for lock_page() in get_futex_key()"
    Reported-and-tested-by: Mike Galbraith
    Signed-off-by: Mel Gorman
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Cc: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160608132522.GM2469@suse.de
    Signed-off-by: Thomas Gleixner

    Mel Gorman
     

23 May, 2016

1 commit

  • I'm looking at trying to possibly merge the 32-bit and 64-bit versions
    of the x86 uaccess.h implementation, but first this needs to be cleaned
    up.

    For example, the 32-bit version of "__copy_from_user_inatomic()" is
    mostly the special cases for the constant size, and it's actually almost
    never relevant. Most users aren't actually using a constant size
    anyway, and the few cases that do small constant copies are better off
    just using __get_user() instead.

    So get rid of the unnecessary complexity.

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

21 Apr, 2016

1 commit

  • Otherwise an incoming waker on the dest hash bucket can miss
    the waiter adding itself to the plist during the lockless
    check optimization (small window but still the correct way
    of doing this); similarly to the decrement counterpart.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Cc: Davidlohr Bueso
    Cc: bigeasy@linutronix.de
    Cc: dvhart@infradead.org
    Cc: stable@kernel.org
    Link: http://lkml.kernel.org/r/1461208164-29150-1-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

20 Apr, 2016

1 commit

  • If userspace calls UNLOCK_PI unconditionally without trying the TID -> 0
    transition in user space first then the user space value might not have the
    waiters bit set. This opens the following race:

    CPU0 CPU1
    uval = get_user(futex)
    lock(hb)
    lock(hb)
    futex |= FUTEX_WAITERS
    ....
    unlock(hb)

    cmpxchg(futex, uval, newval)

    So the cmpxchg fails and returns -EINVAL to user space, which is wrong because
    the futex value is valid.

    To handle this (yes, yet another) corner case gracefully, check for a flag
    change and retry.

    [ tglx: Massaged changelog and slightly reworked implementation ]

    Fixes: ccf9e6a80d9e ("futex: Make unlock_pi more robust")
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: stable@vger.kernel.org
    Cc: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1460723739-5195-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

09 Mar, 2016

1 commit

  • Commit e91467ecd1ef ("bug in futex unqueue_me") introduced a barrier() in
    unqueue_me() to prevent the compiler from rereading the lock pointer which
    might change after a check for NULL.

    Replace the barrier() with a READ_ONCE() for the following reasons:

    1) READ_ONCE() is a weaker form of barrier() that affects only the specific
    load operation, while barrier() is a general compiler level memory barrier.
    READ_ONCE() was not available at the time when the barrier was added.

    2) Aside of that READ_ONCE() is descriptive and self explainatory while a
    barrier without comment is not clear to the casual reader.

    No functional change.

    [ tglx: Massaged changelog ]

    Signed-off-by: Jianyu Zhan
    Acked-by: Christian Borntraeger
    Acked-by: Darren Hart
    Cc: dave@stgolabs.net
    Cc: peterz@infradead.org
    Cc: linux@rasmusvillemoes.dk
    Cc: akpm@linux-foundation.org
    Cc: fengguang.wu@intel.com
    Cc: bigeasy@linutronix.de
    Link: http://lkml.kernel.org/r/1457314344-5685-1-git-send-email-nasa4836@gmail.com
    Signed-off-by: Thomas Gleixner

    Jianyu Zhan
     

17 Feb, 2016

2 commits

  • When dealing with key handling for shared futexes, we can drastically reduce
    the usage/need of the page lock. 1) For anonymous pages, the associated futex
    object is the mm_struct which does not require the page lock. 2) For inode
    based, keys, we can check under RCU read lock if the page mapping is still
    valid and take reference to the inode. This just leaves one rare race that
    requires the page lock in the slow path when examining the swapcache.

    Additionally realtime users currently have a problem with the page lock being
    contended for unbounded periods of time during futex operations.

    Task A
    get_futex_key()
    lock_page()
    ---> preempted

    Now any other task trying to lock that page will have to wait until
    task A gets scheduled back in, which is an unbound time.

    With this patch, we pretty much have a lockless futex_get_key().

    Experiments show that this patch can boost/speedup the hashing of shared
    futexes with the perf futex benchmarks (which is good for measuring such
    change) by up to 45% when there are high (> 100) thread counts on a 60 core
    Westmere. Lower counts are pretty much in the noise range or less than 10%,
    but mid range can be seen at over 30% overall throughput (hash ops/sec).
    This makes anon-mem shared futexes much closer to its private counterpart.

    Signed-off-by: Mel Gorman
    [ Ported on top of thp refcount rework, changelog, comments, fixes. ]
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Thomas Gleixner
    Cc: Chris Mason
    Cc: Darren Hart
    Cc: Hugh Dickins
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1455045314-8305-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Ingo suggested we rename how we reference barriers A and B
    regarding futex ordering guarantees. This patch replaces,
    for both barriers, MB (A) with smp_mb(); (A), such that:

    - We explicitly state that the barriers are SMP, and

    - We standardize how we reference these across futex.c
    helping readers follow what barrier does what and where.

    Suggested-by: Ingo Molnar
    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Thomas Gleixner
    Cc: Chris Mason
    Cc: Darren Hart
    Cc: Hugh Dickins
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: dave@stgolabs.net
    Link: http://lkml.kernel.org/r/1455045314-8305-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

21 Jan, 2016

1 commit

  • By checking the effective credentials instead of the real UID / permitted
    capabilities, ensure that the calling process actually intended to use its
    credentials.

    To ensure that all ptrace checks use the correct caller credentials (e.g.
    in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
    flag), use two new flags and require one of them to be set.

    The problem was that when a privileged task had temporarily dropped its
    privileges, e.g. by calling setreuid(0, user_uid), with the intent to
    perform following syscalls with the credentials of a user, it still passed
    ptrace access checks that the user would not be able to pass.

    While an attacker should not be able to convince the privileged task to
    perform a ptrace() syscall, this is a problem because the ptrace access
    check is reused for things in procfs.

    In particular, the following somewhat interesting procfs entries only rely
    on ptrace access checks:

    /proc/$pid/stat - uses the check for determining whether pointers
    should be visible, useful for bypassing ASLR
    /proc/$pid/maps - also useful for bypassing ASLR
    /proc/$pid/cwd - useful for gaining access to restricted
    directories that contain files with lax permissions, e.g. in
    this scenario:
    lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
    drwx------ root root /root
    drwxr-xr-x root root /root/foobar
    -rw-r--r-- root root /root/foobar/secret

    Therefore, on a system where a root-owned mode 6755 binary changes its
    effective credentials as described and then dumps a user-specified file,
    this could be used by an attacker to reveal the memory layout of root's
    processes or reveal the contents of files he is not allowed to access
    (through /proc/$pid/cwd).

    [akpm@linux-foundation.org: fix warning]
    Signed-off-by: Jann Horn
    Acked-by: Kees Cook
    Cc: Casey Schaufler
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Andy Shevchenko
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: "Eric W. Biederman"
    Cc: Willy Tarreau
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jann Horn
     

16 Jan, 2016

2 commits

  • During Jason's work with postcopy migration support for s390 a problem
    regarding gmap faults was discovered.

    The gmap code will call fixup_user_fault which will end up always in
    handle_mm_fault. Till now we never cared about retries, but as the
    userfaultfd code kind of relies on it. this needs some fix.

    This patchset does not take care of the futex code. I will now look
    closer at this.

    This patch (of 2):

    With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
    to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
    faulting we ever unlocked mmap_sem.

    This patch brings in the logic to handle retries as well as it cleans up
    the current documentation. fixup_user_fault was not having the same
    semantics as filemap_fault. It never indicated if a retry happened and
    so a caller wasn't able to handle that case. So we now changed the
    behaviour to always retry a locked mmap_sem.

    Signed-off-by: Dominik Dingel
    Reviewed-by: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Christian Borntraeger
    Cc: "Jason J. Herne"
    Cc: David Rientjes
    Cc: Eric B Munson
    Cc: Naoya Horiguchi
    Cc: Mel Gorman
    Cc: Heiko Carstens
    Cc: Dominik Dingel
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dominik Dingel
     
  • With new THP refcounting, we don't need tricks to stabilize huge page.
    If we've got reference to tail page, it can't split under us.

    This patch effectively reverts a5b338f2b0b1 ("thp: update futex compound
    knowledge").

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Sasha Levin
    Tested-by: Aneesh Kumar K.V
    Acked-by: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Steve Capper
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Christoph Lameter
    Cc: David Rientjes
    Tested-by: Artem Savkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

20 Dec, 2015

6 commits

  • While reviewing Michael Kerrisk's recent futex manpage update, I noticed
    that we allow the FUTEX_CLOCK_REALTIME flag for FUTEX_WAIT_BITSET but
    not for FUTEX_WAIT.

    FUTEX_WAIT is treated as a simple version for FUTEX_WAIT_BITSET
    internally (with a bitmask of FUTEX_BITSET_MATCH_ANY). As such, I cannot
    come up with a reason for this exclusion for FUTEX_WAIT.

    This change does modify the behavior of the futex syscall, changing a
    call with FUTEX_WAIT | FUTEX_CLOCK_REALTIME from returning -ENOSYS, to be
    equivalent to FUTEX_WAIT_BITSET | FUTEX_CLOCK_REALTIME with a bitset of
    FUTEX_BITSET_MATCH_ANY.

    Reported-by: Michael Kerrisk
    Signed-off-by: Darren Hart
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/9f3bdc116d79d23f5ee72ceb9a2a857f5ff8fa29.1450474525.git.dvhart@linux.intel.com
    Signed-off-by: Thomas Gleixner

    Darren Hart
     
  • out_unlock: does not only drop the locks, it also drops the refcount
    on the pi_state. Really intuitive.

    Move the label after the put_pi_state() call and use 'break' in the
    error handling path of the requeue loop.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Bhuvanesh_Surachari@mentor.com
    Cc: Andy Lowe
    Link: http://lkml.kernel.org/r/20151219200607.526665141@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • In the error handling cases we neither have pi_state nor a reference
    to it. Remove the pointless code.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Bhuvanesh_Surachari@mentor.com
    Cc: Andy Lowe
    Link: http://lkml.kernel.org/r/20151219200607.432780944@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Documentation of the pi_state refcounting in the requeue code is non
    existent. Add it.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Bhuvanesh_Surachari@mentor.com
    Cc: Andy Lowe
    Link: http://lkml.kernel.org/r/20151219200607.335938312@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • free_pi_state() is confusing as it is in fact only freeing/caching the
    pi state when the last reference is gone. Rename it to put_pi_state()
    which reflects better what it is doing.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Bhuvanesh_Surachari@mentor.com
    Cc: Andy Lowe
    Link: http://lkml.kernel.org/r/20151219200607.259636467@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • If the proxy lock in the requeue loop acquires the rtmutex for a
    waiter then it acquired also refcount on the pi_state related to the
    futex, but the waiter side does not drop the reference count.

    Add the missing free_pi_state() call.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Bhuvanesh_Surachari@mentor.com
    Cc: Andy Lowe
    Link: http://lkml.kernel.org/r/20151219200607.178132067@linutronix.de
    Signed-off-by: Thomas Gleixner
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     

05 Nov, 2015

1 commit

  • Pull driver core updates from Greg KH:
    "Here's the "big" driver core updates for 4.4-rc1. Primarily a bunch
    of debugfs updates, with a smattering of minor driver core fixes and
    updates as well.

    All have been in linux-next for a long time"

    * tag 'driver-core-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    debugfs: Add debugfs_create_ulong()
    of: to support binding numa node to specified device in devicetree
    debugfs: Add read-only/write-only bool file ops
    debugfs: Add read-only/write-only size_t file ops
    debugfs: Add read-only/write-only x64 file ops
    debugfs: Consolidate file mode checks in debugfs_create_*()
    Revert "mm: Check if section present during memory block (un)registering"
    driver-core: platform: Provide helpers for multi-driver modules
    mm: Check if section present during memory block (un)registering
    devres: fix a for loop bounds check
    CMA: fix CONFIG_CMA_SIZE_MBYTES overflow in 64bit
    base/platform: assert that dev_pm_domain callbacks are called unconditionally
    sysfs: correctly handle short reads on PREALLOC attrs.
    base: soc: siplify ida usage
    kobject: move EXPORT_SYMBOL() macros next to corresponding definitions
    kobject: explain what kobject's sd field is
    debugfs: document that debugfs_remove*() accepts NULL and error values
    debugfs: Pass bool pointer to debugfs_create_bool()
    ACPI / EC: Fix broken 64bit big-endian users of 'global_lock'

    Linus Torvalds
     

04 Oct, 2015

1 commit

  • Its a bit odd that debugfs_create_bool() takes 'u32 *' as an argument,
    when all it needs is a boolean pointer.

    It would be better to update this API to make it accept 'bool *'
    instead, as that will make it more consistent and often more convenient.
    Over that bool takes just a byte.

    That required updates to all user sites as well, in the same commit
    updating the API. regmap core was also using
    debugfs_{read|write}_file_bool(), directly and variable types were
    updated for that to be bool as well.

    Signed-off-by: Viresh Kumar
    Acked-by: Mark Brown
    Acked-by: Charles Keepax
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

22 Sep, 2015

1 commit

  • futex_hash() references two global variables: the base pointer
    futex_queues and the size of the array futex_hashsize. The latter is
    marked __read_mostly, while the former is not, so they are likely to
    end up very far from each other. This means that futex_hash() is
    likely to encounter two cache misses.

    We could mark futex_queues as __read_mostly as well, but that doesn't
    guarantee they'll end up next to each other (and even if they do, they
    may still end up in different cache lines). So put the two variables
    in a small singleton struct with sufficient alignment and mark that as
    __read_mostly.

    Signed-off-by: Rasmus Villemoes
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Cc: kbuild test robot
    Cc: Sebastian Andrzej Siewior
    Link: http://lkml.kernel.org/r/1441834601-13633-1-git-send-email-linux@rasmusvillemoes.dk
    Signed-off-by: Thomas Gleixner

    Rasmus Villemoes
     

21 Jul, 2015

1 commit


20 Jul, 2015

2 commits

  • Although futexes are well known for being a royal pita,
    we really have very little debugging capabilities - except
    for relying on tglx's eye half the time.

    By simply making use of the existing fault-injection machinery,
    we can improve this situation, allowing generating artificial
    uaddress faults and deadlock scenarios. Of course, when this is
    disabled in production systems, the overhead for failure checks
    is practically zero -- so this is very cheap at the same time.
    Future work would be nice to now enhance trinity to make use of
    this.

    There is a special tunable 'ignore-private', which can filter
    out private futexes. Given the tsk->make_it_fail filter and
    this option, pi futexes can be narrowed down pretty closely.

    Signed-off-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/1435645562-975-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     
  • ... serves a bit better to clarify between blocking
    and non-blocking code paths.

    Signed-off-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/1435645562-975-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

25 Jun, 2015

1 commit

  • Pull locking updates from Thomas Gleixner:
    "These locking updates depend on the alreay merged sched/core branch:

    - Lockless top waiter wakeup for rtmutex (Davidlohr)

    - Reduce hash bucket lock contention for PI futexes (Sebastian)

    - Documentation update (Davidlohr)"

    * 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rtmutex: Update stale plist comments
    futex: Lower the lock contention on the HB lock during wake up
    locking/rtmutex: Implement lockless top-waiter wakeup

    Linus Torvalds
     

23 Jun, 2015

2 commits

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes are:

    - lockless wakeup support for futexes and IPC message queues
    (Davidlohr Bueso, Peter Zijlstra)

    - Replace spinlocks with atomics in thread_group_cputimer(), to
    improve scalability (Jason Low)

    - NUMA balancing improvements (Rik van Riel)

    - SCHED_DEADLINE improvements (Wanpeng Li)

    - clean up and reorganize preemption helpers (Frederic Weisbecker)

    - decouple page fault disabling machinery from the preemption
    counter, to improve debuggability and robustness (David
    Hildenbrand)

    - SCHED_DEADLINE documentation updates (Luca Abeni)

    - topology CPU masks cleanups (Bartosz Golaszewski)

    - /proc/sched_debug improvements (Srikar Dronamraju)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    sched/deadline: Remove needless parameter in dl_runtime_exceeded()
    sched: Remove superfluous resetting of the p->dl_throttled flag
    sched/deadline: Drop duplicate init_sched_dl_class() declaration
    sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/deadline: Make init_sched_dl_class() __init
    sched/deadline: Optimize pull_dl_task()
    sched/preempt: Add static_key() to preempt_notifiers
    sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
    sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
    sched/debug: Add sum_sleep_runtime to /proc//sched
    sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
    sched/debug: Properly format runnable tasks in /proc/sched_debug
    sched/numa: Only consider less busy nodes as numa balancing destinations
    Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
    sched/fair: Prevent throttling in early pick_next_task_fair()
    preempt: Reorganize the notrace definitions a bit
    preempt: Use preempt_schedule_context() as the official tracing preemption point
    sched: Make preempt_schedule_context() function-tracing safe
    x86: Remove cpu_sibling_mask() and cpu_core_mask()
    x86: Replace cpu_**_mask() with topology_**_cpumask()
    ...

    Linus Torvalds
     

20 Jun, 2015

1 commit

  • wake_futex_pi() wakes the task before releasing the hash bucket lock
    (HB). The first thing the woken up task usually does is to acquire the
    lock which requires the HB lock. On SMP Systems this leads to blocking
    on the HB lock which is released by the owner shortly after.
    This patch rearranges the unlock path by first releasing the HB lock and
    then waking up the task.

    [ tglx: Fixed up the rtmutex unlock path ]

    Originally-from: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

19 May, 2015

1 commit

  • Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
    like mb() rename it to match the recent smp_load_acquire() and
    smp_store_release().

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 May, 2015

1 commit

  • Given the overall futex architecture, any chance of reducing
    hb->lock contention is welcome. In this particular case, using
    wake-queues to enable lockless wakeups addresses very much real
    world performance concerns, even cases of soft-lockups in cases
    of large amounts of blocked tasks (which is not hard to find in
    large boxes, using but just a handful of futex).

    At the lowest level, this patch can reduce latency of a single thread
    attempting to acquire hb->lock in highly contended scenarios by a
    up to 2x. At lower counts of nr_wake there are no regressions,
    confirming, of course, that the wake_q handling overhead is practically
    non existent. For instance, while a fair amount of variation,
    the extended pef-bench wakeup benchmark shows for a 20 core machine
    the following avg per-thread time to wakeup its share of tasks:

    nr_thr ms-before ms-after
    16 0.0590 0.0215
    32 0.0396 0.0220
    48 0.0417 0.0182
    64 0.0536 0.0236
    80 0.0414 0.0097
    96 0.0672 0.0152

    Naturally, this can cause spurious wakeups. However there is no core code
    that cannot handle them afaict, and furthermore tglx does have the point
    that other events can already trigger them anyway.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Chris Mason
    Cc: Davidlohr Bueso
    Cc: George Spelvin
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Manfred Spraul
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1430494072-30283-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

22 Apr, 2015

1 commit

  • The check for hrtimer_active() after starting the timer is
    pointless. If the timer is inactive it has expired already and
    therefor the task pointer is already NULL.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203502.985825453@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

24 Feb, 2015

1 commit


18 Feb, 2015

1 commit

  • attach_to_pi_owner() checks p->mm to prevent attaching to kthreads and
    this looks doubly wrong:

    1. It should actually check PF_KTHREAD, kthread can do use_mm().

    2. If this task is not kthread and it is actually the lock owner we can
    wrongly return -EPERM instead of -ESRCH or retry-if-EAGAIN.

    And note that this wrong EPERM is the likely case unless the exiting
    task is (auto)reaped quickly, we check ->mm before PF_EXITING.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Catalin Marinas
    Cc: Darren Hart
    Cc: Davidlohr Bueso
    Cc: Jerome Marchand
    Cc: Larry Woodman
    Cc: Linus Torvalds
    Cc: Mateusz Guzik
    Link: http://lkml.kernel.org/r/20150202140536.GA26406@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

13 Feb, 2015

1 commit

  • If an attacker can cause a controlled kernel stack overflow, overwriting
    the restart block is a very juicy exploit target. This is because the
    restart_block is held in the same memory allocation as the kernel stack.

    Moving the restart block to struct task_struct prevents this exploit by
    making the restart_block harder to locate.

    Note that there are other fields in thread_info that are also easy
    targets, at least on some architectures.

    It's also a decent simplification, since the restart code is more or less
    identical on all architectures.

    [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
    Signed-off-by: Andy Lutomirski
    Cc: Thomas Gleixner
    Cc: Al Viro
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: David Miller
    Acked-by: Richard Weinberger
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Vineet Gupta
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Steven Miao
    Cc: Mark Salter
    Cc: Aurelien Jacquiot
    Cc: Mikael Starvik
    Cc: Jesper Nilsson
    Cc: David Howells
    Cc: Richard Kuo
    Cc: "Luck, Tony"
    Cc: Geert Uytterhoeven
    Cc: Michal Simek
    Cc: Ralf Baechle
    Cc: Jonas Bonn
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Michael Ellerman (powerpc)
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Chen Liqin
    Cc: Lennox Wu
    Cc: Chris Metcalf
    Cc: Guan Xuetao
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Oleg Nesterov
    Cc: Guenter Roeck
    Signed-off-by: James Hogan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Lutomirski
     

19 Jan, 2015

1 commit

  • This patch fixes two separate buglets in calls to futex_lock_pi():

    * Eliminate unused 'detect' argument
    * Change unused 'timeout' argument of FUTEX_TRYLOCK_PI to NULL

    The 'detect' argument of futex_lock_pi() seems never to have been
    used (when it was included with the initial PI mutex implementation
    in Linux 2.6.18, all checks against its value were disabled by
    ANDing against 0 (i.e., if (detect... && 0)), and with
    commit 778e9a9c3e7193ea9f434f382947155ffb59c755, any mention of
    this argument in futex_lock_pi() went way altogether. Its presence
    now serves only to confuse readers of the code, by giving the
    impression that the futex() FUTEX_LOCK_PI operation actually does
    use the 'val' argument. This patch removes the argument.

    The futex_lock_pi() call that corresponds to FUTEX_TRYLOCK_PI includes
    'timeout' as one of its arguments. This misleads the reader into thinking
    that the FUTEX_TRYLOCK_PI operation does employ timeouts for some sensible
    purpose; but it does not. Indeed, it cannot, because the checks at the
    start of sys_futex() exclude FUTEX_TRYLOCK_PI from the set of operations
    that do copy_from_user() on the timeout argument. So, in the
    FUTEX_TRYLOCK_PI futex_lock_pi() call it would be simplest to change
    'timeout' to 'NULL'. This patch does that.

    Signed-off-by: Michael Kerrisk
    Reviewed-by: Darren Hart
    Link: http://lkml.kernel.org/r/54B96646.8010200@gmail.com
    Signed-off-by: Thomas Gleixner

    Michael Kerrisk
     

26 Oct, 2014

2 commits

  • free_pi_state and exit_pi_state_list both clean up futex_pi_state's.
    exit_pi_state_list takes the hb lock first, and most callers of
    free_pi_state do too. requeue_pi doesn't, which means free_pi_state
    can free the pi_state out from under exit_pi_state_list. For example:

    task A | task B
    exit_pi_state_list |
    pi_state = |
    curr->pi_state_list->next |
    | futex_requeue(requeue_pi=1)
    | // pi_state is the same as
    | // the one in task A
    | free_pi_state(pi_state)
    | list_del_init(&pi_state->list)
    | kfree(pi_state)
    list_del_init(&pi_state->list) |

    Move the free_pi_state calls in requeue_pi to before it drops the hb
    locks which it's already holding.

    [ tglx: Removed a pointless free_pi_state() call and the hb->lock held
    debugging. The latter comes via a seperate patch ]

    Signed-off-by: Brian Silverman
    Cc: austin.linux@gmail.com
    Cc: darren@dvhart.com
    Cc: peterz@infradead.org
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1414282837-23092-1-git-send-email-bsilver16384@gmail.com
    Signed-off-by: Thomas Gleixner

    Brian Silverman
     
  • Update our documentation as of fix 76835b0ebf8 (futex: Ensure
    get_futex_key_refs() always implies a barrier). Explicitly
    state that we don't do key referencing for private futexes.

    Signed-off-by: Davidlohr Bueso
    Cc: Matteo Franchin
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Darren Hart
    Cc: Peter Zijlstra
    Cc: Paul E. McKenney
    Acked-by: Catalin Marinas
    Link: http://lkml.kernel.org/r/1414121220.817.0.camel@linux-t7sj.site
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

19 Oct, 2014

1 commit

  • Commit b0c29f79ecea (futexes: Avoid taking the hb->lock if there's
    nothing to wake up) changes the futex code to avoid taking a lock when
    there are no waiters. This code has been subsequently fixed in commit
    11d4616bd07f (futex: revert back to the explicit waiter counting code).
    Both the original commit and the fix-up rely on get_futex_key_refs() to
    always imply a barrier.

    However, for private futexes, none of the cases in the switch statement
    of get_futex_key_refs() would be hit and the function completes without
    a memory barrier as required before checking the "waiters" in
    futex_wake() -> hb_waiters_pending(). The consequence is a race with a
    thread waiting on a futex on another CPU, allowing the waker thread to
    read "waiters == 0" while the waiter thread to have read "futex_val ==
    locked" (in kernel).

    Without this fix, the problem (user space deadlocks) can be seen with
    Android bionic's mutex implementation on an arm64 multi-cluster system.

    Signed-off-by: Catalin Marinas
    Reported-by: Matteo Franchin
    Fixes: b0c29f79ecea (futexes: Avoid taking the hb->lock if there's nothing to wake up)
    Acked-by: Davidlohr Bueso
    Tested-by: Mike Galbraith
    Cc:
    Cc: Darren Hart
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Signed-off-by: Linus Torvalds

    Catalin Marinas
     

13 Sep, 2014

1 commit

  • futex_wait_requeue_pi() calls futex_wait_setup(). If
    futex_wait_setup() succeeds it returns with hb->lock held and
    preemption disabled. Now the sanity check after this does:

    if (match_futex(&q.key, &key2)) {
    ret = -EINVAL;
    goto out_put_keys;
    }

    which releases the keys but does not release hb->lock.

    So we happily return to user space with hb->lock held and therefor
    preemption disabled.

    Unlock hb->lock before taking the exit route.

    Reported-by: Dave "Trinity" Jones
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Darren Hart
    Reviewed-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1409112318500.4178@nanos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner