12 Feb, 2019

2 commits

  • When CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n, the call path
    hrtimer_reprogram -> clockevents_program_event ->
    clockevents_program_min_delta will not retry if the clock event driver
    returns -ETIME.

    If the driver could not satisfy the program_min_delta for any reason, the
    lack of a retry means the CPU may not receive a tick interrupt, potentially
    until the counter does a full period. This leads to rcu_sched timeout
    messages as the stalled CPU is detected by other CPUs, and other issues if
    the CPU is holding locks or other resources at the point at which it
    stalls.

    There have been a couple of observed mechanisms through which a clock event
    driver could not satisfy the requested min_delta and return -ETIME.

    With the MIPS GIC driver, the shared execution resource within MT cores
    means inconventient latency due to execution of instructions from other
    hardware threads in the core, within gic_next_event, can result in an event
    being set in the past.

    Additionally under virtualisation it is possible to get unexpected latency
    during a clockevent device's set_next_event() callback which can make it
    return -ETIME even for a delta based on min_delta_ns.

    It isn't appropriate to use MIN_ADJUST in the virtualisation case as
    occasional hypervisor induced high latency will cause min_delta_ns to
    quickly increase to the maximum.

    Instead, borrow the retry pattern from the MIN_ADJUST case, but without
    making adjustments. Retry up to 10 times, each time increasing the
    attempted delta by min_delta, before giving up.

    [ Matt: Reworked the loop and made retry increase the delta. ]

    Signed-off-by: James Hogan
    Signed-off-by: Matt Redfearn
    Signed-off-by: Thomas Gleixner
    Cc: linux-mips@linux-mips.org
    Cc: Daniel Lezcano
    Cc: "Martin Schwidefsky"
    Cc: James Hogan
    Link: https://lkml.kernel.org/r/1508422643-6075-1-git-send-email-matt.redfearn@mips.com

    James Hogan
     
  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

07 Feb, 2019

1 commit

  • commit 8fb335e078378c8426fabeed1ebee1fbf915690c upstream.

    Currently, exit_ptrace() adds all ptraced tasks in a dead list, then
    zap_pid_ns_processes() waits on all tasks in a current pidns, and only
    then are tasks from the dead list released.

    zap_pid_ns_processes() can get stuck on waiting tasks from the dead
    list. In this case, we will have one unkillable process with one or
    more dead children.

    Thanks to Oleg for the advice to release tasks in find_child_reaper().

    Link: http://lkml.kernel.org/r/20190110175200.12442-1-avagin@gmail.com
    Fixes: 7c8bd2322c7f ("exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent()")
    Signed-off-by: Andrei Vagin
    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrei Vagin
     

31 Jan, 2019

1 commit

  • commit 93ad0fc088c5b4631f796c995bdd27a082ef33a6 upstream.

    The recent commit which prevented a division by 0 issue in the alarm timer
    code broke posix CPU timers as an unwanted side effect.

    The reason is that the common rearm code checks for timer->it_interval
    being 0 now. What went unnoticed is that the posix cpu timer setup does not
    initialize timer->it_interval as it stores the interval in CPU timer
    specific storage. The reason for the separate storage is historical as the
    posix CPU timers always had a 64bit nanoseconds representation internally
    while timer->it_interval is type ktime_t which used to be a modified
    timespec representation on 32bit machines.

    Instead of reverting the offending commit and fixing the alarmtimer issue
    in the alarmtimer code, store the interval in timer->it_interval at CPU
    timer setup time so the common code check works. This also repairs the
    existing inconistency of the posix CPU timer code which kept a single shot
    timer armed despite of the interval being 0.

    The separate storage can be removed in mainline, but that needs to be a
    separate commit as the current one has to be backported to stable kernels.

    Fixes: 0e334db6bb4b ("posix-timers: Fix division by zero bug")
    Reported-by: H.J. Lu
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Peter Zijlstra
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190111133500.840117406@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

23 Jan, 2019

1 commit

  • commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream.

    I noticed that cgroup task groups constantly get throttled even
    if they have low CPU usage, this causes some jitters on the response
    time to some of our business containers when enabling CPU quotas.

    It's very simple to reproduce:

    mkdir /sys/fs/cgroup/cpu/test
    cd /sys/fs/cgroup/cpu/test
    echo 100000 > cpu.cfs_quota_us
    echo $$ > tasks

    then repeat:

    cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily

    After some analysis, we found that cfs_rq::runtime_remaining will
    be cleared by expire_cfs_rq_runtime() due to two equal but stale
    "cfs_{b|q}->runtime_expires" after period timer is re-armed.

    The current condition to judge clock drift in expire_cfs_rq_runtime()
    is wrong, the two runtime_expires are actually the same when clock
    drift happens, so this condtion can never hit. The orginal design was
    correctly done by this commit:

    a9cf55b28610 ("sched: Expire invalid runtime")

    ... but was changed to be the current implementation due to its locking bug.

    This patch introduces another way, it adds a new field in both structures
    cfs_rq and cfs_bandwidth to record the expiration update sequence, and
    uses them to figure out if clock drift happens (true if they are equal).

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    [alakeshh: backport: Fixed merge conflicts:
    - sched.h: Fix the indentation and order in which the variables are
    declared to match with coding style of the existing code in 4.14
    Struct members of same type were declared in separate lines in
    upstream patch which has been changed back to having multiple
    members of same type in the same line.
    e.g. int a; int b; -> int a, b; ]
    Signed-off-by: Alakesh Haloi
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: # 4.14.x
    Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
    Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Xunlei Pang
     

13 Jan, 2019

5 commits

  • commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.

    Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
    scheduler under high loads, starting at around the v4.18 time frame,
    and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
    manipulation.

    Do a (manual) revert of:

    a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

    It turns out that the list_del_leaf_cfs_rq() introduced by this commit
    is a surprising property that was not considered in followup commits
    such as:

    9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

    As Vincent Guittot explains:

    "I think that there is a bigger problem with commit a9e7f6544b9c and
    cfs_rq throttling:

    Let take the example of the following topology TG2 --> TG1 --> root:

    1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
    cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
    one path because it has never been used and can't be throttled so
    tmp_alone_branch will point to leaf_cfs_rq_list at the end.

    2) Then TG1 is throttled

    3) and we add TG3 as a new child of TG1.

    4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
    cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list.

    With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
    So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
    cfs_rq is removed from the list.
    Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
    but tmp_alone_branch still points to TG3 cfs_rq because its throttled
    parent can't be enqueued when the lock is released.
    tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

    So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
    points on another TG cfs_rq, the next TG cfs_rq that will be added,
    will be linked outside rq->leaf_cfs_rq_list - which is bad.

    In addition, we can break the ordering of the cfs_rq in
    rq->leaf_cfs_rq_list but this ordering is used to update and
    propagate the update from leaf down to root."

    Instead of trying to work through all these cases and trying to reproduce
    the very high loads that produced the lockup to begin with, simplify
    the code temporarily by reverting a9e7f6544b9c - which change was clearly
    not thought through completely.

    This (hopefully) gives us a kernel that doesn't lock up so people
    can continue to enjoy their holidays without worrying about regressions. ;-)

    [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

    Analyzed-by: Xie XiuQi
    Analyzed-by: Vincent Guittot
    Reported-by: Zhipeng Xie
    Reported-by: Sargun Dhillon
    Reported-by: Xie XiuQi
    Tested-by: Zhipeng Xie
    Tested-by: Sargun Dhillon
    Signed-off-by: Linus Torvalds
    Acked-by: Vincent Guittot
    Cc: # v4.13+
    Cc: Bin Li
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
    Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 06489cfbd915ff36c8e36df27f1c2dc60f97ca56 upstream.

    Given the fact that devm_memremap_pages() requires a percpu_ref that is
    torn down by devm_memremap_pages_release() the current support for mapping
    RAM is broken.

    Support for remapping "System RAM" has been broken since the beginning and
    there is no existing user of this this code path, so just kill the support
    and make it an explicit error.

    This cleanup also simplifies a follow-on patch to fix the error path when
    setting a devm release action for devm_memremap_pages_release() fails.

    Link: http://lkml.kernel.org/r/154275557997.76910.14689813630968180480.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: "Jérôme Glisse"
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Logan Gunthorpe
    Cc: Balbir Singh
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 808153e1187fa77ac7d7dad261ff476888dcf398 upstream.

    devm_memremap_pages() is a facility that can create struct page entries
    for any arbitrary range and give drivers the ability to subvert core
    aspects of page management.

    Specifically the facility is tightly integrated with the kernel's memory
    hotplug functionality. It injects an altmap argument deep into the
    architecture specific vmemmap implementation to allow allocating from
    specific reserved pages, and it has Linux specific assumptions about page
    structure reference counting relative to get_user_pages() and
    get_user_pages_fast(). It was an oversight and a mistake that this was
    not marked EXPORT_SYMBOL_GPL from the outset.

    Again, devm_memremap_pagex() exposes and relies upon core kernel internal
    assumptions and will continue to evolve along with 'struct page', memory
    hotplug, and support for new memory types / topologies. Only an in-kernel
    GPL-only driver is expected to keep up with this ongoing evolution. This
    interface, and functionality derived from this interface, is not suitable
    for kernel-external drivers.

    Link: http://lkml.kernel.org/r/154275557457.76910.16923571232582744134.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Christoph Hellwig
    Acked-by: Michal Hocko
    Cc: "Jérôme Glisse"
    Cc: Balbir Singh
    Cc: Logan Gunthorpe
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     
  • commit 7b55851367136b1efd84d98fea81ba57a98304cf upstream.

    This changes the fork(2) syscall to record the process start_time after
    initializing the basic task structure but still before making the new
    process visible to user-space.

    Technically, we could record the start_time anytime during fork(2). But
    this might lead to scenarios where a start_time is recorded long before
    a process becomes visible to user-space. For instance, with
    userfaultfd(2) and TLS, user-space can delay the execution of fork(2)
    for an indefinite amount of time (and will, if this causes network
    access, or similar).

    By recording the start_time late, it much closer reflects the point in
    time where the process becomes live and can be observed by other
    processes.

    Lastly, this makes it much harder for user-space to predict and control
    the start_time they get assigned. Previously, user-space could fork a
    process and stall it in copy_thread_tls() before its pid is allocated,
    but after its start_time is recorded. This can be misused to later-on
    cycle through PIDs and resume the stalled fork(2) yielding a process
    that has the same pid and start_time as a process that existed before.
    This can be used to circumvent security systems that identify processes
    by their pid+start_time combination.

    Even though user-space was always aware that start_time recording is
    flaky (but several projects are known to still rely on start_time-based
    identification), changing the start_time to be recorded late will help
    mitigate existing attacks and make it much harder for user-space to
    control the start_time a process gets assigned.

    Reported-by: Jann Horn
    Signed-off-by: Tom Gundersen
    Signed-off-by: David Herrmann
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Herrmann
     
  • commit 0211e12dd0a5385ecffd3557bc570dbad7fcf245 upstream.

    When the allocation of node_to_possible_cpumask fails, then
    irq_create_affinity_masks() returns with a pointer to the empty affinity
    masks array, which will cause malfunction.

    Reorder the allocations so the masks array allocation comes last and every
    failure path returns NULL.

    Fixes: 9a0ef98e186d ("genirq/affinity: Assign vectors to all present CPUs")
    Signed-off-by: Thomas Gleixner
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Mihai Carabas
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

10 Jan, 2019

1 commit

  • commit e9d81a1bc2c48ea9782e3e8b53875f419766ef47 upstream.

    CSS_TASK_ITER_PROCS implements process-only iteration by making
    css_task_iter_advance() skip tasks which aren't threadgroup leaders;
    however, when an iteration is started css_task_iter_start() calls the
    inner helper function css_task_iter_advance_css_set() instead of
    css_task_iter_advance(). As the helper doesn't have the skip logic,
    when the first task to visit is a non-leader thread, it doesn't get
    skipped correctly as shown in the following example.

    # ps -L 2030
    PID LWP TTY STAT TIME COMMAND
    2030 2030 pts/0 Sl+ 0:00 ./test-thread
    2030 2031 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2030 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2030
    2031
    # cat /sys/fs/cgroup/x/cgroup.procs
    2030
    # echo 2030 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2031
    2030

    The last read of cgroup.procs is incorrectly showing non-leader 2031
    in cgroup.procs output.

    This can be fixed by updating css_task_iter_advance() to handle the
    first advance and css_task_iters_tart() to call
    css_task_iter_advance() instead of the inner helper. After the fix,
    the same commands result in the following (correct) result:

    # ps -L 2062
    PID LWP TTY STAT TIME COMMAND
    2062 2062 pts/0 Sl+ 0:00 ./test-thread
    2062 2063 pts/0 Sl+ 0:00 ./test-thread
    # mkdir -p /sys/fs/cgroup/x/a/b
    # echo threaded > /sys/fs/cgroup/x/a/cgroup.type
    # echo threaded > /sys/fs/cgroup/x/a/b/cgroup.type
    # echo 2062 > /sys/fs/cgroup/x/a/cgroup.procs
    # cat /sys/fs/cgroup/x/a/cgroup.threads
    2062
    2063
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062
    # echo 2062 > /sys/fs/cgroup/x/a/b/cgroup.threads
    # cat /sys/fs/cgroup/x/cgroup.procs
    2062

    Signed-off-by: Tejun Heo
    Reported-by: "Michael Kerrisk (man-pages)"
    Fixes: 8cfd8147df67 ("cgroup: implement cgroup v2 thread support")
    Cc: stable@vger.kernel.org # v4.14+
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

29 Dec, 2018

2 commits

  • commit c7c3f05e341a9a2bd1a92993d4f996cfd6e7348e upstream.

    From printk()/serial console point of view panic() is special, because
    it may force CPU to re-enter printk() or/and serial console driver.
    Therefore, some of serial consoles drivers are re-entrant. E.g. 8250:

    serial8250_console_write()
    {
    if (port->sysrq)
    locked = 0;
    else if (oops_in_progress)
    locked = spin_trylock_irqsave(&port->lock, flags);
    else
    spin_lock_irqsave(&port->lock, flags);
    ...
    }

    panic() does set oops_in_progress via bust_spinlocks(1), so in theory
    we should be able to re-enter serial console driver from panic():

    CPU0

    uart_console_write()
    serial8250_console_write() // if (oops_in_progress)
    // spin_trylock_irqsave()
    call_console_drivers()
    console_unlock()
    console_flush_on_panic()
    bust_spinlocks(1) // oops_in_progress++
    panic()

    spin_lock_irqsave(&port->lock, flags) // spin_lock_irqsave()
    serial8250_console_write()
    call_console_drivers()
    console_unlock()
    printk()
    ...

    However, this does not happen and we deadlock in serial console on
    port->lock spinlock. And the problem is that console_flush_on_panic()
    called after bust_spinlocks(0):

    void panic(const char *fmt, ...)
    {
    bust_spinlocks(1);
    ...
    bust_spinlocks(0);
    console_flush_on_panic();
    ...
    }

    bust_spinlocks(0) decrements oops_in_progress, so oops_in_progress
    can go back to zero. Thus even re-entrant console drivers will simply
    spin on port->lock spinlock. Given that port->lock may already be
    locked either by a stopped CPU, or by the very same CPU we execute
    panic() on (for instance, NMI panic() on printing CPU) the system
    deadlocks and does not reboot.

    Fix this by removing bust_spinlocks(0), so oops_in_progress is always
    set in panic() now and, thus, re-entrant console drivers will trylock
    the port->lock instead of spinning on it forever, when we call them
    from console_flush_on_panic().

    Link: http://lkml.kernel.org/r/20181025101036.6823-1-sergey.senozhatsky@gmail.com
    Cc: Steven Rostedt
    Cc: Daniel Wang
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Greg Kroah-Hartman
    Cc: Alan Cox
    Cc: Jiri Slaby
    Cc: Peter Feiner
    Cc: linux-serial@vger.kernel.org
    Cc: Sergey Senozhatsky
    Cc: stable@vger.kernel.org
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit 0e334db6bb4b1fd1e2d72c1f3d8f004313cd9f94 upstream.

    The signal delivery path of posix-timers can try to rearm the timer even if
    the interval is zero. That's handled for the common case (hrtimer) but not
    for alarm timers. In that case the forwarding function raises a division by
    zero exception.

    The handling for hrtimer based posix timers is wrong because it marks the
    timer as active despite the fact that it is stopped.

    Move the check from common_hrtimer_rearm() to posixtimer_rearm() to cure
    both issues.

    Reported-by: syzbot+9d38bedac9cc77b8ad5e@syzkaller.appspotmail.com
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: sboyd@kernel.org
    Cc: stable@vger.kernel.org
    Cc: syzkaller-bugs@googlegroups.com
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1812171328050.1880@nanos.tec.linutronix.de
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

21 Dec, 2018

12 commits

  • commit 7aa54be2976550f17c11a1c3e3630002dea39303 upstream.

    On x86 we cannot do fetch_or() with a single instruction and thus end up
    using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
    with a composite operation: tas-pending + load.

    Using two instructions of course opens a window we previously did not
    have. Consider the scenario:

    CPU0 CPU1 CPU2

    1) lock
    trylock -> (0,0,1)

    2) lock
    trylock /* fail */

    3) unlock -> (0,0,0)

    4) lock
    trylock -> (0,0,1)

    5) tas-pending -> (0,1,1)
    load-val (0,0,1)

    FAIL: _2_ owners

    where 5) is our new composite operation. When we consider each part of
    the qspinlock state as a separate variable (as we can when
    _Q_PENDING_BITS == 8) then the above is entirely possible, because
    tas-pending will only RmW the pending byte, so the later load is able
    to observe prior tail and lock state (but not earlier than its own
    trylock, which operates on the whole word, due to coherence).

    To avoid this we need 2 things:

    - the load must come after the tas-pending (obviously, otherwise it
    can trivially observe prior state).

    - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
    example, such that we cannot observe other state prior to setting
    pending.

    On x86 we can realize this by using "LOCK BTS m32, r32" for
    tas-pending followed by a regular load.

    Note that observing later state is not a problem:

    - if we fail to observe a later unlock, we'll simply spin-wait for
    that store to become visible.

    - if we observe a later xchg_tail(), there is no difference from that
    xchg_tail() having taken place before the tas-pending.

    Suggested-by: Will Deacon
    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
    Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
    Signed-off-by: Ingo Molnar
    [bigeasy: GEN_BINARY_RMWcc macro redo]
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • commit 53bf57fab7321fb42b703056a4c80fc9d986d170 upstream.

    Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
    such that we loose the indent. This also result in a more natural code
    flow IMO.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • commit c61da58d8a9ba9238250a548f00826eaf44af0f7 upstream.

    When a queued locker reaches the head of the queue, it claims the lock
    by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
    must also clear the tail as part of this operation so that subsequent
    lockers can avoid taking the slowpath altogether.

    Currently this is expressed as a cmpxchg() loop that practically only
    runs up to two iterations. This is confusing to the reader and unhelpful
    to the compiler. Rewrite the cmpxchg() loop without the loop, so that a
    failed cmpxchg() implies that there is contention and we just need to
    write to _Q_LOCKED_VAL without considering the rest of the lockword.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 3bea9adc96842b8a7345c7fb202c16ae9c8d5b25 upstream.

    The native clear_pending() function is identical to the PV version, so the
    latter can simply be removed.

    This fixes the build for systems with >= 16K CPUs using the PV lock implementation.

    Reported-by: Waiman Long
    Signed-off-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/20180427101619.GB21705@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 59fb586b4a07b4e1a0ee577140ab4842ba451acd upstream.

    The qspinlock locking slowpath utilises a "pending" bit as a simple form
    of an embedded test-and-set lock that can avoid the overhead of explicit
    queuing in cases where the lock is held but uncontended. This bit is
    managed using a cmpxchg() loop which tries to transition the uncontended
    lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

    Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
    indefinitely if the lock word is seen to oscillate between unlocked
    (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
    able to take the lock in the cmpxchg() loop without queuing and pass it
    around amongst themselves.

    This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
    using atomic_fetch_or, and then inspecting the old value to see whether
    we need to spin on the current lock owner, or whether we now effectively
    hold the lock. The tricky scenario is when concurrent lockers end up
    queuing on the lock and the lock becomes available, causing us to see
    a lockword of (n,0,0). With pending now set, simply queuing could lead
    to deadlock as the head of the queue may not have observed the pending
    flag being cleared. Conversely, if the head of the queue did observe
    pending being cleared, then it could transition the lock from (n,0,0) ->
    (0,0,1) meaning that any attempt to "undo" our setting of the pending
    bit could race with a concurrent locker trying to set it.

    We handle this race by preserving the pending bit when taking the lock
    after reaching the head of the queue and leaving the tail entry intact
    if we saw pending set, because we know that the tail is going to be
    updated shortly.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 625e88be1f41b53cec55827c984e4a89ea8ee9f9 upstream.

    'struct __qspinlock' provides a handy union of fields so that
    subcomponents of the lockword can be accessed by name, without having to
    manage shifts and masks explicitly and take endianness into account.

    This is useful in qspinlock.h and also potentially in arch headers, so
    move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
    definition.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Acked-by: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 6512276d97b160d90b53285bd06f7f201459a7e3 upstream.

    If a locker taking the qspinlock slowpath reads a lock value indicating
    that only the pending bit is set, then it will spin whilst the
    concurrent pending->locked transition takes effect.

    Unfortunately, there is no guarantee that such a transition will ever be
    observed since concurrent lockers could continuously set pending and
    hand over the lock amongst themselves, leading to starvation. Whilst
    this would probably resolve in practice, it means that it is not
    possible to prove liveness properties about the lock and means that lock
    acquisition time is unbounded.

    Rather than removing the pending->locked spinning from the slowpath
    altogether (which has been shown to heavily penalise a 2-threaded
    locking stress test on x86), this patch replaces the explicit spinning
    with a call to atomic_cond_read_relaxed and allows the architecture to
    provide a bound on the number of spins. For architectures that can
    respond to changes in cacheline state in their smp_cond_load implementation,
    it should be sufficient to use the default bound of 1.

    Suggested-by: Waiman Long
    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 95bcade33a8af38755c9b0636e36a36ad3789fe6 upstream.

    When a locker ends up queuing on the qspinlock locking slowpath, we
    initialise the relevant mcs node and publish it indirectly by updating
    the tail portion of the lock word using xchg_tail. If we find that there
    was a pre-existing locker in the queue, we subsequently update their
    ->next field to point at our node so that we are notified when it's our
    turn to take the lock.

    This can be roughly illustrated as follows:

    /* Initialise the fields in node and encode a pointer to node in tail */
    tail = initialise_node(node);

    /*
    * Exchange tail into the lockword using an atomic read-modify-write
    * operation with release semantics
    */
    old = xchg_tail(lock, tail);

    /* If there was a pre-existing waiter ... */
    if (old & _Q_TAIL_MASK) {
    prev = decode_tail(old);
    smp_read_barrier_depends();

    /* ... then update their ->next field to point to node.
    WRITE_ONCE(prev->next, node);
    }

    The conditional update of prev->next therefore relies on the address
    dependency from the result of xchg_tail ensuring order against the
    prior initialisation of node. However, since the release semantics of
    the xchg_tail operation apply only to the write portion of the RmW,
    then this ordering is not guaranteed and it is possible for the CPU
    to return old before the writes to node have been published, consequently
    allowing us to point prev->next to an uninitialised node.

    This patch fixes the problem by making the update of prev->next a RELEASE
    operation, which also removes the reliance on dependency ordering.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Will Deacon
     
  • commit 548095dea63ffc016d39c35b32c628d033638aca upstream.

    Queued spinlocks are not used by DEC Alpha, and furthermore operations
    such as READ_ONCE() and release/relaxed RMW atomics are being changed
    to imply smp_read_barrier_depends(). This commit therefore removes the
    now-redundant smp_read_barrier_depends() from queued_spin_lock_slowpath(),
    and adjusts the comments accordingly.

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Sasha Levin

    Paul E. McKenney
     
  • commit 2840f84f74035e5a535959d5f17269c69fa6edc5 upstream.

    The following commands will cause a memory leak:

    # cd /sys/kernel/tracing
    # mkdir instances/foo
    # echo schedule > instance/foo/set_ftrace_filter
    # rmdir instances/foo

    The reason is that the hashes that hold the filters to set_ftrace_filter and
    set_ftrace_notrace are not freed if they contain any data on the instance
    and the instance is removed.

    Found by kmemleak detector.

    Cc: stable@vger.kernel.org
    Fixes: 591dffdade9f ("ftrace: Allow for function tracing instance to filter functions")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 3cec638b3d793b7cacdec5b8072364b41caeb0e1 upstream.

    When create_event_filter() fails in set_trigger_filter(), the filter may
    still be allocated and needs to be freed. The caller expects the
    data->filter to be updated with the new filter, even if the new filter
    failed (we could add an error message by setting set_str parameter of
    create_event_filter(), but that's another update).

    But because the error would just exit, filter was left hanging and
    nothing could free it.

    Found by kmemleak detector.

    Cc: stable@vger.kernel.org
    Fixes: bac5fb97a173a ("tracing: Add and use generic set_trigger_filter() implementation")
    Reviewed-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • [ Upstream commit 8e7df2b5b7f245c9bd11064712db5cb69044a362 ]

    While it uses %pK, there's still few reasons to read this file
    as non-root.

    Suggested-by: Linus Torvalds
    Acked-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Ingo Molnar
     

17 Dec, 2018

6 commits

  • [ Upstream commit c14376de3a1befa70d9811ca2872d47367b48767 ]

    wake_klogd is a local variable in console_unlock(). The information
    is lost when the console_lock owner using the busy wait added by
    the commit dbdda842fe96f8932 ("printk: Add console owner and waiter
    logic to load balance console writes"). The following race is
    possible:

    CPU0 CPU1
    console_unlock()

    for (;;)
    /* calling console for last message */

    printk()
    log_store()
    log_next_seq++;

    /* see new message */
    if (seen_seq != log_next_seq) {
    wake_klogd = true;
    seen_seq = log_next_seq;
    }

    console_lock_spinning_enable();

    if (console_trylock_spinning())
    /* spinning */

    if (console_lock_spinning_disable_and_check()) {
    printk_safe_exit_irqrestore(flags);
    return;

    console_unlock()
    if (seen_seq != log_next_seq) {
    /* already seen */
    /* nothing to do */

    Result: Nobody would wakeup klogd.

    One solution would be to make a global variable from wake_klogd.
    But then we would need to manipulate it under a lock or so.

    This patch wakes klogd also when console_lock is passed to the
    spinning waiter. It looks like the right way to go. Also userspace
    should have a chance to see and store any "flood" of messages.

    Note that the very late klogd wake up was a historic solution.
    It made sense on single CPU systems or when sys_syslog() operations
    were synchronized using the big kernel lock like in v2.1.113.
    But it is questionable these days.

    Fixes: dbdda842fe96f8932 ("printk: Add console owner and waiter logic to load balance console writes")
    Link: http://lkml.kernel.org/r/20180226155734.dzwg3aovqnwtvkoy@pathway.suse.cz
    Cc: Steven Rostedt
    Cc: linux-kernel@vger.kernel.org
    Cc: Tejun Heo
    Suggested-by: Sergey Senozhatsky
    Reviewed-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Sasha Levin

    Petr Mladek
     
  • [ Upstream commit fd5f7cde1b85d4c8e09ca46ce948e008a2377f64 ]

    This patch, basically, reverts commit 6b97a20d3a79 ("printk:
    set may_schedule for some of console_trylock() callers").
    That commit was a mistake, it introduced a big dependency
    on the scheduler, by enabling preemption under console_sem
    in printk()->console_unlock() path, which is rather too
    critical. The patch did not significantly reduce the
    possibilities of printk() lockups, but made it possible to
    stall printk(), as has been reported by Tetsuo Handa [1].

    Another issues is that preemption under console_sem also
    messes up with Steven Rostedt's hand off scheme, by making
    it possible to sleep with console_sem both in console_unlock()
    and in vprintk_emit(), after acquiring the console_sem
    ownership (anywhere between printk_safe_exit_irqrestore() in
    console_trylock_spinning() and printk_safe_enter_irqsave()
    in console_unlock()). This makes hand off less likely and,
    at the same time, may result in a significant amount of
    pending logbuf messages. Preempted console_sem owner makes
    it impossible for other CPUs to emit logbuf messages, but
    does not make it impossible for other CPUs to append new
    messages to the logbuf.

    Reinstate the old behavior and make printk() non-preemptible.
    Should any printk() lockup reports arrive they must be handled
    in a different way.

    [1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp
    Fixes: 6b97a20d3a79 ("printk: set may_schedule for some of console_trylock() callers")
    Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV
    To: Tetsuo Handa
    Cc: Sergey Senozhatsky
    Cc: Tejun Heo
    Cc: akpm@linux-foundation.org
    Cc: linux-mm@kvack.org
    Cc: Cong Wang
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jan Kara
    Cc: Mathieu Desnoyers
    Cc: Byungchul Park
    Cc: Pavel Machek
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Tetsuo Handa
    Reviewed-by: Steven Rostedt (VMware)
    Signed-off-by: Petr Mladek
    Signed-off-by: Sasha Levin

    Sergey Senozhatsky
     
  • [ Upstream commit c162d5b4338d72deed61aa65ed0f2f4ba2bbc8ab ]

    The commit ("printk: Add console owner and waiter logic to load balance
    console writes") made vprintk_emit() and console_unlock() even more
    complicated.

    This patch extracts the new code into 3 helper functions. They should
    help to keep it rather self-contained. It will be easier to use and
    maintain.

    This patch just shuffles the existing code. It does not change
    the functionality.

    Link: http://lkml.kernel.org/r/20180112160837.GD24497@linux.suse
    Cc: akpm@linux-foundation.org
    Cc: linux-mm@kvack.org
    Cc: Cong Wang
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jan Kara
    Cc: Mathieu Desnoyers
    Cc: Tetsuo Handa
    Cc: rostedt@home.goodmis.org
    Cc: Byungchul Park
    Cc: Tejun Heo
    Cc: Pavel Machek
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Steven Rostedt (VMware)
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Sasha Levin

    Petr Mladek
     
  • [ Upstream commit dbdda842fe96f8932bae554f0adf463c27c42bc7 ]

    This patch implements what I discussed in Kernel Summit. I added
    lockdep annotation (hopefully correctly), and it hasn't had any splats
    (since I fixed some bugs in the first iterations). It did catch
    problems when I had the owner covering too much. But now that the owner
    is only set when actively calling the consoles, lockdep has stayed
    quiet.

    Here's the design again:

    I added a "console_owner" which is set to a task that is actively
    writing to the consoles. It is *not* the same as the owner of the
    console_lock. It is only set when doing the calls to the console
    functions. It is protected by a console_owner_lock which is a raw spin
    lock.

    There is a console_waiter. This is set when there is an active console
    owner that is not current, and waiter is not set. This too is protected
    by console_owner_lock.

    In printk() when it tries to write to the consoles, we have:

    if (console_trylock())
    console_unlock();

    Now I added an else, which will check if there is an active owner, and
    no current waiter. If that is the case, then console_waiter is set, and
    the task goes into a spin until it is no longer set.

    When the active console owner finishes writing the current message to
    the consoles, it grabs the console_owner_lock and sees if there is a
    waiter, and clears console_owner.

    If there is a waiter, then it breaks out of the loop, clears the waiter
    flag (because that will release the waiter from its spin), and exits.
    Note, it does *not* release the console semaphore. Because it is a
    semaphore, there is no owner. Another task may release it. This means
    that the waiter is guaranteed to be the new console owner! Which it
    becomes.

    Then the waiter calls console_unlock() and continues to write to the
    consoles.

    If another task comes along and does a printk() it too can become the
    new waiter, and we wash rinse and repeat!

    By Petr Mladek about possible new deadlocks:

    The thing is that we move console_sem only to printk() call
    that normally calls console_unlock() as well. It means that
    the transferred owner should not bring new type of dependencies.
    As Steven said somewhere: "If there is a deadlock, it was
    there even before."

    We could look at it from this side. The possible deadlock would
    look like:

    CPU0 CPU1

    console_unlock()

    console_owner = current;

    spin_lockA()
    printk()
    spin = true;
    while (...)

    call_console_drivers()
    spin_lockA()

    This would be a deadlock. CPU0 would wait for the lock A.
    While CPU1 would own the lockA and would wait for CPU0
    to finish calling the console drivers and pass the console_sem
    owner.

    But if the above is true than the following scenario was
    already possible before:

    CPU0

    spin_lockA()
    printk()
    console_unlock()
    call_console_drivers()
    spin_lockA()

    By other words, this deadlock was there even before. Such
    deadlocks are prevented by using printk_deferred() in
    the sections guarded by the lock A.

    By Steven Rostedt:

    To demonstrate the issue, this module has been shown to lock up a
    system with 4 CPUs and a slow console (like a serial console). It is
    also able to lock up a 8 CPU system with only a fast (VGA) console, by
    passing in "loops=100". The changes in this commit prevent this module
    from locking up the system.

    #include
    #include
    #include
    #include
    #include
    #include

    static bool stop_testing;
    static unsigned int loops = 1;

    static void preempt_printk_workfn(struct work_struct *work)
    {
    int i;

    while (!READ_ONCE(stop_testing)) {
    for (i = 0; i < loops && !READ_ONCE(stop_testing); i++) {
    preempt_disable();
    pr_emerg("%5d%-75s\n", smp_processor_id(),
    " XXX NOPREEMPT");
    preempt_enable();
    }
    msleep(1);
    }
    }

    static struct work_struct __percpu *works;

    static void finish(void)
    {
    int cpu;

    WRITE_ONCE(stop_testing, true);
    for_each_online_cpu(cpu)
    flush_work(per_cpu_ptr(works, cpu));
    free_percpu(works);
    }

    static int __init test_init(void)
    {
    int cpu;

    works = alloc_percpu(struct work_struct);
    if (!works)
    return -ENOMEM;

    /*
    * This is just a test module. This will break if you
    * do any CPU hot plugging between loading and
    * unloading the module.
    */

    for_each_online_cpu(cpu) {
    struct work_struct *work = per_cpu_ptr(works, cpu);

    INIT_WORK(work, &preempt_printk_workfn);
    schedule_work_on(cpu, work);
    }

    return 0;
    }

    static void __exit test_exit(void)
    {
    finish();
    }

    module_param(loops, uint, 0);
    module_init(test_init);
    module_exit(test_exit);
    MODULE_LICENSE("GPL");

    Link: http://lkml.kernel.org/r/20180110132418.7080-2-pmladek@suse.com
    Cc: akpm@linux-foundation.org
    Cc: linux-mm@kvack.org
    Cc: Cong Wang
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Jan Kara
    Cc: Mathieu Desnoyers
    Cc: Tetsuo Handa
    Cc: Byungchul Park
    Cc: Tejun Heo
    Cc: Pavel Machek
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Steven Rostedt (VMware)
    [pmladek@suse.com: Commit message about possible deadlocks]
    Acked-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek
    Signed-off-by: Sasha Levin

    Steven Rostedt (VMware)
     
  • This reverts commit c9b8d580b3fb0ab65d37c372aef19a318fda3199.

    This is just a technical revert to make the printk fix apply cleanly,
    this patch will be re-picked in about 3 commits.

    Sasha Levin
     
  • [ Upstream commit 1efb6ee3edea57f57f9fb05dba8dcb3f7333f61f ]

    A format string consisting of "%p" or "%s" followed by an invalid
    specifier (e.g. "%p%\n" or "%s%") could pass the check which
    would make format_decode (lib/vsprintf.c) to warn.

    Fixes: 9c959c863f82 ("tracing: Allow BPF programs to call bpf_trace_printk()")
    Reported-by: syzbot+1ec5c5ec949c4adaa0c4@syzkaller.appspotmail.com
    Signed-off-by: Martynas Pumputis
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Sasha Levin

    Martynas Pumputis
     

08 Dec, 2018

2 commits

  • commit 09d3f015d1e1b4fee7e9bbdcf54201d239393391 upstream.

    Commit:

    142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")

    added the UPROBE_COPY_INSN flag, and corresponding smp_wmb() and smp_rmb()
    memory barriers, to ensure that handle_swbp() uses fully-initialized
    uprobes only.

    However, the smp_rmb() is mis-placed: this barrier should be placed
    after handle_swbp() has tested for the flag, thus guaranteeing that
    (program-order) subsequent loads from the uprobe can see the initial
    stores performed by prepare_uprobe().

    Move the smp_rmb() accordingly. Also amend the comments associated
    to the two memory barriers to indicate their actual locations.

    Signed-off-by: Andrea Parri
    Acked-by: Oleg Nesterov
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Cc: stable@kernel.org
    Fixes: 142b18ddc8143 ("uprobes: Fix handle_swbp() vs unregister() + register() race")
    Link: http://lkml.kernel.org/r/20181122161031.15179-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     
  • commit 2cf2f0d5b91fd1b06a6ae260462fc7945ea84add upstream.

    gcc discovered that the memcpy() arguments in kdbnearsym() overlap, so
    we should really use memmove(), which is defined to handle that correctly:

    In function 'memcpy',
    inlined from 'kdbnearsym' at /git/arm-soc/kernel/debug/kdb/kdb_support.c:132:4:
    /git/arm-soc/include/linux/string.h:353:9: error: '__builtin_memcpy' accessing 792 bytes at offsets 0 and 8 overlaps 784 bytes at offset 8 [-Werror=restrict]
    return __builtin_memcpy(p, q, size);

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jason Wessel
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     

06 Dec, 2018

7 commits

  • commit 46f7ecb1e7359f183f5bbd1e08b90e10e52164f9 upstream

    The IBPB control code in x86 removed the usage. Remove the functionality
    which was introduced for this.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit a74cfffb03b73d41e08f84c2e5c87dec0ce3db9f upstream

    arch_smt_update() is only called when the sysfs SMT control knob is
    changed. This means that when SMT is enabled in the sysfs control knob the
    system is considered to have SMT active even if all siblings are offline.

    To allow finegrained control of the speculation mitigations, the actual SMT
    state is more interesting than the fact that siblings could be enabled.

    Rework the code, so arch_smt_update() is invoked from each individual CPU
    hotplug function, and simplify the update function while at it.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream

    Make the scheduler's 'sched_smt_present' static key globaly available, so
    it can be used in the x86 speculation control code.

    Provide a query function and a stub for the CONFIG_SMP=n case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit c5511d03ec090980732e929c318a7a6374b5550e upstream

    Currently the 'sched_smt_present' static key is enabled when at CPU bringup
    SMT topology is observed, but it is never disabled. However there is demand
    to also disable the key when the topology changes such that there is no SMT
    present anymore.

    Implement this by making the key count the number of cores that have SMT
    enabled.

    In particular, the SMT topology bits are set before interrrupts are enabled
    and similarly, are cleared after interrupts are disabled for the last time
    and the CPU dies.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra (Intel)
     
  • commit dbfe2953f63c640463c630746cd5d9de8b2f63ae upstream

    Currently, IBPB is only issued in cases when switching into a non-dumpable
    process, the rationale being to protect such 'important and security
    sensitive' processess (such as GPG) from data leaking into a different
    userspace process via spectre v2.

    This is however completely insufficient to provide proper userspace-to-userpace
    spectrev2 protection, as any process can poison branch buffers before being
    scheduled out, and the newly scheduled process immediately becomes spectrev2
    victim.

    In order to minimize the performance impact (for usecases that do require
    spectrev2 protection), issue the barrier only in cases when switching between
    processess where the victim can't be ptraced by the potential attacker (as in
    such cases, the attacker doesn't have to bother with branch buffers at all).

    [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
    PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
    fine-grained ]

    Fixes: 18bf3c3ea8 ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
    Originally-by: Tim Chen
    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: "SchauflerCasey"
    Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pm
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • commit 53c613fe6349994f023245519265999eed75957f upstream

    STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
    (once enabled) prevents cross-hyperthread control of decisions made by
    indirect branch predictors.

    Enable this feature if

    - the CPU is vulnerable to spectre v2
    - the CPU supports SMT and has SMT siblings online
    - spectre_v2 mitigation autoselection is enabled (default)

    After some previous discussion, this leaves STIBP on all the time, as wrmsr
    on crossing kernel boundary is a no-no. This could perhaps later be a bit
    more optimized (like disabling it in NOHZ, experiment with disabling it in
    idle, etc) if needed.

    Note that the synchronization of the mask manipulation via newly added
    spec_ctrl_mutex is currently not strictly needed, as the only updater is
    already being serialized by cpu_add_remove_lock, but let's make this a
    little bit more future-proof.

    Signed-off-by: Jiri Kosina
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: "WoodhouseDavid"
    Cc: Andi Kleen
    Cc: Tim Chen
    Cc: "SchauflerCasey"
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
    Signed-off-by: Greg Kroah-Hartman

    Jiri Kosina
     
  • commit ce48c146495a1a50e48cdbfbfaba3e708be7c07c upstream

    Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

    tg_set_cfs_bandwidth()
    get_online_cpus()
    cpus_read_lock()

    cfs_bandwidth_usage_inc()
    static_key_slow_inc()
    cpus_read_lock()

    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra