25 Apr, 2018

1 commit

  • When CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n, the call path
    hrtimer_reprogram -> clockevents_program_event ->
    clockevents_program_min_delta will not retry if the clock event driver
    returns -ETIME.

    If the driver could not satisfy the program_min_delta for any reason, the
    lack of a retry means the CPU may not receive a tick interrupt, potentially
    until the counter does a full period. This leads to rcu_sched timeout
    messages as the stalled CPU is detected by other CPUs, and other issues if
    the CPU is holding locks or other resources at the point at which it
    stalls.

    There have been a couple of observed mechanisms through which a clock event
    driver could not satisfy the requested min_delta and return -ETIME.

    With the MIPS GIC driver, the shared execution resource within MT cores
    means inconventient latency due to execution of instructions from other
    hardware threads in the core, within gic_next_event, can result in an event
    being set in the past.

    Additionally under virtualisation it is possible to get unexpected latency
    during a clockevent device's set_next_event() callback which can make it
    return -ETIME even for a delta based on min_delta_ns.

    It isn't appropriate to use MIN_ADJUST in the virtualisation case as
    occasional hypervisor induced high latency will cause min_delta_ns to
    quickly increase to the maximum.

    Instead, borrow the retry pattern from the MIN_ADJUST case, but without
    making adjustments. Retry up to 10 times, each time increasing the
    attempted delta by min_delta, before giving up.

    [ Matt: Reworked the loop and made retry increase the delta. ]

    Signed-off-by: James Hogan
    Signed-off-by: Matt Redfearn
    Signed-off-by: Thomas Gleixner
    Cc: linux-mips@linux-mips.org
    Cc: Daniel Lezcano
    Cc: "Martin Schwidefsky"
    Cc: James Hogan
    Link: https://lkml.kernel.org/r/1508422643-6075-1-git-send-email-matt.redfearn@mips.com

    James Hogan
     

12 Apr, 2018

1 commit

  • commit 39290b389ea upstream.

    The current "rodata=off" parameter disables read-only kernel mappings
    under CONFIG_DEBUG_RODATA:
    commit d2aa1acad22f ("mm/init: Add 'rodata=off' boot cmdline parameter
    to disable read-only kernel mappings")

    This patch is a logical extension to module mappings ie. read-only mappings
    at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX
    (mainly for debug use). Please note, however, that it only affects RO/RW
    permissions, keeping NX set.

    This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory
    (always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64.

    Suggested-by: and Acked-by: Mark Rutland
    Signed-off-by: AKASHI Takahiro
    Reviewed-by: Kees Cook
    Acked-by: Rusty Russell
    Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org
    Signed-off-by: Jessica Yu
    Signed-off-by: Alex Shi

    Conflicts:
    keeping kaiser.h in init/main.c

    AKASHI Takahiro
     

20 Mar, 2018

5 commits

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    (cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

    Conflicts:
    block/blk-mq.c
    include/linux/blk_types.h
    include/linux/blkdev.h

    Christoph Hellwig
     
  • There is no need to always call blocking console_lock() in
    console_cpu_notify(), it's quite possible that console_sem can
    be locked by other CPU on the system, either already printing
    or soon to begin printing the messages. console_lock() in this
    case can simply block CPU hotplug for unknown period of time
    (console_unlock() is time unbound). Not that hotplug is very
    fast, but still, with other CPUs being online and doing
    printk() console_cpu_notify() can stuck.

    Use console_trylock() instead and opt-out if console_sem is
    already acquired from another CPU, since that CPU will do
    the printing for us.

    Link: http://lkml.kernel.org/r/20170121104729.8585-1-sergey.senozhatsky@gmail.com
    Cc: Steven Rostedt
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Sebastian Andrzej Siewior
    Cc: Ingo Molnar
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Petr Mladek

    This patch also fixes a deadlock that happens if while holding the
    console lock someone issues a call that eventually takes the cpu
    hotplug lock, like in the case below, where the following happens:

    * task Xorg issues an ioctl to the fb layer which takes the console
    lock and calls the driver's ioctl routine

    * at the same time, task bat-cpuhotplug issue a hotplug cpu enable
    operation which takes the cpu hotplug lock and waits for the cpu
    bringup operation to complete

    * the fb driver calls dma_alloc_coherent which, on this platform,
    eventually tries to take the cpu hotplug

    * task cpuhp/2 tries to flush the console

    * at this point task Xorg waits after task bat-cpuhotplug to release
    the cpu hotplug lock which waits after task cpuhp/2 to signal that
    the CPU is up which waits after the Xorg to release the console
    lock

    Linux version 4.9.11-02771-gd85d61b-dirty
    CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c53c7d
    CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
    OF: fdt:Machine model: Freescale i.MX6 Quad SABRE Smart Device Board

    sysrq: SysRq : Show Blocked State

    task PC stack pid father
    cpuhp/2 D 0 18 2 0x00000000
    [] (__schedule) from [] (schedule+0x4c/0xac)
    [] (schedule) from [] (schedule_timeout+0x1e8/0x2fc)
    [] (schedule_timeout) from [] (__down+0x64/0x9c)
    [] (__down) from [] (down+0x44/0x58)
    [] (down) from [] (console_lock+0x2c/0x74)
    [] (console_lock) from [] (console_cpu_notify+0x28/0x34)
    [] (console_cpu_notify) from [] (notifier_call_chain+0x44/0x84)
    [] (notifier_call_chain) from [] (__cpu_notify+0x38/0x50)
    [] (__cpu_notify) from [] (notify_online+0x18/0x20)
    [] (notify_online) from [] (cpuhp_up_callbacks+0x24/0xd4)
    [] (cpuhp_up_callbacks) from [] (cpuhp_thread_fun+0x13c/0x148)
    [] (cpuhp_thread_fun) from [] (smpboot_thread_fn+0x17c/0x2dc)
    [] (smpboot_thread_fn) from [] (kthread+0xf0/0x108)
    [] (kthread) from [] (ret_from_fork+0x14/0x24)

    bat-cpuhotplug. D 0 841 718 0x00000000
    [] (__schedule) from [] (schedule+0x4c/0xac)
    [] (schedule) from [] (schedule_timeout+0x1e8/0x2fc)
    [] (schedule_timeout) from [] (wait_for_common+0xb0/0x160)
    [] (wait_for_common) from [] (bringup_cpu+0x50/0xa8)
    [] (bringup_cpu) from [] (cpuhp_up_callbacks+0x24/0xd4)
    [] (cpuhp_up_callbacks) from [] (_cpu_up+0xa8/0xec)
    [] (_cpu_up) from [] (do_cpu_up+0x74/0x9c)
    [] (do_cpu_up) from [] (device_online+0x68/0x8c)
    [] (device_online) from [] (online_store+0x68/0x74)
    [] (online_store) from [] (kernfs_fop_write+0xf4/0x1f8)
    [] (kernfs_fop_write) from [] (__vfs_write+0x1c/0x114)
    [] (__vfs_write) from [] (vfs_write+0xa4/0x168)
    [] (vfs_write) from [] (SyS_write+0x3c/0x90)
    [] (SyS_write) from [] (ret_fast_syscall+0x0/0x1c)

    Xorg D 0 860 832 0x00000000
    [] (__schedule) from [] (schedule+0x4c/0xac)
    [] (schedule) from [] (schedule_preempt_disabled+0x14/0x20)
    [] (schedule_preempt_disabled) from [] (mutex_lock_nested+0x1f8/0x4a4)
    [] (mutex_lock_nested) from [] (get_online_cpus+0x78/0xbc)
    [] (get_online_cpus) from [] (lru_add_drain_all+0x48/0x1b4)
    [] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
    [] (migrate_prep) from [] (alloc_contig_range+0xd0/0x320)
    [] (alloc_contig_range) from [] (cma_alloc+0xb8/0x1a8)
    [] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
    [] (__alloc_from_contiguous) from [] (cma_allocator_alloc+0x34/0x3c)
    [] (cma_allocator_alloc) from [] (__dma_alloc+0x174/0x338)
    [] (__dma_alloc) from [] (arm_dma_alloc+0x40/0x48)
    [] (arm_dma_alloc) from [] (mxcfb_set_par+0x8ec/0xd7c)
    [] (mxcfb_set_par) from [] (fb_set_var+0x1d4/0x358)
    [] (fb_set_var) from [] (do_fb_ioctl+0x4e4/0x704)
    [] (do_fb_ioctl) from [] (do_vfs_ioctl+0xa0/0xa10)
    [] (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
    [] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x1c)

    Signed-off-by: Octavian Purdila
    Reviewed-by: Leonard Crestez

    Sergey Senozhatsky
     
  • When the tick is stopped and we reach the dynticks evaluation code on
    IRQ exit, we perform a soft tick restart if we observe an expired timer
    from there. It means we program the nearest possible tick but we stay in
    dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
    again after that expired timer is handled.

    Now this solution works most of the time but if we suffer an IRQ storm
    and those interrupts trigger faster than the hardware clockevents min
    delay, our tick won't fire until that IRQ storm is finished.

    Here is the problem: on IRQ exit we reprog the timer to at least
    NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
    reschedule again to NOW() + min_clockevents_delay, etc... The tick
    is eternally rescheduled min_clockevents_delay ahead.

    A solution is to simply remove this soft tick restart. After all
    the normal dynticks evaluation path can handle 0 delay just fine. And
    by doing that we benefit from the optimization branch which avoids
    clock reprogramming if the clockevents deadline hasn't changed since
    the last reprog. This fixes our issue because we don't do repetitive
    clock reprog that always add hardware min delay.

    As a side effect it should even optimize the 0 delay path in general.

    Reported-and-tested-by: Octavian Purdila
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
    Pallipadi to generic.

    Change-Id: Idf29cda15be151f494ff245933c12462643388d5
    Acked-by: Nicolas Pitre
    Signed-off-by: Todd Poynor

    Todd Poynor
     
  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

18 Mar, 2018

1 commit

  • commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream.

    Introduce a helper to retrieve the current task's work struct if it is
    a workqueue worker.

    This allows us to fix a long-standing deadlock in several DRM drivers
    wherein the ->runtime_suspend callback waits for a specific worker to
    finish and that worker in turn calls a function which waits for runtime
    suspend to finish. That function is invoked from multiple call sites
    and waiting for runtime suspend to finish is the correct thing to do
    except if it's executing in the context of the worker.

    Cc: Lai Jiangshan
    Cc: Dave Airlie
    Cc: Ben Skeggs
    Cc: Alex Deucher
    Acked-by: Tejun Heo
    Reviewed-by: Lyude Paul
    Signed-off-by: Lukas Wunner
    Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
    Signed-off-by: Greg Kroah-Hartman

    Lukas Wunner
     

11 Mar, 2018

4 commits

  • [ upstream commit 32fff239de37ef226d5b66329dd133f64d63b22d ]

    syszbot managed to trigger RCU detected stalls in
    bpf_array_free_percpu()

    It takes time to allocate a huge percpu map, but even more time to free
    it.

    Since we run in process context, use cond_resched() to yield cpu if
    needed.

    Fixes: a10423b87a7e ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ upstream commit 9c2d63b843a5c8a8d0559cc067b5398aa5ec3ffc ]

    syzkaller recently triggered OOM during percpu map allocation;
    while there is work in progress by Dennis Zhou to add __GFP_NORETRY
    semantics for percpu allocator under pressure, there seems also a
    missing bpf_map_precharge_memlock() check in array map allocation.

    Given today the actual bpf_map_charge_memlock() happens after the
    find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
    is there to bail out early before we go and do the map setup work
    when we find that we hit the limits anyway. Therefore add this for
    array map as well.

    Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
    Fixes: a10423b87a7e ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
    Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Cc: Dennis Zhou
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ upstream commit a316338cb71a3260201490e615f2f6d5c0d8fb2c ]

    trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
    attr->map_flags, since it does not support preallocation yet. We
    check the flag, but we never copy the flag into trie->map.map_flags,
    which is later on exposed into fdinfo and used by loaders such as
    iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
    whether a pinned map has the same spec as the one from the BPF obj
    file and if not, bails out, which is currently the case for lpm
    since it exposes always 0 as flags.

    Also copy over flags in array_map_alloc() and stack_map_alloc().
    They always have to be 0 right now, but we should make sure to not
    miss to copy them over at a later point in time when we add actual
    flags for them to use.

    Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
    Reported-by: Jarno Rajahalme
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • commit c52232a49e203a65a6e1a670cd5262f59e9364a0 upstream.

    On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
    live CPU. This happens from the control thread which initiated the unplug.

    If the CPU on which the control thread runs came out from a longer idle
    period then the base clock of that CPU might be stale because the control
    thread runs prior to any event which forwards the clock.

    In such a case the timers from the unplugged CPU are queued on the live CPU
    based on the stale clock which can cause large delays due to increased
    granularity of the outer timer wheels which are far away from base:;clock.

    But there is a worse problem than that. The following sequence of events
    illustrates it:

    - CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

    The timer is queued at wheel level 2, with resulting expiry time = 60032
    (due to level granularity).

    - CPU1 enters idle @60007, with next timer expiry @60020.

    - CPU0 is hotplugged at @60009

    - CPU1 exits idle and runs the control thread which migrates the
    timers from CPU0

    timer1 is now queued in level 0 for immediate handling in the next
    softirq because the requested expiry time 59969 is before CPU1 base->clk
    60007

    - CPU1 runs code which forwards the base clock which succeeds because the
    next expiring timer. which was collected at idle entry time is still set
    to 60020.

    So it forwards beyond 60007 and therefore misses to expire the migrated
    timer1. That timer gets expired when the wheel wraps around again, which
    takes between 63 and 630ms depending on the HZ setting.

    Address both problems by invoking forward_timer_base() for the control CPUs
    timer base. All other places, which might run into a similar problem
    (mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
    that.

    [ tglx: Massaged comment and changelog ]

    Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
    Co-developed-by: Neeraj Upadhyay
    Signed-off-by: Neeraj Upadhyay
    Signed-off-by: Lingutla Chandrasekhar
    Signed-off-by: Thomas Gleixner
    Cc: Anna-Maria Gleixner
    Cc: linux-arm-msm@vger.kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
    Signed-off-by: Greg Kroah-Hartman

    Lingutla Chandrasekhar
     

03 Mar, 2018

2 commits

  • [ Upstream commit 11bca0a83f83f6093d816295668e74ef24595944 ]

    An interrupt storm on a bad interrupt will cause the kernel
    log to be clogged.

    [ 60.089234] ->handle_irq(): ffffffffbe2f803f,
    [ 60.090455] 0xffffffffbf2af380
    [ 60.090510] handle_bad_irq+0x0/0x2e5
    [ 60.090522] ->irq_data.chip(): ffffffffbf2af380,
    [ 60.090553] IRQ_NOPROBE set
    [ 60.090584] ->handle_irq(): ffffffffbe2f803f,
    [ 60.090590] handle_bad_irq+0x0/0x2e5
    [ 60.090596] ->irq_data.chip(): ffffffffbf2af380,
    [ 60.090602] 0xffffffffbf2af380
    [ 60.090608] ->action(): (null)
    [ 60.090779] handle_bad_irq+0x0/0x2e5

    This was seen when running an upstream kernel on Acer Chromebook R11. The
    system was unstable as result.

    Guard the log message with __printk_ratelimit to reduce the impact. This
    won't prevent the interrupt storm from happening, but at least the system
    remains stable.

    Signed-off-by: Guenter Roeck
    Signed-off-by: Thomas Gleixner
    Cc: Dmitry Torokhov
    Cc: Joe Perches
    Cc: Andy Shevchenko
    Cc: Mika Westerberg
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953
    Link: https://lkml.kernel.org/r/1512234784-21038-1-git-send-email-linux@roeck-us.net
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Guenter Roeck
     
  • commit 48d0c9becc7f3c66874c100c126459a9da0fdced upstream.

    The POSIX specification defines that relative CLOCK_REALTIME timers are not
    affected by clock modifications. Those timers have to use CLOCK_MONOTONIC
    to ensure POSIX compliance.

    The introduction of the additional HRTIMER_MODE_PINNED mode broke this
    requirement for pinned timers.

    There is no user space visible impact because user space timers are not
    using pinned mode, but for consistency reasons this needs to be fixed.

    Check whether the mode has the HRTIMER_MODE_REL bit set instead of
    comparing with HRTIMER_MODE_ABS.

    Signed-off-by: Anna-Maria Gleixner
    Cc: Christoph Hellwig
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: keescook@chromium.org
    Fixes: 597d0275736d ("timers: Framework for identifying pinned timers")
    Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de
    Signed-off-by: Ingo Molnar
    Cc: Mike Galbraith
    Signed-off-by: Greg Kroah-Hartman

    Anna-Maria Gleixner
     

28 Feb, 2018

1 commit

  • commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream.

    If devm_memremap_pages() detects a collision while adding entries
    to the radix-tree, we call pgmap_radix_release(). Unfortunately,
    the function removes *all* entries for the range -- including the
    entries that caused the collision in the first place.

    Modify pgmap_radix_release() to take an additional argument to
    indicate where to stop, so that only newly added entries are removed
    from the tree.

    Cc:
    Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()")
    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

25 Feb, 2018

2 commits

  • commit a77660d231f8b3d84fd23ed482e0964f7aa546d6 upstream.

    Currently KCOV_ENABLE does not check if the current task is already
    associated with another kcov descriptor. As the result it is possible
    to associate a single task with more than one kcov descriptor, which
    later leads to a memory leak of the old descriptor. This relation is
    really meant to be one-to-one (task has only one back link).

    Extend validation to detect such misuse.

    Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com
    Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
    Signed-off-by: Dmitry Vyukov
    Reported-by: Shankara Pailoor
    Cc: Dmitry Vyukov
    Cc: syzbot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Vyukov
     
  • commit a6da0024ffc19e0d47712bb5ca4fd083f76b07df upstream.

    We need to ensure that tracepoints are registered and unregistered
    with the users of them. The existing atomic count isn't enough for
    that. Add a lock around the tracepoints, so we serialize access
    to them.

    This fixes cases where we have multiple users setting up and
    tearing down tracepoints, like this:

    CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted
    4.14.0-rc5-next-20171018+ #36
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:16 [inline]
    dump_stack+0x194/0x257 lib/dump_stack.c:52
    panic+0x1e4/0x41c kernel/panic.c:183
    __warn+0x1c4/0x1e0 kernel/panic.c:546
    report_bug+0x211/0x2d0 lib/bug.c:183
    fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
    do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
    do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
    do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
    do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
    invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
    RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline]
    RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283
    RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293
    RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07
    RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818
    RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004
    R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0
    R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0
    tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304
    register_trace_block_rq_insert include/trace/events/block.h:191 [inline]
    blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043
    do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542
    blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564
    sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089
    vfs_ioctl fs/ioctl.c:45 [inline]
    do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
    SYSC_ioctl fs/ioctl.c:700 [inline]
    SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
    entry_SYSCALL_64_fastpath+0x1f/0xbe
    RIP: 0033:0x444339
    RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
    RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339
    RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009
    RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
    R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000

    since we can now run these in parallel. Ensure that the exported helpers
    for doing this are grabbing the queue trace mutex.

    Reported-by: Steven Rostedt
    Tested-by: Dmitry Vyukov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     

22 Feb, 2018

1 commit

  • commit 10a0cd6e4932b5078215b1ec2c896597eec0eff9 upstream.

    The functions devm_memremap_pages() and devm_memremap_pages_release() use
    different ways to calculate the section-aligned amount of memory. The
    latter function may use an incorrect size if the memory region is small
    but straddles a section border.

    Use the same code for both.

    Cc:
    Fixes: 5f29a77cd957 ("mm: fix mixed zone detection in devm_memremap_pages")
    Signed-off-by: Jan H. Schönherr
    Signed-off-by: Dan Williams
    Signed-off-by: Greg Kroah-Hartman

    Jan H. Schönherr
     

17 Feb, 2018

6 commits

  • commit 7b6586562708d2b3a04fe49f217ddbadbbbb0546 upstream.

    __unregister_ftrace_function_probe() will incorrectly parse the glob filter
    because it resets the search variable that was setup by filter_parse_regex().

    Al Viro reported this:

    After that call of filter_parse_regex() we could have func_g.search not
    equal to glob only if glob started with '!' or '*'. In the former case
    we would've buggered off with -EINVAL (not = 1). In the latter we
    would've set func_g.search equal to glob + 1, calculated the length of
    that thing in func_g.len and proceeded to reset func_g.search back to
    glob.

    Suppose the glob is e.g. *foo*. We end up with
    func_g.type = MATCH_MIDDLE_ONLY;
    func_g.len = 3;
    func_g.search = "*foo";
    Feeding that to ftrace_match_record() will not do anything sane - we
    will be looking for names containing "*foo" (->len is ignored for that
    one).

    Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk

    Fixes: 3ba009297149f ("ftrace: Introduce ftrace_glob structure")
    Reviewed-by: Dmitry Safonov
    Reviewed-by: Masami Hiramatsu
    Reported-by: Al Viro
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit a1be1f3931bfe0a42b46fef77a04593c2b136e7f upstream.

    This reverts commit ba62bafe942b ("kernel/relay.c: fix potential memory leak").

    This commit introduced a double free bug, because 'chan' is already
    freed by the line:

    kref_put(&chan->kref, relay_destroy_channel);

    This bug was found by syzkaller, using the BLKTRACESETUP ioctl.

    Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com
    Fixes: ba62bafe942b ("kernel/relay.c: fix potential memory leak")
    Signed-off-by: Eric Biggers
    Reported-by: syzbot
    Reviewed-by: Andrew Morton
    Cc: Zhouyi Zhou
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 4f7e988e63e336827f4150de48163bed05d653bd upstream.

    This reverts commit 92266d6ef60c ("async: simplify lowest_in_progress()")
    which was simply wrong: In the case where domain is NULL, we now use the
    wrong offsetof() in the list_first_entry macro, so we don't actually
    fetch the ->cookie value, but rather the eight bytes located
    sizeof(struct list_head) further into the struct async_entry.

    On 64 bit, that's the data member, while on 32 bit, that's a u64 built
    from func and data in some order.

    I think the bug happens to be harmless in practice: It obviously only
    affects callers which pass a NULL domain, and AFAICT the only such
    caller is

    async_synchronize_full() ->
    async_synchronize_full_domain(NULL) ->
    async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL)

    and the ASYNC_COOKIE_MAX means that in practice we end up waiting for
    the async_global_pending list to be empty - but it would break if
    somebody happened to pass (void*)-1 as the data element to
    async_schedule, and of course also if somebody ever does a
    async_synchronize_cookie_domain(, NULL) with a "finite" cookie value.

    Maybe the "harmless in practice" means this isn't -stable material. But
    I'm not completely confident my quick git grep'ing is enough, and there
    might be affected code in one of the earlier kernels that has since been
    removed, so I'll leave the decision to the stable guys.

    Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk
    Fixes: 92266d6ef60c "async: simplify lowest_in_progress()"
    Signed-off-by: Rasmus Villemoes
    Acked-by: Tejun Heo
    Cc: Arjan van de Ven
    Cc: Adam Wallis
    Cc: Lai Jiangshan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rasmus Villemoes
     
  • commit 364f56653708ba8bcdefd4f0da2a42904baa8eeb upstream.

    When issuing an IPI RT push, where an IPI is sent to each CPU that has more
    than one RT task scheduled on it, it references the root domain's rto_mask,
    that contains all the CPUs within the root domain that has more than one RT
    task in the runable state. The problem is, after the IPIs are initiated, the
    rq->lock is released. This means that the root domain that is associated to
    the run queue could be freed while the IPIs are going around.

    Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
    the root domain's ref count respectively. This way when initiating the IPIs,
    the scheduler will up the root domain's ref count before releasing the
    rq->lock, ensuring that the root domain does not go away until the IPI round
    is complete.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit ad0f1d9d65938aec72a698116cd73a980916895e upstream.

    When the rto_push_irq_work_func() is called, it looks at the RT overloaded
    bitmask in the root domain via the runqueue (rq->rd). The problem is that
    during CPU up and down, nothing here stops rq->rd from changing between
    taking the rq->rd->rto_lock and releasing it. That means the lock that is
    released is not the same lock that was taken.

    Instead of using this_rq()->rd to get the root domain, as the irq work is
    part of the root domain, we can simply get the root domain from the irq work
    that is passed to the routine:

    container_of(work, struct root_domain, rto_push_work)

    This keeps the root domain consistent.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit cef31d9af908243421258f1df35a4a644604efbe upstream.

    timer_create() specifies via sigevent->sigev_notify the signal delivery for
    the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
    and (SIGEV_SIGNAL | SIGEV_THREAD_ID).

    The sanity check in good_sigevent() is only checking the valid combination
    for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
    not set it accepts any random value.

    This has no real effects on the posix timer and signal delivery code, but
    it affects show_timer() which handles the output of /proc/$PID/timers. That
    function uses a string array to pretty print sigev_notify. The access to
    that array has no bound checks, so random sigev_notify cause access beyond
    the array bounds.

    Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
    masking from various code pathes as SIGEV_NONE can never be set in
    combination with SIGEV_THREAD_ID.

    Reported-by: Eric Biggers
    Reported-by: Dmitry Vyukov
    Reported-by: Alexey Dobriyan
    Signed-off-by: Thomas Gleixner
    Cc: John Stultz
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

13 Feb, 2018

1 commit

  • (cherry picked from commit caf7501a1b4ec964190f31f9c3f163de252273b8)

    There's a risk that a kernel which has full retpoline mitigations becomes
    vulnerable when a module gets loaded that hasn't been compiled with the
    right compiler or the right option.

    To enable detection of that mismatch at module load time, add a module info
    string "retpoline" at build time when the module was compiled with
    retpoline support. This only covers compiled C source, but assembler source
    or prebuilt object files are not checked.

    If a retpoline enabled kernel detects a non retpoline protected module at
    load time, print a warning and report it in the sysfs vulnerability file.

    [ tglx: Massaged changelog ]

    Signed-off-by: Andi Kleen
    Signed-off-by: Thomas Gleixner
    Cc: David Woodhouse
    Cc: gregkh@linuxfoundation.org
    Cc: torvalds@linux-foundation.org
    Cc: jeyu@kernel.org
    Cc: arjan@linux.intel.com
    Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org
    Signed-off-by: David Woodhouse
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

31 Jan, 2018

7 commits

  • [ upstream commit f37a8cb84cce18762e8f86a70bd6a49a66ab964c ]

    Alexei found that verifier does not reject stores into context
    via BPF_ST instead of BPF_STX. And while looking at it, we
    also should not allow XADD variant of BPF_STX.

    The context rewriter is only assuming either BPF_LDX_MEM- or
    BPF_STX_MEM-type operations, thus reject anything other than
    that so that assumptions in the rewriter properly hold. Add
    test cases as well for BPF selftests.

    Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
    Reported-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ upstream commit 68fda450a7df51cff9e5a4d4a4d9d0d5f2589153 ]

    due to some JITs doing if (src_reg == 0) check in 64-bit mode
    for div/mod operations mask upper 32-bits of src register
    before doing the check

    Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
    Fixes: 7a12b5031c6b ("sparc64: Add eBPF JIT.")
    Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ upstream commit c366287ebd698ef5e3de300d90cd62ee9ee7373e ]

    Divides by zero are not nice, lets avoid them if possible.

    Also do_div() seems not needed when dealing with 32bit operands,
    but this seems a minor detail.

    Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
    Signed-off-by: Eric Dumazet
    Reported-by: syzbot
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • [ upstream commit 7891a87efc7116590eaba57acc3c422487802c6f ]

    The following snippet was throwing an 'unknown opcode cc' warning
    in BPF interpreter:

    0: (18) r0 = 0x0
    2: (7b) *(u64 *)(r10 -16) = r0
    3: (cc) (u32) r0 s>>= (u32) r0
    4: (95) exit

    Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
    generation, not all of them do and interpreter does neither. We can
    leave existing ones and implement it later in bpf-next for the
    remaining ones, but reject this properly in verifier for the time
    being.

    Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
    Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ upstream commit 290af86629b25ffd1ed6232c4e9107da031705cb ]

    The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

    A quote from goolge project zero blog:
    "At this point, it would normally be necessary to locate gadgets in
    the host kernel code that can be used to actually leak data by reading
    from an attacker-controlled location, shifting and masking the result
    appropriately and then using the result of that as offset to an
    attacker-controlled address for a load. But piecing gadgets together
    and figuring out which ones work in a speculation context seems annoying.
    So instead, we decided to use the eBPF interpreter, which is built into
    the host kernel - while there is no legitimate way to invoke it from inside
    a VM, the presence of the code in the host kernel's text section is sufficient
    to make it usable for the attack, just like with ordinary ROP gadgets."

    To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
    option that removes interpreter from the kernel in favor of JIT-only mode.
    So far eBPF JIT is supported by:
    x64, arm64, arm32, sparc64, s390, powerpc64, mips64

    The start of JITed program is randomized and code page is marked as read-only.
    In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

    v2->v3:
    - move __bpf_prog_ret0 under ifdef (Daniel)

    v1->v2:
    - fix init order, test_bpf and cBPF (Daniel's feedback)
    - fix offloaded bpf (Jakub's feedback)
    - add 'return 0' dummy in case something can invoke prog->bpf_func
    - retarget bpf tree. For bpf-next the patch would need one extra hunk.
    It will be sent when the trees are merged back to net-next

    Considered doing:
    int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
    but it seems better to land the patch as-is and in bpf-next remove
    bpf_jit_enable global variable from all JITs, consolidate in one place
    and remove this jit_init() function.

    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ upstream commit 90caccdd8cc0215705f18b92771b449b01e2474a ]

    - bpf prog_array just like all other types of bpf array accepts 32-bit index.
    Clarify that in the comment.
    - fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
    - tighten corresponding check in the interpreter to stay consistent

    The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
    in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and
    though JIT code is wrong it will check bounds correctly.
    Hence two fixes tags. All other JITs don't have this problem.

    Signed-off-by: Alexei Starovoitov
    Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
    Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper")
    Acked-by: Daniel Borkmann
    Acked-by: Martin KaFai Lau
    Reviewed-by: Eric Dumazet
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream.

    The hrtimer interrupt code contains a hang detection and mitigation
    mechanism, which prevents that a long delayed hrtimer interrupt causes a
    continous retriggering of interrupts which prevent the system from making
    progress. If a hang is detected then the timer hardware is programmed with
    a certain delay into the future and a flag is set in the hrtimer cpu base
    which prevents newly enqueued timers from reprogramming the timer hardware
    prior to the chosen delay. The subsequent hrtimer interrupt after the delay
    clears the flag and resumes normal operation.

    If such a hang happens in the last hrtimer interrupt before a CPU is
    unplugged then the hang_detected flag is set and stays that way when the
    CPU is plugged in again. At that point the timer hardware is not armed and
    it cannot be armed because the hang_detected flag is still active, so
    nothing clears that flag. As a consequence the CPU does not receive hrtimer
    interrupts and no timers expire on that CPU which results in RCU stalls and
    other malfunctions.

    Clear the flag along with some other less critical members of the hrtimer
    cpu base to ensure starting from a clean state when a CPU is plugged in.

    Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
    root cause of that hard to reproduce heisenbug. Once understood it's
    trivial and certainly justifies a brown paperbag.

    Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
    Reported-by: Paul E. McKenney
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Sewior
    Cc: Anna-Maria Gleixner
    Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

24 Jan, 2018

5 commits

  • commit 62635ea8c18f0f62df4cc58379e4f1d33afd5801 upstream.

    show_workqueue_state() can print out a lot of messages while being in
    atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
    device is slow it may end up triggering NMI hard lockup watchdog.

    Signed-off-by: Sergey Senozhatsky
    Signed-off-by: Tejun Heo
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit 1ebe1eaf2f02784921759992ae1fde1a9bec8fd0 upstream.

    Since enums do not get converted by the TRACE_EVENT macro into their values,
    the event format displaces the enum name and not the value. This breaks
    tools like perf and trace-cmd that need to interpret the raw binary data. To
    solve this, an enum map was created to convert these enums into their actual
    numbers on boot up. This is done by TRACE_EVENTS() adding a
    TRACE_DEFINE_ENUM() macro.

    Some enums were not being converted. This was caused by an optization that
    had a bug in it.

    All calls get checked against this enum map to see if it should be converted
    or not, and it compares the call's system to the system that the enum map
    was created under. If they match, then they call is processed.

    To cut down on the number of iterations needed to find the maps with a
    matching system, since calls and maps are grouped by system, when a match is
    made, the index into the map array is saved, so that the next call, if it
    belongs to the same system as the previous call, could start right at that
    array index and not have to scan all the previous arrays.

    The problem was, the saved index was used as the variable to know if this is
    a call in a new system or not. If the index was zero, it was assumed that
    the call is in a new system and would keep incrementing the saved index
    until it found a matching system. The issue arises when the first matching
    system was at index zero. The next map, if it belonged to the same system,
    would then think it was the first match and increment the index to one. If
    the next call belong to the same system, it would begin its search of the
    maps off by one, and miss the first enum that should be converted. This left
    a single enum not converted properly.

    Also add a comment to describe exactly what that index was for. It took me a
    bit too long to figure out what I was thinking when debugging this issue.

    Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com

    Fixes: 0c564a538aa93 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
    Reported-by: Chuck Lever
    Teste-by: Chuck Lever
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit ae83b56a56f8d9643dedbee86b457fa1c5d42f59 upstream.

    When a contrained task is throttled by dl_check_constrained_dl(),
    it may carry the remaining positive runtime, as a result when
    dl_task_timer() fires and calls replenish_dl_entity(), it will
    not be replenished correctly due to the positive dl_se->runtime.

    This patch assigns its runtime to 0 if positive after throttling.

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Daniel Bristot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Luca Abeni
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
    Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com
    Signed-off-by: Ingo Molnar
    Cc: Ben Hutchings
    Signed-off-by: Greg Kroah-Hartman

    Xunlei Pang
     
  • commit ed4bbf7910b28ce3c691aef28d245585eaabda06 upstream.

    When the timer base is checked for expired timers then the deferrable base
    must be checked as well. This was missed when making the deferrable base
    independent of base::nohz_active.

    Fixes: ced6d5c11d3e ("timers: Use deferrable base independent of base::nohz_active")
    Signed-off-by: Thomas Gleixner
    Cc: Anna-Maria Gleixner
    Cc: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Paul McKenney
    Cc: rt@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a upstream.

    UBSAN reports signed integer overflow in kernel/futex.c:

    UBSAN: Undefined behaviour in kernel/futex.c:2041:18
    signed integer overflow:
    0 - -2147483648 cannot be represented in type 'int'

    Add a sanity check to catch negative values of nr_wake and nr_requeue.

    Signed-off-by: Li Jinyue
    Signed-off-by: Thomas Gleixner
    Cc: peterz@infradead.org
    Cc: dvhart@infradead.org
    Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Li Jinyue
     

17 Jan, 2018

3 commits

  • commit bbeb6e4323dad9b5e0ee9f60c223dd532e2403b1 upstream.

    syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
    and thus unprivileged. With the recently added logic in b2157399cc98
    ("bpf: prevent out-of-bounds speculation") we round this up to the next
    power of two value for max_entries for unprivileged such that we can
    apply proper masking into potentially zeroed out map slots.

    However, this will generate an index_mask of 0xffffffff, and therefore
    a + 1 will let this overflow into new max_entries of 0. This will pass
    allocation, etc, and later on map access we still enforce on the original
    attr->max_entries value which was 0xfffffffd, therefore triggering GPF
    all over the place. Thus bail out on overflow in such case.

    Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
    since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
    space is undefined. Therefore, do this by hand in a 64 bit variable.

    This fixes all the issues triggered by syzkaller's reproducers.

    Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
    Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
    Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
    Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
    Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
    Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

    Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
    memory accesses under a bounds check may be speculated even if the
    bounds check fails, providing a primitive for building a side channel.

    To avoid leaking kernel data round up array-based maps and mask the index
    after bounds check, so speculated load with out of bounds index will load
    either valid value from the array or zero from the padded area.

    Unconditionally mask index for all array types even when max_entries
    are not rounded to power of 2 for root user.
    When map is created by unpriv user generate a sequence of bpf insns
    that includes AND operation to make sure that JITed code includes
    the same 'index & index_mask' operation.

    If prog_array map is created by unpriv user replace
    bpf_tail_call(ctx, map, index);
    with
    if (index >= max_entries) {
    index &= map->index_mask;
    bpf_tail_call(ctx, map, index);
    }
    (along with roundup to power 2) to prevent out-of-bounds speculation.
    There is secondary redundant 'if (index >= max_entries)' in the interpreter
    and in all JITs, but they can be optimized later if necessary.

    Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
    cannot be used by unpriv, so no changes there.

    That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
    all architectures with and without JIT.

    v2->v3:
    Daniel noticed that attack potentially can be crafted via syscall commands
    without loading the program, so add masking to those paths as well.

    Signed-off-by: Alexei Starovoitov
    Acked-by: John Fastabend
    Signed-off-by: Daniel Borkmann
    Cc: Jiri Slaby
    [ Backported to 4.9 - gregkh ]
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • commit 79741b3bdec01a8628368fbcfccc7d189ed606cb upstream.

    reduce indent and make it iterate over instructions similar to
    convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller
    Cc: Jiri Slaby
    [Backported to 4.9.y - gregkh]
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov