Eric Lee / smarc-fsl-linux-kernel

25 Apr, 2018

1 commit

21968d8b3 clockevents: Retry programming min delta up to 10 times ... Browse Code »

When CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=n, the call path
hrtimer_reprogram -> clockevents_program_event ->
clockevents_program_min_delta will not retry if the clock event driver
returns -ETIME.

If the driver could not satisfy the program_min_delta for any reason, the
lack of a retry means the CPU may not receive a tick interrupt, potentially
until the counter does a full period. This leads to rcu_sched timeout
messages as the stalled CPU is detected by other CPUs, and other issues if
the CPU is holding locks or other resources at the point at which it
stalls.

There have been a couple of observed mechanisms through which a clock event
driver could not satisfy the requested min_delta and return -ETIME.

With the MIPS GIC driver, the shared execution resource within MT cores
means inconventient latency due to execution of instructions from other
hardware threads in the core, within gic_next_event, can result in an event
being set in the past.

Additionally under virtualisation it is possible to get unexpected latency
during a clockevent device's set_next_event() callback which can make it
return -ETIME even for a delta based on min_delta_ns.

It isn't appropriate to use MIN_ADJUST in the virtualisation case as
occasional hypervisor induced high latency will cause min_delta_ns to
quickly increase to the maximum.

Instead, borrow the retry pattern from the MIN_ADJUST case, but without
making adjustments. Retry up to 10 times, each time increasing the
attempted delta by min_delta, before giving up.

[ Matt: Reworked the loop and made retry increase the delta. ]

Signed-off-by: James Hogan
Signed-off-by: Matt Redfearn
Signed-off-by: Thomas Gleixner
Cc: linux-mips@linux-mips.org
Cc: Daniel Lezcano
Cc: "Martin Schwidefsky"
Cc: James Hogan
Link: https://lkml.kernel.org/r/1508422643-6075-1-git-send-email-matt.redfearn@mips.com

James Hogan
2018-04-25 11:29:50 +0800

12 Apr, 2018

1 commit

60b09f35c module: extend 'rodata=off' boot cmdline parameter to module mappings ... Browse Code »

commit 39290b389ea upstream.

The current "rodata=off" parameter disables read-only kernel mappings
under CONFIG_DEBUG_RODATA:
commit d2aa1acad22f ("mm/init: Add 'rodata=off' boot cmdline parameter
to disable read-only kernel mappings")

This patch is a logical extension to module mappings ie. read-only mappings
at module loading can be disabled even if CONFIG_DEBUG_SET_MODULE_RONX
(mainly for debug use). Please note, however, that it only affects RO/RW
permissions, keeping NX set.

This is the first step to make CONFIG_DEBUG_SET_MODULE_RONX mandatory
(always-on) in the future as CONFIG_DEBUG_RODATA on x86 and arm64.

Suggested-by: and Acked-by: Mark Rutland
Signed-off-by: AKASHI Takahiro
Reviewed-by: Kees Cook
Acked-by: Rusty Russell
Link: http://lkml.kernel.org/r/20161114061505.15238-1-takahiro.akashi@linaro.org
Signed-off-by: Jessica Yu
Signed-off-by: Alex Shi

Conflicts:
keeping kaiser.h in init/main.c

AKASHI Takahiro
2018-04-12 18:46:06 +0800

20 Mar, 2018

5 commits

2e2af2902 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

Conflicts:
block/blk-mq.c
include/linux/blk_types.h
include/linux/blkdev.h

Christoph Hellwig
2018-03-20 04:37:01 +0800
6c493bb24 MLK-16068 printk: use console_trylock() in console_cpu_notify() ... Browse Code »

There is no need to always call blocking console_lock() in
console_cpu_notify(), it's quite possible that console_sem can
be locked by other CPU on the system, either already printing
or soon to begin printing the messages. console_lock() in this
case can simply block CPU hotplug for unknown period of time
(console_unlock() is time unbound). Not that hotplug is very
fast, but still, with other CPUs being online and doing
printk() console_cpu_notify() can stuck.

Use console_trylock() instead and opt-out if console_sem is
already acquired from another CPU, since that CPU will do
the printing for us.

Link: http://lkml.kernel.org/r/20170121104729.8585-1-sergey.senozhatsky@gmail.com
Cc: Steven Rostedt
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Sebastian Andrzej Siewior
Cc: Ingo Molnar
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky
Signed-off-by: Petr Mladek

This patch also fixes a deadlock that happens if while holding the
console lock someone issues a call that eventually takes the cpu
hotplug lock, like in the case below, where the following happens:

* task Xorg issues an ioctl to the fb layer which takes the console
lock and calls the driver's ioctl routine

* at the same time, task bat-cpuhotplug issue a hotplug cpu enable
operation which takes the cpu hotplug lock and waits for the cpu
bringup operation to complete

* the fb driver calls dma_alloc_coherent which, on this platform,
eventually tries to take the cpu hotplug

* task cpuhp/2 tries to flush the console

* at this point task Xorg waits after task bat-cpuhotplug to release
the cpu hotplug lock which waits after task cpuhp/2 to signal that
the CPU is up which waits after the Xorg to release the console
lock

Linux version 4.9.11-02771-gd85d61b-dirty
CPU: ARMv7 Processor [412fc09a] revision 10 (ARMv7), cr=10c53c7d
CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
OF: fdt:Machine model: Freescale i.MX6 Quad SABRE Smart Device Board

sysrq: SysRq : Show Blocked State

task PC stack pid father
cpuhp/2 D 0 18 2 0x00000000
[] (__schedule) from [] (schedule+0x4c/0xac)
[] (schedule) from [] (schedule_timeout+0x1e8/0x2fc)
[] (schedule_timeout) from [] (__down+0x64/0x9c)
[] (__down) from [] (down+0x44/0x58)
[] (down) from [] (console_lock+0x2c/0x74)
[] (console_lock) from [] (console_cpu_notify+0x28/0x34)
[] (console_cpu_notify) from [] (notifier_call_chain+0x44/0x84)
[] (notifier_call_chain) from [] (__cpu_notify+0x38/0x50)
[] (__cpu_notify) from [] (notify_online+0x18/0x20)
[] (notify_online) from [] (cpuhp_up_callbacks+0x24/0xd4)
[] (cpuhp_up_callbacks) from [] (cpuhp_thread_fun+0x13c/0x148)
[] (cpuhp_thread_fun) from [] (smpboot_thread_fn+0x17c/0x2dc)
[] (smpboot_thread_fn) from [] (kthread+0xf0/0x108)
[] (kthread) from [] (ret_from_fork+0x14/0x24)

bat-cpuhotplug. D 0 841 718 0x00000000
[] (__schedule) from [] (schedule+0x4c/0xac)
[] (schedule) from [] (schedule_timeout+0x1e8/0x2fc)
[] (schedule_timeout) from [] (wait_for_common+0xb0/0x160)
[] (wait_for_common) from [] (bringup_cpu+0x50/0xa8)
[] (bringup_cpu) from [] (cpuhp_up_callbacks+0x24/0xd4)
[] (cpuhp_up_callbacks) from [] (_cpu_up+0xa8/0xec)
[] (_cpu_up) from [] (do_cpu_up+0x74/0x9c)
[] (do_cpu_up) from [] (device_online+0x68/0x8c)
[] (device_online) from [] (online_store+0x68/0x74)
[] (online_store) from [] (kernfs_fop_write+0xf4/0x1f8)
[] (kernfs_fop_write) from [] (__vfs_write+0x1c/0x114)
[] (__vfs_write) from [] (vfs_write+0xa4/0x168)
[] (vfs_write) from [] (SyS_write+0x3c/0x90)
[] (SyS_write) from [] (ret_fast_syscall+0x0/0x1c)

Xorg D 0 860 832 0x00000000
[] (__schedule) from [] (schedule+0x4c/0xac)
[] (schedule) from [] (schedule_preempt_disabled+0x14/0x20)
[] (schedule_preempt_disabled) from [] (mutex_lock_nested+0x1f8/0x4a4)
[] (mutex_lock_nested) from [] (get_online_cpus+0x78/0xbc)
[] (get_online_cpus) from [] (lru_add_drain_all+0x48/0x1b4)
[] (lru_add_drain_all) from [] (migrate_prep+0x8/0x10)
[] (migrate_prep) from [] (alloc_contig_range+0xd0/0x320)
[] (alloc_contig_range) from [] (cma_alloc+0xb8/0x1a8)
[] (cma_alloc) from [] (__alloc_from_contiguous+0x38/0xd8)
[] (__alloc_from_contiguous) from [] (cma_allocator_alloc+0x34/0x3c)
[] (cma_allocator_alloc) from [] (__dma_alloc+0x174/0x338)
[] (__dma_alloc) from [] (arm_dma_alloc+0x40/0x48)
[] (arm_dma_alloc) from [] (mxcfb_set_par+0x8ec/0xd7c)
[] (mxcfb_set_par) from [] (fb_set_var+0x1d4/0x358)
[] (fb_set_var) from [] (do_fb_ioctl+0x4e4/0x704)
[] (do_fb_ioctl) from [] (do_vfs_ioctl+0xa0/0xa10)
[] (do_vfs_ioctl) from [] (SyS_ioctl+0x34/0x5c)
[] (SyS_ioctl) from [] (ret_fast_syscall+0x0/0x1c)

Signed-off-by: Octavian Purdila
Reviewed-by: Leonard Crestez

Sergey Senozhatsky
2018-03-20 04:33:47 +0800
0ba0eb2c6 MLK-14859 nohz: Fix buggy tick delay on IRQ storms ... Browse Code »

When the tick is stopped and we reach the dynticks evaluation code on
IRQ exit, we perform a soft tick restart if we observe an expired timer
from there. It means we program the nearest possible tick but we stay in
dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
again after that expired timer is handled.

Now this solution works most of the time but if we suffer an IRQ storm
and those interrupts trigger faster than the hardware clockevents min
delay, our tick won't fire until that IRQ storm is finished.

Here is the problem: on IRQ exit we reprog the timer to at least
NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
reschedule again to NOW() + min_clockevents_delay, etc... The tick
is eternally rescheduled min_clockevents_delay ahead.

A solution is to simply remove this soft tick restart. After all
the normal dynticks evaluation path can handle 0 delay just fine. And
by doing that we benefit from the optimization branch which avoids
clock reprogramming if the clockevents deadline hasn't changed since
the last reprog. This fixes our issue because we don't do repetitive
clock reprog that always add hardware min delay.

As a side effect it should even optimize the 0 delay path in general.

Reported-and-tested-by: Octavian Purdila
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Rik van Riel
Cc: Ingo Molnar
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2018-03-20 04:22:35 +0800
959202df0 Move x86_64 idle notifiers to generic ... Browse Code »

Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
Pallipadi to generic.

Change-Id: Idf29cda15be151f494ff245933c12462643388d5
Acked-by: Nicolas Pitre
Signed-off-by: Todd Poynor

Todd Poynor
2018-03-20 03:49:36 +0800
af32725e6 cpufreq: Move gov_attr_* macros to cpufreq.h ... Browse Code »

These macros can be reused by governors which don't use the common
governor code present in cpufreq_governor.c and should be moved to the
relevant header.

Now that they are getting moved to the right header file, reuse them in
schedutil governor as well (that required rename of show/store
routines).

Also create gov_attr_wo() macro for write-only sysfs files, this will be
used by Interactive governor in a later patch.

Signed-off-by: Viresh Kumar

Viresh Kumar
2018-03-20 03:49:36 +0800

18 Mar, 2018

1 commit

225ce6d7f workqueue: Allow retrieval of current task's work struct ... Browse Code »

commit 27d4ee03078aba88c5e07dcc4917e8d01d046f38 upstream.

Introduce a helper to retrieve the current task's work struct if it is
a workqueue worker.

This allows us to fix a long-standing deadlock in several DRM drivers
wherein the ->runtime_suspend callback waits for a specific worker to
finish and that worker in turn calls a function which waits for runtime
suspend to finish. That function is invoked from multiple call sites
and waiting for runtime suspend to finish is the correct thing to do
except if it's executing in the context of the worker.

Cc: Lai Jiangshan
Cc: Dave Airlie
Cc: Ben Skeggs
Cc: Alex Deucher
Acked-by: Tejun Heo
Reviewed-by: Lyude Paul
Signed-off-by: Lukas Wunner
Link: https://patchwork.freedesktop.org/patch/msgid/2d8f603074131eb87e588d2b803a71765bd3a2fd.1518338788.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman

Lukas Wunner
2018-03-18 18:18:48 +0800

11 Mar, 2018

4 commits

2a8bc5316 bpf: add schedule points in percpu arrays management ... Browse Code »

[ upstream commit 32fff239de37ef226d5b66329dd133f64d63b22d ]

syszbot managed to trigger RCU detected stalls in
bpf_array_free_percpu()

It takes time to allocate a huge percpu map, but even more time to free
it.

Since we run in process context, use cond_resched() to yield cpu if
needed.

Fixes: a10423b87a7e ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-03-11 23:21:35 +0800
422baf61d bpf: fix mlock precharge on arraymaps ... Browse Code »

[ upstream commit 9c2d63b843a5c8a8d0559cc067b5398aa5ec3ffc ]

syzkaller recently triggered OOM during percpu map allocation;
while there is work in progress by Dennis Zhou to add __GFP_NORETRY
semantics for percpu allocator under pressure, there seems also a
missing bpf_map_precharge_memlock() check in array map allocation.

Given today the actual bpf_map_charge_memlock() happens after the
find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
is there to bail out early before we go and do the map setup work
when we find that we hit the limits anyway. Therefore add this for
array map as well.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Fixes: a10423b87a7e ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann
Cc: Dennis Zhou
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-03-11 23:21:34 +0800
816cfeb77 bpf: fix wrong exposure of map_flags into fdinfo for lpm ... Browse Code »

[ upstream commit a316338cb71a3260201490e615f2f6d5c0d8fb2c ]

trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-03-11 23:21:34 +0800
13e75c74c timers: Forward timer base before migrating timers ... Browse Code »

commit c52232a49e203a65a6e1a670cd5262f59e9364a0 upstream.

On CPU hotunplug the enqueued timers of the unplugged CPU are migrated to a
live CPU. This happens from the control thread which initiated the unplug.

If the CPU on which the control thread runs came out from a longer idle
period then the base clock of that CPU might be stale because the control
thread runs prior to any event which forwards the clock.

In such a case the timers from the unplugged CPU are queued on the live CPU
based on the stale clock which can cause large delays due to increased
granularity of the outer timer wheels which are far away from base:;clock.

But there is a worse problem than that. The following sequence of events
illustrates it:

- CPU0 timer1 is queued expires = 59969 and base->clk = 59131.

The timer is queued at wheel level 2, with resulting expiry time = 60032
(due to level granularity).

- CPU1 enters idle @60007, with next timer expiry @60020.

- CPU0 is hotplugged at @60009

- CPU1 exits idle and runs the control thread which migrates the
timers from CPU0

timer1 is now queued in level 0 for immediate handling in the next
softirq because the requested expiry time 59969 is before CPU1 base->clk
60007

- CPU1 runs code which forwards the base clock which succeeds because the
next expiring timer. which was collected at idle entry time is still set
to 60020.

So it forwards beyond 60007 and therefore misses to expire the migrated
timer1. That timer gets expired when the wheel wraps around again, which
takes between 63 and 630ms depending on the HZ setting.

Address both problems by invoking forward_timer_base() for the control CPUs
timer base. All other places, which might run into a similar problem
(mod_timer()/add_timer_on()) already invoke forward_timer_base() to avoid
that.

[ tglx: Massaged comment and changelog ]

Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Co-developed-by: Neeraj Upadhyay
Signed-off-by: Neeraj Upadhyay
Signed-off-by: Lingutla Chandrasekhar
Signed-off-by: Thomas Gleixner
Cc: Anna-Maria Gleixner
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180118115022.6368-1-clingutla@codeaurora.org
Signed-off-by: Greg Kroah-Hartman

Lingutla Chandrasekhar
2018-03-11 23:21:28 +0800

03 Mar, 2018

2 commits

ff5544ddf genirq: Guard handle_bad_irq log messages ... Browse Code »

[ Upstream commit 11bca0a83f83f6093d816295668e74ef24595944 ]

An interrupt storm on a bad interrupt will cause the kernel
log to be clogged.

[ 60.089234] ->handle_irq(): ffffffffbe2f803f,
[ 60.090455] 0xffffffffbf2af380
[ 60.090510] handle_bad_irq+0x0/0x2e5
[ 60.090522] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090553] IRQ_NOPROBE set
[ 60.090584] ->handle_irq(): ffffffffbe2f803f,
[ 60.090590] handle_bad_irq+0x0/0x2e5
[ 60.090596] ->irq_data.chip(): ffffffffbf2af380,
[ 60.090602] 0xffffffffbf2af380
[ 60.090608] ->action(): (null)
[ 60.090779] handle_bad_irq+0x0/0x2e5

This was seen when running an upstream kernel on Acer Chromebook R11. The
system was unstable as result.

Guard the log message with __printk_ratelimit to reduce the impact. This
won't prevent the interrupt storm from happening, but at least the system
remains stable.

Signed-off-by: Guenter Roeck
Signed-off-by: Thomas Gleixner
Cc: Dmitry Torokhov
Cc: Joe Perches
Cc: Andy Shevchenko
Cc: Mika Westerberg
Link: https://bugzilla.kernel.org/show_bug.cgi?id=197953
Link: https://lkml.kernel.org/r/1512234784-21038-1-git-send-email-linux@roeck-us.net
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Guenter Roeck
2018-03-03 17:23:25 +0800
5a9f69b2c hrtimer: Ensure POSIX compliance (relative CLOCK_REALTIME hrtimers) ... Browse Code »

commit 48d0c9becc7f3c66874c100c126459a9da0fdced upstream.

The POSIX specification defines that relative CLOCK_REALTIME timers are not
affected by clock modifications. Those timers have to use CLOCK_MONOTONIC
to ensure POSIX compliance.

The introduction of the additional HRTIMER_MODE_PINNED mode broke this
requirement for pinned timers.

There is no user space visible impact because user space timers are not
using pinned mode, but for consistency reasons this needs to be fixed.

Check whether the mode has the HRTIMER_MODE_REL bit set instead of
comparing with HRTIMER_MODE_ABS.

Signed-off-by: Anna-Maria Gleixner
Cc: Christoph Hellwig
Cc: John Stultz
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: keescook@chromium.org
Fixes: 597d0275736d ("timers: Framework for identifying pinned timers")
Link: http://lkml.kernel.org/r/20171221104205.7269-7-anna-maria@linutronix.de
Signed-off-by: Ingo Molnar
Cc: Mike Galbraith
Signed-off-by: Greg Kroah-Hartman

Anna-Maria Gleixner
2018-03-03 17:23:20 +0800

28 Feb, 2018

1 commit

8f7cf88d5 mm: Fix devm_memremap_pages() collision handling ... Browse Code »

commit 77dd66a3c67c93ab401ccc15efff25578be281fd upstream.

If devm_memremap_pages() detects a collision while adding entries
to the radix-tree, we call pgmap_radix_release(). Unfortunately,
the function removes *all* entries for the range -- including the
entries that caused the collision in the first place.

Modify pgmap_radix_release() to take an additional argument to
indicate where to stop, so that only newly added entries are removed
from the tree.

Cc:
Fixes: 9476df7d80df ("mm: introduce find_dev_pagemap()")
Signed-off-by: Jan H. Schönherr
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Jan H. Schönherr
2018-02-28 17:18:34 +0800

25 Feb, 2018

2 commits

c33f9272e kcov: detect double association with a single task ... Browse Code »

commit a77660d231f8b3d84fd23ed482e0964f7aa546d6 upstream.

Currently KCOV_ENABLE does not check if the current task is already
associated with another kcov descriptor. As the result it is possible
to associate a single task with more than one kcov descriptor, which
later leads to a memory leak of the old descriptor. This relation is
really meant to be one-to-one (task has only one back link).

Extend validation to detect such misuse.

Link: http://lkml.kernel.org/r/20180122082520.15716-1-dvyukov@google.com
Fixes: 5c9a8750a640 ("kernel: add kcov code coverage")
Signed-off-by: Dmitry Vyukov
Reported-by: Shankara Pailoor
Cc: Dmitry Vyukov
Cc: syzbot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Dmitry Vyukov
2018-02-25 18:05:42 +0800
7569adcf3 blktrace: fix unlocked registration of tracepoints ... Browse Code »

commit a6da0024ffc19e0d47712bb5ca4fd083f76b07df upstream.

We need to ensure that tracepoints are registered and unregistered
with the users of them. The existing atomic count isn't enough for
that. Add a lock around the tracepoints, so we serialize access
to them.

This fixes cases where we have multiple users setting up and
tearing down tracepoints, like this:

CPU: 0 PID: 2995 Comm: syzkaller857118 Not tainted
4.14.0-rc5-next-20171018+ #36
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x194/0x257 lib/dump_stack.c:52
panic+0x1e4/0x41c kernel/panic.c:183
__warn+0x1c4/0x1e0 kernel/panic.c:546
report_bug+0x211/0x2d0 lib/bug.c:183
fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:177
do_trap_no_signal arch/x86/kernel/traps.c:211 [inline]
do_trap+0x260/0x390 arch/x86/kernel/traps.c:260
do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:297
do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:310
invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
RIP: 0010:tracepoint_add_func kernel/tracepoint.c:210 [inline]
RIP: 0010:tracepoint_probe_register_prio+0x397/0x9a0 kernel/tracepoint.c:283
RSP: 0018:ffff8801d1d1f6c0 EFLAGS: 00010293
RAX: ffff8801d22e8540 RBX: 00000000ffffffef RCX: ffffffff81710f07
RDX: 0000000000000000 RSI: ffffffff85b679c0 RDI: ffff8801d5f19818
RBP: ffff8801d1d1f7c8 R08: ffffffff81710c10 R09: 0000000000000004
R10: ffff8801d1d1f6b0 R11: 0000000000000003 R12: ffffffff817597f0
R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8801d1d1f7a0
tracepoint_probe_register+0x2a/0x40 kernel/tracepoint.c:304
register_trace_block_rq_insert include/trace/events/block.h:191 [inline]
blk_register_tracepoints+0x1e/0x2f0 kernel/trace/blktrace.c:1043
do_blk_trace_setup+0xa10/0xcf0 kernel/trace/blktrace.c:542
blk_trace_setup+0xbd/0x180 kernel/trace/blktrace.c:564
sg_ioctl+0xc71/0x2d90 drivers/scsi/sg.c:1089
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
entry_SYSCALL_64_fastpath+0x1f/0xbe
RIP: 0033:0x444339
RSP: 002b:00007ffe05bb5b18 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000006d66c0 RCX: 0000000000444339
RDX: 000000002084cf90 RSI: 00000000c0481273 RDI: 0000000000000009
RBP: 0000000000000082 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: ffffffffffffffff
R13: 00000000c0481273 R14: 0000000000000000 R15: 0000000000000000

since we can now run these in parallel. Ensure that the exported helpers
for doing this are grabbing the queue trace mutex.

Reported-by: Steven Rostedt
Tested-by: Dmitry Vyukov
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Jens Axboe
2018-02-25 18:05:41 +0800

22 Feb, 2018

1 commit

cbd0c0fc5 mm: Fix memory size alignment in devm_memremap_pages_release() ... Browse Code »

commit 10a0cd6e4932b5078215b1ec2c896597eec0eff9 upstream.

The functions devm_memremap_pages() and devm_memremap_pages_release() use
different ways to calculate the section-aligned amount of memory. The
latter function may use an incorrect size if the memory region is small
but straddles a section border.

Use the same code for both.

Cc:
Fixes: 5f29a77cd957 ("mm: fix mixed zone detection in devm_memremap_pages")
Signed-off-by: Jan H. Schönherr
Signed-off-by: Dan Williams
Signed-off-by: Greg Kroah-Hartman

Jan H. Schönherr
2018-02-22 22:43:48 +0800

17 Feb, 2018

6 commits

2de1085e8 ftrace: Remove incorrect setting of glob search field ... Browse Code »

commit 7b6586562708d2b3a04fe49f217ddbadbbbb0546 upstream.

__unregister_ftrace_function_probe() will incorrectly parse the glob filter
because it resets the search variable that was setup by filter_parse_regex().

Al Viro reported this:

After that call of filter_parse_regex() we could have func_g.search not
equal to glob only if glob started with '!' or '*'. In the former case
we would've buggered off with -EINVAL (not = 1). In the latter we
would've set func_g.search equal to glob + 1, calculated the length of
that thing in func_g.len and proceeded to reset func_g.search back to
glob.

Suppose the glob is e.g. *foo*. We end up with
func_g.type = MATCH_MIDDLE_ONLY;
func_g.len = 3;
func_g.search = "*foo";
Feeding that to ftrace_match_record() will not do anything sane - we
will be looking for names containing "*foo" (->len is ignored for that
one).

Link: http://lkml.kernel.org/r/20180127031706.GE13338@ZenIV.linux.org.uk

Fixes: 3ba009297149f ("ftrace: Introduce ftrace_glob structure")
Reviewed-by: Dmitry Safonov
Reviewed-by: Masami Hiramatsu
Reported-by: Al Viro
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (VMware)
2018-02-17 20:21:20 +0800
91cebf98c kernel/relay.c: revert "kernel/relay.c: fix potential memory leak" ... Browse Code »

commit a1be1f3931bfe0a42b46fef77a04593c2b136e7f upstream.

This reverts commit ba62bafe942b ("kernel/relay.c: fix potential memory leak").

This commit introduced a double free bug, because 'chan' is already
freed by the line:

kref_put(&chan->kref, relay_destroy_channel);

This bug was found by syzkaller, using the BLKTRACESETUP ioctl.

Link: http://lkml.kernel.org/r/20180127004759.101823-1-ebiggers3@gmail.com
Fixes: ba62bafe942b ("kernel/relay.c: fix potential memory leak")
Signed-off-by: Eric Biggers
Reported-by: syzbot
Reviewed-by: Andrew Morton
Cc: Zhouyi Zhou
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2018-02-17 20:21:18 +0800
33a4459bd kernel/async.c: revert "async: simplify lowest_in_progress()" ... Browse Code »

commit 4f7e988e63e336827f4150de48163bed05d653bd upstream.

This reverts commit 92266d6ef60c ("async: simplify lowest_in_progress()")
which was simply wrong: In the case where domain is NULL, we now use the
wrong offsetof() in the list_first_entry macro, so we don't actually
fetch the ->cookie value, but rather the eight bytes located
sizeof(struct list_head) further into the struct async_entry.

On 64 bit, that's the data member, while on 32 bit, that's a u64 built
from func and data in some order.

I think the bug happens to be harmless in practice: It obviously only
affects callers which pass a NULL domain, and AFAICT the only such
caller is

async_synchronize_full() ->
async_synchronize_full_domain(NULL) ->
async_synchronize_cookie_domain(ASYNC_COOKIE_MAX, NULL)

and the ASYNC_COOKIE_MAX means that in practice we end up waiting for
the async_global_pending list to be empty - but it would break if
somebody happened to pass (void*)-1 as the data element to
async_schedule, and of course also if somebody ever does a
async_synchronize_cookie_domain(, NULL) with a "finite" cookie value.

Maybe the "harmless in practice" means this isn't -stable material. But
I'm not completely confident my quick git grep'ing is enough, and there
might be affected code in one of the earlier kernels that has since been
removed, so I'll leave the decision to the stable guys.

Link: http://lkml.kernel.org/r/20171128104938.3921-1-linux@rasmusvillemoes.dk
Fixes: 92266d6ef60c "async: simplify lowest_in_progress()"
Signed-off-by: Rasmus Villemoes
Acked-by: Tejun Heo
Cc: Arjan van de Ven
Cc: Adam Wallis
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Rasmus Villemoes
2018-02-17 20:21:18 +0800
a384e5437 sched/rt: Up the root domain ref count when passing it around via IPIs ... Browse Code »

commit 364f56653708ba8bcdefd4f0da2a42904baa8eeb upstream.

When issuing an IPI RT push, where an IPI is sent to each CPU that has more
than one RT task scheduled on it, it references the root domain's rto_mask,
that contains all the CPUs within the root domain that has more than one RT
task in the runable state. The problem is, after the IPIs are initiated, the
rq->lock is released. This means that the root domain that is associated to
the run queue could be freed while the IPIs are going around.

Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
the root domain's ref count respectively. This way when initiating the IPIs,
the scheduler will up the root domain's ref count before releasing the
rq->lock, ensuring that the root domain does not go away until the IPI round
is complete.

Reported-by: Pavan Kondeti
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (VMware)
2018-02-17 20:21:13 +0800
1c6799813 sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() ... Browse Code »

commit ad0f1d9d65938aec72a698116cd73a980916895e upstream.

When the rto_push_irq_work_func() is called, it looks at the RT overloaded
bitmask in the root domain via the runqueue (rq->rd). The problem is that
during CPU up and down, nothing here stops rq->rd from changing between
taking the rq->rd->rto_lock and releasing it. That means the lock that is
released is not the same lock that was taken.

Instead of using this_rq()->rd to get the root domain, as the irq work is
part of the root domain, we can simply get the root domain from the irq work
that is passed to the routine:

container_of(work, struct root_domain, rto_push_work)

This keeps the root domain consistent.

Reported-by: Pavan Kondeti
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (VMware)
2018-02-17 20:21:13 +0800
0b376535a posix-timer: Properly check sigevent->sigev_notify ... Browse Code »

commit cef31d9af908243421258f1df35a4a644604efbe upstream.

timer_create() specifies via sigevent->sigev_notify the signal delivery for
the new timer. The valid modes are SIGEV_NONE, SIGEV_SIGNAL, SIGEV_THREAD
and (SIGEV_SIGNAL | SIGEV_THREAD_ID).

The sanity check in good_sigevent() is only checking the valid combination
for the SIGEV_THREAD_ID bit, i.e. SIGEV_SIGNAL, but if SIGEV_THREAD_ID is
not set it accepts any random value.

This has no real effects on the posix timer and signal delivery code, but
it affects show_timer() which handles the output of /proc/$PID/timers. That
function uses a string array to pretty print sigev_notify. The access to
that array has no bound checks, so random sigev_notify cause access beyond
the array bounds.

Add proper checks for the valid notify modes and remove the SIGEV_THREAD_ID
masking from various code pathes as SIGEV_NONE can never be set in
combination with SIGEV_THREAD_ID.

Reported-by: Eric Biggers
Reported-by: Dmitry Vyukov
Reported-by: Alexey Dobriyan
Signed-off-by: Thomas Gleixner
Cc: John Stultz
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-02-17 20:21:13 +0800

13 Feb, 2018

1 commit

a1745ad92 module/retpoline: Warn about missing retpoline in module ... Browse Code »

(cherry picked from commit caf7501a1b4ec964190f31f9c3f163de252273b8)

There's a risk that a kernel which has full retpoline mitigations becomes
vulnerable when a module gets loaded that hasn't been compiled with the
right compiler or the right option.

To enable detection of that mismatch at module load time, add a module info
string "retpoline" at build time when the module was compiled with
retpoline support. This only covers compiled C source, but assembler source
or prebuilt object files are not checked.

If a retpoline enabled kernel detects a non retpoline protected module at
load time, print a warning and report it in the sysfs vulnerability file.

[ tglx: Massaged changelog ]

Signed-off-by: Andi Kleen
Signed-off-by: Thomas Gleixner
Cc: David Woodhouse
Cc: gregkh@linuxfoundation.org
Cc: torvalds@linux-foundation.org
Cc: jeyu@kernel.org
Cc: arjan@linux.intel.com
Link: https://lkml.kernel.org/r/20180125235028.31211-1-andi@firstfloor.org
Signed-off-by: David Woodhouse
Signed-off-by: Greg Kroah-Hartman

Andi Kleen
2018-02-13 19:35:58 +0800

31 Jan, 2018

7 commits

f531fbb06 bpf: reject stores into ctx via st and xadd ... Browse Code »

[ upstream commit f37a8cb84cce18762e8f86a70bd6a49a66ab964c ]

Alexei found that verifier does not reject stores into context
via BPF_ST instead of BPF_STX. And while looking at it, we
also should not allow XADD variant of BPF_STX.

The context rewriter is only assuming either BPF_LDX_MEM- or
BPF_STX_MEM-type operations, thus reject anything other than
that so that assumptions in the rewriter properly hold. Add
test cases as well for BPF selftests.

Fixes: d691f9e8d440 ("bpf: allow programs to write to certain skb fields")
Reported-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-01-31 19:55:57 +0800
265d7657c bpf: fix 32-bit divide by zero ... Browse Code »

[ upstream commit 68fda450a7df51cff9e5a4d4a4d9d0d5f2589153 ]

due to some JITs doing if (src_reg == 0) check in 64-bit mode
for div/mod operations mask upper 32-bits of src register
before doing the check

Fixes: 622582786c9e ("net: filter: x86: internal BPF JIT")
Fixes: 7a12b5031c6b ("sparc64: Add eBPF JIT.")
Reported-by: syzbot+48340bb518e88849e2e3@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Alexei Starovoitov
2018-01-31 19:55:57 +0800
460607780 bpf: fix divides by zero ... Browse Code »

[ upstream commit c366287ebd698ef5e3de300d90cd62ee9ee7373e ]

Divides by zero are not nice, lets avoid them if possible.

Also do_div() seems not needed when dealing with 32bit operands,
but this seems a minor detail.

Fixes: bd4cf0ed331a ("net: filter: rework/optimize internal BPF interpreter's instruction set")
Signed-off-by: Eric Dumazet
Reported-by: syzbot
Signed-off-by: Alexei Starovoitov
Signed-off-by: Greg Kroah-Hartman

Eric Dumazet
2018-01-31 19:55:57 +0800
fcabc6d00 bpf: arsh is not supported in 32 bit alu thus reject it ... Browse Code »

[ upstream commit 7891a87efc7116590eaba57acc3c422487802c6f ]

The following snippet was throwing an 'unknown opcode cc' warning
in BPF interpreter:

0: (18) r0 = 0x0
2: (7b) *(u64 *)(r10 -16) = r0
3: (cc) (u32) r0 s>>= (u32) r0
4: (95) exit

Although a number of JITs do support BPF_ALU | BPF_ARSH | BPF_{K,X}
generation, not all of them do and interpreter does neither. We can
leave existing ones and implement it later in bpf-next for the
remaining ones, but reject this properly in verifier for the time
being.

Fixes: 17a5267067f3 ("bpf: verifier (add verifier core)")
Reported-by: syzbot+93c4904c5c70348a6890@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-01-31 19:55:57 +0800
a3d6dd6a6 bpf: introduce BPF_JIT_ALWAYS_ON config ... Browse Code »

[ upstream commit 290af86629b25ffd1ed6232c4e9107da031705cb ]

The BPF interpreter has been used as part of the spectre 2 attack CVE-2017-5715.

A quote from goolge project zero blog:
"At this point, it would normally be necessary to locate gadgets in
the host kernel code that can be used to actually leak data by reading
from an attacker-controlled location, shifting and masking the result
appropriately and then using the result of that as offset to an
attacker-controlled address for a load. But piecing gadgets together
and figuring out which ones work in a speculation context seems annoying.
So instead, we decided to use the eBPF interpreter, which is built into
the host kernel - while there is no legitimate way to invoke it from inside
a VM, the presence of the code in the host kernel's text section is sufficient
to make it usable for the attack, just like with ordinary ROP gadgets."

To make attacker job harder introduce BPF_JIT_ALWAYS_ON config
option that removes interpreter from the kernel in favor of JIT-only mode.
So far eBPF JIT is supported by:
x64, arm64, arm32, sparc64, s390, powerpc64, mips64

The start of JITed program is randomized and code page is marked as read-only.
In addition "constant blinding" can be turned on with net.core.bpf_jit_harden

v2->v3:
- move __bpf_prog_ret0 under ifdef (Daniel)

v1->v2:
- fix init order, test_bpf and cBPF (Daniel's feedback)
- fix offloaded bpf (Jakub's feedback)
- add 'return 0' dummy in case something can invoke prog->bpf_func
- retarget bpf tree. For bpf-next the patch would need one extra hunk.
It will be sent when the trees are merged back to net-next

Considered doing:
int bpf_jit_enable __read_mostly = BPF_EBPF_JIT_DEFAULT;
but it seems better to land the patch as-is and in bpf-next remove
bpf_jit_enable global variable from all JITs, consolidate in one place
and remove this jit_init() function.

Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Alexei Starovoitov
2018-01-31 19:55:56 +0800
5226bb3b9 bpf: fix bpf_tail_call() x64 JIT ... Browse Code »

[ upstream commit 90caccdd8cc0215705f18b92771b449b01e2474a ]

- bpf prog_array just like all other types of bpf array accepts 32-bit index.
Clarify that in the comment.
- fix x64 JIT of bpf_tail_call which was incorrectly loading 8 instead of 4 bytes
- tighten corresponding check in the interpreter to stay consistent

The JIT bug can be triggered after introduction of BPF_F_NUMA_NODE flag
in commit 96eabe7a40aa in 4.14. Before that the map_flags would stay zero and
though JIT code is wrong it will check bounds correctly.
Hence two fixes tags. All other JITs don't have this problem.

Signed-off-by: Alexei Starovoitov
Fixes: 96eabe7a40aa ("bpf: Allow selecting numa node during map creation")
Fixes: b52f00e6a715 ("x86: bpf_jit: implement bpf_tail_call() helper")
Acked-by: Daniel Borkmann
Acked-by: Martin KaFai Lau
Reviewed-by: Eric Dumazet
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Alexei Starovoitov
2018-01-31 19:55:56 +0800
c98ff7299 hrtimer: Reset hrtimer cpu base proper on CPU hotplug ... Browse Code »

commit d5421ea43d30701e03cadc56a38854c36a8b4433 upstream.

The hrtimer interrupt code contains a hang detection and mitigation
mechanism, which prevents that a long delayed hrtimer interrupt causes a
continous retriggering of interrupts which prevent the system from making
progress. If a hang is detected then the timer hardware is programmed with
a certain delay into the future and a flag is set in the hrtimer cpu base
which prevents newly enqueued timers from reprogramming the timer hardware
prior to the chosen delay. The subsequent hrtimer interrupt after the delay
clears the flag and resumes normal operation.

If such a hang happens in the last hrtimer interrupt before a CPU is
unplugged then the hang_detected flag is set and stays that way when the
CPU is plugged in again. At that point the timer hardware is not armed and
it cannot be armed because the hang_detected flag is still active, so
nothing clears that flag. As a consequence the CPU does not receive hrtimer
interrupts and no timers expire on that CPU which results in RCU stalls and
other malfunctions.

Clear the flag along with some other less critical members of the hrtimer
cpu base to ensure starting from a clean state when a CPU is plugged in.

Thanks to Paul, Sebastian and Anna-Maria for their help to get down to the
root cause of that hard to reproduce heisenbug. Once understood it's
trivial and certainly justifies a brown paperbag.

Fixes: 41d2e4949377 ("hrtimer: Tune hrtimer_interrupt hang logic")
Reported-by: Paul E. McKenney
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Sebastian Sewior
Cc: Anna-Maria Gleixner
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801261447590.2067@nanos
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-01-31 19:55:56 +0800

24 Jan, 2018

5 commits

ca2d73686 workqueue: avoid hard lockups in show_workqueue_state() ... Browse Code »

commit 62635ea8c18f0f62df4cc58379e4f1d33afd5801 upstream.

show_workqueue_state() can print out a lot of messages while being in
atomic context, e.g. sysrq-t -> show_workqueue_state(). If the console
device is slow it may end up triggering NMI hard lockup watchdog.

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Tejun Heo
Signed-off-by: Greg Kroah-Hartman

Sergey Senozhatsky
2018-01-24 02:57:08 +0800
9a50ea0ce tracing: Fix converting enum's from the map in trace_event_eval_update() ... Browse Code »

commit 1ebe1eaf2f02784921759992ae1fde1a9bec8fd0 upstream.

Since enums do not get converted by the TRACE_EVENT macro into their values,
the event format displaces the enum name and not the value. This breaks
tools like perf and trace-cmd that need to interpret the raw binary data. To
solve this, an enum map was created to convert these enums into their actual
numbers on boot up. This is done by TRACE_EVENTS() adding a
TRACE_DEFINE_ENUM() macro.

Some enums were not being converted. This was caused by an optization that
had a bug in it.

All calls get checked against this enum map to see if it should be converted
or not, and it compares the call's system to the system that the enum map
was created under. If they match, then they call is processed.

To cut down on the number of iterations needed to find the maps with a
matching system, since calls and maps are grouped by system, when a match is
made, the index into the map array is saved, so that the next call, if it
belongs to the same system as the previous call, could start right at that
array index and not have to scan all the previous arrays.

The problem was, the saved index was used as the variable to know if this is
a call in a new system or not. If the index was zero, it was assumed that
the call is in a new system and would keep incrementing the saved index
until it found a matching system. The issue arises when the first matching
system was at index zero. The next map, if it belonged to the same system,
would then think it was the first match and increment the index to one. If
the next call belong to the same system, it would begin its search of the
maps off by one, and miss the first enum that should be converted. This left
a single enum not converted properly.

Also add a comment to describe exactly what that index was for. It took me a
bit too long to figure out what I was thinking when debugging this issue.

Link: http://lkml.kernel.org/r/717BE572-2070-4C1E-9902-9F2E0FEDA4F8@oracle.com

Fixes: 0c564a538aa93 ("tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values")
Reported-by: Chuck Lever
Teste-by: Chuck Lever
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (VMware)
2018-01-24 02:57:07 +0800
1ad4f2872 sched/deadline: Zero out positive runtime after throttling constrained tasks ... Browse Code »

commit ae83b56a56f8d9643dedbee86b457fa1c5d42f59 upstream.

When a contrained task is throttled by dl_check_constrained_dl(),
it may carry the remaining positive runtime, as a result when
dl_task_timer() fires and calls replenish_dl_entity(), it will
not be replenished correctly due to the positive dl_se->runtime.

This patch assigns its runtime to 0 if positive after throttling.

Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Daniel Bristot de Oliveira
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Luca Abeni
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Fixes: df8eac8cafce ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.com
Signed-off-by: Ingo Molnar
Cc: Ben Hutchings
Signed-off-by: Greg Kroah-Hartman

Xunlei Pang
2018-01-24 02:57:05 +0800
676109b28 timers: Unconditionally check deferrable base ... Browse Code »

commit ed4bbf7910b28ce3c691aef28d245585eaabda06 upstream.

When the timer base is checked for expired timers then the deferrable base
must be checked as well. This was missed when making the deferrable base
independent of base::nohz_active.

Fixes: ced6d5c11d3e ("timers: Use deferrable base independent of base::nohz_active")
Signed-off-by: Thomas Gleixner
Cc: Anna-Maria Gleixner
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Paul McKenney
Cc: rt@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-01-24 02:57:04 +0800
d8a3170db futex: Prevent overflow by strengthen input validation ... Browse Code »

commit fbe0e839d1e22d88810f3ee3e2f1479be4c0aa4a upstream.

UBSAN reports signed integer overflow in kernel/futex.c:

UBSAN: Undefined behaviour in kernel/futex.c:2041:18
signed integer overflow:
0 - -2147483648 cannot be represented in type 'int'

Add a sanity check to catch negative values of nr_wake and nr_requeue.

Signed-off-by: Li Jinyue
Signed-off-by: Thomas Gleixner
Cc: peterz@infradead.org
Cc: dvhart@infradead.org
Link: https://lkml.kernel.org/r/1513242294-31786-1-git-send-email-lijinyue@huawei.com
Signed-off-by: Greg Kroah-Hartman

Li Jinyue
2018-01-24 02:57:04 +0800

17 Jan, 2018

3 commits

820ef2a0e bpf, array: fix overflow in max_entries and undefined behavior in index_mask ... Browse Code »

commit bbeb6e4323dad9b5e0ee9f60c223dd532e2403b1 upstream.

syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
and thus unprivileged. With the recently added logic in b2157399cc98
("bpf: prevent out-of-bounds speculation") we round this up to the next
power of two value for max_entries for unprivileged such that we can
apply proper masking into potentially zeroed out map slots.

However, this will generate an index_mask of 0xffffffff, and therefore
a + 1 will let this overflow into new max_entries of 0. This will pass
allocation, etc, and later on map access we still enforce on the original
attr->max_entries value which was 0xfffffffd, therefore triggering GPF
all over the place. Thus bail out on overflow in such case.

Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
space is undefined. Therefore, do this by hand in a 64 bit variable.

This fixes all the issues triggered by syzkaller's reproducers.

Fixes: b2157399cc98 ("bpf: prevent out-of-bounds speculation")
Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-01-17 16:38:55 +0800
a9bfac14c bpf: prevent out-of-bounds speculation ... Browse Code »

commit b2157399cc9898260d6031c5bfe45fe137c1fbe7 upstream.

Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
memory accesses under a bounds check may be speculated even if the
bounds check fails, providing a primitive for building a side channel.

To avoid leaking kernel data round up array-based maps and mask the index
after bounds check, so speculated load with out of bounds index will load
either valid value from the array or zero from the padded area.

Unconditionally mask index for all array types even when max_entries
are not rounded to power of 2 for root user.
When map is created by unpriv user generate a sequence of bpf insns
that includes AND operation to make sure that JITed code includes
the same 'index & index_mask' operation.

If prog_array map is created by unpriv user replace
bpf_tail_call(ctx, map, index);
with
if (index >= max_entries) {
index &= map->index_mask;
bpf_tail_call(ctx, map, index);
}
(along with roundup to power 2) to prevent out-of-bounds speculation.
There is secondary redundant 'if (index >= max_entries)' in the interpreter
and in all JITs, but they can be optimized later if necessary.

Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
cannot be used by unpriv, so no changes there.

That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
all architectures with and without JIT.

v2->v3:
Daniel noticed that attack potentially can be crafted via syscall commands
without loading the program, so add masking to those paths as well.

Signed-off-by: Alexei Starovoitov
Acked-by: John Fastabend
Signed-off-by: Daniel Borkmann
Cc: Jiri Slaby
[ Backported to 4.9 - gregkh ]
Signed-off-by: Greg Kroah-Hartman

Alexei Starovoitov
2018-01-17 16:38:55 +0800
f55093dcc bpf: refactor fixup_bpf_calls() ... Browse Code »

commit 79741b3bdec01a8628368fbcfccc7d189ed606cb upstream.

reduce indent and make it iterate over instructions similar to
convert_ctx_accesses(). Also convert hard BUG_ON into soft verifier error.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller
Cc: Jiri Slaby
[Backported to 4.9.y - gregkh]
Signed-off-by: Greg Kroah-Hartman

Alexei Starovoitov
2018-01-17 16:38:55 +0800