09 Jun, 2017

1 commit

  • When the tick is stopped and we reach the dynticks evaluation code on
    IRQ exit, we perform a soft tick restart if we observe an expired timer
    from there. It means we program the nearest possible tick but we stay in
    dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
    again after that expired timer is handled.

    Now this solution works most of the time but if we suffer an IRQ storm
    and those interrupts trigger faster than the hardware clockevents min
    delay, our tick won't fire until that IRQ storm is finished.

    Here is the problem: on IRQ exit we reprog the timer to at least
    NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
    reschedule again to NOW() + min_clockevents_delay, etc... The tick
    is eternally rescheduled min_clockevents_delay ahead.

    A solution is to simply remove this soft tick restart. After all
    the normal dynticks evaluation path can handle 0 delay just fine. And
    by doing that we benefit from the optimization branch which avoids
    clock reprogramming if the clockevents deadline hasn't changed since
    the last reprog. This fixes our issue because we don't do repetitive
    clock reprog that always add hardware min delay.

    As a side effect it should even optimize the 0 delay path in general.

    Reported-and-tested-by: Octavian Purdila
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

23 Feb, 2017

2 commits

  • Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
    Pallipadi to generic.

    Change-Id: Idf29cda15be151f494ff245933c12462643388d5
    Acked-by: Nicolas Pitre
    Signed-off-by: Todd Poynor

    Todd Poynor
     
  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

15 Feb, 2017

2 commits

  • commit 451d24d1e5f40bad000fa9abe36ddb16fc9928cb upstream.

    Alexei had his box explode because doing read() on a package
    (rapl/uncore) event that isn't currently scheduled in ends up doing an
    out-of-bounds load.

    Rework the code to more explicitly deal with event->oncpu being -1.

    Reported-by: Alexei Starovoitov
    Tested-by: Alexei Starovoitov
    Tested-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: eranian@google.com
    Fixes: d6a2f9035bfc ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG")
    Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit bfeda41d06d85ad9d52f2413cfc2b77be5022f75 upstream.

    Since KERN_CONT became meaningful again, lockdep stack traces have had
    annoying extra newlines, like this:

    [ 5.561122] -> #1 (B){+.+...}:
    [ 5.561528]
    [ 5.561532] [] lock_acquire+0xc3/0x210
    [ 5.562178]
    [ 5.562181] [] mutex_lock_nested+0x74/0x6d0
    [ 5.562861]
    [ 5.562880] [] init_btrfs_fs+0x21/0x196 [btrfs]
    [ 5.563717]
    [ 5.563721] [] do_one_initcall+0x52/0x1b0
    [ 5.564554]
    [ 5.564559] [] do_init_module+0x5f/0x209
    [ 5.565357]
    [ 5.565361] [] load_module+0x218d/0x2b80
    [ 5.566020]
    [ 5.566021] [] SyS_finit_module+0xeb/0x120
    [ 5.566694]
    [ 5.566696] [] entry_SYSCALL_64_fastpath+0x1f/0xc2

    That's happening because each printk() call now gets printed on its own
    line, and we do a separate call to print the spaces before the symbol.
    Fix it by doing the printk() directly instead of using the
    print_ip_sym() helper.

    Additionally, the symbol address isn't very helpful, so let's get rid of
    that, too. The final result looks like this:

    [ 5.194518] -> #1 (B){+.+...}:
    [ 5.195002] lock_acquire+0xc3/0x210
    [ 5.195439] mutex_lock_nested+0x74/0x6d0
    [ 5.196491] do_one_initcall+0x52/0x1b0
    [ 5.196939] do_init_module+0x5f/0x209
    [ 5.197355] load_module+0x218d/0x2b80
    [ 5.197792] SyS_finit_module+0xeb/0x120
    [ 5.198251] entry_SYSCALL_64_fastpath+0x1f/0xc2

    Suggested-by: Linus Torvalds
    Signed-off-by: Omar Sandoval
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Fixes: 4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")
    Link: http://lkml.kernel.org/r/43b4e114724b2bdb0308fa86cb33aa07d3d67fad.1486510315.git.osandov@fb.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

09 Feb, 2017

5 commits

  • commit 08d85f3ea99f1eeafc4e8507936190e86a16ee8c upstream.

    Since commit f3b0946d629c ("genirq/msi: Make sure PCI MSIs are
    activated early"), we can end-up activating a PCI/MSI twice (once
    at allocation time, and once at startup time).

    This is normally of no consequences, except that there is some
    HW out there that may misbehave if activate is used more than once
    (the GICv3 ITS, for example, uses the activate callback
    to issue the MAPVI command, and the architecture spec says that
    "If there is an existing mapping for the EventID-DeviceID
    combination, behavior is UNPREDICTABLE").

    While this could be worked around in each individual driver, it may
    make more sense to tackle the issue at the core level. In order to
    avoid getting in that situation, let's have a per-interrupt flag
    to remember if we have already activated that interrupt or not.

    Fixes: f3b0946d629c ("genirq/msi: Make sure PCI MSIs are activated early")
    Reported-and-tested-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Marc Zyngier
     
  • commit 07cd12945551b63ecb1a349d50a6d69d1d6feb4a upstream.

    While refactoring cgroup creation, a5bca2152036 ("cgroup: factor out
    cgroup_create() out of cgroup_mkdir()") incorrectly onlined subsystems
    before the new cgroup is associated with it kernfs_node. This is fine
    for cgroup proper but cgroup_name/path() depend on the associated
    kernfs_node and if a subsystem makes the new cgroup_subsys_state
    visible, which they're allowed to after onlining, it can lead to NULL
    dereference.

    The current code performs cgroup creation and subsystem onlining in
    cgroup_create() and cgroup_mkdir() makes the cgroup and subsystems
    visible afterwards. There's no reason to online the subsystems early
    and we can simply drop cgroup_apply_control_enable() call from
    cgroup_create() so that the subsystems are onlined and made visible at
    the same time.

    Signed-off-by: Tejun Heo
    Reported-by: Konstantin Khlebnikov
    Fixes: a5bca2152036 ("cgroup: factor out cgroup_create() out of cgroup_mkdir()")
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     
  • commit 79c6f448c8b79c321e4a1f31f98194e4f6b6cae7 upstream.

    The hwlat tracer creates a kernel thread at start of the tracer. It is
    pinned to a single CPU and will move to the next CPU after each period of
    running. If the user modifies the migration thread's affinity, it will not
    change after that happens.

    The original code created the thread at the first instance it was called,
    but later was changed to destroy the thread after the tracer was finished,
    and would not be created until the next instance of the tracer was
    established. The code that initialized the affinity was only called on the
    initial instantiation of the tracer. After that, it was not initialized, and
    the previous affinity did not match the current newly created one, making
    it appear that the user modified the thread's affinity when it did not, and
    the thread failed to migrate again.

    Fixes: 0330f7aa8ee6 ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit 0b3589be9b98994ce3d5aeca52445d1f5627c4ba upstream.

    Andres reported that MMAP2 records for anonymous memory always have
    their protection field 0.

    Turns out, someone daft put the prot/flags generation code in the file
    branch, leaving them unset for anonymous memory.

    Reported-by: Andres Freund
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Don Zickus
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: acme@kernel.org
    Cc: anton@ozlabs.org
    Cc: namhyung@kernel.org
    Fixes: f972eb63b100 ("perf: Pass protection and flags bits through mmap2 interface")
    Link: http://lkml.kernel.org/r/20170126221508.GF6536@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit a76a82a3e38c8d3fb6499e3dfaeb0949241ab588 upstream.

    Dmitry reported a KASAN use-after-free on event->group_leader.

    It turns out there's a hole in perf_remove_from_context() due to
    event_function_call() not calling its function when the task
    associated with the event is already dead.

    In this case the event will have been detached from the task, but the
    grouping will have been retained, such that group operations might
    still work properly while there are live child events etc.

    This does however mean that we can miss a perf_group_detach() call
    when the group decomposes, this in turn can then lead to
    use-after-free.

    Fix it by explicitly doing the group detach if its still required.

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: syzkaller
    Fixes: 63b6da39bb38 ("perf: Fix perf_event_exit_task() race")
    Link: http://lkml.kernel.org/r/20170126153955.GD6515@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

01 Feb, 2017

3 commits

  • commit 321027c1fe77f892f4ea07846aeae08cefbbb290 upstream.

    Di Shen reported a race between two concurrent sys_perf_event_open()
    calls where both try and move the same pre-existing software group
    into a hardware context.

    The problem is exactly that described in commit:

    f63a8daa5812 ("perf: Fix event->ctx locking")

    ... where, while we wait for a ctx->mutex acquisition, the event->ctx
    relation can have changed under us.

    That very same commit failed to recognise sys_perf_event_context() as an
    external access vector to the events and thereby didn't apply the
    established locking rules correctly.

    So while one sys_perf_event_open() call is stuck waiting on
    mutex_lock_double(), the other (which owns said locks) moves the group
    about. So by the time the former sys_perf_event_open() acquires the
    locks, the context we've acquired is stale (and possibly dead).

    Apply the established locking rules as per perf_event_ctx_lock_nested()
    to the mutex_lock_double() for the 'move_group' case. This obviously means
    we need to validate state after we acquire the locks.

    Reported-by: Di Shen (Keen Lab)
    Tested-by: John Dias
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Min Chong
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Fixes: f63a8daa5812 ("perf: Fix event->ctx locking")
    Link: http://lkml.kernel.org/r/20170106131444.GZ3174@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit ff9f8a7cf935468a94d9927c68b00daae701667e upstream.

    We perform the conversion between kernel jiffies and ms only when
    exporting kernel value to user space.

    We need to do the opposite operation when value is written by user.

    Only matters when HZ != 1000

    Signed-off-by: Eric Dumazet
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Eric Dumazet
     
  • commit 880a38547ff08715ce4f1daf9a4bb30c87676e68 upstream.

    The ucounts_lock is being used to protect various ucounts lifecycle
    management functionalities. However, those services can also be invoked
    when a pidns is being freed in an RCU callback (e.g. softirq context).
    This can lead to deadlocks. There were already efforts trying to
    prevent similar deadlocks in add7c65ca426 ("pid: fix lockdep deadlock
    warning due to ucount_lock"), however they just moved the context
    from hardirq to softrq. Fix this issue once and for all by explictly
    making the lock disable irqs altogether.

    Dmitry Vyukov reported:

    > I've got the following deadlock report while running syzkaller fuzzer
    > on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
    > device if it matters):
    >
    > =================================
    > [ INFO: inconsistent lock state ]
    > 4.10.0-rc3-next-20170112-xc2-dirty #6 Not tainted
    > ---------------------------------
    > inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    > swapper/2/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    > (ucounts_lock){+.?...}, at: [< inline >] spin_lock
    > ./include/linux/spinlock.h:302
    > (ucounts_lock){+.?...}, at: []
    > put_ucounts+0x60/0x138 kernel/ucount.c:162
    > {SOFTIRQ-ON-W} state was registered at:
    > [] mark_lock+0x220/0xb60 kernel/locking/lockdep.c:3054
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2941
    > [] __lock_acquire+0x388/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [< inline >] get_ucounts kernel/ucount.c:131
    > [] inc_ucount+0x80/0x6c8 kernel/ucount.c:189
    > [< inline >] inc_mnt_namespaces fs/namespace.c:2818
    > [] alloc_mnt_ns+0x78/0x3a8 fs/namespace.c:2849
    > [] create_mnt_ns+0x28/0x200 fs/namespace.c:2959
    > [< inline >] init_mount_tree fs/namespace.c:3199
    > [] mnt_init+0x258/0x384 fs/namespace.c:3251
    > [] vfs_caches_init+0x6c/0x80 fs/dcache.c:3626
    > [] start_kernel+0x414/0x460 init/main.c:648
    > [] __primary_switched+0x6c/0x70 arch/arm64/kernel/head.S:456
    > irq event stamp: 2316924
    > hardirqs last enabled at (2316924): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2911
    > hardirqs last enabled at (2316924): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last enabled at (2316924): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last enabled at (2316924): []
    > rcu_process_callbacks+0x7a4/0xc28 kernel/rcu/tree.c:3166
    > hardirqs last disabled at (2316923): [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2900
    > hardirqs last disabled at (2316923): [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > hardirqs last disabled at (2316923): [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > hardirqs last disabled at (2316923): []
    > rcu_process_callbacks+0x210/0xc28 kernel/rcu/tree.c:3166
    > softirqs last enabled at (2316912): []
    > _local_bh_enable+0x4c/0x80 kernel/softirq.c:155
    > softirqs last disabled at (2316913): [< inline >]
    > do_softirq_own_stack ./include/linux/interrupt.h:488
    > softirqs last disabled at (2316913): [< inline >]
    > invoke_softirq kernel/softirq.c:371
    > softirqs last disabled at (2316913): []
    > irq_exit+0x264/0x308 kernel/softirq.c:405
    >
    > other info that might help us debug this:
    > Possible unsafe locking scenario:
    >
    > CPU0
    > ----
    > lock(ucounts_lock);
    >
    > lock(ucounts_lock);
    >
    > *** DEADLOCK ***
    >
    > 1 lock held by swapper/2/0:
    > #0: (rcu_callback){......}, at: [< inline >] __rcu_reclaim
    > kernel/rcu/rcu.h:108
    > #0: (rcu_callback){......}, at: [< inline >] rcu_do_batch
    > kernel/rcu/tree.c:2919
    > #0: (rcu_callback){......}, at: [< inline >]
    > invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > #0: (rcu_callback){......}, at: [< inline >]
    > __rcu_process_callbacks kernel/rcu/tree.c:3149
    > #0: (rcu_callback){......}, at: []
    > rcu_process_callbacks+0x720/0xc28 kernel/rcu/tree.c:3166
    >
    > stack backtrace:
    > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.10.0-rc3-next-20170112-xc2-dirty #6
    > Hardware name: Hardkernel ODROID-C2 (DT)
    > Call trace:
    > [] dump_backtrace+0x0/0x440 arch/arm64/kernel/traps.c:500
    > [] show_stack+0x20/0x30 arch/arm64/kernel/traps.c:225
    > [] dump_stack+0x110/0x168
    > [] print_usage_bug.part.27+0x49c/0x4bc
    > kernel/locking/lockdep.c:2387
    > [< inline >] print_usage_bug kernel/locking/lockdep.c:2357
    > [< inline >] valid_state kernel/locking/lockdep.c:2400
    > [< inline >] mark_lock_irq kernel/locking/lockdep.c:2617
    > [] mark_lock+0x934/0xb60 kernel/locking/lockdep.c:3065
    > [< inline >] mark_irqflags kernel/locking/lockdep.c:2923
    > [] __lock_acquire+0x640/0x3260 kernel/locking/lockdep.c:3295
    > [] lock_acquire+0xa4/0x138 kernel/locking/lockdep.c:3753
    > [< inline >] __raw_spin_lock ./include/linux/spinlock_api_smp.h:144
    > [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
    > [< inline >] spin_lock ./include/linux/spinlock.h:302
    > [] put_ucounts+0x60/0x138 kernel/ucount.c:162
    > [] dec_ucount+0xf4/0x158 kernel/ucount.c:214
    > [< inline >] dec_pid_namespaces kernel/pid_namespace.c:89
    > [] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
    > [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
    > [< inline >] rcu_do_batch kernel/rcu/tree.c:2919
    > [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
    > [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3149
    > [] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
    > [] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
    > [< inline >] do_softirq_own_stack ./include/linux/interrupt.h:488
    > [< inline >] invoke_softirq kernel/softirq.c:371
    > [] irq_exit+0x264/0x308 kernel/softirq.c:405
    > [] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
    > [] gic_handle_irq+0x68/0xd8
    > Exception stack(0xffff8000648e7dd0 to 0xffff8000648e7f00)
    > 7dc0: ffff8000648d4b3c 0000000000000007
    > 7de0: 0000000000000000 1ffff0000c91a967 1ffff0000c91a967 1ffff0000c91a967
    > 7e00: ffff20000a4b6b68 0000000000000001 0000000000000007 0000000000000001
    > 7e20: 1fffe4000149ae90 ffff200009d35000 0000000000000000 0000000000000002
    > 7e40: 0000000000000000 0000000000000000 0000000002624a1a 0000000000000000
    > 7e60: 0000000000000000 ffff200009cbcd88 000060006d2ed000 0000000000000140
    > 7e80: ffff200009cff000 ffff200009cb6000 ffff200009cc2020 ffff200009d2159d
    > 7ea0: 0000000000000000 ffff8000648d4380 0000000000000000 ffff8000648e7f00
    > 7ec0: ffff20000820a478 ffff8000648e7f00 ffff20000820a47c 0000000010000145
    > 7ee0: 0000000000000140 dfff200000000000 ffffffffffffffff ffff20000820a478
    > [] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
    > [< inline >] arch_local_irq_restore
    > ./arch/arm64/include/asm/irqflags.h:81
    > [] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
    > [< inline >] cpuidle_idle_call kernel/sched/idle.c:200
    > [] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
    > [] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
    > [] secondary_start_kernel+0x2cc/0x358
    > arch/arm64/kernel/smp.c:276
    > [] 0x279f1a4

    Reported-by: Dmitry Vyukov
    Tested-by: Dmitry Vyukov
    Fixes: add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Link: https://www.spinics.net/lists/kernel/msg2426637.html
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Nikolay Borisov
     

26 Jan, 2017

2 commits

  • commit 52d7e48b86fc108e45a656d8e53e4237993c481d upstream.

    The current preemptible RCU implementation goes through three phases
    during bootup. In the first phase, there is only one CPU that is running
    with preemption disabled, so that a no-op is a synchronous grace period.
    In the second mid-boot phase, the scheduler is running, but RCU has
    not yet gotten its kthreads spawned (and, for expedited grace periods,
    workqueues are not yet running. During this time, any attempt to do
    a synchronous grace period will hang the system (or complain bitterly,
    depending). In the third and final phase, RCU is fully operational and
    everything works normally.

    This has been OK for some time, but there has recently been some
    synchronous grace periods showing up during the second mid-boot phase.
    This code worked "by accident" for awhile, but started failing as soon
    as expedited RCU grace periods switched over to workqueues in commit
    8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue").
    Note that the code was buggy even before this commit, as it was subject
    to failure on real-time systems that forced all expedited grace periods
    to run as normal grace periods (for example, using the rcu_normal ksysfs
    parameter). The callchain from the failure case is as follows:

    early_amd_iommu_init()
    |-> acpi_put_table(ivrs_base);
    |-> acpi_tb_put_table(table_desc);
    |-> acpi_tb_invalidate_table(table_desc);
    |-> acpi_tb_release_table(...)
    |-> acpi_os_unmap_memory
    |-> acpi_os_unmap_iomem
    |-> acpi_os_map_cleanup
    |-> synchronize_rcu_expedited

    The kernel showing this callchain was built with CONFIG_PREEMPT_RCU=y,
    which caused the code to try using workqueues before they were
    initialized, which did not go well.

    This commit therefore reworks RCU to permit synchronous grace periods
    to proceed during this mid-boot phase. This commit is therefore a
    fix to a regression introduced in v4.9, and is therefore being put
    forward post-merge-window in v4.10.

    This commit sets a flag from the existing rcu_scheduler_starting()
    function which causes all synchronous grace periods to take the expedited
    path. The expedited path now checks this flag, using the requesting task
    to drive the expedited grace period forward during the mid-boot phase.
    Finally, this flag is updated by a core_initcall() function named
    rcu_exp_runtime_mode(), which causes the runtime codepaths to be used.

    Note that this arrangement assumes that tasks are not sent POSIX signals
    (or anything similar) from the time that the first task is spawned
    through core_initcall() time.

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Reported-by: "Zheng, Lv"
    Reported-by: Borislav Petkov
    Signed-off-by: Paul E. McKenney
    Tested-by: Stan Kain
    Tested-by: Ivan
    Tested-by: Emanuel Castelo
    Tested-by: Bruno Pesavento
    Tested-by: Borislav Petkov
    Tested-by: Frederic Bezies
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit f466ae66fa6a599f9a53b5f9bafea4b8cfffa7fb upstream.

    It is now legal to invoke synchronize_sched() at early boot, which causes
    Tiny RCU's synchronize_sched() to emit spurious splats. This commit
    therefore removes the cond_resched() from Tiny RCU's synchronize_sched().

    Fixes: 8b355e3bc140 ("rcu: Drive expedited grace periods from workqueue")
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     

20 Jan, 2017

3 commits

  • commit add7c65ca426b7a37184dd3d2172394e23d585d6 upstream.

    =========================================================
    [ INFO: possible irq lock inversion dependency detected ]
    4.10.0-rc2-00024-g4aecec9-dirty #118 Tainted: G W
    ---------------------------------------------------------
    swapper/1/0 just changed the state of lock:
    (&(&sighand->siglock)->rlock){-.....}, at: [] __lock_task_sighand+0xb6/0x2c0
    but this lock took another, HARDIRQ-unsafe lock in the past:
    (ucounts_lock){+.+...}
    and interrupts could create inverse lock ordering between them.
    other info that might help us debug this:
    Chain exists of: &(&sighand->siglock)->rlock --> &(&tty->ctrl_lock)->rlock --> ucounts_lock
    Possible interrupt unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(ucounts_lock);
    local_irq_disable();
    lock(&(&sighand->siglock)->rlock);
    lock(&(&tty->ctrl_lock)->rlock);

    lock(&(&sighand->siglock)->rlock);

    *** DEADLOCK ***

    This patch removes a dependency between rlock and ucount_lock.

    Fixes: f333c700c610 ("pidns: Add a limit on the number of pid namespaces")
    Signed-off-by: Andrei Vagin
    Acked-by: Al Viro
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Greg Kroah-Hartman

    Andrei Vagin
     
  • commit b6416e61012429e0277bd15a229222fd17afc1c1 upstream.

    Modules that use static_key_deferred need a way to synchronize with
    any delayed work that is still pending when the module is unloaded.
    Introduce static_key_deferred_flush() which flushes any pending
    jump label updates.

    Signed-off-by: David Matlack
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    David Matlack
     
  • commit f931ab479dd24cf7a2c6e2df19778406892591fb upstream.

    Both arch_add_memory() and arch_remove_memory() expect a single threaded
    context.

    For example, arch/x86/mm/init_64.c::kernel_physical_mapping_init() does
    not hold any locks over this check and branch:

    if (pgd_val(*pgd)) {
    pud = (pud_t *)pgd_page_vaddr(*pgd);
    paddr_last = phys_pud_init(pud, __pa(vaddr),
    __pa(vaddr_end),
    page_size_mask);
    continue;
    }

    pud = alloc_low_page();
    paddr_last = phys_pud_init(pud, __pa(vaddr), __pa(vaddr_end),
    page_size_mask);

    The result is that two threads calling devm_memremap_pages()
    simultaneously can end up colliding on pgd initialization. This leads
    to crash signatures like the following where the loser of the race
    initializes the wrong pgd entry:

    BUG: unable to handle kernel paging request at ffff888ebfff0000
    IP: memcpy_erms+0x6/0x10
    PGD 2f8e8fc067 PUD 0 /*
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

12 Jan, 2017

3 commits

  • commit c1a9eeb938b5433947e5ea22f89baff3182e7075 upstream.

    When a disfunctional timer, e.g. dummy timer, is installed, the tick core
    tries to setup the broadcast timer.

    If no broadcast device is installed, the kernel crashes with a NULL pointer
    dereference in tick_broadcast_setup_oneshot() because the function has no
    sanity check.

    Reported-by: Mason
    Signed-off-by: Thomas Gleixner
    Cc: Mark Rutland
    Cc: Anna-Maria Gleixner
    Cc: Richard Cochran
    Cc: Sebastian Andrzej Siewior
    Cc: Daniel Lezcano
    Cc: Peter Zijlstra ,
    Cc: Sebastian Frias
    Cc: Thibaud Cornic
    Cc: Robin Murphy
    Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit c0af52437254fda8b0cdbaae5a9b6d9327f1fcd5 upstream.

    Commit 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading
    infrastructure") introduced a better IRQ spreading mechanism, taking
    account of the available NUMA nodes in the machine.

    Problem is that the algorithm of retrieving the nodemask iterates
    "linearly" based on the number of online nodes - some architectures
    present non-linear node distribution among the nodemask, like PowerPC.
    If this is the case, the algorithm lead to a wrong node count number
    and therefore to a bad/incomplete IRQ affinity distribution.

    For example, this problem were found in a machine with 128 CPUs and two
    nodes, namely nodes 0 and 8 (instead of 0 and 1, if it was linearly
    distributed). This led to a wrong affinity distribution which then led to
    a bad mq allocation for nvme driver.

    Finally, we take the opportunity to fix a comment regarding the affinity
    distribution when we have _more_ nodes than vectors.

    Fixes: 34c3d9819fda ("genirq/affinity: Provide smarter irq spreading infrastructure")
    Reported-by: Gabriel Krisman Bertazi
    Signed-off-by: Guilherme G. Piccoli
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Gabriel Krisman Bertazi
    Reviewed-by: Gavin Shan
    Cc: linux-pci@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: hch@lst.de
    Link: http://lkml.kernel.org/r/1481738472-2671-1-git-send-email-gpiccoli@linux.vnet.ibm.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Guilherme G. Piccoli
     
  • commit 9a29d0fbc2d9ad99fb8a981ab72548cc360e9d4c upstream.

    Smatch complains that we started using the array offset before we
    checked that it was valid.

    Fixes: 017c59c042d0 ('relay: Use per CPU constructs for the relay channel buffer pointers')
    Link: http://lkml.kernel.org/r/20161013084947.GC16198@mwanda
    Signed-off-by: Dan Carpenter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

09 Jan, 2017

2 commits

  • commit 794de08a16cf1fc1bf785dc48f66d36218cf6d88 upstream.

    Both the wakeup and irqsoff tracers can use the function graph tracer when
    the display-graph option is set. The problem is that they ignore the notrace
    file, and record the entry of functions that would be ignored by the
    function_graph tracer. This causes the trace->depth to be recorded into the
    ring buffer. The set_graph_notrace uses a trick by adding a large negative
    number to the trace->depth when a graph function is to be ignored.

    On trace output, the graph function uses the depth to record a stack of
    functions. But since the depth is negative, it accesses the array with a
    negative number and causes an out of bounds access that can cause a kernel
    oops or corrupt data.

    Have the print functions handle cases where a tracer still records functions
    even when they are in set_graph_notrace.

    Also add warnings if the depth is below zero before accessing the array.

    Note, the function graph logic will still prevent the return of these
    functions from being recorded, which means that they will be left hanging
    without a return. For example:

    # echo '*spin*' > set_graph_notrace
    # echo 1 > options/display-graph
    # echo wakeup > current_tracer
    # cat trace
    [...]
    _raw_spin_lock() {
    preempt_count_add() {
    do_raw_spin_lock() {
    update_rq_clock();

    Where it should look like:

    _raw_spin_lock() {
    preempt_count_add();
    do_raw_spin_lock();
    }
    update_rq_clock();

    Cc: Namhyung Kim
    Fixes: 29ad23b00474 ("ftrace: Add set_graph_notrace filter")
    Signed-off-by: Steven Rostedt
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 9c1645727b8fa90d07256fdfcc45bf831242a3ab upstream.

    The clocksource delta to nanoseconds conversion is using signed math, but
    the delta is unsigned. This makes the conversion space smaller than
    necessary and in case of a multiplication overflow the conversion can
    become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

    Shifting a signed integer right obvioulsy preserves the sign, which has
    interesting consequences:

    - Time jumps backwards

    - __iter_div_u64_rem() which is used in one of the calling code pathes
    will take forever to piecewise calculate the seconds/nanoseconds part.

    This has been reported by several people with different scenarios:

    David observed that when stopping a VM with a debugger:

    "It was essentially the stopped by debugger case. I forget exactly why,
    but the guest was being explicitly stopped from outside, it wasn't just
    scheduling lag. I think it was something in the vicinity of 10 minutes
    stopped."

    When lifting the stop the machine went dead.

    The stopped by debugger case is not really interesting, but nevertheless it
    would be a good thing not to die completely.

    But this was also observed on a live system by Liav:

    "When the OS is too overloaded, delta will get a high enough value for the
    msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
    after the shift the nsec variable will gain a value similar to
    0xffffffffff000000."

    Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
    ("time: Add cycles to nanoseconds translation"). It had been fixed a year
    ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
    timekeeping_get_ns()").

    Though it's not surprising that the issue has been reintroduced because the
    function itself and the whole call chain uses s64 for the result and the
    propagation of it. The change in this recent commit is subtle:

    s64 nsec;

    - nsec = (d * m + n) >> s:
    + nsec = d * m + n;
    + nsec >>= s;

    d being type of cycle_t adds another level of obfuscation.

    This wouldn't have happened if the previous change to unsigned computation
    would have made the 'nsec' variable u64 right away and a follow up patch
    had cleaned up the whole call chain.

    There have been patches submitted which basically did a revert of the above
    patch leaving everything else unchanged as signed. Back to square one. This
    spawned a admittedly pointless discussion about potential users which rely
    on the unsigned behaviour until someone pointed out that it had been fixed
    before. The changelogs of said patches added further confusion as they made
    finally false claims about the consequences for eventual users which expect
    signed results.

    Despite delta being cycle_t, aka. u64, it's very well possible to hand in
    a signed negative value and the signed computation will happily return the
    correct result. But nobody actually sat down and analyzed the code which
    was added as user after the propably unintended signed conversion.

    Though in sensitive code like this it's better to analyze it proper and
    make sure that nothing relies on this than hunting the subtle wreckage half
    a year later. After analyzing all call chains it stands that no caller can
    hand in a negative value (which actually would work due to the s64 cast)
    and rely on the signed math to do the right thing.

    Change the conversion function to unsigned math. The conversion of all call
    chains is done in a follow up patch.

    This solves the starvation issue, which was caused by the negative result,
    but it does not solve the underlying problem. It merily procrastinates
    it. When the timekeeper update is deferred long enough that the unsigned
    multiplication overflows, then time going backwards is observable again.

    It does neither solve the issue of clocksources with a small counter width
    which will wrap around possibly several times and cause random time stamps
    to be generated. But those are usually not found on systems used for
    virtualization, so this is likely a non issue.

    I took the liberty to claim authorship for this simply because
    analyzing all callsites and writing the changelog took substantially
    more time than just making the simple s/s64/u64/ change and ignore the
    rest.

    Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
    Reported-by: David Gibson
    Reported-by: Liav Rehana
    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

06 Jan, 2017

7 commits

  • commit 2d13bb6494c807bcf3f78af0e96c0b8615a94385 upstream.

    We've got a delay loop waiting for secondary CPUs. That loop uses
    loops_per_jiffy. However, loops_per_jiffy doesn't actually mean how
    many tight loops make up a jiffy on all architectures. It is quite
    common to see things like this in the boot log:

    Calibrating delay loop (skipped), value calculated using timer
    frequency.. 48.00 BogoMIPS (lpj=24000)

    In my case I was seeing lots of cases where other CPUs timed out
    entering the debugger only to print their stack crawls shortly after the
    kdb> prompt was written.

    Elsewhere in kgdb we already use udelay(), so that should be safe enough
    to use to implement our timeout. We'll delay 1 ms for 1000 times, which
    should give us a full second of delay (just like the old code wanted)
    but allow us to notice that we're done every 1 ms.

    [akpm@linux-foundation.org: simplifications, per Daniel]
    Link: http://lkml.kernel.org/r/1477091361-2039-1-git-send-email-dianders@chromium.org
    Signed-off-by: Douglas Anderson
    Reviewed-by: Daniel Thompson
    Cc: Jason Wessel
    Cc: Brian Norris
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Douglas Anderson
     
  • commit 4d1f0fb096aedea7bb5489af93498a82e467c480 upstream.

    NMI handler doesn't call set_irq_regs(), it's set only by normal IRQ.
    Thus get_irq_regs() returns NULL or stale registers snapshot with IP/SP
    pointing to the code interrupted by IRQ which was interrupted by NMI.
    NULL isn't a problem: in this case watchdog calls dump_stack() and
    prints full stack trace including NMI. But if we're stuck in IRQ
    handler then NMI watchlog will print stack trace without IRQ part at
    all.

    This patch uses registers snapshot passed into NMI handler as arguments:
    these registers point exactly to the instruction interrupted by NMI.

    Fixes: 55537871ef66 ("kernel/watchdog.c: perform all-CPU backtrace in case of hard lockup")
    Link: http://lkml.kernel.org/r/146771764784.86724.6006627197118544150.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Jiri Kosina
    Cc: Ulrich Obergfell
    Cc: Aaron Tomlin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Konstantin Khlebnikov
     
  • commit 84d77d3f06e7e8dea057d10e8ec77ad71f721be3 upstream.

    It is the reasonable expectation that if an executable file is not
    readable there will be no way for a user without special privileges to
    read the file. This is enforced in ptrace_attach but if ptrace
    is already attached before exec there is no enforcement for read-only
    executables.

    As the only way to read such an mm is through access_process_vm
    spin a variant called ptrace_access_vm that will fail if the
    target process is not being ptraced by the current process, or
    the current process did not have sufficient privileges when ptracing
    began to read the target processes mm.

    In the ptrace implementations replace access_process_vm by
    ptrace_access_vm. There remain several ptrace sites that still use
    access_process_vm as they are reading the target executables
    instructions (for kernel consumption) or register stacks. As such it
    does not appear necessary to add a permission check to those calls.

    This bug has always existed in Linux.

    Fixes: v1.0
    Reported-by: Andy Lutomirski
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 64b875f7ac8a5d60a4e191479299e931ee949b67 upstream.

    When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
    overlooked. This can result in incorrect behavior when an application
    like strace traces an exec of a setuid executable.

    Further PT_PTRACE_CAP does not have enough information for making good
    security decisions as it does not report which user namespace the
    capability is in. This has already allowed one mistake through
    insufficient granulariy.

    I found this issue when I was testing another corner case of exec and
    discovered that I could not get strace to set PT_PTRACE_CAP even when
    running strace as root with a full set of caps.

    This change fixes the above issue with strace allowing stracing as
    root a setuid executable without disabling setuid. More fundamentaly
    this change allows what is allowable at all times, by using the correct
    information in it's decision.

    Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit bfedb589252c01fa505ac9f6f2a3d5d68d707ef4 upstream.

    During exec dumpable is cleared if the file that is being executed is
    not readable by the user executing the file. A bug in
    ptrace_may_access allows reading the file if the executable happens to
    enter into a subordinate user namespace (aka clone(CLONE_NEWUSER),
    unshare(CLONE_NEWUSER), or setns(fd, CLONE_NEWUSER).

    This problem is fixed with only necessary userspace breakage by adding
    a user namespace owner to mm_struct, captured at the time of exec, so
    it is clear in which user namespace CAP_SYS_PTRACE must be present in
    to be able to safely give read permission to the executable.

    The function ptrace_may_access is modified to verify that the ptracer
    has CAP_SYS_ADMIN in task->mm->user_ns instead of task->cred->user_ns.
    This ensures that if the task changes it's cred into a subordinate
    user namespace it does not become ptraceable.

    The function ptrace_attach is modified to only set PT_PTRACE_CAP when
    CAP_SYS_PTRACE is held over task->mm->user_ns. The intent of
    PT_PTRACE_CAP is to be a flag to note that whatever permission changes
    the task might go through the tracer has sufficient permissions for
    it not to be an issue. task->cred->user_ns is always the same
    as or descendent of mm->user_ns. Which guarantees that having
    CAP_SYS_PTRACE over mm->user_ns is the worst case for the tasks
    credentials.

    To prevent regressions mm->dumpable and mm->user_ns are not considered
    when a task has no mm. As simply failing ptrace_may_attach causes
    regressions in privileged applications attempting to read things
    such as /proc//stat

    Acked-by: Kees Cook
    Tested-by: Cyrill Gorcunov
    Fixes: 8409cca70561 ("userns: allow ptrace from non-init user namespaces")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit f84df2a6f268de584a201e8911384a2d244876e3 upstream.

    When the user namespace support was merged the need to prevent
    ptrace from revealing the contents of an unreadable executable
    was overlooked.

    Correct this oversight by ensuring that the executed file
    or files are in mm->user_ns, by adjusting mm->user_ns.

    Use the new function privileged_wrt_inode_uidgid to see if
    the executable is a member of the user namespace, and as such
    if having CAP_SYS_PTRACE in the user namespace should allow
    tracing the executable. If not update mm->user_ns to
    the parent user namespace until an appropriate parent is found.

    Reported-by: Jann Horn
    Fixes: 9e4a36ece652 ("userns: Fail exec for suid and sgid binaries with ids outside our user namespace.")
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • commit 777c6e0daebb3fcefbbd6f620410a946b07ef6d0 upstream.

    Yu Zhao has noticed that __unregister_cpu_notifier only unregisters its
    notifiers when HOTPLUG_CPU=y while the registration might succeed even
    when HOTPLUG_CPU=n if MODULE is enabled. This means that e.g. zswap
    might keep a stale notifier on the list on the manual clean up during
    the pool tear down and thus corrupt the list. Resulting in the following

    [ 144.964346] BUG: unable to handle kernel paging request at ffff880658a2be78
    [ 144.971337] IP: [] raw_notifier_chain_register+0x1b/0x40

    [ 145.122628] Call Trace:
    [ 145.125086] [] __register_cpu_notifier+0x18/0x20
    [ 145.131350] [] zswap_pool_create+0x273/0x400
    [ 145.137268] [] __zswap_param_set+0x1fc/0x300
    [ 145.143188] [] ? trace_hardirqs_on+0xd/0x10
    [ 145.149018] [] ? kernel_param_lock+0x28/0x30
    [ 145.154940] [] ? __might_fault+0x4f/0xa0
    [ 145.160511] [] zswap_compressor_param_set+0x17/0x20
    [ 145.167035] [] param_attr_store+0x5c/0xb0
    [ 145.172694] [] module_attr_store+0x1d/0x30
    [ 145.178443] [] sysfs_kf_write+0x4f/0x70
    [ 145.183925] [] kernfs_fop_write+0x149/0x180
    [ 145.189761] [] __vfs_write+0x18/0x40
    [ 145.194982] [] vfs_write+0xb2/0x1a0
    [ 145.200122] [] SyS_write+0x52/0xa0
    [ 145.205177] [] entry_SYSCALL_64_fastpath+0x12/0x17

    This can be even triggered manually by changing
    /sys/module/zswap/parameters/compressor multiple times.

    Fix this issue by making unregister APIs symmetric to the register so
    there are no surprises.

    Fixes: 47e627bc8c9a ("[PATCH] hotplug: Allow modules to use the cpu hotplug notifiers even if !CONFIG_HOTPLUG_CPU")
    Reported-and-tested-by: Yu Zhao
    Signed-off-by: Michal Hocko
    Cc: linux-mm@kvack.org
    Cc: Andrew Morton
    Cc: Dan Streetman
    Link: http://lkml.kernel.org/r/20161207135438.4310-1-mhocko@kernel.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

08 Dec, 2016

4 commits

  • In __sanitizer_cov_trace_pc we use task_struct and fields within it, but
    as we haven't included , it is not guaranteed to be
    defined. While we usually happen to acquire the definition through a
    transitive include, this is fragile (and hasn't been true in the past,
    causing issues with backports).

    Include to avoid any fragility.

    [mark.rutland@arm.com: rewrote changelog]
    Link: http://lkml.kernel.org/r/1481007384-27529-1-git-send-email-wangkefeng.wang@huawei.com
    Signed-off-by: Kefeng Wang
    Acked-by: Mark Rutland
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: James Morse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kefeng Wang
     
  • Pull scheduler fix from Ingo Molnar:
    "An autogroup nice level adjustment bug fix"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/autogroup: Fix 64-bit kernel nice level adjustment

    Linus Torvalds
     
  • Pull perf fixes from Ingo Molnar:
    "A bogus warning fix, a counter width handling fix affecting certain
    machines, plus a oneliner hw-enablement patch for Knights Mill CPUs"

    * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Remove invalid warning from list_update_cgroup_even()t
    perf/x86: Fix full width counter, counter overflow
    perf/x86/intel: Enable C-state residency events for Knights Mill

    Linus Torvalds
     
  • Pull locking fixes from Ingo Molnar:
    "Two rtmutex race fixes (which miraculously never triggered, that we
    know of), plus two lockdep printk formatting regression fixes"

    * 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Fix report formatting
    locking/rtmutex: Use READ_ONCE() in rt_mutex_owner()
    locking/rtmutex: Prevent dequeue vs. unlock race
    locking/selftest: Fix output since KERN_CONT changes

    Linus Torvalds
     

06 Dec, 2016

2 commits

  • Since commit:

    4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")

    printk() requires KERN_CONT to continue log messages. Lots of printk()
    in lockdep.c and print_ip_sym() don't have it. As the result lockdep
    reports are completely messed up.

    Add missing KERN_CONT and inline print_ip_sym() where necessary.

    Example of a messed up report:

    0-rc5+ #41 Not tainted
    -------------------------------------------------------
    syz-executor0/5036 is trying to acquire lock:
    (
    rtnl_mutex
    ){+.+.+.}
    , at:
    [] rtnl_lock+0x1c/0x20
    but task is already holding lock:
    (
    &net->packet.sklist_lock
    ){+.+...}
    , at:
    [] packet_diag_dump+0x1a6/0x1920
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3
    (
    &net->packet.sklist_lock
    +.+...}
    ...

    Without this patch all scripts that parse kernel bug reports are broken.

    Signed-off-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andreyknvl@google.com
    Cc: aryabinin@virtuozzo.com
    Cc: joe@perches.com
    Cc: syzkaller@googlegroups.com
    Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     
  • The warning introduced in commit:

    864c2357ca89 ("perf/core: Do not set cpuctx->cgrp for unscheduled cgroups")

    assumed that a cgroup switch always precedes list_del_event. This is
    not the case. Remove warning.

    Make sure that cpuctx->cgrp is NULL until a cgroup event is sched in
    or ctx->nr_cgroups == 0.

    Signed-off-by: David Carrillo-Cisneros
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Arnaldo Carvalho de Melo
    Cc: Borislav Petkov
    Cc: Fenghua Yu
    Cc: Jiri Olsa
    Cc: Kan Liang
    Cc: Linus Torvalds
    Cc: Marcelo Tosatti
    Cc: Nilay Vaish
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Ravi V Shankar
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vegard Nossum
    Cc: Vikas Shivappa
    Cc: Vince Weaver
    Link: http://lkml.kernel.org/r/1480841177-27299-1-git-send-email-davidcc@google.com
    Signed-off-by: Ingo Molnar

    David Carrillo-Cisneros
     

03 Dec, 2016

1 commit

  • Pull networking fixes from David Miller:

    1) Lots more phydev and probe error path leaks in various drivers by
    Johan Hovold.

    2) Fix race in packet_set_ring(), from Philip Pettersson.

    3) Use after free in dccp_invalid_packet(), from Eric Dumazet.

    4) Signnedness overflow in SO_{SND,RCV}BUFFORCE, also from Eric
    Dumazet.

    5) When tunneling between ipv4 and ipv6 we can be left with the wrong
    skb->protocol value as we enter the IPSEC engine and this causes all
    kinds of problems. Set it before the output path does any
    dst_output() calls, from Eli Cooper.

    6) bcmgenet uses wrong device struct pointer in DMA API calls, fix from
    Florian Fainelli.

    7) Various netfilter nat bug fixes from FLorian Westphal.

    8) Fix memory leak in ipvlan_link_new(), from Gao Feng.

    9) Locking fixes, particularly wrt. socket lookups, in l2tp from
    Guillaume Nault.

    10) Avoid invoking rhash teardowns in atomic context by moving netlink
    cb->done() dump completion from a worker thread. Fix from Herbert
    Xu.

    11) Buffer refcount problems in tun and macvtap on errors, from Jason
    Wang.

    12) We don't set Kconfig symbol DEFAULT_TCP_CONG properly when the user
    selects BBR. Fix from Julian Wollrath.

    13) Fix deadlock in transmit path on altera TSE driver, from Lino
    Sanfilippo.

    14) Fix unbalanced reference counting in dsa_switch_tree, from Nikita
    Yushchenko.

    15) tc_tunnel_key needs to be properly exported to userspace via uapi,
    fix from Roi Dayan.

    16) rds_tcp_init_net() doesn't unregister notifier in error path, fix
    from Sowmini Varadhan.

    17) Stale packet header pointer access after pskb_expand_head() in
    genenve driver, fix from Sabrina Dubroca.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (103 commits)
    net: avoid signed overflows for SO_{SND|RCV}BUFFORCE
    geneve: avoid use-after-free of skb->data
    tipc: check minimum bearer MTU
    net: renesas: ravb: unintialized return value
    sh_eth: remove unchecked interrupts for RZ/A1
    net: bcmgenet: Utilize correct struct device for all DMA operations
    NET: usb: qmi_wwan: add support for Telit LE922A PID 0x1040
    cdc_ether: Fix handling connection notification
    ip6_offload: check segs for NULL in ipv6_gso_segment.
    RDS: TCP: unregister_netdevice_notifier() in error path of rds_tcp_init_net
    Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"
    ipv6: Set skb->protocol properly for local output
    ipv4: Set skb->protocol properly for local output
    packet: fix race condition in packet_set_ring
    net: ethernet: altera: TSE: do not use tx queue lock in tx completion handler
    net: ethernet: altera: TSE: Remove unneeded dma sync for tx buffers
    net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks
    net: ethernet: stmmac: platform: fix outdated function header
    net: ethernet: stmmac: dwmac-meson8b: fix probe error path
    net: ethernet: stmmac: dwmac-generic: fix probe error path
    ...

    Linus Torvalds
     

02 Dec, 2016

2 commits

  • While debugging the rtmutex unlock vs. dequeue race Will suggested to use
    READ_ONCE() in rt_mutex_owner() as it might race against the
    cmpxchg_release() in unlock_rt_mutex_safe().

    Will: "It's a minor thing which will most likely not matter in practice"

    Careful search did not unearth an actual problem in todays code, but it's
    better to be safe than surprised.

    Suggested-by: Will Deacon
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

01 Dec, 2016

1 commit

  • If we have a branch that looks something like this

    int foo = map->value;
    if (condition) {
    foo += blah;
    } else {
    foo = bar;
    }
    map->array[foo] = baz;

    We will incorrectly assume that the !condition branch is equal to the condition
    branch as the register for foo will be UNKNOWN_VALUE in both cases. We need to
    adjust this logic to only do this if we didn't do a varlen access after we
    processed the !condition branch, otherwise we have different ranges and need to
    check the other branch as well.

    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Reported-by: Jann Horn
    Signed-off-by: Josef Bacik
    Acked-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Signed-off-by: David S. Miller

    Josef Bacik