13 Jul, 2017

3 commits


09 Jun, 2017

1 commit

  • When the tick is stopped and we reach the dynticks evaluation code on
    IRQ exit, we perform a soft tick restart if we observe an expired timer
    from there. It means we program the nearest possible tick but we stay in
    dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
    again after that expired timer is handled.

    Now this solution works most of the time but if we suffer an IRQ storm
    and those interrupts trigger faster than the hardware clockevents min
    delay, our tick won't fire until that IRQ storm is finished.

    Here is the problem: on IRQ exit we reprog the timer to at least
    NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
    reschedule again to NOW() + min_clockevents_delay, etc... The tick
    is eternally rescheduled min_clockevents_delay ahead.

    A solution is to simply remove this soft tick restart. After all
    the normal dynticks evaluation path can handle 0 delay just fine. And
    by doing that we benefit from the optimization branch which avoids
    clock reprogramming if the clockevents deadline hasn't changed since
    the last reprog. This fixes our issue because we don't do repetitive
    clock reprog that always add hardware min delay.

    As a side effect it should even optimize the 0 delay path in general.

    Reported-and-tested-by: Octavian Purdila
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

22 Mar, 2017

14 commits

  • am: c406096522

    Change-Id: I2ac74fa5f03a3dff0b45e91d8be4e80c9c70da98

    Niklas Cassel
     
  • am: 1522181f4b

    Change-Id: I74d73e5550ccb3a8d846f636b4f7940b124d261c

    Peter Zijlstra
     
  • am: 6244ffc5a1

    Change-Id: I973ed47c736e8deeebcea67e885abfb557fc1469

    Peter Zijlstra
     
  • am: 0e0f1d6fdb

    Change-Id: I3990cdd8bc095423723df1771ec87158dc8cb63a

    Daniel Borkmann
     
  • am: 1889d6d9b5

    Change-Id: I528df00bc5cd00aec10f56ad53907f6ca8044973

    Daniel Borkmann
     
  • am: b7f5aa1ca0

    Change-Id: I1d61f752ac33e41ebb748eba7d3f48003ee29c19

    Alexei Starovoitov
     
  • am: 1411707acb

    Change-Id: I02e316379d5a19258ea601e75b9f4cece6450acd

    Thomas Graf
     
  • commit 17fcbd590d0c3e35bd9646e2215f86586378bc42 upstream.

    We hang if SIGKILL has been sent, but the task is stuck in down_read()
    (after do_exit()), even though no task is doing down_write() on the
    rwsem in question:

    INFO: task libupnp:21868 blocked for more than 120 seconds.
    libupnp D 0 21868 1 0x08100008
    ...
    Call Trace:
    __schedule()
    schedule()
    __down_read()
    do_exit()
    do_group_exit()
    __wake_up_parent()

    This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
    the following commit:

    04cafed7fc19 ("locking/rwsem: Fix down_write_killable()")

    ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.

    Signed-off-by: Niklas Cassel
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Niklas Cassel
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: d47996082f52 ("locking/rwsem: Introduce basis for down_write_killable()")
    Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Niklas Cassel
     
  • commit 9bbb25afeb182502ca4f2c4f3f88af0681b34cae upstream.

    Thomas spotted that fixup_pi_state_owner() can return errors and we
    fail to unlock the rt_mutex in that case.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit c236c8e95a3d395b0494e7108f0d41cf36ec107c upstream.

    While working on the futex code, I stumbled over this potential
    use-after-free scenario. Dmitry triggered it later with syzkaller.

    pi_mutex is a pointer into pi_state, which we drop the reference on in
    unqueue_me_pi(). So any access to that pointer after that is bad.

    Since other sites already do rt_mutex_unlock() with hb->lock held, see
    for example futex_lock_pi(), simply move the unlock before
    unqueue_me_pi().

    Reported-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Darren Hart
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • [ Upstream commit 6760bf2ddde8ad64f8205a651223a93de3a35494 ]

    Martin reported a verifier issue that hit the BUG_ON() for his
    test case in the mark_reg_unknown_value() function:

    [ 202.861380] kernel BUG at kernel/bpf/verifier.c:467!
    [...]
    [ 203.291109] Call Trace:
    [ 203.296501] [] mark_map_reg+0x45/0x50
    [ 203.308225] [] mark_map_regs+0x78/0x90
    [ 203.320140] [] do_check+0x226d/0x2c90
    [ 203.331865] [] bpf_check+0x48b/0x780
    [ 203.343403] [] bpf_prog_load+0x27e/0x440
    [ 203.355705] [] ? handle_mm_fault+0x11af/0x1230
    [ 203.369158] [] ? security_capable+0x48/0x60
    [ 203.382035] [] SyS_bpf+0x124/0x960
    [ 203.393185] [] ? __do_page_fault+0x276/0x490
    [ 203.406258] [] entry_SYSCALL_64_fastpath+0x13/0x94

    This issue got uncovered after the fix in a08dd0da5307 ("bpf: fix
    regression on verifier pruning wrt map lookups"). The reason why it
    wasn't noticed before was, because as mentioned in a08dd0da5307,
    mark_map_regs() was doing the id matching incorrectly based on the
    uncached regs[regno].id. So, in the first loop, we walked all regs
    and as soon as we found regno == i, then this reg's id was cleared
    when calling mark_reg_unknown_value() thus that every subsequent
    register was probed against id of 0 (which, in combination with the
    PTR_TO_MAP_VALUE_OR_NULL type is an invalid condition that no other
    register state can hold), and therefore wasn't type transitioned such
    as in the spilled register case for the second loop.

    Now since that got fixed, it turned out that 57a09bf0a416 ("bpf:
    Detect identical PTR_TO_MAP_VALUE_OR_NULL registers") used
    mark_reg_unknown_value() incorrectly for the spilled regs, and thus
    hitting the BUG_ON() in some cases due to regno >= MAX_BPF_REG.

    Although spilled regs have the same type as the non-spilled regs
    for the verifier state, that is, struct bpf_reg_state, they are
    semantically different from the non-spilled regs. In other words,
    there can be up to 64 (MAX_BPF_STACK / BPF_REG_SIZE) spilled regs
    in the stack, for example, register R could have been spilled by
    the program to stack location X, Y, Z, and in mark_map_regs() we
    need to scan these stack slots of type STACK_SPILL for potential
    registers that we have to transition from PTR_TO_MAP_VALUE_OR_NULL.
    Therefore, depending on the location, the spilled_regs regno can
    be a lot higher than just MAX_BPF_REG's value since we operate on
    stack instead. The reset in mark_reg_unknown_value() itself is
    just fine, only that the BUG_ON() was inappropriate for this. Fix
    it by making a __mark_reg_unknown_value() version that can be
    called from mark_map_reg() generically; we know for the non-spilled
    case that the regno is always < MAX_BPF_REG anyway.

    Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    Reported-by: Martin KaFai Lau
    Signed-off-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit a08dd0da5307ba01295c8383923e51e7997c3576 ]

    Commit 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL
    registers") introduced a regression where existing programs stopped
    loading due to reaching the verifier's maximum complexity limit,
    whereas prior to this commit they were loading just fine; the affected
    program has roughly 2k instructions.

    What was found is that state pruning couldn't be performed effectively
    anymore due to mismatches of the verifier's register state, in particular
    in the id tracking. It doesn't mean that 57a09bf0a416 is incorrect per
    se, but rather that verifier needs to perform a lot more work for the
    same program with regards to involved map lookups.

    Since commit 57a09bf0a416 is only about tracking registers with type
    PTR_TO_MAP_VALUE_OR_NULL, the id is only needed to follow registers
    until they are promoted through pattern matching with a NULL check to
    either PTR_TO_MAP_VALUE or UNKNOWN_VALUE type. After that point, the
    id becomes irrelevant for the transitioned types.

    For UNKNOWN_VALUE, id is already reset to 0 via mark_reg_unknown_value(),
    but not so for PTR_TO_MAP_VALUE where id is becoming stale. It's even
    transferred further into other types that don't make use of it. Among
    others, one example is where UNKNOWN_VALUE is set on function call
    return with RET_INTEGER return type.

    states_equal() will then fall through the memcmp() on register state;
    note that the second memcmp() uses offsetofend(), so the id is part of
    that since d2a4dd37f6b4 ("bpf: fix state equivalence"). But the bisect
    pointed already to 57a09bf0a416, where we really reach beyond complexity
    limit. What I found was that states_equal() often failed in this
    case due to id mismatches in spilled regs with registers in type
    PTR_TO_MAP_VALUE. Unlike non-spilled regs, spilled regs just perform
    a memcmp() on their reg state and don't have any other optimizations
    in place, therefore also id was relevant in this case for making a
    pruning decision.

    We can safely reset id to 0 as well when converting to PTR_TO_MAP_VALUE.
    For the affected program, it resulted in a ~17 fold reduction of
    complexity and let the program load fine again. Selftest suite also
    runs fine. The only other place where env->id_gen is used currently is
    through direct packet access, but for these cases id is long living, thus
    a different scenario.

    Also, the current logic in mark_map_regs() is not fully correct when
    marking NULL branch with UNKNOWN_VALUE. We need to cache the destination
    reg's id in any case. Otherwise, once we marked that reg as UNKNOWN_VALUE,
    it's id is reset and any subsequent registers that hold the original id
    and are of type PTR_TO_MAP_VALUE_OR_NULL won't be marked UNKNOWN_VALUE
    anymore, since mark_map_reg() reuses the uncached regs[regno].id that
    was just overridden. Note, we don't need to cache it outside of
    mark_map_regs(), since it's called once on this_branch and the other
    time on other_branch, which are both two independent verifier states.
    A test case for this is added here, too.

    Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    Signed-off-by: Daniel Borkmann
    Acked-by: Thomas Graf
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit d2a4dd37f6b41fbcad76efbf63124eb3126c66fe ]

    Commmits 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    and 484611357c19 ("bpf: allow access into map value arrays") by themselves
    are correct, but in combination they make state equivalence ignore 'id' field
    of the register state which can lead to accepting invalid program.

    Fixes: 57a09bf0a416 ("bpf: Detect identical PTR_TO_MAP_VALUE_OR_NULL registers")
    Fixes: 484611357c19 ("bpf: allow access into map value arrays")
    Signed-off-by: Alexei Starovoitov
    Acked-by: Daniel Borkmann
    Acked-by: Thomas Graf
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Alexei Starovoitov
     
  • [ Upstream commit 57a09bf0a416700676e77102c28f9cfcb48267e0 ]

    A BPF program is required to check the return register of a
    map_elem_lookup() call before accessing memory. The verifier keeps
    track of this by converting the type of the result register from
    PTR_TO_MAP_VALUE_OR_NULL to PTR_TO_MAP_VALUE after a conditional
    jump ensures safety. This check is currently exclusively performed
    for the result register 0.

    In the event the compiler reorders instructions, BPF_MOV64_REG
    instructions may be moved before the conditional jump which causes
    them to keep their type PTR_TO_MAP_VALUE_OR_NULL to which the
    verifier objects when the register is accessed:

    0: (b7) r1 = 10
    1: (7b) *(u64 *)(r10 -8) = r1
    2: (bf) r2 = r10
    3: (07) r2 += -8
    4: (18) r1 = 0x59c00000
    6: (85) call 1
    7: (bf) r4 = r0
    8: (15) if r0 == 0x0 goto pc+1
    R0=map_value(ks=8,vs=8) R4=map_value_or_null(ks=8,vs=8) R10=fp
    9: (7a) *(u64 *)(r4 +0) = 0
    R4 invalid mem access 'map_value_or_null'

    This commit extends the verifier to keep track of all identical
    PTR_TO_MAP_VALUE_OR_NULL registers after a map_elem_lookup() by
    assigning them an ID and then marking them all when the conditional
    jump is observed.

    Signed-off-by: Thomas Graf
    Reviewed-by: Josef Bacik
    Acked-by: Daniel Borkmann
    Acked-by: Alexei Starovoitov
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Thomas Graf
     

18 Mar, 2017

3 commits

  • am: ee6f7ee1e4

    Change-Id: I222bf26458f918b2481e0c14ce2b0bf5fd3c85d8

    Eric W. Biederman
     
  • commit 040757f738e13caaa9c5078bca79aa97e11dde88 upstream.

    Always increment/decrement ucount->count under the ucounts_lock. The
    increments are there already and moving the decrements there means the
    locking logic of the code is simpler. This simplification in the
    locking logic fixes a race between put_ucounts and get_ucounts that
    could result in a use-after-free because the count could go zero then
    be found by get_ucounts and then be freed by put_ucounts.

    A bug presumably this one was found by a combination of syzkaller and
    KASAN. JongWhan Kim reported the syzkaller failure and Dmitry Vyukov
    spotted the race in the code.

    Fixes: f6b2db1a3e8d ("userns: Make the count of user namespaces per user")
    Reported-by: JongHwan Kim
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrei Vagin
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     
  • EAS uses "const struct sched_group_energy * const" fairly consistently.
    But a couple of places swap the "*" and second "const", making the
    pointer mutable.

    In the case of struct sched_group, "* const" would have been an error,
    since init_sched_energy() writes to sd->groups->sge.

    Change-Id: Ic6a8fcf99e65c0f25d9cc55c32625ef3ca5c9aca
    Signed-off-by: Greg Hackmann

    Greg Hackmann
     

15 Mar, 2017

2 commits

  • am: d3381fab77

    Change-Id: I5f0019d2c86afd0a055f823f6751db9f6378e38f

    Eric W. Biederman
     
  • commit 93faccbbfa958a9668d3ab4e30f38dd205cee8d8 upstream.

    To support unprivileged users mounting filesystems two permission
    checks have to be performed: a test to see if the user allowed to
    create a mount in the mount namespace, and a test to see if
    the user is allowed to access the specified filesystem.

    The automount case is special in that mounting the original filesystem
    grants permission to mount the sub-filesystems, to any user who
    happens to stumble across the their mountpoint and satisfies the
    ordinary filesystem permission checks.

    Attempting to handle the automount case by using override_creds
    almost works. It preserves the idea that permission to mount
    the original filesystem is permission to mount the sub-filesystem.
    Unfortunately using override_creds messes up the filesystems
    ordinary permission checks.

    Solve this by being explicit that a mount is a submount by introducing
    vfs_submount, and using it where appropriate.

    vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
    sget and friends know that a mount is a submount so they can take appropriate
    action.

    sget and sget_userns are modified to not perform any permission checks
    on submounts.

    follow_automount is modified to stop using override_creds as that
    has proven problemantic.

    do_mount is modified to always remove the new MS_SUBMOUNT flag so
    that we know userspace will never by able to specify it.

    autofs4 is modified to stop using current_real_cred that was put in
    there to handle the previous version of submount permission checking.

    cifs is modified to pass the mountpoint all of the way down to vfs_submount.

    debugfs is modified to pass the mountpoint all of the way down to
    trace_automount by adding a new parameter. To make this change easier
    a new typedef debugfs_automount_t is introduced to capture the type of
    the debugfs automount function.

    Fixes: 069d5ac9ae0d ("autofs: Fix automounts by using current_real_cred()->uid")
    Fixes: aeaa4a79ff6a ("fs: Call d_automount with the filesystems creds")
    Reviewed-by: Trond Myklebust
    Reviewed-by: Seth Forshee
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Greg Kroah-Hartman

    Eric W. Biederman
     

14 Mar, 2017

1 commit


12 Mar, 2017

6 commits

  • am: 3de5a92847

    Change-Id: Ib7d8806d40c785e0128af2d846bda76727aa1255

    Mathieu Desnoyers
     
  • am: 6d94a6b32e

    Change-Id: I972940b00813db773e278ce8b6bc46f3cba15988

    Stas Sergeev
     
  • am: f1faaec484

    Change-Id: I7e37ea31b7376f63e9ff38b5d40020e793a64c1a

    Dan Williams
     
  • commit 907565337ebf998a68cb5c5b2174ce5e5da065eb upstream.

    Userspace applications should be allowed to expect the membarrier system
    call with MEMBARRIER_CMD_SHARED command to issue memory barriers on
    nohz_full CPUs, but synchronize_sched() does not take those into
    account.

    Given that we do not want unrelated processes to be able to affect
    real-time sensitive nohz_full CPUs, simply return ENOSYS when membarrier
    is invoked on a kernel with enabled nohz_full CPUs.

    Signed-off-by: Mathieu Desnoyers
    CC: Josh Triplett
    CC: Steven Rostedt
    Signed-off-by: Paul E. McKenney
    Cc: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Rik van Riel
    Acked-by: Lai Jiangshan
    Reviewed-by: Josh Triplett
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     
  • commit 441398d378f29a5ad6d0fcda07918e54e4961800 upstream.

    Currently SS_AUTODISARM is not supported in compatibility mode, but does
    not return -EINVAL either. This makes dosemu built with -m32 on x86_64
    to crash. Also the kernel's sigaltstack selftest fails if compiled with
    -m32.

    This patch adds the needed support.

    Link: http://lkml.kernel.org/r/20170205101213.8163-2-stsp@list.ru
    Signed-off-by: Stas Sergeev
    Cc: Milosz Tanski
    Cc: Andy Lutomirski
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc: Nicolas Pitre
    Cc: Waiman Long
    Cc: Dave Hansen
    Cc: Dmitry Safonov
    Cc: Wang Xiaoqiang
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Stas Sergeev
     
  • commit b5d24fda9c3dce51fcb4eee459550a458eaaf1e2 upstream.

    The mem_hotplug_{begin,done} lock coordinates with {get,put}_online_mems()
    to hold off "readers" of the current state of memory from new hotplug
    actions. mem_hotplug_begin() expects exclusive access, via the
    device_hotplug lock, to set mem_hotplug.active_writer. Calling
    mem_hotplug_begin() without locking device_hotplug can lead to
    corrupting mem_hotplug.refcount and missed wakeups / soft lockups.

    [dan.j.williams@intel.com: v2]
    Link: http://lkml.kernel.org/r/148728203365.38457.17804568297887708345.stgit@dwillia2-desk3.amr.corp.intel.com
    Link: http://lkml.kernel.org/r/148693885680.16345.17802627926777862337.stgit@dwillia2-desk3.amr.corp.intel.com
    Fixes: f931ab479dd2 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}")
    Signed-off-by: Dan Williams
    Reported-by: Ben Hutchings
    Cc: Michal Hocko
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Masayoshi Mizuma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Dan Williams
     

24 Feb, 2017

4 commits

  • Todd Kjos
     
  • commit f222449c9dfad7c9bb8cb53e64c5c407b172ebbc upstream.

    We cannot do printk() from tk_debug_account_sleep_time(), because
    tk_debug_account_sleep_time() is called under tk_core seq lock.
    The reason why printk() is unsafe there is that console_sem may
    invoke scheduler (up()->wake_up_process()->activate_task()), which,
    in turn, can return back to timekeeping code, for instance, via
    get_time()->ktime_get(), deadlocking the system on tk_core seq lock.

    [ 48.950592] ======================================================
    [ 48.950622] [ INFO: possible circular locking dependency detected ]
    [ 48.950622] 4.10.0-rc7-next-20170213+ #101 Not tainted
    [ 48.950622] -------------------------------------------------------
    [ 48.950622] kworker/0:0/3 is trying to acquire lock:
    [ 48.950653] (tk_core){----..}, at: [] retrigger_next_event+0x4c/0x90
    [ 48.950683]
    but task is already holding lock:
    [ 48.950683] (hrtimer_bases.lock){-.-...}, at: [] retrigger_next_event+0x38/0x90
    [ 48.950714]
    which lock already depends on the new lock.

    [ 48.950714]
    the existing dependency chain (in reverse order) is:
    [ 48.950714]
    -> #5 (hrtimer_bases.lock){-.-...}:
    [ 48.950744] _raw_spin_lock_irqsave+0x50/0x64
    [ 48.950775] lock_hrtimer_base+0x28/0x58
    [ 48.950775] hrtimer_start_range_ns+0x20/0x5c8
    [ 48.950775] __enqueue_rt_entity+0x320/0x360
    [ 48.950805] enqueue_rt_entity+0x2c/0x44
    [ 48.950805] enqueue_task_rt+0x24/0x94
    [ 48.950836] ttwu_do_activate+0x54/0xc0
    [ 48.950836] try_to_wake_up+0x248/0x5c8
    [ 48.950836] __setup_irq+0x420/0x5f0
    [ 48.950836] request_threaded_irq+0xdc/0x184
    [ 48.950866] devm_request_threaded_irq+0x58/0xa4
    [ 48.950866] omap_i2c_probe+0x530/0x6a0
    [ 48.950897] platform_drv_probe+0x50/0xb0
    [ 48.950897] driver_probe_device+0x1f8/0x2cc
    [ 48.950897] __driver_attach+0xc0/0xc4
    [ 48.950927] bus_for_each_dev+0x6c/0xa0
    [ 48.950927] bus_add_driver+0x100/0x210
    [ 48.950927] driver_register+0x78/0xf4
    [ 48.950958] do_one_initcall+0x3c/0x16c
    [ 48.950958] kernel_init_freeable+0x20c/0x2d8
    [ 48.950958] kernel_init+0x8/0x110
    [ 48.950988] ret_from_fork+0x14/0x24
    [ 48.950988]
    -> #4 (&rt_b->rt_runtime_lock){-.-...}:
    [ 48.951019] _raw_spin_lock+0x40/0x50
    [ 48.951019] rq_offline_rt+0x9c/0x2bc
    [ 48.951019] set_rq_offline.part.2+0x2c/0x58
    [ 48.951049] rq_attach_root+0x134/0x144
    [ 48.951049] cpu_attach_domain+0x18c/0x6f4
    [ 48.951049] build_sched_domains+0xba4/0xd80
    [ 48.951080] sched_init_smp+0x68/0x10c
    [ 48.951080] kernel_init_freeable+0x160/0x2d8
    [ 48.951080] kernel_init+0x8/0x110
    [ 48.951080] ret_from_fork+0x14/0x24
    [ 48.951110]
    -> #3 (&rq->lock){-.-.-.}:
    [ 48.951110] _raw_spin_lock+0x40/0x50
    [ 48.951141] task_fork_fair+0x30/0x124
    [ 48.951141] sched_fork+0x194/0x2e0
    [ 48.951141] copy_process.part.5+0x448/0x1a20
    [ 48.951171] _do_fork+0x98/0x7e8
    [ 48.951171] kernel_thread+0x2c/0x34
    [ 48.951171] rest_init+0x1c/0x18c
    [ 48.951202] start_kernel+0x35c/0x3d4
    [ 48.951202] 0x8000807c
    [ 48.951202]
    -> #2 (&p->pi_lock){-.-.-.}:
    [ 48.951232] _raw_spin_lock_irqsave+0x50/0x64
    [ 48.951232] try_to_wake_up+0x30/0x5c8
    [ 48.951232] up+0x4c/0x60
    [ 48.951263] __up_console_sem+0x2c/0x58
    [ 48.951263] console_unlock+0x3b4/0x650
    [ 48.951263] vprintk_emit+0x270/0x474
    [ 48.951293] vprintk_default+0x20/0x28
    [ 48.951293] printk+0x20/0x30
    [ 48.951324] kauditd_hold_skb+0x94/0xb8
    [ 48.951324] kauditd_thread+0x1a4/0x56c
    [ 48.951324] kthread+0x104/0x148
    [ 48.951354] ret_from_fork+0x14/0x24
    [ 48.951354]
    -> #1 ((console_sem).lock){-.....}:
    [ 48.951385] _raw_spin_lock_irqsave+0x50/0x64
    [ 48.951385] down_trylock+0xc/0x2c
    [ 48.951385] __down_trylock_console_sem+0x24/0x80
    [ 48.951385] console_trylock+0x10/0x8c
    [ 48.951416] vprintk_emit+0x264/0x474
    [ 48.951416] vprintk_default+0x20/0x28
    [ 48.951416] printk+0x20/0x30
    [ 48.951446] tk_debug_account_sleep_time+0x5c/0x70
    [ 48.951446] __timekeeping_inject_sleeptime.constprop.3+0x170/0x1a0
    [ 48.951446] timekeeping_resume+0x218/0x23c
    [ 48.951477] syscore_resume+0x94/0x42c
    [ 48.951477] suspend_enter+0x554/0x9b4
    [ 48.951477] suspend_devices_and_enter+0xd8/0x4b4
    [ 48.951507] enter_state+0x934/0xbd4
    [ 48.951507] pm_suspend+0x14/0x70
    [ 48.951507] state_store+0x68/0xc8
    [ 48.951538] kernfs_fop_write+0xf4/0x1f8
    [ 48.951538] __vfs_write+0x1c/0x114
    [ 48.951538] vfs_write+0xa0/0x168
    [ 48.951568] SyS_write+0x3c/0x90
    [ 48.951568] __sys_trace_return+0x0/0x10
    [ 48.951568]
    -> #0 (tk_core){----..}:
    [ 48.951599] lock_acquire+0xe0/0x294
    [ 48.951599] ktime_get_update_offsets_now+0x5c/0x1d4
    [ 48.951629] retrigger_next_event+0x4c/0x90
    [ 48.951629] on_each_cpu+0x40/0x7c
    [ 48.951629] clock_was_set_work+0x14/0x20
    [ 48.951660] process_one_work+0x2b4/0x808
    [ 48.951660] worker_thread+0x3c/0x550
    [ 48.951660] kthread+0x104/0x148
    [ 48.951690] ret_from_fork+0x14/0x24
    [ 48.951690]
    other info that might help us debug this:

    [ 48.951690] Chain exists of:
    tk_core --> &rt_b->rt_runtime_lock --> hrtimer_bases.lock

    [ 48.951721] Possible unsafe locking scenario:

    [ 48.951721] CPU0 CPU1
    [ 48.951721] ---- ----
    [ 48.951721] lock(hrtimer_bases.lock);
    [ 48.951751] lock(&rt_b->rt_runtime_lock);
    [ 48.951751] lock(hrtimer_bases.lock);
    [ 48.951751] lock(tk_core);
    [ 48.951782]
    *** DEADLOCK ***

    [ 48.951782] 3 locks held by kworker/0:0/3:
    [ 48.951782] #0: ("events"){.+.+.+}, at: [] process_one_work+0x1f8/0x808
    [ 48.951812] #1: (hrtimer_work){+.+...}, at: [] process_one_work+0x1f8/0x808
    [ 48.951843] #2: (hrtimer_bases.lock){-.-...}, at: [] retrigger_next_event+0x38/0x90
    [ 48.951843] stack backtrace:
    [ 48.951873] CPU: 0 PID: 3 Comm: kworker/0:0 Not tainted 4.10.0-rc7-next-20170213+
    [ 48.951904] Workqueue: events clock_was_set_work
    [ 48.951904] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
    [ 48.951934] [] (show_stack) from [] (dump_stack+0xac/0xe0)
    [ 48.951934] [] (dump_stack) from [] (print_circular_bug+0x1d0/0x308)
    [ 48.951965] [] (print_circular_bug) from [] (validate_chain+0xf50/0x1324)
    [ 48.951965] [] (validate_chain) from [] (__lock_acquire+0x468/0x7e8)
    [ 48.951995] [] (__lock_acquire) from [] (lock_acquire+0xe0/0x294)
    [ 48.951995] [] (lock_acquire) from [] (ktime_get_update_offsets_now+0x5c/0x1d4)
    [ 48.952026] [] (ktime_get_update_offsets_now) from [] (retrigger_next_event+0x4c/0x90)
    [ 48.952026] [] (retrigger_next_event) from [] (on_each_cpu+0x40/0x7c)
    [ 48.952056] [] (on_each_cpu) from [] (clock_was_set_work+0x14/0x20)
    [ 48.952056] [] (clock_was_set_work) from [] (process_one_work+0x2b4/0x808)
    [ 48.952087] [] (process_one_work) from [] (worker_thread+0x3c/0x550)
    [ 48.952087] [] (worker_thread) from [] (kthread+0x104/0x148)
    [ 48.952087] [] (kthread) from [] (ret_from_fork+0x14/0x24)

    Replace printk() with printk_deferred(), which does not call into
    the scheduler.

    Fixes: 0bf43f15db85 ("timekeeping: Prints the amounts of time spent during suspend")
    Reported-and-tested-by: Tony Lindgren
    Signed-off-by: Sergey Senozhatsky
    Cc: Petr Mladek
    Cc: Sergey Senozhatsky
    Cc: Peter Zijlstra
    Cc: "Rafael J . Wysocki"
    Cc: Steven Rostedt
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20170215044332.30449-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit fc98c3c8c9dcafd67adcce69e6ce3191d5306c9c upstream.

    Use rcuidle console tracepoint because, apparently, it may be issued
    from an idle CPU:

    hw-breakpoint: Failed to enable monitor mode on CPU 0.
    hw-breakpoint: CPU 0 failed to disable vector catch

    ===============================
    [ ERR: suspicious RCU usage. ]
    4.10.0-rc8-next-20170215+ #119 Not tainted
    -------------------------------
    ./include/trace/events/printk.h:32 suspicious rcu_dereference_check() usage!

    other info that might help us debug this:

    RCU used illegally from idle CPU!
    rcu_scheduler_active = 2, debug_locks = 0
    RCU used illegally from extended quiescent state!
    2 locks held by swapper/0/0:
    #0: (cpu_pm_notifier_lock){......}, at: [] cpu_pm_exit+0x10/0x54
    #1: (console_lock){+.+.+.}, at: [] vprintk_emit+0x264/0x474

    stack backtrace:
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.10.0-rc8-next-20170215+ #119
    Hardware name: Generic OMAP4 (Flattened Device Tree)
    console_unlock
    vprintk_emit
    vprintk_default
    printk
    reset_ctrl_regs
    dbg_cpu_pm_notify
    notifier_call_chain
    cpu_pm_exit
    omap_enter_idle_coupled
    cpuidle_enter_state
    cpuidle_enter_state_coupled
    do_idle
    cpu_startup_entry
    start_kernel

    This RCU warning, however, is suppressed by lockdep_off() in printk().
    lockdep_off() increments the ->lockdep_recursion counter and thus
    disables RCU_LOCKDEP_WARN() and debug_lockdep_rcu_enabled(), which want
    lockdep to be enabled "current->lockdep_recursion == 0".

    Link: http://lkml.kernel.org/r/20170217015932.11898-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Reported-by: Tony Lindgren
    Tested-by: Tony Lindgren
    Acked-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Cc: Petr Mladek
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tony Lindgren
    Cc: Russell King
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Sergey Senozhatsky
     
  • commit 25f71d1c3e98ef0e52371746220d66458eac75bc upstream.

    The UEVENT user mode helper is enabled before the initcalls are executed
    and is available when the root filesystem has been mounted.

    The user mode helper is triggered by device init calls and the executable
    might use the futex syscall.

    futex_init() is marked __initcall which maps to device_initcall, but there
    is no guarantee that futex_init() is invoked _before_ the first device init
    call which triggers the UEVENT user mode helper.

    If the user mode helper uses the futex syscall before futex_init() then the
    syscall crashes with a NULL pointer dereference because the futex subsystem
    has not been initialized yet.

    Move futex_init() to core_initcall so futexes are initialized before the
    root filesystem is mounted and the usermode helper becomes available.

    [ tglx: Rewrote changelog ]

    Signed-off-by: Yang Yang
    Cc: jiang.biao2@zte.com.cn
    Cc: jiang.zhengxiong@zte.com.cn
    Cc: zhong.weidong@zte.com.cn
    Cc: deng.huali@zte.com.cn
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Yang Yang
     

23 Feb, 2017

2 commits

  • Move the x86_64 idle notifiers originally by Andi Kleen and Venkatesh
    Pallipadi to generic.

    Change-Id: Idf29cda15be151f494ff245933c12462643388d5
    Acked-by: Nicolas Pitre
    Signed-off-by: Todd Poynor

    Todd Poynor
     
  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

17 Feb, 2017

4 commits

  • Modify the schedutil cpufreq governor to boost the CPU
    frequency if the SCHED_CPUFREQ_IOWAIT flag is passed to
    it via cpufreq_update_util().

    If that happens, the frequency is set to the maximum during
    the first update after receiving the SCHED_CPUFREQ_IOWAIT flag
    and then the boost is reduced by half during each following update.

    Signed-off-by: Rafael J. Wysocki
    Looks-good-to: Steve Muckle
    Acked-by: Peter Zijlstra (Intel)
    (cherry picked from commit 21ca6d2c52f8ca8638129c1dfc489d0b0ae68532)

    Rafael J. Wysocki
     
  • PELT does not consider SMT when scaling its utilization values via
    arch_scale_cpu_capacity(). The value in rq->cpu_capacity_orig does
    take SMT into consideration though and therefore may be smaller than
    the utilization reported by PELT.

    On an Intel i7-3630QM for example rq->cpu_capacity_orig is 589 but
    util_avg scales up to 1024. This means that a 50% utilized CPU will show
    up in schedutil as ~86% busy.

    Fix this by using the same CPU scaling value in schedutil as that which
    is used by PELT.

    Signed-off-by: Steve Muckle
    Signed-off-by: Rafael J. Wysocki
    (cherry picked from commit 8314bc83f6a33958a033955e9bdc48e8dd4d5fb0)

    Steve Muckle
     
  • It is useful to know the reason why cpufreq_update_util() has just
    been called and that can be passed as flags to cpufreq_update_util()
    and to the ->func() callback in struct update_util_data. However,
    doing that in addition to passing the util and max arguments they
    already take would be clumsy, so avoid it.

    Instead, use the observation that the schedutil governor is part
    of the scheduler proper, so it can access scheduler data directly.
    This allows the util and max arguments of cpufreq_update_util()
    and the ->func() callback in struct update_util_data to be replaced
    with a flags one, but schedutil has to be modified to follow.

    Thus make the schedutil governor obtain the CFS utilization
    information from the scheduler and use the "RT" and "DL" flags
    instead of the special utilization value of ULONG_MAX to track
    updates from the RT and DL sched classes. Make it non-modular
    too to avoid having to export scheduler variables to modules at
    large.

    Next, update all of the other users of cpufreq_update_util()
    and the ->func() callback in struct update_util_data accordingly.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    (cherry picked from commit 58919e83c85c3a3c5fb34025dc0e95ddd998c478)

    Rafael J. Wysocki
     
  • The slow-path frequency transition path is relatively expensive as it
    requires waking up a thread to do work. Should support be added for
    remote CPU cpufreq updates that is also expensive since it requires an
    IPI. These activities should be avoided if they are not necessary.

    To that end, calculate the actual driver-supported frequency required by
    the new utilization value in schedutil by using the recently added
    cpufreq_driver_resolve_freq API. If it is the same as the previously
    requested driver frequency then there is no need to continue with the
    update assuming the cpu frequency limits have not changed. This will
    have additional benefits should the semantics of the rate limit be
    changed to apply solely to frequency transitions rather than to
    frequency calculations in schedutil.

    The last raw required frequency is cached. This allows the driver
    frequency lookup to be skipped in the event that the new raw required
    frequency matches the last one, assuming a frequency update has not been
    forced due to limits changing (indicated by a next_freq value of
    UINT_MAX, see sugov_should_update_freq).

    Signed-off-by: Steve Muckle
    Reviewed-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    (cherry picked from commit 5cbea46984d67f614c74c4401b54b9d681861e80)

    Steve Muckle