18 Apr, 2019

1 commit

  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar
    (Vipul: Fixed merge conflicts)
    Signed-off-by: Vipul Kumar

    Viresh Kumar
     

17 Apr, 2019

1 commit

  • commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.

    A NULL pointer dereference bug was reported on a distribution kernel but
    the same issue should be present on mainline kernel. It occured on s390
    but should not be arch-specific. A partial oops looks like:

    Unable to handle kernel pointer dereference in virtual kernel address space
    ...
    Call Trace:
    ...
    try_to_wake_up+0xfc/0x450
    vhost_poll_wakeup+0x3a/0x50 [vhost]
    __wake_up_common+0xbc/0x178
    __wake_up_common_lock+0x9e/0x160
    __wake_up_sync_key+0x4e/0x60
    sock_def_readable+0x5e/0x98

    The bug hits any time between 1 hour to 3 days. The dereference occurs
    in update_cfs_rq_h_load when accumulating h_load. The problem is that
    cfq_rq->h_load_next is not protected by any locking and can be updated
    by parallel calls to task_h_load. Depending on the compiler, code may be
    generated that re-reads cfq_rq->h_load_next after the check for NULL and
    then oops when reading se->avg.load_avg. The dissassembly showed that it
    was possible to reread h_load_next after the check for NULL.

    While this does not appear to be an issue for later compilers, it's still
    an accident if the correct code is generated. Full locking in this path
    would have high overhead so this patch uses READ_ONCE to read h_load_next
    only once and check for NULL before dereferencing. It was confirmed that
    there were no further oops after 10 days of testing.

    As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
    potential problems with store tearing.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
    Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

06 Apr, 2019

3 commits

  • [ Upstream commit c546951d9c9300065bad253ecdf1ac59ce9d06c8 ]

    move_queued_task() synchronizes with task_rq_lock() as follows:

    move_queued_task() task_rq_lock()

    [S] ->on_rq = MIGRATING [L] rq = task_rq()
    WMB (__set_task_cpu()) ACQUIRE (rq->lock);
    [S] ->cpu = new_cpu [L] ->on_rq

    where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
    address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
    "[L] ->on_rq" by the ACQUIRE itself.

    Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
    this address dependency. Also, mark the accesses to ->cpu and ->on_rq
    with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.

    Signed-off-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Andrea Parri
     
  • [ Upstream commit 1ca4fa3ab604734e38e2a3000c9abf788512ffa7 ]

    register_sched_domain_sysctl() copies the cpu_possible_mask into
    sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
    allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when
    CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
    uninitialized (all zeroes) and the kernel may fail to initialize
    sched_domain sysctl entries for all possible CPUs.

    This is visible to the user if the kernel is booted with maxcpus=n, or
    if ACPI tables have been modified to leave CPUs offline, and then
    checking for missing /proc/sys/kernel/sched_domain/cpu* entries.

    Fix this by separating the allocation and initialization, and adding a
    flag to initialize the possible CPU entries while system booting only.

    Tested-by: Syuuichirou Ishii
    Tested-by: Tarumizu, Kohei
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Masayoshi Mizuma
    Acked-by: Joe Lawrence
    Cc: Linus Torvalds
    Cc: Masayoshi Mizuma
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Hidetoshi Seto
     
  • [ Upstream commit 99687cdbb3f6c8e32bcc7f37496e811f30460e48 ]

    The percpu members of struct sd_data and s_data are declared as:

    struct ... ** __percpu member;

    So their type is:

    __percpu pointer to pointer to struct ...

    But looking at how they're used, their type should be:

    pointer to __percpu pointer to struct ...

    and they should thus be declared as:

    struct ... * __percpu *member;

    So fix the placement of '__percpu' in the definition of these
    structures.

    This addresses a bunch of Sparse's warnings like:

    warning: incorrect type in initializer (different address spaces)
    expected void const [noderef] *__vpp_verify
    got struct sched_domain **

    Signed-off-by: Luc Van Oostenryck
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190118144936.79158-1-luc.vanoostenryck@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Luc Van Oostenryck
     

06 Mar, 2019

1 commit

  • [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]

    Notable cmpxchg() does not provide ordering when it fails, however
    wake_q_add() requires ordering in this specific case too. Without this
    it would be possible for the concurrent wakeup to not observe our
    prior state.

    Andrea Parri provided:

    C wake_up_q-wake_q_add

    {
    int next = 0;
    int y = 0;
    }

    P0(int *next, int *y)
    {
    int r0;

    /* in wake_up_q() */

    WRITE_ONCE(*next, 1); /* node->next = NULL */
    smp_mb(); /* implied by wake_up_process() */
    r0 = READ_ONCE(*y);
    }

    P1(int *next, int *y)
    {
    int r1;

    /* in wake_q_add() */

    WRITE_ONCE(*y, 1); /* wake_cond = true */
    smp_mb__before_atomic();
    r1 = cmpxchg_relaxed(next, 1, 2);
    }

    exists (0:r0=0 /\ 1:r1=0)

    This "exists" clause cannot be satisfied according to the LKMM:

    Test wake_up_q-wake_q_add Allowed
    States 3
    0:r0=0; 1:r1=1;
    0:r0=1; 1:r1=0;
    0:r0=1; 1:r1=1;
    No
    Witnesses
    Positive: 0 Negative: 3
    Condition exists (0:r0=0 /\ 1:r1=0)
    Observation wake_up_q-wake_q_add Never 0 3

    Reported-by: Yongji Xie
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     

13 Feb, 2019

1 commit

  • commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream.

    With the following commit:

    73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")

    ... the hotplug code attempted to detect when SMT was disabled by BIOS,
    in which case it reported SMT as permanently disabled. However, that
    code broke a virt hotplug scenario, where the guest is booted with only
    primary CPU threads, and a sibling is brought online later.

    The problem is that there doesn't seem to be a way to reliably
    distinguish between the HW "SMT disabled by BIOS" case and the virt
    "sibling not yet brought online" case. So the above-mentioned commit
    was a bit misguided, as it permanently disabled SMT for both cases,
    preventing future virt sibling hotplugs.

    Going back and reviewing the original problems which were attempted to
    be solved by that commit, when SMT was disabled in BIOS:

    1) /sys/devices/system/cpu/smt/control showed "on" instead of
    "notsupported"; and

    2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning.

    I'd propose that we instead consider #1 above to not actually be a
    problem. Because, at least in the virt case, it's possible that SMT
    wasn't disabled by BIOS and a sibling thread could be brought online
    later. So it makes sense to just always default the smt control to "on"
    to allow for that possibility (assuming cpuid indicates that the CPU
    supports SMT).

    The real problem is #2, which has a simple fix: change vmx_vm_init() to
    query the actual current SMT state -- i.e., whether any siblings are
    currently online -- instead of looking at the SMT "control" sysfs value.

    So fix it by:

    a) reverting the original "fix" and its followup fix:

    73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
    bc2d8d262cba ("cpu/hotplug: Fix SMT supported evaluation")

    and

    b) changing vmx_vm_init() to query the actual current SMT state --
    instead of the sysfs control value -- to determine whether the L1TF
    warning is needed. This also requires the 'sched_smt_present'
    variable to exported, instead of 'cpu_smt_control'.

    Fixes: 73d5e2b47264 ("cpu/hotplug: detect SMT disabled by BIOS")
    Reported-by: Igor Mammedov
    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Thomas Gleixner
    Cc: Joe Mario
    Cc: Jiri Kosina
    Cc: Peter Zijlstra
    Cc: kvm@vger.kernel.org
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Josh Poimboeuf
     

13 Jan, 2019

1 commit

  • commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.

    Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
    scheduler under high loads, starting at around the v4.18 time frame,
    and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
    manipulation.

    Do a (manual) revert of:

    a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

    It turns out that the list_del_leaf_cfs_rq() introduced by this commit
    is a surprising property that was not considered in followup commits
    such as:

    9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

    As Vincent Guittot explains:

    "I think that there is a bigger problem with commit a9e7f6544b9c and
    cfs_rq throttling:

    Let take the example of the following topology TG2 --> TG1 --> root:

    1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
    cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
    one path because it has never been used and can't be throttled so
    tmp_alone_branch will point to leaf_cfs_rq_list at the end.

    2) Then TG1 is throttled

    3) and we add TG3 as a new child of TG1.

    4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
    cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list.

    With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
    So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
    cfs_rq is removed from the list.
    Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
    but tmp_alone_branch still points to TG3 cfs_rq because its throttled
    parent can't be enqueued when the lock is released.
    tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

    So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
    points on another TG cfs_rq, the next TG cfs_rq that will be added,
    will be linked outside rq->leaf_cfs_rq_list - which is bad.

    In addition, we can break the ordering of the cfs_rq in
    rq->leaf_cfs_rq_list but this ordering is used to update and
    propagate the update from leaf down to root."

    Instead of trying to work through all these cases and trying to reproduce
    the very high loads that produced the lockup to begin with, simplify
    the code temporarily by reverting a9e7f6544b9c - which change was clearly
    not thought through completely.

    This (hopefully) gives us a kernel that doesn't lock up so people
    can continue to enjoy their holidays without worrying about regressions. ;-)

    [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

    Analyzed-by: Xie XiuQi
    Analyzed-by: Vincent Guittot
    Reported-by: Zhipeng Xie
    Reported-by: Sargun Dhillon
    Reported-by: Xie XiuQi
    Tested-by: Zhipeng Xie
    Tested-by: Sargun Dhillon
    Signed-off-by: Linus Torvalds
    Acked-by: Vincent Guittot
    Cc: # v4.13+
    Cc: Bin Li
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
    Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

20 Dec, 2018

1 commit

  • Commit 11d4afd4ff667f9b6178ee8c142c36cb78bd84db upstream.

    Create a config for enabling irq load tracking in the scheduler.
    irq load tracking is useful only when irq or paravirtual time is
    accounted but it's only possible with SMP for now.

    Also use __maybe_unused to remove the compilation warning in
    update_rq_clock_task() that has been introduced by:

    2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")

    Suggested-by: Ingo Molnar
    Reported-by: Dou Liyang
    Reported-by: Miguel Ojeda
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bp@alien8.de
    Cc: dou_liyang@163.com
    Fixes: 2e62c4743adc ("sched/fair: Remove #ifdefs from scale_rt_capacity()")
    Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Vincent Guittot
     

06 Dec, 2018

2 commits

  • commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream

    Make the scheduler's 'sched_smt_present' static key globaly available, so
    it can be used in the x86 speculation control code.

    Provide a query function and a stub for the CONFIG_SMP=n case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit c5511d03ec090980732e929c318a7a6374b5550e upstream

    Currently the 'sched_smt_present' static key is enabled when at CPU bringup
    SMT topology is observed, but it is never disabled. However there is demand
    to also disable the key when the topology changes such that there is no SMT
    present anymore.

    Implement this by making the key count the number of cores that have SMT
    enabled.

    In particular, the SMT topology bits are set before interrrupts are enabled
    and similarly, are cleared after interrupts are disabled for the last time
    and the CPU dies.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra (Intel)
     

01 Dec, 2018

1 commit

  • [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ]

    A ~10% regression has been reported for UnixBench's execl throughput
    test by Aaron Lu and Ye Xiaolong:

    https://lkml.org/lkml/2018/10/30/765

    That test is pretty simple, it does a "recursive" execve() syscall on the
    same binary. Starting from the syscall, this sequence is possible:

    do_execve()
    do_execveat_common()
    __do_execve_file()
    sched_exec()
    select_task_rq_fair()
    Reported-by: Ye Xiaolong
    Tested-by: Aaron Lu
    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Fixes: f9be3e5961c5 (sched/fair: Use util_est in LB and WU paths)
    Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Patrick Bellasi
     

27 Nov, 2018

1 commit

  • [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]

    When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
    for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
    Juno or HiKey960), we get the following report:

    [ 0.748225] Call trace:
    [ 0.750685] lockdep_assert_cpus_held+0x30/0x40
    [ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
    [ 0.760137] build_sched_domains+0x1034/0x1108
    [ 0.764601] sched_init_domains+0x68/0x90
    [ 0.768628] sched_init_smp+0x30/0x80
    [ 0.772309] kernel_init_freeable+0x278/0x51c
    [ 0.776685] kernel_init+0x10/0x108
    [ 0.780190] ret_from_fork+0x10/0x18

    The static_key in question is 'sched_asym_cpucapacity' introduced by
    commit:

    df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")

    In this particular case, we enable it because smp_prepare_cpus() will
    end up fetching the capacity-dmips-mhz entry from the devicetree,
    so we already have some asymmetry detected when entering sched_init_smp().

    This didn't get detected in tip/sched/core because we were missing:

    commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

    Calls to build_sched_domains() post sched_init_smp() will hold the
    hotplug lock, it just so happens that this very first call is a
    special case. As stated by a comment in sched_init_smp(), "There's no
    userspace yet to cause hotplug operations" so this is a harmless
    warning.

    However, to both respect the semantics of underlying
    callees and make lockdep happy, take the hotplug lock in
    sched_init_smp(). This also satisfies the comment atop
    sched_init_domains() that says "Callers must hold the hotplug lock".

    Reported-by: Sudeep Holla
    Tested-by: Sudeep Holla
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: morten.rasmussen@arm.com
    Cc: quentin.perret@arm.com
    Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

16 Oct, 2018

1 commit

  • The comment and the code around the update_min_vruntime() call in
    dequeue_entity() are not in agreement.

    From commit:

    b60205c7c558 ("sched/fair: Fix min_vruntime tracking")

    I think that we want to update min_vruntime when a task is sleeping/migrating.
    So, the check is inverted there - fix it.

    Signed-off-by: Song Muchun
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
    Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
    Signed-off-by: Ingo Molnar

    Song Muchun
     

11 Oct, 2018

1 commit

  • With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
    distribute_cfs_runtime may not empty the throttled_list before it runs
    out of runtime to distribute. In that case, due to the change from
    c06f04c7048 to put throttled entries at the head of the list, later entries
    on the list will starve. Essentially, the same X processes will get pulled
    off the list, given CPU time and then, when expired, get put back on the
    head of the list where distribute_cfs_runtime will give runtime to the same
    set of processes leaving the rest.

    Fix the issue by setting a bit in struct cfs_bandwidth when
    distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
    decide to put the throttled entry on the tail or the head of the list. The
    bit is set/cleared by the callers of distribute_cfs_runtime while they hold
    cfs_bandwidth->lock.

    This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
    the live system. In some cases you can simply look at the throttled list and
    see the later entries are not changing:

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -976050
    2 ffff90b56cb2cc00 -484925
    3 ffff90b56cb2bc00 -658814
    4 ffff90b56cb2ba00 -275365
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -994147
    2 ffff90b56cb2cc00 -306051
    3 ffff90b56cb2bc00 -961321
    4 ffff90b56cb2ba00 -24490
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    Sometimes it is easier to see by finding a process getting starved and looking
    at the sched_info:

    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },
    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },

    Signed-off-by: Phil Auld
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
    Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
    Signed-off-by: Ingo Molnar

    Phil Auld
     

02 Oct, 2018

6 commits

  • Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
    should migrate to a local node. This filter avoids excessive ping-ponging
    if a page is shared or used by threads that migrate cross-node frequently.

    Threads inherit both page tables and the preferred node ID from the
    parent. This means that threads can trigger hinting faults earlier than
    a new task which delays scanning for a number of seconds. As it can be
    load balanced very early in its lifetime there can be an unnecessary delay
    before it starts migrating thread-local data. This patch migrates private
    pages faster early in the lifetime of a thread using the sequence counter
    as an identifier of new tasks.

    With this patch applied, STREAM performance is the same as 4.17 even though
    processes are not spread cross-node prematurely. Other workloads showed
    a mix of minor gains and losses. This is somewhat expected most workloads
    are not very sensitive to the starting conditions of a process.

    4.19.0-rc5 4.19.0-rc5 4.17.0
    numab-v1r1 fastmigrate-v1r1 vanilla
    MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%)
    MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%)
    MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%)
    MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%)

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Linux-MM
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • If NUMA improvement from the task migration is going to be very
    minimal, then avoid task migration.

    Specjbb2005 results (8 warehouses)
    Higher bops are better

    2 Socket - 2 Node Haswell - X86
    JVMS Prev Current %Change
    4 198512 205910 3.72673
    1 313559 318491 1.57291

    2 Socket - 4 Node Power8 - PowerNV
    JVMS Prev Current %Change
    8 74761.9 74935.9 0.232739
    1 214874 226796 5.54837

    2 Socket - 2 Node Power9 - PowerNV
    JVMS Prev Current %Change
    4 180536 189780 5.12031
    1 210281 205695 -2.18089

    4 Socket - 4 Node Power7 - PowerVM
    JVMS Prev Current %Change
    8 56511.4 60370 6.828
    1 104899 108100 3.05151

    1/7 cases is regressing, if we look at events migrate_pages seem
    to vary the most especially in the regressing case. Also some
    amount of variance is expected between different runs of
    Specjbb2005.

    Some events stats before and after applying the patch.

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 13,818,546 13,801,554
    migrations 1,149,960 1,151,541
    faults 385,583 433,246
    cache-misses 55,259,546,768 55,168,691,835
    sched:sched_move_numa 2,257 2,551
    sched:sched_stick_numa 9 24
    sched:sched_swap_numa 512 904
    migrate:mm_migrate_pages 2,225 1,571

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 72692 113682
    numa_hint_faults_local 62270 102163
    numa_hit 238762 240181
    numa_huge_pte_updates 48 36
    numa_interleave 75 64
    numa_local 238676 240103
    numa_other 86 78
    numa_pages_migrated 2225 1564
    numa_pte_updates 98557 134080

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 3,173,490 3,079,150
    migrations 36,966 31,455
    faults 108,776 99,081
    cache-misses 12,200,075,320 11,588,126,740
    sched:sched_move_numa 1,264 1
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 0
    migrate:mm_migrate_pages 899 36

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 21109 430
    numa_hint_faults_local 17120 77
    numa_hit 72934 71277
    numa_huge_pte_updates 42 0
    numa_interleave 33 22
    numa_local 72866 71218
    numa_other 68 59
    numa_pages_migrated 915 23
    numa_pte_updates 42326 0

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 8,312,022 8,707,565
    migrations 231,705 171,342
    faults 310,242 310,820
    cache-misses 402,324,573 136,115,400
    sched:sched_move_numa 193 215
    sched:sched_stick_numa 0 6
    sched:sched_swap_numa 3 24
    migrate:mm_migrate_pages 93 162

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 11838 8985
    numa_hint_faults_local 11216 8154
    numa_hit 90689 93819
    numa_huge_pte_updates 0 0
    numa_interleave 1579 882
    numa_local 89634 93496
    numa_other 1055 323
    numa_pages_migrated 92 169
    numa_pte_updates 12109 9217

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 2,170,481 2,152,072
    migrations 10,126 10,704
    faults 160,962 164,376
    cache-misses 10,834,845 3,818,437
    sched:sched_move_numa 10 16
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 7
    migrate:mm_migrate_pages 2 199

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 403 2248
    numa_hint_faults_local 358 1666
    numa_hit 25898 25704
    numa_huge_pte_updates 0 0
    numa_interleave 207 200
    numa_local 25860 25679
    numa_other 38 25
    numa_pages_migrated 2 197
    numa_pte_updates 400 2234

    perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 110,339,633 93,330,595
    migrations 4,139,812 4,122,061
    faults 863,622 865,979
    cache-misses 231,838,045,660 225,395,083,479
    sched:sched_move_numa 2,196 2,372
    sched:sched_stick_numa 33 24
    sched:sched_swap_numa 544 769
    migrate:mm_migrate_pages 2,469 1,677

    vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 85748 91638
    numa_hint_faults_local 66831 78096
    numa_hit 242213 242225
    numa_huge_pte_updates 0 0
    numa_interleave 0 2
    numa_local 242211 242219
    numa_other 2 6
    numa_pages_migrated 2376 1515
    numa_pte_updates 86233 92274

    perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 59,331,057 51,487,271
    migrations 552,019 537,170
    faults 266,586 256,921
    cache-misses 73,796,312,990 70,073,831,187
    sched:sched_move_numa 981 576
    sched:sched_stick_numa 54 24
    sched:sched_swap_numa 286 327
    migrate:mm_migrate_pages 713 726

    vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 14807 12000
    numa_hint_faults_local 5738 5024
    numa_hit 36230 36470
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 36228 36465
    numa_other 2 5
    numa_pages_migrated 703 726
    numa_pte_updates 14742 11930

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
    cross-node migration. In the event of excessive load balancing due to
    saturation, this may result in the scan rate being pegged at maximum and
    further overloading the machine.

    This patch only resets the scan if NUMA balancing is active, a preferred
    node has been selected and the task is being migrated from the preferred
    node as these are the most harmful. For example, a migration to the preferred
    node does not justify a faster scan rate. Similarly, a migration between two
    nodes that are not preferred is probably bouncing due to over-saturation of
    the machine. In that case, scanning faster and trapping more NUMA faults
    will further overload the machine.

    Specjbb2005 results (8 warehouses)
    Higher bops are better

    2 Socket - 2 Node Haswell - X86
    JVMS Prev Current %Change
    4 203370 205332 0.964744
    1 328431 319785 -2.63252

    2 Socket - 4 Node Power8 - PowerNV
    JVMS Prev Current %Change
    1 206070 206585 0.249915

    2 Socket - 2 Node Power9 - PowerNV
    JVMS Prev Current %Change
    4 188386 189162 0.41192
    1 201566 213760 6.04963

    4 Socket - 4 Node Power7 - PowerVM
    JVMS Prev Current %Change
    8 59157.4 58736.8 -0.710985
    1 105495 105419 -0.0720413

    Some events stats before and after applying the patch.

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 13,825,492 14,285,708
    migrations 1,152,509 1,180,621
    faults 371,948 339,114
    cache-misses 55,654,206,041 55,205,631,894
    sched:sched_move_numa 1,856 843
    sched:sched_stick_numa 4 6
    sched:sched_swap_numa 428 219
    migrate:mm_migrate_pages 898 365

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 57146 26907
    numa_hint_faults_local 51612 24279
    numa_hit 238164 239771
    numa_huge_pte_updates 16 0
    numa_interleave 63 68
    numa_local 238085 239688
    numa_other 79 83
    numa_pages_migrated 883 363
    numa_pte_updates 67540 27415

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 3,288,525 3,202,779
    migrations 38,652 37,186
    faults 111,678 106,076
    cache-misses 12,111,197,376 12,024,873,744
    sched:sched_move_numa 900 931
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 5 1
    migrate:mm_migrate_pages 714 637

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 18572 17409
    numa_hint_faults_local 14850 14367
    numa_hit 73197 73953
    numa_huge_pte_updates 11 20
    numa_interleave 25 25
    numa_local 73138 73892
    numa_other 59 61
    numa_pages_migrated 712 668
    numa_pte_updates 24021 27276

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 8,451,543 8,474,013
    migrations 202,804 254,934
    faults 310,024 320,506
    cache-misses 253,522,507 110,580,458
    sched:sched_move_numa 213 725
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 2 7
    migrate:mm_migrate_pages 88 145

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 11830 22797
    numa_hint_faults_local 11301 21539
    numa_hit 90038 89308
    numa_huge_pte_updates 0 0
    numa_interleave 855 865
    numa_local 89796 88955
    numa_other 242 353
    numa_pages_migrated 88 149
    numa_pte_updates 12039 22930

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 2,049,153 2,195,628
    migrations 11,405 11,179
    faults 162,309 149,656
    cache-misses 7,203,343 8,117,515
    sched:sched_move_numa 22 49
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 0
    migrate:mm_migrate_pages 1 5

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 1693 3577
    numa_hint_faults_local 1669 3476
    numa_hit 25177 26142
    numa_huge_pte_updates 0 0
    numa_interleave 194 358
    numa_local 24993 26042
    numa_other 184 100
    numa_pages_migrated 1 5
    numa_pte_updates 1577 3587

    perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 94,515,937 100,602,296
    migrations 4,203,554 4,135,630
    faults 832,697 789,256
    cache-misses 226,248,698,331 226,160,621,058
    sched:sched_move_numa 1,730 1,366
    sched:sched_stick_numa 14 16
    sched:sched_swap_numa 432 374
    migrate:mm_migrate_pages 1,398 1,350

    vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 80079 47857
    numa_hint_faults_local 68620 39768
    numa_hit 241187 240165
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 241186 240165
    numa_other 1 0
    numa_pages_migrated 1347 1224
    numa_pte_updates 80729 48354

    perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 63,704,961 58,515,496
    migrations 573,404 564,845
    faults 230,878 245,807
    cache-misses 76,568,222,781 73,603,757,976
    sched:sched_move_numa 509 996
    sched:sched_stick_numa 31 10
    sched:sched_swap_numa 182 193
    migrate:mm_migrate_pages 541 646

    vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 8501 13422
    numa_hint_faults_local 2960 5619
    numa_hit 35526 36118
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 35526 36116
    numa_other 0 2
    numa_pages_migrated 539 616
    numa_pte_updates 8433 13374

    Signed-off-by: Mel Gorman
    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • Currently task scan rate is reset when NUMA balancer migrates the task
    to a different node. If NUMA balancer initiates a swap, reset is only
    applicable to the task that initiates the swap. Similarly no scan rate
    reset is done if the task is migrated across nodes by traditional load
    balancer.

    Instead move the scan reset to the migrate_task_rq. This ensures the
    task moved out of its preferred node, either gets back to its preferred
    node quickly or finds a new preferred node. Doing so, would be fair to
    all tasks migrating across nodes.

    Specjbb2005 results (8 warehouses)
    Higher bops are better

    2 Socket - 2 Node Haswell - X86
    JVMS Prev Current %Change
    4 200668 203370 1.3465
    1 321791 328431 2.06345

    2 Socket - 4 Node Power8 - PowerNV
    JVMS Prev Current %Change
    1 204848 206070 0.59654

    2 Socket - 2 Node Power9 - PowerNV
    JVMS Prev Current %Change
    4 188098 188386 0.153112
    1 200351 201566 0.606436

    4 Socket - 4 Node Power7 - PowerVM
    JVMS Prev Current %Change
    8 58145.9 59157.4 1.73959
    1 103798 105495 1.63491

    Some events stats before and after applying the patch.

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 13,912,183 13,825,492
    migrations 1,155,931 1,152,509
    faults 367,139 371,948
    cache-misses 54,240,196,814 55,654,206,041
    sched:sched_move_numa 1,571 1,856
    sched:sched_stick_numa 9 4
    sched:sched_swap_numa 463 428
    migrate:mm_migrate_pages 703 898

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 50155 57146
    numa_hint_faults_local 45264 51612
    numa_hit 239652 238164
    numa_huge_pte_updates 36 16
    numa_interleave 68 63
    numa_local 239576 238085
    numa_other 76 79
    numa_pages_migrated 680 883
    numa_pte_updates 71146 67540

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 3,156,720 3,288,525
    migrations 30,354 38,652
    faults 97,261 111,678
    cache-misses 12,400,026,826 12,111,197,376
    sched:sched_move_numa 4 900
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 1 5
    migrate:mm_migrate_pages 20 714

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 272 18572
    numa_hint_faults_local 186 14850
    numa_hit 71362 73197
    numa_huge_pte_updates 0 11
    numa_interleave 23 25
    numa_local 71299 73138
    numa_other 63 59
    numa_pages_migrated 2 712
    numa_pte_updates 0 24021

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 8,606,824 8,451,543
    migrations 155,352 202,804
    faults 301,409 310,024
    cache-misses 157,759,224 253,522,507
    sched:sched_move_numa 168 213
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 3 2
    migrate:mm_migrate_pages 125 88

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 4650 11830
    numa_hint_faults_local 3946 11301
    numa_hit 90489 90038
    numa_huge_pte_updates 0 0
    numa_interleave 892 855
    numa_local 90034 89796
    numa_other 455 242
    numa_pages_migrated 124 88
    numa_pte_updates 4818 12039

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 2,113,167 2,049,153
    migrations 10,533 11,405
    faults 142,727 162,309
    cache-misses 5,594,192 7,203,343
    sched:sched_move_numa 10 22
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 0
    migrate:mm_migrate_pages 6 1

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 744 1693
    numa_hint_faults_local 584 1669
    numa_hit 25551 25177
    numa_huge_pte_updates 0 0
    numa_interleave 263 194
    numa_local 25302 24993
    numa_other 249 184
    numa_pages_migrated 6 1
    numa_pte_updates 744 1577

    perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 101,227,352 94,515,937
    migrations 4,151,829 4,203,554
    faults 745,233 832,697
    cache-misses 224,669,561,766 226,248,698,331
    sched:sched_move_numa 617 1,730
    sched:sched_stick_numa 2 14
    sched:sched_swap_numa 187 432
    migrate:mm_migrate_pages 316 1,398

    vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 24195 80079
    numa_hint_faults_local 21639 68620
    numa_hit 238331 241187
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 238331 241186
    numa_other 0 1
    numa_pages_migrated 204 1347
    numa_pte_updates 24561 80729

    perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 62,738,978 63,704,961
    migrations 562,702 573,404
    faults 228,465 230,878
    cache-misses 75,778,067,952 76,568,222,781
    sched:sched_move_numa 648 509
    sched:sched_stick_numa 13 31
    sched:sched_swap_numa 137 182
    migrate:mm_migrate_pages 733 541

    vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 10281 8501
    numa_hint_faults_local 3242 2960
    numa_hit 36338 35526
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 36338 35526
    numa_other 0 0
    numa_pages_migrated 706 539
    numa_pte_updates 10176 8433

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • This additional parameter (new_cpu) is used later for identifying if
    task migration is across nodes.

    No functional change.

    Specjbb2005 results (8 warehouses)
    Higher bops are better

    2 Socket - 2 Node Haswell - X86
    JVMS Prev Current %Change
    4 203353 200668 -1.32036
    1 328205 321791 -1.95427

    2 Socket - 4 Node Power8 - PowerNV
    JVMS Prev Current %Change
    1 214384 204848 -4.44809

    2 Socket - 2 Node Power9 - PowerNV
    JVMS Prev Current %Change
    4 188553 188098 -0.241311
    1 196273 200351 2.07772

    4 Socket - 4 Node Power7 - PowerVM
    JVMS Prev Current %Change
    8 57581.2 58145.9 0.980702
    1 103468 103798 0.318939

    Brings out the variance between different specjbb2005 runs.

    Some events stats before and after applying the patch.

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 13,941,377 13,912,183
    migrations 1,157,323 1,155,931
    faults 382,175 367,139
    cache-misses 54,993,823,500 54,240,196,814
    sched:sched_move_numa 2,005 1,571
    sched:sched_stick_numa 14 9
    sched:sched_swap_numa 529 463
    migrate:mm_migrate_pages 1,573 703

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 67099 50155
    numa_hint_faults_local 58456 45264
    numa_hit 240416 239652
    numa_huge_pte_updates 18 36
    numa_interleave 65 68
    numa_local 240339 239576
    numa_other 77 76
    numa_pages_migrated 1574 680
    numa_pte_updates 77182 71146

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 3,176,453 3,156,720
    migrations 30,238 30,354
    faults 87,869 97,261
    cache-misses 12,544,479,391 12,400,026,826
    sched:sched_move_numa 23 4
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 6 1
    migrate:mm_migrate_pages 10 20

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 236 272
    numa_hint_faults_local 201 186
    numa_hit 72293 71362
    numa_huge_pte_updates 0 0
    numa_interleave 26 23
    numa_local 72233 71299
    numa_other 60 63
    numa_pages_migrated 8 2
    numa_pte_updates 0 0

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 8,478,820 8,606,824
    migrations 171,323 155,352
    faults 307,499 301,409
    cache-misses 240,353,599 157,759,224
    sched:sched_move_numa 214 168
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 4 3
    migrate:mm_migrate_pages 89 125

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 5301 4650
    numa_hint_faults_local 4745 3946
    numa_hit 92943 90489
    numa_huge_pte_updates 0 0
    numa_interleave 899 892
    numa_local 92345 90034
    numa_other 598 455
    numa_pages_migrated 88 124
    numa_pte_updates 5505 4818

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 2,066,172 2,113,167
    migrations 11,076 10,533
    faults 149,544 142,727
    cache-misses 10,398,067 5,594,192
    sched:sched_move_numa 43 10
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 0
    migrate:mm_migrate_pages 6 6

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 3552 744
    numa_hint_faults_local 3347 584
    numa_hit 25611 25551
    numa_huge_pte_updates 0 0
    numa_interleave 213 263
    numa_local 25583 25302
    numa_other 28 249
    numa_pages_migrated 6 6
    numa_pte_updates 3535 744

    perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 99,358,136 101,227,352
    migrations 4,041,607 4,151,829
    faults 749,653 745,233
    cache-misses 225,562,543,251 224,669,561,766
    sched:sched_move_numa 771 617
    sched:sched_stick_numa 14 2
    sched:sched_swap_numa 204 187
    migrate:mm_migrate_pages 1,180 316

    vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 27409 24195
    numa_hint_faults_local 20677 21639
    numa_hit 239988 238331
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 239983 238331
    numa_other 5 0
    numa_pages_migrated 1016 204
    numa_pte_updates 27916 24561

    perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 60,899,307 62,738,978
    migrations 544,668 562,702
    faults 270,834 228,465
    cache-misses 74,543,455,635 75,778,067,952
    sched:sched_move_numa 735 648
    sched:sched_stick_numa 25 13
    sched:sched_swap_numa 174 137
    migrate:mm_migrate_pages 816 733

    vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 11059 10281
    numa_hint_faults_local 4733 3242
    numa_hit 41384 36338
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 41383 36338
    numa_other 1 0
    numa_pages_migrated 815 706
    numa_pte_updates 11323 10176

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • Task migration under NUMA balancing can happen in parallel. More than
    one task might choose to migrate to the same CPU at the same time. This
    can result in:

    - During task swap, choosing a task that was not part of the evaluation.
    - During task swap, task which just got moved into its preferred node,
    moving to a completely different node.
    - During task swap, task failing to move to the preferred node, will have
    to wait an extra interval for the next migrate opportunity.
    - During task movement, multiple task movements can cause load imbalance.

    This problem is more likely if there are more cores per node or more
    nodes in the system.

    Use a per run-queue variable to check if NUMA-balance is active on the
    run-queue.

    Specjbb2005 results (8 warehouses)
    Higher bops are better

    2 Socket - 2 Node Haswell - X86
    JVMS Prev Current %Change
    4 200194 203353 1.57797
    1 311331 328205 5.41995

    2 Socket - 4 Node Power8 - PowerNV
    JVMS Prev Current %Change
    1 197654 214384 8.46429

    2 Socket - 2 Node Power9 - PowerNV
    JVMS Prev Current %Change
    4 192605 188553 -2.10379
    1 213402 196273 -8.02664

    4 Socket - 4 Node Power7 - PowerVM
    JVMS Prev Current %Change
    8 52227.1 57581.2 10.2516
    1 102529 103468 0.915838

    There is a regression on power 9 box. If we look at the details,
    that box has a sudden jump in cache-misses with this patch.
    All other parameters seem to be pointing towards NUMA
    consolidation.

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 13,345,784 13,941,377
    migrations 1,127,820 1,157,323
    faults 374,736 382,175
    cache-misses 55,132,054,603 54,993,823,500
    sched:sched_move_numa 1,923 2,005
    sched:sched_stick_numa 52 14
    sched:sched_swap_numa 595 529
    migrate:mm_migrate_pages 1,932 1,573

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 60605 67099
    numa_hint_faults_local 51804 58456
    numa_hit 239945 240416
    numa_huge_pte_updates 14 18
    numa_interleave 60 65
    numa_local 239865 240339
    numa_other 80 77
    numa_pages_migrated 1931 1574
    numa_pte_updates 67823 77182

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    cs 3,016,467 3,176,453
    migrations 37,326 30,238
    faults 115,342 87,869
    cache-misses 11,692,155,554 12,544,479,391
    sched:sched_move_numa 965 23
    sched:sched_stick_numa 8 0
    sched:sched_swap_numa 35 6
    migrate:mm_migrate_pages 1,168 10

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86
    Event Before After
    numa_hint_faults 16286 236
    numa_hint_faults_local 11863 201
    numa_hit 112482 72293
    numa_huge_pte_updates 33 0
    numa_interleave 20 26
    numa_local 112419 72233
    numa_other 63 60
    numa_pages_migrated 1144 8
    numa_pte_updates 32859 0

    perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 8,629,724 8,478,820
    migrations 221,052 171,323
    faults 308,661 307,499
    cache-misses 135,574,913 240,353,599
    sched:sched_move_numa 147 214
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 2 4
    migrate:mm_migrate_pages 64 89

    vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 11481 5301
    numa_hint_faults_local 10968 4745
    numa_hit 89773 92943
    numa_huge_pte_updates 0 0
    numa_interleave 1116 899
    numa_local 89220 92345
    numa_other 553 598
    numa_pages_migrated 62 88
    numa_pte_updates 11694 5505

    perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    cs 2,272,887 2,066,172
    migrations 12,206 11,076
    faults 163,704 149,544
    cache-misses 4,801,186 10,398,067
    sched:sched_move_numa 44 43
    sched:sched_stick_numa 0 0
    sched:sched_swap_numa 0 0
    migrate:mm_migrate_pages 17 6

    vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV
    Event Before After
    numa_hint_faults 2261 3552
    numa_hint_faults_local 1993 3347
    numa_hit 25726 25611
    numa_huge_pte_updates 0 0
    numa_interleave 239 213
    numa_local 25498 25583
    numa_other 228 28
    numa_pages_migrated 17 6
    numa_pte_updates 2266 3535

    perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 117,980,962 99,358,136
    migrations 3,950,220 4,041,607
    faults 736,979 749,653
    cache-misses 224,976,072,879 225,562,543,251
    sched:sched_move_numa 504 771
    sched:sched_stick_numa 50 14
    sched:sched_swap_numa 239 204
    migrate:mm_migrate_pages 1,260 1,180

    vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 18293 27409
    numa_hint_faults_local 11969 20677
    numa_hit 240854 239988
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 240851 239983
    numa_other 3 5
    numa_pages_migrated 1190 1016
    numa_pte_updates 18106 27916

    perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    cs 61,053,158 60,899,307
    migrations 551,586 544,668
    faults 244,174 270,834
    cache-misses 74,326,766,973 74,543,455,635
    sched:sched_move_numa 344 735
    sched:sched_stick_numa 24 25
    sched:sched_swap_numa 140 174
    migrate:mm_migrate_pages 568 816

    vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM
    Event Before After
    numa_hint_faults 6461 11059
    numa_hint_faults_local 2283 4733
    numa_hit 35661 41384
    numa_huge_pte_updates 0 0
    numa_interleave 0 0
    numa_local 35661 41383
    numa_other 0 1
    numa_pages_migrated 568 815
    numa_pte_updates 6518 11323

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1537552141-27815-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

10 Sep, 2018

7 commits

  • Fix kernel-doc warning for missing 'flags' parameter description:

    ../kernel/sched/fair.c:3371: warning: Function parameter or member 'flags' not described in 'attach_entity_load_avg'

    Signed-off-by: Randy Dunlap
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: ea14b57e8a18 ("sched/cpufreq: Provide migration hint")
    Link: http://lkml.kernel.org/r/cdda0d42-880d-4229-a9f7-5899c977a063@infradead.org
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     
  • It can happen that load_balance() finds a busiest group and then a
    busiest rq but the calculated imbalance is in fact 0.

    In such situation, detach_tasks() returns immediately and lets the
    flag LBF_ALL_PINNED set. The busiest CPU is then wrongly assumed to
    have pinned tasks and removed from the load balance mask. then, we
    redo a load balance without the busiest CPU. This creates wrong load
    balance situation and generates wrong task migration.

    If the calculated imbalance is 0, it's useless to try to find a
    busiest rq as no task will be migrated and we can return immediately.

    This situation can happen with heterogeneous system or smp system when
    RT tasks are decreasing the capacity of some CPUs.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: jhugo@codeaurora.org
    Link: http://lkml.kernel.org/r/1536306664-29827-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Since commit:

    523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")

    scale_rt_capacity() returns the remaining capacity and not a scale factor
    to apply on cpu_capacity_orig. arch_scale_cpu() is directly called by
    scale_rt_capacity() so we must take the sched_domain argument.

    Reported-by: Srikar Dronamraju
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 523e979d3164 ("sched/core: Use PELT for scale_rt_capacity()")
    Link: http://lkml.kernel.org/r/20180904093626.GA23936@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • When a task which previously ran on a given CPU is remotely queued to
    wake up on that same CPU, there is a period where the task's state is
    TASK_WAKING and its vruntime is not normalized. This is not accounted
    for in vruntime_normalized() which will cause an error in the task's
    vruntime if it is switched from the fair class during this time.

    For example if it is boosted to RT priority via rt_mutex_setprio(),
    rq->min_vruntime will not be subtracted from the task's vruntime but
    it will be added again when the task returns to the fair class. The
    task's vruntime will have been erroneously doubled and the effective
    priority of the task will be reduced.

    Note this will also lead to inflation of all vruntimes since the doubled
    vruntime value will become the rq's min_vruntime when other tasks leave
    the rq. This leads to repeated doubling of the vruntime and priority
    penalty.

    Fix this by recognizing a WAKING task's vruntime as normalized only if
    sched_remote_wakeup is true. This indicates a migration, in which case
    the vruntime would have been normalized in migrate_task_rq_fair().

    Based on a similar patch from John Dias .

    Suggested-by: Peter Zijlstra
    Tested-by: Dietmar Eggemann
    Signed-off-by: Steve Muckle
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Redpath
    Cc: John Dias
    Cc: Linus Torvalds
    Cc: Miguel de Dios
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Paul Turner
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: kernel-team@android.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
    Signed-off-by: Ingo Molnar

    Steve Muckle
     
  • update_blocked_averages() is called to periodiccally decay the stalled load
    of idle CPUs and to sync all loads before running load balance.

    When cfs rq is idle, it trigs a load balance during pick_next_task_fair()
    in order to potentially pull tasks and to use this newly idle CPU. This
    load balance happens whereas prev task from another class has not been put
    and its utilization updated yet. This may lead to wrongly account running
    time as idle time for RT or DL classes.

    Test that no RT or DL task is running when updating their utilization in
    update_blocked_averages().

    We still update RT and DL utilization instead of simply skipping them to
    make sure that all metrics are synced when used during load balance.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
    Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
    Link: http://lkml.kernel.org/r/1535728975-22799-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • With the following commit:

    051f3ca02e46 ("sched/topology: Introduce NUMA identity node sched domain")

    the scheduler introduced a new NUMA level. However this leads to the NUMA topology
    on 2 node systems to not be marked as NUMA_DIRECT anymore.

    After this commit, it gets reported as NUMA_BACKPLANE, because
    sched_domains_numa_level is now 2 on 2 node systems.

    Fix this by allowing setting systems that have up to 2 NUMA levels as
    NUMA_DIRECT.

    While here remove code that assumes that level can be 0.

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andre Wild
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Suravee Suthikulpanit
    Cc: Thomas Gleixner
    Cc: linuxppc-dev
    Fixes: 051f3ca02e46 "Introduce NUMA identity node sched domain"
    Link: http://lkml.kernel.org/r/1533920419-17410-1-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:

    ======================================================
    WARNING: possible circular locking dependency detected
    4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted
    ------------------------------------------------------
    sh/3358 is trying to acquire lock:
    000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
    but task is already holding lock:
    00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
    lock_acquire+0xb8/0x148
    down_write+0xac/0x140
    start_creating+0x5c/0x168
    debugfs_create_dir+0x18/0x220
    opp_debug_register+0x8c/0x120
    _add_opp_dev+0x104/0x1f8
    dev_pm_opp_get_opp_table+0x174/0x340
    _of_add_opp_table_v2+0x110/0x760
    dev_pm_opp_of_add_table+0x5c/0x240
    dev_pm_opp_of_cpumask_add_table+0x5c/0x100
    cpufreq_init+0x160/0x430
    cpufreq_online+0x1cc/0xe30
    cpufreq_add_dev+0x78/0x198
    subsys_interface_register+0x168/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #2 (opp_table_lock){+.+.}:
    lock_acquire+0xb8/0x148
    __mutex_lock+0x104/0xf50
    mutex_lock_nested+0x1c/0x28
    _of_add_opp_table_v2+0xb4/0x760
    dev_pm_opp_of_add_table+0x5c/0x240
    dev_pm_opp_of_cpumask_add_table+0x5c/0x100
    cpufreq_init+0x160/0x430
    cpufreq_online+0x1cc/0xe30
    cpufreq_add_dev+0x78/0x198
    subsys_interface_register+0x168/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #1 (subsys mutex#6){+.+.}:
    lock_acquire+0xb8/0x148
    __mutex_lock+0x104/0xf50
    mutex_lock_nested+0x1c/0x28
    subsys_interface_register+0xd8/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #0 (cpu_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x203c/0x21d0
    lock_acquire+0xb8/0x148
    cpus_read_lock+0x58/0x1c8
    static_key_enable+0x14/0x30
    sched_feat_write+0x314/0x428
    full_proxy_write+0xa0/0x138
    __vfs_write+0xd8/0x388
    vfs_write+0xdc/0x318
    ksys_write+0xb4/0x138
    sys_write+0xc/0x18
    __sys_trace_return+0x0/0x4
    other info that might help us debug this:
    Chain exists of:
    cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
    Possible unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(&sb->s_type->i_mutex_key#3);
    lock(opp_table_lock);
    lock(&sb->s_type->i_mutex_key#3);
    lock(cpu_hotplug_lock.rw_sem);
    *** DEADLOCK ***
    2 locks held by sh/3358:
    #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
    #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
    stack backtrace:
    CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18
    Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
    Call trace:
    dump_backtrace+0x0/0x288
    show_stack+0x14/0x20
    dump_stack+0x13c/0x1ac
    print_circular_bug.isra.10+0x270/0x438
    check_prev_add.constprop.16+0x4dc/0xb98
    __lock_acquire+0x203c/0x21d0
    lock_acquire+0xb8/0x148
    cpus_read_lock+0x58/0x1c8
    static_key_enable+0x14/0x30
    sched_feat_write+0x314/0x428
    full_proxy_write+0xa0/0x138
    __vfs_write+0xd8/0x388
    vfs_write+0xdc/0x318
    ksys_write+0xb4/0x138
    sys_write+0xc/0x18
    __sys_trace_return+0x0/0x4

    This is because when loading the cpufreq_dt module we first acquire
    cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
    the &sb->s_type->i_mutex_key lock.

    But when writing to /sys/kernel/debug/sched_features, the
    cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.

    To fix this bug, reverse the lock acquisition order when writing to
    sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
    &sb->s_type->i_mutex_key.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Jiada Wang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Eugeniu Rosca
    Cc: George G. Davis
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com
    Signed-off-by: Ingo Molnar

    Jiada Wang
     

23 Aug, 2018

2 commits

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds
     
  • Better ensure we actually hold the lock using lockdep than just commenting
    on it. Due to the various exported _locked interfaces it is far too easy
    to get the locking wrong.

    Link: http://lkml.kernel.org/r/20171214152344.6880-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Acked-by: Ingo Molnar
    Cc: Al Viro
    Cc: Andrea Arcangeli
    Cc: Ingo Molnar
    Cc: Jason Baron
    Cc: Matthew Wilcox
    Cc: Mike Rapoport
    Cc: Peter Zijlstra
    Cc: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

22 Aug, 2018

2 commits

  • Pull more power management updates from Rafael Wysocki:
    "These fix the main idle loop and the menu cpuidle governor, clean up
    the latter, fix a mistake in the PCI bus type's support for system
    suspend and resume, fix the ondemand and conservative cpufreq
    governors, address a build issue in the system wakeup framework and
    make the ACPI C-states desciptions less confusing.

    Specifics:

    - Make the idle loop handle stopped scheduler tick correctly (Rafael
    Wysocki).

    - Prevent the menu cpuidle governor from letting CPUs spend too much
    time in shallow idle states when it is invoked with scheduler tick
    stopped and clean it up somewhat (Rafael Wysocki).

    - Avoid invoking the platform firmware to make the platform enter the
    ACPI S3 sleep state with suspended PCIe root ports which may
    confuse the firmware and cause it to crash (Rafael Wysocki).

    - Fix sysfs-related race in the ondemand and conservative cpufreq
    governors which may cause the system to crash if the governor
    module is removed during an update of CPU frequency limits (Henry
    Willard).

    - Select SRCU when building the system wakeup framework to avoid a
    build issue in it (zhangyi).

    - Make the descriptions of ACPI C-states vendor-neutral to avoid
    confusion (Prarit Bhargava)"

    * tag 'pm-4.19-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpuidle: menu: Handle stopped tick more aggressively
    sched: idle: Avoid retaining the tick when it has been stopped
    PCI / ACPI / PM: Resume all bridges on suspend-to-RAM
    cpuidle: menu: Update stale polling override comment
    cpufreq: governor: Avoid accessing invalid governor_data
    x86/ACPI/cstate: Make APCI C1 FFH MWAIT C-state description vendor-neutral
    cpuidle: menu: Fix white space
    PM / sleep: wakeup: Fix build error caused by missing SRCU support

    Linus Torvalds
     
  • …iederm/user-namespace

    Pull core signal handling updates from Eric Biederman:
    "It was observed that a periodic timer in combination with a
    sufficiently expensive fork could prevent fork from every completing.
    This contains the changes to remove the need for that restart.

    This set of changes is split into several parts:

    - The first part makes PIDTYPE_TGID a proper pid type instead
    something only for very special cases. The part starts using
    PIDTYPE_TGID enough so that in __send_signal where signals are
    actually delivered we know if the signal is being sent to a a group
    of processes or just a single process.

    - With that prep work out of the way the logic in fork is modified so
    that fork logically makes signals received while it is running
    appear to be received after the fork completes"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (22 commits)
    signal: Don't send signals to tasks that don't exist
    signal: Don't restart fork when signals come in.
    fork: Have new threads join on-going signal group stops
    fork: Skip setting TIF_SIGPENDING in ptrace_init_task
    signal: Add calculate_sigpending()
    fork: Unconditionally exit if a fatal signal is pending
    fork: Move and describe why the code examines PIDNS_ADDING
    signal: Push pid type down into complete_signal.
    signal: Push pid type down into __send_signal
    signal: Push pid type down into send_signal
    signal: Pass pid type into do_send_sig_info
    signal: Pass pid type into send_sigio_to_task & send_sigurg_to_task
    signal: Pass pid type into group_send_sig_info
    signal: Pass pid and pid type into send_sigqueue
    posix-timers: Noralize good_sigevent
    signal: Use PIDTYPE_TGID to clearly store where file signals will be sent
    pid: Implement PIDTYPE_TGID
    pids: Move the pgrp and session pid pointers from task_struct to signal_struct
    kvm: Don't open code task_pid in kvm_vcpu_ioctl
    pids: Compute task_tgid using signal->leader_pid
    ...

    Linus Torvalds
     

21 Aug, 2018

1 commit

  • Pull tracing updates from Steven Rostedt:

    - Restructure of lockdep and latency tracers

    This is the biggest change. Joel Fernandes restructured the hooks
    from irqs and preemption disabling and enabling. He got rid of a lot
    of the preprocessor #ifdef mess that they caused.

    He turned both lockdep and the latency tracers to use trace events
    inserted in the preempt/irqs disabling paths. But unfortunately,
    these started to cause issues in corner cases. Thus, parts of the
    code was reverted back to where lockdep and the latency tracers just
    get called directly (without using the trace events). But because the
    original change cleaned up the code very nicely we kept that, as well
    as the trace events for preempt and irqs disabling, but they are
    limited to not being called in NMIs.

    - Have trace events use SRCU for "rcu idle" calls. This was required
    for the preempt/irqs off trace events. But it also had to not allow
    them to be called in NMI context. Waiting till Paul makes an NMI safe
    SRCU API.

    - New notrace SRCU API to allow trace events to use SRCU.

    - Addition of mcount-nop option support

    - SPDX headers replacing GPL templates.

    - Various other fixes and clean ups.

    - Some fixes are marked for stable, but were not fully tested before
    the merge window opened.

    * tag 'trace-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (44 commits)
    tracing: Fix SPDX format headers to use C++ style comments
    tracing: Add SPDX License format tags to tracing files
    tracing: Add SPDX License format to bpf_trace.c
    blktrace: Add SPDX License format header
    s390/ftrace: Add -mfentry and -mnop-mcount support
    tracing: Add -mcount-nop option support
    tracing: Avoid calling cc-option -mrecord-mcount for every Makefile
    tracing: Handle CC_FLAGS_FTRACE more accurately
    Uprobe: Additional argument arch_uprobe to uprobe_write_opcode()
    Uprobes: Simplify uprobe_register() body
    tracepoints: Free early tracepoints after RCU is initialized
    uprobes: Use synchronize_rcu() not synchronize_sched()
    tracing: Fix synchronizing to event changes with tracepoint_synchronize_unregister()
    ftrace: Remove unused pointer ftrace_swapper_pid
    tracing: More reverting of "tracing: Centralize preemptirq tracepoints and unify their usage"
    tracing/irqsoff: Handle preempt_count for different configs
    tracing: Partial revert of "tracing: Centralize preemptirq tracepoints and unify their usage"
    tracing: irqsoff: Account for additional preempt_disable
    trace: Use rcu_dereference_raw for hooks from trace-event subsystem
    tracing/kprobes: Fix within_notrace_func() to check only notrace functions
    ...

    Linus Torvalds
     

20 Aug, 2018

1 commit

  • If the tick has been stopped already, but the governor has not asked to
    stop it (which it can do sometimes), the idle loop should invoke
    tick_nohz_idle_stop_tick(), to let tick_nohz_stop_tick() take care
    of this case properly.

    Fixes: 554c8aa8ecad (sched: idle: Select idle state before stopping the tick)
    Cc: 4.17+ # 4.17+
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

15 Aug, 2018

1 commit

  • Merge L1 Terminal Fault fixes from Thomas Gleixner:
    "L1TF, aka L1 Terminal Fault, is yet another speculative hardware
    engineering trainwreck. It's a hardware vulnerability which allows
    unprivileged speculative access to data which is available in the
    Level 1 Data Cache when the page table entry controlling the virtual
    address, which is used for the access, has the Present bit cleared or
    other reserved bits set.

    If an instruction accesses a virtual address for which the relevant
    page table entry (PTE) has the Present bit cleared or other reserved
    bits set, then speculative execution ignores the invalid PTE and loads
    the referenced data if it is present in the Level 1 Data Cache, as if
    the page referenced by the address bits in the PTE was still present
    and accessible.

    While this is a purely speculative mechanism and the instruction will
    raise a page fault when it is retired eventually, the pure act of
    loading the data and making it available to other speculative
    instructions opens up the opportunity for side channel attacks to
    unprivileged malicious code, similar to the Meltdown attack.

    While Meltdown breaks the user space to kernel space protection, L1TF
    allows to attack any physical memory address in the system and the
    attack works across all protection domains. It allows an attack of SGX
    and also works from inside virtual machines because the speculation
    bypasses the extended page table (EPT) protection mechanism.

    The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646

    The mitigations provided by this pull request include:

    - Host side protection by inverting the upper address bits of a non
    present page table entry so the entry points to uncacheable memory.

    - Hypervisor protection by flushing L1 Data Cache on VMENTER.

    - SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
    by offlining the sibling CPU threads. The knobs are available on
    the kernel command line and at runtime via sysfs

    - Control knobs for the hypervisor mitigation, related to L1D flush
    and SMT control. The knobs are available on the kernel command line
    and at runtime via sysfs

    - Extensive documentation about L1TF including various degrees of
    mitigations.

    Thanks to all people who have contributed to this in various ways -
    patches, review, testing, backporting - and the fruitful, sometimes
    heated, but at the end constructive discussions.

    There is work in progress to provide other forms of mitigations, which
    might be less horrible performance wise for a particular kind of
    workloads, but this is not yet ready for consumption due to their
    complexity and limitations"

    * 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
    x86/microcode: Allow late microcode loading with SMT disabled
    tools headers: Synchronise x86 cpufeatures.h for L1TF additions
    x86/mm/kmmio: Make the tracer robust against L1TF
    x86/mm/pat: Make set_memory_np() L1TF safe
    x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
    x86/speculation/l1tf: Invert all not present mappings
    cpu/hotplug: Fix SMT supported evaluation
    KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
    x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
    x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
    Documentation/l1tf: Remove Yonah processors from not vulnerable list
    x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
    x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
    x86: Don't include linux/irq.h from asm/hardirq.h
    x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
    x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
    x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
    x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
    x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
    cpu/hotplug: detect SMT disabled by BIOS
    ...

    Linus Torvalds
     

14 Aug, 2018

2 commits

  • Pull x86 timer updates from Thomas Gleixner:
    "Early TSC based time stamping to allow better boot time analysis.

    This comes with a general cleanup of the TSC calibration code which
    grew warts and duct taping over the years and removes 250 lines of
    code. Initiated and mostly implemented by Pavel with help from various
    folks"

    * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/kvmclock: Mark kvm_get_preset_lpj() as __init
    x86/tsc: Consolidate init code
    sched/clock: Disable interrupts when calling generic_sched_clock_init()
    timekeeping: Prevent false warning when persistent clock is not available
    sched/clock: Close a hole in sched_clock_init()
    x86/tsc: Make use of tsc_calibrate_cpu_early()
    x86/tsc: Split native_calibrate_cpu() into early and late parts
    sched/clock: Use static key for sched_clock_running
    sched/clock: Enable sched clock early
    sched/clock: Move sched clock initialization and merge with generic clock
    x86/tsc: Use TSC as sched clock early
    x86/tsc: Initialize cyc2ns when tsc frequency is determined
    x86/tsc: Calibrate tsc only once
    ARM/time: Remove read_boot_clock64()
    s390/time: Remove read_boot_clock64()
    timekeeping: Default boot time offset to local_clock()
    timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
    s390/time: Add read_persistent_wall_and_boot_offset()
    x86/xen/time: Output xen sched_clock time from 0
    x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
    ...

    Linus Torvalds
     
  • Pull locking/atomics update from Thomas Gleixner:
    "The locking, atomics and memory model brains delivered:

    - A larger update to the atomics code which reworks the ordering
    barriers, consolidates the atomic primitives, provides the new
    atomic64_fetch_add_unless() primitive and cleans up the include
    hell.

    - Simplify cmpxchg() instrumentation and add instrumentation for
    xchg() and cmpxchg_double().

    - Updates to the memory model and documentation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
    locking/atomics: Rework ordering barriers
    locking/atomics: Instrument cmpxchg_double*()
    locking/atomics: Instrument xchg()
    locking/atomics: Simplify cmpxchg() instrumentation
    locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
    tools/memory-model: Rename litmus tests to comply to norm7
    tools/memory-model/Documentation: Fix typo, smb->smp
    sched/Documentation: Update wake_up() & co. memory-barrier guarantees
    locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
    sched/core: Use smp_mb() in wake_woken_function()
    tools/memory-model: Add informal LKMM documentation to MAINTAINERS
    locking/atomics/Documentation: Describe atomic_set() as a write operation
    tools/memory-model: Make scripts executable
    tools/memory-model: Remove ACCESS_ONCE() from model
    tools/memory-model: Remove ACCESS_ONCE() from recipes
    locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
    MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
    tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
    tools/memory-model: Add litmus test for full multicopy atomicity
    locking/refcount: Always allow checked forms
    ...

    Linus Torvalds
     

05 Aug, 2018

1 commit


04 Aug, 2018

1 commit

  • Add a function calculate_sigpending to test to see if any signals are
    pending for a new task immediately following fork. Signals have to
    happen either before or after fork. Today our practice is to push
    all of the signals to before the fork, but that has the downside that
    frequent or periodic signals can make fork take much much longer than
    normal or prevent fork from completing entirely.

    So we need move signals that we can after the fork to prevent that.

    This updates the code to set TIF_SIGPENDING on a new task if there
    are signals or other activities that have moved so that they appear
    to happen after the fork.

    As the code today restarts if it sees any such activity this won't
    immediately have an effect, as there will be no reason for it
    to set TIF_SIGPENDING immediately after the fork.

    Adding calculate_sigpending means the code in fork can safely be
    changed to not always restart if a signal is pending.

    The new calculate_sigpending function sets sigpending if there
    are pending bits in jobctl, pending signals, the freezer needs
    to freeze the new task or the live kernel patching framework
    need the new thread to take the slow path to userspace.

    I have verified that setting TIF_SIGPENDING does make a new process
    take the slow path to userspace before it executes it's first userspace
    instruction.

    I have looked at the callers of signal_wake_up and the code paths
    setting TIF_SIGPENDING and I don't see anything else that needs to be
    handled. The code probably doesn't need to set TIF_SIGPENDING for the
    kernel live patching as it uses a separate thread flag as well. But
    at this point it seems safer reuse the recalc_sigpending logic and get
    the kernel live patching folks to sort out their story later.

    V2: I have moved the test into schedule_tail where siglock can
    be grabbed and recalc_sigpending can be reused directly.
    Further as the last action of setting up a new task this
    guarantees that TIF_SIGPENDING will be properly set in the
    new process.

    The helper calculate_sigpending takes the siglock and
    uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
    clear TIF_SIGPENDING if it is unnecessary. This allows reusing
    the existing code and keeps maintenance of the conditions simple.

    Oleg Nesterov suggested the movement
    and pointed out the need to take siglock if this code
    was going to be called while the new task is discoverable.

    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

31 Jul, 2018

1 commit

  • This patch detaches the preemptirq tracepoints from the tracers and
    keeps it separate.

    Advantages:
    * Lockdep and irqsoff event can now run in parallel since they no longer
    have their own calls.

    * This unifies the usecase of adding hooks to an irqsoff and irqson
    event, and a preemptoff and preempton event.
    3 users of the events exist:
    - Lockdep
    - irqsoff and preemptoff tracers
    - irqs and preempt trace events

    The unification cleans up several ifdefs and makes the code in preempt
    tracer and irqsoff tracers simpler. It gets rid of all the horrific
    ifdeferry around PROVE_LOCKING and makes configuration of the different
    users of the tracepoints more easy and understandable. It also gets rid
    of the time_* function calls from the lockdep hooks used to call into
    the preemptirq tracer which is not needed anymore. The negative delta in
    lines of code in this patch is quite large too.

    In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
    as a single point for registering probes onto the tracepoints. With
    this,
    the web of config options for preempt/irq toggle tracepoints and its
    users becomes:

    PREEMPT_TRACER PREEMPTIRQ_EVENTS IRQSOFF_TRACER PROVE_LOCKING
    | | \ | |
    \ (selects) / \ \ (selects) /
    TRACE_PREEMPT_TOGGLE ----> TRACE_IRQFLAGS
    \ /
    \ (depends on) /
    PREEMPTIRQ_TRACEPOINTS

    Other than the performance tests mentioned in the previous patch, I also
    ran the locking API test suite. I verified that all tests cases are
    passing.

    I also injected issues by not registering lockdep probes onto the
    tracepoints and I see failures to confirm that the probes are indeed
    working.

    This series + lockdep probes not registered (just to inject errors):
    [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
    [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
    [ 0.000000] sirq-safe-A => hirqs-on/12:FAILED|FAILED| ok |
    [ 0.000000] sirq-safe-A => hirqs-on/21:FAILED|FAILED| ok |
    [ 0.000000] hard-safe-A + irqs-on/12:FAILED|FAILED| ok |
    [ 0.000000] soft-safe-A + irqs-on/12:FAILED|FAILED| ok |
    [ 0.000000] hard-safe-A + irqs-on/21:FAILED|FAILED| ok |
    [ 0.000000] soft-safe-A + irqs-on/21:FAILED|FAILED| ok |
    [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
    [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

    With this series + lockdep probes registered, all locking tests pass:

    [ 0.000000] hard-irqs-on + irq-safe-A/21: ok | ok | ok |
    [ 0.000000] soft-irqs-on + irq-safe-A/21: ok | ok | ok |
    [ 0.000000] sirq-safe-A => hirqs-on/12: ok | ok | ok |
    [ 0.000000] sirq-safe-A => hirqs-on/21: ok | ok | ok |
    [ 0.000000] hard-safe-A + irqs-on/12: ok | ok | ok |
    [ 0.000000] soft-safe-A + irqs-on/12: ok | ok | ok |
    [ 0.000000] hard-safe-A + irqs-on/21: ok | ok | ok |
    [ 0.000000] soft-safe-A + irqs-on/21: ok | ok | ok |
    [ 0.000000] hard-safe-A + unsafe-B #1/123: ok | ok | ok |
    [ 0.000000] soft-safe-A + unsafe-B #1/123: ok | ok | ok |

    Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.org

    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Namhyung Kim
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Steven Rostedt (VMware)

    Joel Fernandes (Google)