30 Dec, 2020

2 commits

  • [ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ]

    do_sched_yield() invokes schedule() with interrupts disabled which is
    not allowed. This goes back to the pre git era to commit a6efb709806c
    ("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

    Reenable interrupts and remove the misleading comment which "explains" it.

    Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
    Signed-off-by: Sasha Levin

    Thomas Gleixner
     
  • [ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ]

    When change sched_rt_{runtime, period}_us, we validate that the new
    settings should at least accommodate the currently allocated -dl
    bandwidth:

    sched_rt_handler()
    --> sched_dl_bandwidth_validate()
    {
    new_bw = global_rt_runtime()/global_rt_period();

    for_each_possible_cpu(cpu) {
    dl_b = dl_bw_of(cpu);
    if (new_bw < dl_b->total_bw) total_bw is the allocated bandwidth of the whole root domain.
    Instead, we should compare dl_b->total_bw against "cpus*new_bw",
    where 'cpus' is the number of CPUs of the root domain.

    Also, below annotation(in kernel/sched/sched.h) implied implementation
    only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
    evolving till got merged(v9), but the annotation remains unchanged,
    meaningless and misleading, update it.

    * With respect to SMP, the bandwidth is given on a per-CPU basis,
    * meaning that:
    * - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
    * - dl_total_bw array contains, in the i-eth element, the currently
    * allocated bandwidth on the i-eth CPU.

    [1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

    Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
    Signed-off-by: Peng Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Juri Lelli
    Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
    Signed-off-by: Sasha Levin

    Peng Liu
     

09 Dec, 2020

3 commits

  • membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
    syncing the core on all sibling threads but not necessarily the calling
    thread. This behavior is fundamentally buggy and cannot be used safely.

    Suppose a user program has two threads. Thread A is on CPU 0 and thread B
    is on CPU 1. Thread A modifies some text and calls
    membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

    Then thread B executes the modified code. If, at any point after
    membarrier() decides which CPUs to target, thread A could be preempted and
    replaced by thread B on CPU 0. This could even happen on exit from the
    membarrier() syscall. If this happens, thread B will end up running on CPU
    0 without having synced.

    In principle, this could be fixed by arranging for the scheduler to issue
    sync_core_before_usermode() whenever switching between two threads in the
    same mm if there is any possibility of a concurrent membarrier() call, but
    this would have considerable overhead. Instead, make membarrier() sync the
    calling CPU as well.

    As an optimization, this avoids an extra smp_mb() in the default
    barrier-only mode and an extra rseq preempt on the caller.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • membarrier() does not explicitly sync_core() remote CPUs; instead, it
    relies on the assumption that an IPI will result in a core sync. On x86,
    this may be true in practice, but it's not architecturally reliable. In
    particular, the SDM and APM do not appear to guarantee that interrupt
    delivery is serializing. While IRET does serialize, IPI return can
    schedule, thereby switching to another task in the same mm that was
    sleeping in a syscall. The new task could then SYSRET back to usermode
    without ever executing IRET.

    Make this more robust by explicitly calling sync_core_before_usermode()
    on remote cores. (This also helps people who search the kernel tree for
    instances of sync_core() and sync_core_before_usermode() -- one might be
    surprised that the core membarrier code doesn't currently show up in a
    such a search.)

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org

    Andy Lutomirski
     
  • It seems that most RSEQ membarrier users will expect any stores done before
    the membarrier() syscall to be visible to the target task(s). While this
    is extremely likely to be true in practice, nothing actually guarantees it
    by a strict reading of the x86 manuals. Rather than providing this
    guarantee by accident and potentially causing a problem down the road, just
    add an explicit barrier.

    Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
    Signed-off-by: Andy Lutomirski
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Mathieu Desnoyers
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org

    Andy Lutomirski
     

30 Nov, 2020

1 commit


24 Nov, 2020

1 commit

  • We call arch_cpu_idle() with RCU disabled, but then use
    local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

    Switch all arch_cpu_idle() implementations to use
    raw_local_irq_{en,dis}able() and carefully manage the
    lockdep,rcu,tracing state like we do in entry.

    (XXX: we really should change arch_cpu_idle() to not return with
    interrupts enabled)

    Reported-by: Sven Schnelle
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Mark Rutland
    Tested-by: Mark Rutland
    Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org

    Peter Zijlstra
     

23 Nov, 2020

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "A couple of scheduler fixes:

    - Make the conditional update of the overutilized state work
    correctly by caching the relevant flags state before overwriting
    them and checking them afterwards.

    - Fix a data race in the wakeup path which caused loadavg on ARM64
    platforms to become a random number generator.

    - Fix the ordering of the iowaiter accounting operations so it can't
    be decremented before it is incremented.

    - Fix a bug in the deadline scheduler vs. priority inheritance when a
    non-deadline task A has inherited the parameters of a deadline task
    B and then blocks on a non-deadline task C.

    The second inheritance step used the static deadline parameters of
    task A, which are usually 0, instead of further propagating task
    B's parameters. The zero initialized parameters trigger a bug in
    the deadline scheduler"

    * tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/deadline: Fix priority inheritance with multiple scheduling classes
    sched: Fix rq->nr_iowait ordering
    sched: Fix data-race in wakeup
    sched/fair: Fix overutilized update in enqueue_task_fair()

    Linus Torvalds
     

17 Nov, 2020

3 commits

  • Glenn reported that "an application [he developed produces] a BUG in
    deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
    PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS
    task that was boosted by a SCHED_DEADLINE task boosts another CFS task
    (nested priority inheritance).

    ------------[ cut here ]------------
    kernel BUG at kernel/sched/deadline.c:1462!
    invalid opcode: 0000 [#1] PREEMPT SMP
    CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
    Hardware name: ...
    RIP: 0010:enqueue_task_dl+0x335/0x910
    Code: ...
    RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
    RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
    RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
    RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
    R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
    R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
    FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    ? intel_pstate_update_util_hwp+0x13/0x170
    rt_mutex_setprio+0x1cc/0x4b0
    task_blocks_on_rt_mutex+0x225/0x260
    rt_spin_lock_slowlock_locked+0xab/0x2d0
    rt_spin_lock_slowlock+0x50/0x80
    hrtimer_grab_expiry_lock+0x20/0x30
    hrtimer_cancel+0x13/0x30
    do_nanosleep+0xa0/0x150
    hrtimer_nanosleep+0xe1/0x230
    ? __hrtimer_init_sleeper+0x60/0x60
    __x64_sys_nanosleep+0x8d/0xa0
    do_syscall_64+0x4a/0x100
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fa58b52330d
    ...
    ---[ end trace 0000000000000002 ]—

    He also provided a simple reproducer creating the situation below:

    So the execution order of locking steps are the following
    (N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
    are mutexes that are enabled * with priority inheritance.)

    Time moves forward as this timeline goes down:

    N1 N2 D1
    | | |
    | | |
    Lock(M1) | |
    | | |
    | Lock(M2) |
    | | |
    | | Lock(M2)
    | | |
    | Lock(M1) |
    | (!!bug triggered!) |

    Daniel reported a similar situation as well, by just letting ksoftirqd
    run with DEADLINE (and eventually block on a mutex).

    Problem is that boosted entities (Priority Inheritance) use static
    DEADLINE parameters of the top priority waiter. However, there might be
    cases where top waiter could be a non-DEADLINE entity that is currently
    boosted by a DEADLINE entity from a different lock chain (i.e., nested
    priority chains involving entities of non-DEADLINE classes). In this
    case, top waiter static DEADLINE parameters could be null (initialized
    to 0 at fork()) and replenish_dl_entity() would hit a BUG().

    Fix this by keeping track of the original donor and using its parameters
    when a task is boosted.

    Reported-by: Glenn Elliott
    Reported-by: Daniel Bristot de Oliveira
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Bristot de Oliveira
    Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com

    Juri Lelli
     
  • schedule() ttwu()
    deactivate_task(); if (p->on_rq && ...) // false
    atomic_dec(&task_rq(p)->nr_iowait);
    if (prev->in_iowait)
    atomic_inc(&rq->nr_iowait);

    Allows nr_iowait to be decremented before it gets incremented,
    resulting in more dodgy IO-wait numbers than usual.

    Note that because we can now do ttwu_queue_wakelist() before
    p->on_cpu==0, we lose the natural ordering and have to further delay
    the decrement.

    Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    Reported-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mel Gorman
    Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net

    Peter Zijlstra
     
  • enqueue_task_fair() attempts to skip the overutilized update for new
    tasks as their util_avg is not accurate yet. However, the flag we check
    to do so is overwritten earlier on in the function, which makes the
    condition pretty much a nop.

    Fix this by saving the flag early on.

    Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
    Reported-by: Rick Yiu
    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com

    Quentin Perret
     

16 Nov, 2020

1 commit

  • Pull scheduler fixes from Thomas Gleixner:
    "A set of scheduler fixes:

    - Address a load balancer regression by making the load balancer use
    the same logic as the wakeup path to spread tasks in the LLC domain

    - Prefer the CPU on which a task run last over the local CPU in the
    fast wakeup path for asymmetric CPU capacity systems to align with
    the symmetric case. This ensures more locality and prevents massive
    migration overhead on those asymetric systems

    - Fix a memory corruption bug in the scheduler debug code caused by
    handing a modified buffer pointer to kfree()"

    * tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/debug: Fix memory corruption caused by multiple small reads of flags
    sched/fair: Prefer prev cpu in asymmetric wakeup path
    sched/fair: Ensure tasks spreading in LLC during LB

    Linus Torvalds
     

11 Nov, 2020

4 commits

  • Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
    with small reads causes oopses with slub corruption issues because the kfree is
    free'ing an offset from a previous allocation. Fix this by adding in a new
    pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
    to handle memory copies of the buf offsets.

    Fixes: 5b9f8ff7b320 ("sched/debug: Output SD flag names rather than their values")
    Reported-by: Jeff Bastian
    Signed-off-by: Colin Ian King
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com

    Colin Ian King
     
  • During fast wakeup path, scheduler always check whether local or prev
    cpus are good candidates for the task before looking for other cpus in
    the domain. With commit b7a331615d25 ("sched/fair: Add asymmetric CPU
    capacity wakeup scan") the heterogenous system gains a dedicated path
    but doesn't try to reuse prev cpu whenever possible. If the previous
    cpu is idle and belong to the LLC domain, we should check it 1st
    before looking for another cpu because it stays one of the best
    candidate and this also stabilizes task placement on the system.

    This change aligns asymmetric path behavior with symmetric one and reduces
    cases where the task migrates across all cpus of the sd_asym_cpucapacity
    domains at wakeup.

    This change does not impact normal EAS mode but only the overloaded case or
    when EAS is not used.

    - On hikey960 with performance governor (EAS disable)

    ./perf bench sched pipe -T -l 50000
    mainline w/ patch
    # migrations 999364 0
    ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22%

    - On hikey with performance governor

    ./perf bench sched pipe -T -l 50000
    mainline w/ patch
    # migrations 0 0
    ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4%

    According to test on hikey, the patch doesn't impact symmetric system
    compared to current implementation (only tested on arm64)

    Also read the uclamped value of task's utilization at most twice instead
    instead each time we compare task's utilization with cpu's capacity.

    Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Dietmar Eggemann
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org

    Vincent Guittot
     
  • schbench shows latency increase for 95 percentile above since:
    commit 0b0695f2b34a ("sched/fair: Rework load_balance()")

    Align the behavior of the load balancer with the wake up path, which tries
    to select an idle CPU which belongs to the LLC for a waking task.

    calculate_imbalance() will use nr_running instead of the spare
    capacity when CPUs share resources (ie cache) at the domain level. This
    will ensure a better spread of tasks on idle CPUs.

    Running schbench on a hikey (8cores arm64) shows the problem:

    tip/sched/core :
    schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
    Latency percentiles (usec)
    50.0th: 33
    75.0th: 45
    90.0th: 51
    95.0th: 4152
    *99.0th: 14288
    99.5th: 14288
    99.9th: 14288
    min=0, max=14276

    tip/sched/core + patch :
    schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
    Latency percentiles (usec)
    50.0th: 34
    75.0th: 47
    90.0th: 52
    95.0th: 78
    *99.0th: 94
    99.5th: 94
    99.9th: 94
    min=0, max=94

    Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
    Reported-by: Chris Mason
    Suggested-by: Rik van Riel
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Rik van Riel
    Tested-by: Rik van Riel
    Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org

    Vincent Guittot
     
  • A new cpufreq governor flag will be added subsequently, so replace
    the bool dynamic_switching fleid in struct cpufreq_governor with a
    flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
    the "dynamic switching" governors instead of it.

    No intentional functional impact.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar

    Rafael J. Wysocki
     

03 Nov, 2020

1 commit

  • The cpufreq policy's frequency limits (min/max) can get changed at any
    point of time, while schedutil is trying to update the next frequency.
    Though the schedutil governor has necessary locking and support in place
    to make sure we don't miss any of those updates, there is a corner case
    where the governor will find that the CPU is already running at the
    desired frequency and so may skip an update.

    For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
    and is running at 1 GHz currently. Schedutil tries to update the
    frequency to 1.2 GHz, during this time the policy limits get changed as
    policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
    frequency at various instances, we will eventually set the frequency to
    1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

    Now lets say the policy limits get changed back at this time with
    policy->min as 1 GHz. The next time schedutil is invoked by the
    scheduler, we will reevaluate the next frequency (because
    need_freq_update will get set due to limits change event) and lets say
    we want to set the frequency to 1.2 GHz again. At this point
    sugov_update_next_freq() will find the next_freq == current_freq and
    will abort the update, while the CPU actually runs at 1.4 GHz.

    Until now need_freq_update was used as a flag to indicate that the
    policy's frequency limits have changed, and that we should consider the
    new limits while reevaluating the next frequency.

    This patch fixes the above mentioned issue by extending the purpose of
    the need_freq_update flag. If this flag is set now, the schedutil
    governor will not try to abort a frequency change even if next_freq ==
    current_freq.

    As similar behavior is required in the case of
    CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
    set to false if that flag is set for the driver.

    We also don't need to consider the need_freq_update flag in
    sugov_update_single() anymore to handle the special case of busy CPU, as
    we won't abort a frequency update anymore.

    Reported-by: zhuguangqing
    Suggested-by: Rafael J. Wysocki
    Signed-off-by: Viresh Kumar
    [ rjw: Rearrange code to avoid a branch ]
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

29 Oct, 2020

1 commit

  • Because sugov_update_next_freq() may skip a frequency update even if
    the need_freq_update flag has been set for the policy at hand, policy
    limits updates may not take effect as expected.

    For example, if the intel_pstate driver operates in the passive mode
    with HWP enabled, it needs to update the HWP min and max limits when
    the policy min and max limits change, respectively, but that may not
    happen if the target frequency does not change along with the limit
    at hand. In particular, if the policy min is changed first, causing
    the target frequency to be adjusted to it, and the policy max limit
    is changed later to the same value, the HWP max limit will not be
    updated to follow it as expected, because the target frequency is
    still equal to the policy min limit and it will not change until
    that limit is updated.

    To address this issue, modify get_next_freq() to let the driver
    callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
    is set regardless of whether or not the new frequency to set is
    equal to the previous one.

    Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
    Reported-by: Zhang Rui
    Tested-by: Zhang Rui
    Cc: 5.9+ # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
    Cc: 5.9+ # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

26 Oct, 2020

2 commits

  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Pull scheduler fixes from Thomas Gleixner:
    "Two scheduler fixes:

    - A trivial build fix for sched_feat() to compile correctly with
    CONFIG_JUMP_LABEL=n

    - Replace a zero lenght array with a flexible array"

    * tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/features: Fix !CONFIG_JUMP_LABEL case
    sched: Replace zero-length array with flexible-array

    Linus Torvalds
     

24 Oct, 2020

1 commit

  • Pull more power management updates from Rafael Wysocki:
    "First of all, the adaptive voltage scaling (AVS) drivers go to new
    platform-specific locations as planned (this part was reported to have
    merge conflicts against the new arm-soc updates in linux-next).

    In addition to that, there are some fixes (intel_idle, intel_pstate,
    RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
    accounting support to the generic power domains (genpd) code and some
    janitorial changes all over.

    Specifics:

    - Move the AVS drivers to new platform-specific locations and get rid
    of the drivers/power/avs directory (Ulf Hansson).

    - Add on/off notifiers and idle state accounting support to the
    generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).

    - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).

    - Make intel_idle disregard ACPI _CST if it cannot use the data
    returned by that method (Mel Gorman).

    - Modify intel_pstate to avoid leaving useless sysfs directory
    structure behind if it cannot be registered (Chen Yu).

    - Fix domain detection in the RAPL power capping driver and prevent
    it from failing to enumerate the Psys RAPL domain (Zhang Rui).

    - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
    later AMD chips (Wei Huang).

    - Update the driver assumptions comment in intel_idle and fix a
    kerneldoc comment in the runtime PM framework (Alexander Monakov,
    Bean Huo).

    - Avoid unnecessary resets of the cached frequency in the schedutil
    cpufreq governor to reduce overhead (Wei Wang).

    - Clean up the cpufreq core a bit (Viresh Kumar).

    - Make assorted minor janitorial changes (Daniel Lezcano, Geert
    Uytterhoeven, Hubert Jasudowicz, Tom Rix).

    - Clean up and optimize the cpupower utility somewhat (Colin Ian
    King, Martin Kaistra)"

    * tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM: sleep: remove unreachable break
    PM: AVS: Drop the avs directory and the corresponding Kconfig
    PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
    PM: runtime: Fix typo in pm_runtime_set_active() helper comment
    PM: domains: Fix build error for genpd notifiers
    powercap: Fix typo in Kconfig "Plance" -> "Plane"
    cpufreq: schedutil: restore cached freq when next_f is not changed
    acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
    PM: AVS: smartreflex Move driver to soc specific drivers
    PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
    PM: domains: enable domain idle state accounting
    PM: domains: Add curly braces to delimit comment + statement block
    PM: domains: Add support for PM domain on/off notifiers for genpd
    powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
    powercap/intel_rapl: Fix domain detection
    intel_idle: Ignore _CST if control cannot be taken from the platform
    cpuidle: Remove pointless stub
    intel_idle: mention assumption that WBINVD is not needed
    MAINTAINERS: Add section for cpuidle-psci PM domain
    cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
    ...

    Linus Torvalds
     

19 Oct, 2020

1 commit

  • We have the raw cached freq to reduce the chance in calling cpufreq
    driver where it could be costly in some arch/SoC.

    Currently, the raw cached freq is reset in sugov_update_single() when
    it avoids frequency reduction (which is not desirable sometimes), but
    it is better to restore the previous value of it in that case,
    because it may not change in the next cycle and it is not necessary
    to change the CPU frequency then.

    Adapted from https://android-review.googlesource.com/1352810/

    Signed-off-by: Wei Wang
    Acked-by: Viresh Kumar
    [ rjw: Subject edit and changelog rewrite ]
    Signed-off-by: Rafael J. Wysocki

    Wei Wang
     

18 Oct, 2020

1 commit

  • A previous commit changed the notification mode from true/false to an
    int, allowing notify-no, notify-yes, or signal-notify. This was
    backwards compatible in the sense that any existing true/false user
    would translate to either 0 (on notification sent) or 1, the latter
    which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

    Clean this up properly, and define a proper enum for the notification
    mode. Now we have:

    - TWA_NONE. This is 0, same as before the original change, meaning no
    notification requested.
    - TWA_RESUME. This is 1, same as before the original change, meaning
    that we use TIF_NOTIFY_RESUME.
    - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
    notification.

    Clean up all the callers, switching their 0/1/false/true to using the
    appropriate TWA_* mode for notifications.

    Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Oct, 2020

3 commits

  • Commit:

    765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")

    made sched features static for !CONFIG_SCHED_DEBUG configurations, but
    overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.

    For the latter echoing changes to /sys/kernel/debug/sched_features has
    the nasty effect of effectively changing what sched_features reports,
    but without actually changing the scheduler behaviour (since different
    translation units get different sysctl_sched_features).

    Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
    restructuring ifdefs.

    Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
    Co-developed-by: Daniel Bristot de Oliveira
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Juri Lelli
    Signed-off-by: Ingo Molnar
    Acked-by: Patrick Bellasi
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com

    Juri Lelli
     
  • In the following commit:

    04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array")

    a zero-length array cpumask[0] has been replaced with cpumask[].
    But there is still a cpumask[0] in 'struct sched_group_capacity'
    which was missed.

    The point of using [] instead of [0] is that with [] the compiler will
    generate a build warning if it isn't the last member of a struct.

    [ mingo: Rewrote the changelog. ]

    Signed-off-by: zhuguangqing
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com

    zhuguangqing
     
  • Pull power management updates from Rafael Wysocki:
    "These rework the collection of cpufreq statistics to allow it to take
    place if fast frequency switching is enabled in the governor, rework
    the frequency invariance handling in the cpufreq core and drivers, add
    new hardware support to a couple of cpufreq drivers, fix a number of
    assorted issues and clean up the code all over.

    Specifics:

    - Rework cpufreq statistics collection to allow it to take place when
    fast frequency switching is enabled in the governor (Viresh Kumar).

    - Make the cpufreq core set the frequency scale on behalf of the
    driver and update several cpufreq drivers accordingly (Ionela
    Voinescu, Valentin Schneider).

    - Add new hardware support to the STI and qcom cpufreq drivers and
    improve them (Alain Volmat, Manivannan Sadhasivam).

    - Fix multiple assorted issues in cpufreq drivers (Jon Hunter,
    Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan
    Gerhold, Viresh Kumar).

    - Fix several assorted issues in the operating performance points
    (OPP) framework (Stephan Gerhold, Viresh Kumar).

    - Allow devfreq drivers to fetch devfreq instances by DT enumeration
    instead of using explicit phandles and modify the devfreq core code
    to support driver-specific devfreq DT bindings (Leonard Crestez,
    Chanwoo Choi).

    - Improve initial hardware resetting in the tegra30 devfreq driver
    and clean up the tegra cpuidle driver (Dmitry Osipenko).

    - Update the cpuidle core to collect state entry rejection statistics
    and expose them via sysfs (Lina Iyer).

    - Improve the ACPI _CST code handling diagnostics (Chen Yu).

    - Update the PSCI cpuidle driver to allow the PM domain
    initialization to occur in the OSI mode as well as in the PC mode
    (Ulf Hansson).

    - Rework the generic power domains (genpd) core code to allow domain
    power off transition to be aborted in the absence of the "power
    off" domain callback (Ulf Hansson).

    - Fix two suspend-to-idle issues in the ACPI EC driver (Rafael
    Wysocki).

    - Fix the handling of timer_expires in the PM-runtime framework on
    32-bit systems and the handling of device links in it (Grygorii
    Strashko, Xiang Chen).

    - Add IO requests batching support to the hibernate image saving and
    reading code and drop a bogus get_gendisk() from there (Xiaoyi
    Chen, Christoph Hellwig).

    - Allow PCIe ports to be put into the D3cold power state if they are
    power-manageable via ACPI (Lukas Wunner).

    - Add missing header file include to a power capping driver (Pujin
    Shi).

    - Clean up the qcom-cpr AVS driver a bit (Liu Shixin).

    - Kevin Hilman steps down as designated reviwer of adaptive voltage
    scaling (AVS) drivers (Kevin Hilman)"

    * tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
    cpufreq: stats: Fix string format specifier mismatch
    arm: disable frequency invariance for CONFIG_BL_SWITCHER
    cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale()
    cpufreq: stats: Add memory barrier to store_reset()
    cpufreq: schedutil: Simplify sugov_fast_switch()
    ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
    ACPI: EC: PM: Flush EC work unconditionally after wakeup
    PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
    PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
    cpufreq: Move traces and update to policy->cur to cpufreq core
    cpufreq: stats: Enable stats for fast-switch as well
    cpufreq: stats: Mark few conditionals with unlikely()
    cpufreq: stats: Remove locking
    cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
    PM: domains: Allow to abort power off when no ->power_off() callback
    PM: domains: Rename power state enums for genpd
    PM / devfreq: tegra30: Improve initial hardware resetting
    PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function
    PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function
    PM / devfreq: Add devfreq_get_devfreq_by_node function
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - reorganize & clean up the SD* flags definitions and add a bunch of
    sanity checks. These new checks caught quite a few bugs or at least
    inconsistencies, resulting in another set of patches.

    - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ

    - add a new tracepoint to improve CPU capacity tracking

    - improve overloaded SMP system load-balancing behavior

    - tweak SMT balancing

    - energy-aware scheduling updates

    - NUMA balancing improvements

    - deadline scheduler fixes and improvements

    - CPU isolation fixes

    - misc cleanups, simplifications and smaller optimizations

    * tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
    sched/deadline: Unthrottle PI boosted threads while enqueuing
    sched/debug: Add new tracepoint to track cpu_capacity
    sched/fair: Tweak pick_next_entity()
    rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
    rseq/selftests,x86_64: Add rseq_offset_deref_addv()
    rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
    sched/fair: Use dst group while checking imbalance for NUMA balancer
    sched/fair: Reduce busy load balance interval
    sched/fair: Minimize concurrent LBs between domain level
    sched/fair: Reduce minimal imbalance threshold
    sched/fair: Relax constraint on task's load during load balance
    sched/fair: Remove the force parameter of update_tg_load_avg()
    sched/fair: Fix wrong cpu selecting from isolated domain
    sched: Remove unused inline function uclamp_bucket_base_value()
    sched/rt: Disable RT_RUNTIME_SHARE by default
    sched/deadline: Fix stale throttling on de-/boosted tasks
    sched/numa: Use runnable_avg to classify node
    sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
    MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
    sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
    ...

    Linus Torvalds
     

07 Oct, 2020

1 commit


05 Oct, 2020

1 commit


03 Oct, 2020

3 commits

  • stress-ng has a test (stress-ng --cyclic) that creates a set of threads
    under SCHED_DEADLINE with the following parameters:

    dl_runtime = 10000 (10 us)
    dl_deadline = 100000 (100 us)
    dl_period = 100000 (100 us)

    These parameters are very aggressive. When using a system without HRTICK
    set, these threads can easily execute longer than the dl_runtime because
    the throttling happens with 1/HZ resolution.

    During the main part of the test, the system works just fine because
    the workload does not try to run over the 10 us. The problem happens at
    the end of the test, on the exit() path. During exit(), the threads need
    to do some cleanups that require real-time mutex locks, mainly those
    related to memory management, resulting in this scenario:

    Note: locks are rt_mutexes...
    ------------------------------------------------------------------------
    TASK A: TASK B: TASK C:
    activation
    activation
    activation

    lock(a): OK! lock(b): OK!

    lock(a)
    -> block (task A owns it)
    -> self notice/set throttled
    +--< -> arm replenished timer
    | switch-out
    | lock(b)
    | -> B prio>
    | -> boost TASK B
    | unlock(a) switch-out
    | -> handle lock a to B
    | -> wakeup(B)
    | -> B is throttled:
    | -> do not enqueue
    | switch-out
    |
    |
    +---------------------> replenishment timer
    -> TASK B is boosted:
    -> do not enqueue
    ------------------------------------------------------------------------

    BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
    crashes with hung task C.

    This problem is avoided by removing the throttle state from the boosted
    thread while boosting it (by TASK A in the example above), allowing it to
    be queued and run boosted.

    The next replenishment will take care of the runtime overrun, pushing
    the deadline further away. See the "while (dl_se->runtime
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Juri Lelli
    Tested-by: Mark Simmons
    Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com

    Daniel Bristot de Oliveira
     
  • rq->cpu_capacity is a key element in several scheduler parts, such as EAS
    task placement and load balancing. Tracking this value enables testing
    and/or debugging by a toolkit.

    Signed-off-by: Vincent Donnefort
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com

    Vincent Donnefort
     
  • Currently, pick_next_entity(...) has the following structure
    (simplified):

    [...]
    if (last_buddy_ok())
    result = last_buddy;
    if (next_buddy_ok())
    result = next_buddy;
    [...]

    The intended behavior is to prefer next buddy over last buddy;
    the current code somewhat obfuscates this, and also wastes
    cycles checking the last buddy when eventually the next buddy is
    picked up.

    So this patch refactors two 'ifs' above into

    [...]
    if (next_buddy_ok())
    result = next_buddy;
    else if (last_buddy_ok())
    result = last_buddy;
    [...]

    Signed-off-by: Peter Oskolkov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com

    Peter Oskolkov
     

25 Sep, 2020

8 commits

  • This patchset is based on Google-internal RSEQ work done by Paul
    Turner and Andrew Hunter.

    When working with per-CPU RSEQ-based memory allocations, it is
    sometimes important to make sure that a global memory location is no
    longer accessed from RSEQ critical sections. For example, there can be
    two per-CPU lists, one is "active" and accessed per-CPU, while another
    one is inactive and worked on asynchronously "off CPU" (e.g. garbage
    collection is performed). Then at some point the two lists are
    swapped, and a fast RCU-like mechanism is required to make sure that
    the previously active list is no longer accessed.

    This patch introduces such a mechanism: in short, membarrier() syscall
    issues an IPI to a CPU, restarting a potentially active RSEQ critical
    section on the CPU.

    Signed-off-by: Peter Oskolkov
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mathieu Desnoyers
    Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com

    Peter Oskolkov
     
  • Barry Song noted the following

    Something is wrong. In find_busiest_group(), we are checking if
    src has higher load, however, in task_numa_find_cpu(), we are
    checking if dst will have higher load after balancing. It seems
    it is not sensible to check src.

    It maybe cause wrong imbalance value, for example,

    if dst_running = env->dst_stats.nr_running + 1 results in 3 or
    above, and src_running = env->src_stats.nr_running - 1 results
    in 1;

    The current code is thinking imbalance as 0 since src_running is
    smaller than 2. This is inconsistent with load balancer.

    Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
    a task "from an almost idle domain" to a "domain with spare capacity". This
    patch forbids movement "from a misplaced domain" to "an almost idle domain"
    as that is closer to what the CPU load balancer expects.

    This patch is not a universal win. The old behaviour was intended to allow
    a task from an almost idle NUMA node to migrate to its preferred node if
    the destination had capacity but there are corner cases. For example,
    a NAS compute load could be parallelised to use 1/3rd of available CPUs
    but not all those potential tasks are active at all times allowing this
    logic to trigger. An obvious example is specjbb 2005 running various
    numbers of warehouses on a 2 socket box with 80 cpus.

    specjbb
    5.9.0-rc4 5.9.0-rc4
    vanilla dstbalance-v1r1
    Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
    Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
    Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
    Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
    Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
    Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
    Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
    Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
    Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
    Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
    Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
    Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
    Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
    Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
    Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
    Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
    Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
    Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
    Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
    Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
    Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
    Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
    Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
    Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
    Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
    Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
    Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
    Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
    Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
    Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
    Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
    Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
    Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
    Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
    Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
    Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
    Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
    Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
    Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
    Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
    Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
    Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
    Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
    Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
    Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
    Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
    Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
    Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
    Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
    Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
    Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
    Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
    Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
    Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
    Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
    Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
    Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
    Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
    Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
    Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
    Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
    Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
    Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
    Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
    Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
    Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
    Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
    Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
    Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
    Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
    Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
    Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
    Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
    Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
    Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
    Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
    Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
    Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
    Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
    Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*

    The performance at the mid-point is better but not universally better. The
    patch is a mixed bag depending on the workload, machine and overall
    levels of utilisation. Sometimes it's better (sometimes much better),
    other times it is worse (sometimes much worse). Given that there isn't a
    universally good decision in this section and more people seem to prefer
    the patch then it may be best to keep the LB decisions consistent and
    revisit imbalance handling when the load balancer code changes settle down.

    Jirka Hladky added the following observation.

    Our results are mostly in line with what you see. We observe
    big gains (20-50%) when the system is loaded to 1/3 of the
    maximum capacity and mixed results at the full load - some
    workloads benefit from the patch at the full load, others not,
    but performance changes at the full load are mostly within the
    noise of results (+/-5%). Overall, we think this patch is helpful.

    [mgorman@techsingularity.net: Rewrote changelog]
    Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
    Signed-off-by: Barry Song
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net

    Barry Song
     
  • The busy_factor, which increases load balance interval when a cpu is busy,
    is set to 32 by default. This value generates some huge LB interval on
    large system like the THX2 made of 2 node x 28 cores x 4 threads.
    For such system, the interval increases from 112ms to 3584ms at MC level.
    And from 228ms to 7168ms at NUMA level.

    Even on smaller system, a lower busy factor has shown improvement on the
    fair distribution of the running time so let reduce it for all.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org

    Vincent Guittot
     
  • sched domains tend to trigger simultaneously the load balance loop but
    the larger domains often need more time to collect statistics. This
    slowness makes the larger domain trying to detach tasks from a rq whereas
    tasks already migrated somewhere else at a sub-domain level. This is not
    a real problem for idle LB because the period of smaller domains will
    increase with its CPUs being busy and this will let time for higher ones
    to pulled tasks. But this becomes a problem when all CPUs are already busy
    because all domains stay synced when they trigger their LB.

    A simple way to minimize simultaneous LB of all domains is to decrement the
    the busy interval by 1 jiffies. Because of the busy_factor, the interval of
    larger domain will not be a multiple of smaller ones anymore.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org

    Vincent Guittot
     
  • The 25% default imbalance threshold for DIE and NUMA domain is large
    enough to generate significant unfairness between threads. A typical
    example is the case of 11 threads running on 2x4 CPUs. The imbalance of
    20% between the 2 groups of 4 cores is just low enough to not trigger
    the load balance between the 2 groups. We will have always the same 6
    threads on one group of 4 CPUs and the other 5 threads on the other
    group of CPUS. With a fair time sharing in each group, we ends up with
    +20% running time for the group of 5 threads.

    Consider decreasing the imbalance threshold for overloaded case where we
    use the load to balance task and to ensure fair time sharing.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Acked-by: Hillf Danton
    Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org

    Vincent Guittot
     
  • Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
    load balancer currently migrates the waiting task between the CPUs in an
    almost random manner. The success of a rq pulling a task depends of the
    value of nr_balance_failed of its domains and its ability to be faster
    than others to detach it. This behavior results in an unfair distribution
    of the running time between tasks because some CPUs will run most of the
    time, if not always, the same task whereas others will share their time
    between several tasks.

    Instead of using nr_balance_failed as a boolean to relax the condition
    for detaching task, the LB will use nr_balanced_failed to relax the
    threshold between the tasks'load and the imbalance. This mecanism
    prevents the same rq or domain to always win the load balance fight.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org

    Vincent Guittot
     
  • In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
    sometimes update_tg_load_avg(cfs_rq, false) is used.
    update_tg_load_avg() has the parameter force, but in current code,
    it never set 1 or true to it, so remove the force parameter.

    Signed-off-by: Xianting Tian
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com

    Xianting Tian
     
  • We've met problems that occasionally tasks with full cpumask
    (e.g. by putting it into a cpuset or setting to full affinity)
    were migrated to our isolated cpus in production environment.

    After some analysis, we found that it is due to the current
    select_idle_smt() not considering the sched_domain mask.

    Steps to reproduce on my 31-CPU hyperthreads machine:
    1. with boot parameter: "isolcpus=domain,2-31"
    (thread lists: 0,16 and 1,17)
    2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
    3. some threads will be migrated to the isolated cpu16~17.

    Fix it by checking the valid domain mask in select_idle_smt().

    Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
    Reported-by: Wetp Zhang
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Jiang Biao
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com

    Xunlei Pang