11 Nov, 2020

3 commits

  • Defer the softirq processing to ksoftirqd if a RT task is running
    or queued on the current CPU. This complements the RT task placement
    algorithm which tries to find a CPU that is not currently busy with
    softirqs.

    Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only deferred
    as they can potentially run for long time.

    Bug: 168521633
    Change-Id: Id7665244af6bbd5a96d9e591cf26154e9eaa860c
    Signed-off-by: Pavankumar Kondeti
    [satyap@codeaurora.org: trivial merge conflict resolution.]
    Signed-off-by: Satya Durga Srinivasu Prabhala
    [elavila: Port to mainline, squash with bugfix]
    Signed-off-by: J. Avila

    Pavankumar Kondeti
     
  • The scheduling change to avoid putting RT threads on cores that
    are handling softint's was catching cases where there was no reason
    to believe the softint would take a long time, resulting in unnecessary
    migration overhead. This patch reduces the migration to cases where
    the core has a softint that is actually likely to take a long time,
    as opposed to the RCU, SCHED, and TIMER softints that are rather quick.

    Bug: 31752786
    Bug: 168521633
    Change-Id: Ib4e179f1e15c736b2fdba31070494e357e9fbbe2
    Signed-off-by: John Dias
    [elavila: Amend commit text for AOSP, port to mainline]
    Signed-off-by: J. Avila

    John Dias
     
  • In certain audio use cases, scheduling RT threads on cores that are
    handling softirqs can lead to glitches. Prevent this behavior.

    Bug: 31501544
    Bug: 168521633
    Change-Id: I99dd7aaa12c11270b28dbabea484bcc8fb8ba0c1
    Signed-off-by: John Dias
    [elavila: Port to mainline, amend commit text]
    Signed-off-by: J. Avila

    John Dias
     

06 Nov, 2020

2 commits


03 Nov, 2020

1 commit

  • The cpufreq policy's frequency limits (min/max) can get changed at any
    point of time, while schedutil is trying to update the next frequency.
    Though the schedutil governor has necessary locking and support in place
    to make sure we don't miss any of those updates, there is a corner case
    where the governor will find that the CPU is already running at the
    desired frequency and so may skip an update.

    For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
    and is running at 1 GHz currently. Schedutil tries to update the
    frequency to 1.2 GHz, during this time the policy limits get changed as
    policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
    frequency at various instances, we will eventually set the frequency to
    1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

    Now lets say the policy limits get changed back at this time with
    policy->min as 1 GHz. The next time schedutil is invoked by the
    scheduler, we will reevaluate the next frequency (because
    need_freq_update will get set due to limits change event) and lets say
    we want to set the frequency to 1.2 GHz again. At this point
    sugov_update_next_freq() will find the next_freq == current_freq and
    will abort the update, while the CPU actually runs at 1.4 GHz.

    Until now need_freq_update was used as a flag to indicate that the
    policy's frequency limits have changed, and that we should consider the
    new limits while reevaluating the next frequency.

    This patch fixes the above mentioned issue by extending the purpose of
    the need_freq_update flag. If this flag is set now, the schedutil
    governor will not try to abort a frequency change even if next_freq ==
    current_freq.

    As similar behavior is required in the case of
    CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
    set to false if that flag is set for the driver.

    We also don't need to consider the need_freq_update flag in
    sugov_update_single() anymore to handle the special case of busy CPU, as
    we won't abort a frequency update anymore.

    Reported-by: zhuguangqing
    Suggested-by: Rafael J. Wysocki
    Signed-off-by: Viresh Kumar
    [ rjw: Rearrange code to avoid a branch ]
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

02 Nov, 2020

1 commit


30 Oct, 2020

1 commit


29 Oct, 2020

2 commits

  • Because sugov_update_next_freq() may skip a frequency update even if
    the need_freq_update flag has been set for the policy at hand, policy
    limits updates may not take effect as expected.

    For example, if the intel_pstate driver operates in the passive mode
    with HWP enabled, it needs to update the HWP min and max limits when
    the policy min and max limits change, respectively, but that may not
    happen if the target frequency does not change along with the limit
    at hand. In particular, if the policy min is changed first, causing
    the target frequency to be adjusted to it, and the policy max limit
    is changed later to the same value, the HWP max limit will not be
    updated to follow it as expected, because the target frequency is
    still equal to the policy min limit and it will not change until
    that limit is updated.

    To address this issue, modify get_next_freq() to let the driver
    callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
    is set regardless of whether or not the new frequency to set is
    equal to the previous one.

    Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
    Reported-by: Zhang Rui
    Tested-by: Zhang Rui
    Cc: 5.9+ # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
    Cc: 5.9+ # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Linux 5.10-rc1

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: Iace3fc84a00d3023c75caa086a266de17dc1847c

    Greg Kroah-Hartman
     

27 Oct, 2020

1 commit


26 Oct, 2020

2 commits

  • Use a more generic form for __section that requires quotes to avoid
    complications with clang and gcc differences.

    Remove the quote operator # from compiler_attributes.h __section macro.

    Convert all unquoted __section(foo) uses to quoted __section("foo").
    Also convert __attribute__((section("foo"))) uses to __section("foo")
    even if the __attribute__ has multiple list entry forms.

    Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

    Signed-off-by: Joe Perches
    Reviewed-by: Nick Desaulniers
    Reviewed-by: Miguel Ojeda
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • Pull scheduler fixes from Thomas Gleixner:
    "Two scheduler fixes:

    - A trivial build fix for sched_feat() to compile correctly with
    CONFIG_JUMP_LABEL=n

    - Replace a zero lenght array with a flexible array"

    * tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/features: Fix !CONFIG_JUMP_LABEL case
    sched: Replace zero-length array with flexible-array

    Linus Torvalds
     

25 Oct, 2020

1 commit


24 Oct, 2020

2 commits

  • Pull more power management updates from Rafael Wysocki:
    "First of all, the adaptive voltage scaling (AVS) drivers go to new
    platform-specific locations as planned (this part was reported to have
    merge conflicts against the new arm-soc updates in linux-next).

    In addition to that, there are some fixes (intel_idle, intel_pstate,
    RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
    accounting support to the generic power domains (genpd) code and some
    janitorial changes all over.

    Specifics:

    - Move the AVS drivers to new platform-specific locations and get rid
    of the drivers/power/avs directory (Ulf Hansson).

    - Add on/off notifiers and idle state accounting support to the
    generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).

    - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).

    - Make intel_idle disregard ACPI _CST if it cannot use the data
    returned by that method (Mel Gorman).

    - Modify intel_pstate to avoid leaving useless sysfs directory
    structure behind if it cannot be registered (Chen Yu).

    - Fix domain detection in the RAPL power capping driver and prevent
    it from failing to enumerate the Psys RAPL domain (Zhang Rui).

    - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
    later AMD chips (Wei Huang).

    - Update the driver assumptions comment in intel_idle and fix a
    kerneldoc comment in the runtime PM framework (Alexander Monakov,
    Bean Huo).

    - Avoid unnecessary resets of the cached frequency in the schedutil
    cpufreq governor to reduce overhead (Wei Wang).

    - Clean up the cpufreq core a bit (Viresh Kumar).

    - Make assorted minor janitorial changes (Daniel Lezcano, Geert
    Uytterhoeven, Hubert Jasudowicz, Tom Rix).

    - Clean up and optimize the cpupower utility somewhat (Colin Ian
    King, Martin Kaistra)"

    * tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
    PM: sleep: remove unreachable break
    PM: AVS: Drop the avs directory and the corresponding Kconfig
    PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
    PM: runtime: Fix typo in pm_runtime_set_active() helper comment
    PM: domains: Fix build error for genpd notifiers
    powercap: Fix typo in Kconfig "Plance" -> "Plane"
    cpufreq: schedutil: restore cached freq when next_f is not changed
    acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
    PM: AVS: smartreflex Move driver to soc specific drivers
    PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
    PM: domains: enable domain idle state accounting
    PM: domains: Add curly braces to delimit comment + statement block
    PM: domains: Add support for PM domain on/off notifiers for genpd
    powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
    powercap/intel_rapl: Fix domain detection
    intel_idle: Ignore _CST if control cannot be taken from the platform
    cpuidle: Remove pointless stub
    intel_idle: mention assumption that WBINVD is not needed
    MAINTAINERS: Add section for cpuidle-psci PM domain
    cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
    ...

    Linus Torvalds
     
  • Enable idle drivers to wakeup idle CPUs by exporting wake_up_if_idle().

    Bug: 169136276
    Signed-off-by: Lina Iyer
    Change-Id: If1529ad4b883f36de1692cd3ac1853ff722e3522

    Lina Iyer
     

21 Oct, 2020

1 commit


19 Oct, 2020

1 commit

  • We have the raw cached freq to reduce the chance in calling cpufreq
    driver where it could be costly in some arch/SoC.

    Currently, the raw cached freq is reset in sugov_update_single() when
    it avoids frequency reduction (which is not desirable sometimes), but
    it is better to restore the previous value of it in that case,
    because it may not change in the next cycle and it is not necessary
    to change the CPU frequency then.

    Adapted from https://android-review.googlesource.com/1352810/

    Signed-off-by: Wei Wang
    Acked-by: Viresh Kumar
    [ rjw: Subject edit and changelog rewrite ]
    Signed-off-by: Rafael J. Wysocki

    Wei Wang
     

18 Oct, 2020

1 commit

  • A previous commit changed the notification mode from true/false to an
    int, allowing notify-no, notify-yes, or signal-notify. This was
    backwards compatible in the sense that any existing true/false user
    would translate to either 0 (on notification sent) or 1, the latter
    which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

    Clean this up properly, and define a proper enum for the notification
    mode. Now we have:

    - TWA_NONE. This is 0, same as before the original change, meaning no
    notification requested.
    - TWA_RESUME. This is 1, same as before the original change, meaning
    that we use TIF_NOTIFY_RESUME.
    - TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
    notification.

    Clean up all the callers, switching their 0/1/false/true to using the
    appropriate TWA_* mode for notifications.

    Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Oct, 2020

1 commit


15 Oct, 2020

3 commits

  • Commit:

    765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")

    made sched features static for !CONFIG_SCHED_DEBUG configurations, but
    overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.

    For the latter echoing changes to /sys/kernel/debug/sched_features has
    the nasty effect of effectively changing what sched_features reports,
    but without actually changing the scheduler behaviour (since different
    translation units get different sysctl_sched_features).

    Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
    restructuring ifdefs.

    Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
    Co-developed-by: Daniel Bristot de Oliveira
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Juri Lelli
    Signed-off-by: Ingo Molnar
    Acked-by: Patrick Bellasi
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com

    Juri Lelli
     
  • In the following commit:

    04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array")

    a zero-length array cpumask[0] has been replaced with cpumask[].
    But there is still a cpumask[0] in 'struct sched_group_capacity'
    which was missed.

    The point of using [] instead of [0] is that with [] the compiler will
    generate a build warning if it isn't the last member of a struct.

    [ mingo: Rewrote the changelog. ]

    Signed-off-by: zhuguangqing
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com

    zhuguangqing
     
  • Pull power management updates from Rafael Wysocki:
    "These rework the collection of cpufreq statistics to allow it to take
    place if fast frequency switching is enabled in the governor, rework
    the frequency invariance handling in the cpufreq core and drivers, add
    new hardware support to a couple of cpufreq drivers, fix a number of
    assorted issues and clean up the code all over.

    Specifics:

    - Rework cpufreq statistics collection to allow it to take place when
    fast frequency switching is enabled in the governor (Viresh Kumar).

    - Make the cpufreq core set the frequency scale on behalf of the
    driver and update several cpufreq drivers accordingly (Ionela
    Voinescu, Valentin Schneider).

    - Add new hardware support to the STI and qcom cpufreq drivers and
    improve them (Alain Volmat, Manivannan Sadhasivam).

    - Fix multiple assorted issues in cpufreq drivers (Jon Hunter,
    Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan
    Gerhold, Viresh Kumar).

    - Fix several assorted issues in the operating performance points
    (OPP) framework (Stephan Gerhold, Viresh Kumar).

    - Allow devfreq drivers to fetch devfreq instances by DT enumeration
    instead of using explicit phandles and modify the devfreq core code
    to support driver-specific devfreq DT bindings (Leonard Crestez,
    Chanwoo Choi).

    - Improve initial hardware resetting in the tegra30 devfreq driver
    and clean up the tegra cpuidle driver (Dmitry Osipenko).

    - Update the cpuidle core to collect state entry rejection statistics
    and expose them via sysfs (Lina Iyer).

    - Improve the ACPI _CST code handling diagnostics (Chen Yu).

    - Update the PSCI cpuidle driver to allow the PM domain
    initialization to occur in the OSI mode as well as in the PC mode
    (Ulf Hansson).

    - Rework the generic power domains (genpd) core code to allow domain
    power off transition to be aborted in the absence of the "power
    off" domain callback (Ulf Hansson).

    - Fix two suspend-to-idle issues in the ACPI EC driver (Rafael
    Wysocki).

    - Fix the handling of timer_expires in the PM-runtime framework on
    32-bit systems and the handling of device links in it (Grygorii
    Strashko, Xiang Chen).

    - Add IO requests batching support to the hibernate image saving and
    reading code and drop a bogus get_gendisk() from there (Xiaoyi
    Chen, Christoph Hellwig).

    - Allow PCIe ports to be put into the D3cold power state if they are
    power-manageable via ACPI (Lukas Wunner).

    - Add missing header file include to a power capping driver (Pujin
    Shi).

    - Clean up the qcom-cpr AVS driver a bit (Liu Shixin).

    - Kevin Hilman steps down as designated reviwer of adaptive voltage
    scaling (AVS) drivers (Kevin Hilman)"

    * tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
    cpufreq: stats: Fix string format specifier mismatch
    arm: disable frequency invariance for CONFIG_BL_SWITCHER
    cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale()
    cpufreq: stats: Add memory barrier to store_reset()
    cpufreq: schedutil: Simplify sugov_fast_switch()
    ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
    ACPI: EC: PM: Flush EC work unconditionally after wakeup
    PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
    PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
    cpufreq: Move traces and update to policy->cur to cpufreq core
    cpufreq: stats: Enable stats for fast-switch as well
    cpufreq: stats: Mark few conditionals with unlikely()
    cpufreq: stats: Remove locking
    cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
    PM: domains: Allow to abort power off when no ->power_off() callback
    PM: domains: Rename power state enums for genpd
    PM / devfreq: tegra30: Improve initial hardware resetting
    PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function
    PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function
    PM / devfreq: Add devfreq_get_devfreq_by_node function
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - reorganize & clean up the SD* flags definitions and add a bunch of
    sanity checks. These new checks caught quite a few bugs or at least
    inconsistencies, resulting in another set of patches.

    - rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ

    - add a new tracepoint to improve CPU capacity tracking

    - improve overloaded SMP system load-balancing behavior

    - tweak SMT balancing

    - energy-aware scheduling updates

    - NUMA balancing improvements

    - deadline scheduler fixes and improvements

    - CPU isolation fixes

    - misc cleanups, simplifications and smaller optimizations

    * tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
    sched/deadline: Unthrottle PI boosted threads while enqueuing
    sched/debug: Add new tracepoint to track cpu_capacity
    sched/fair: Tweak pick_next_entity()
    rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
    rseq/selftests,x86_64: Add rseq_offset_deref_addv()
    rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
    sched/fair: Use dst group while checking imbalance for NUMA balancer
    sched/fair: Reduce busy load balance interval
    sched/fair: Minimize concurrent LBs between domain level
    sched/fair: Reduce minimal imbalance threshold
    sched/fair: Relax constraint on task's load during load balance
    sched/fair: Remove the force parameter of update_tg_load_avg()
    sched/fair: Fix wrong cpu selecting from isolated domain
    sched: Remove unused inline function uclamp_bucket_base_value()
    sched/rt: Disable RT_RUNTIME_SHARE by default
    sched/deadline: Fix stale throttling on de-/boosted tasks
    sched/numa: Use runnable_avg to classify node
    sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
    MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
    sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
    ...

    Linus Torvalds
     

07 Oct, 2020

1 commit


05 Oct, 2020

1 commit


03 Oct, 2020

3 commits

  • stress-ng has a test (stress-ng --cyclic) that creates a set of threads
    under SCHED_DEADLINE with the following parameters:

    dl_runtime = 10000 (10 us)
    dl_deadline = 100000 (100 us)
    dl_period = 100000 (100 us)

    These parameters are very aggressive. When using a system without HRTICK
    set, these threads can easily execute longer than the dl_runtime because
    the throttling happens with 1/HZ resolution.

    During the main part of the test, the system works just fine because
    the workload does not try to run over the 10 us. The problem happens at
    the end of the test, on the exit() path. During exit(), the threads need
    to do some cleanups that require real-time mutex locks, mainly those
    related to memory management, resulting in this scenario:

    Note: locks are rt_mutexes...
    ------------------------------------------------------------------------
    TASK A: TASK B: TASK C:
    activation
    activation
    activation

    lock(a): OK! lock(b): OK!

    lock(a)
    -> block (task A owns it)
    -> self notice/set throttled
    +--< -> arm replenished timer
    | switch-out
    | lock(b)
    | -> B prio>
    | -> boost TASK B
    | unlock(a) switch-out
    | -> handle lock a to B
    | -> wakeup(B)
    | -> B is throttled:
    | -> do not enqueue
    | switch-out
    |
    |
    +---------------------> replenishment timer
    -> TASK B is boosted:
    -> do not enqueue
    ------------------------------------------------------------------------

    BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
    crashes with hung task C.

    This problem is avoided by removing the throttle state from the boosted
    thread while boosting it (by TASK A in the example above), allowing it to
    be queued and run boosted.

    The next replenishment will take care of the runtime overrun, pushing
    the deadline further away. See the "while (dl_se->runtime
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Juri Lelli
    Tested-by: Mark Simmons
    Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com

    Daniel Bristot de Oliveira
     
  • rq->cpu_capacity is a key element in several scheduler parts, such as EAS
    task placement and load balancing. Tracking this value enables testing
    and/or debugging by a toolkit.

    Signed-off-by: Vincent Donnefort
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com

    Vincent Donnefort
     
  • Currently, pick_next_entity(...) has the following structure
    (simplified):

    [...]
    if (last_buddy_ok())
    result = last_buddy;
    if (next_buddy_ok())
    result = next_buddy;
    [...]

    The intended behavior is to prefer next buddy over last buddy;
    the current code somewhat obfuscates this, and also wastes
    cycles checking the last buddy when eventually the next buddy is
    picked up.

    So this patch refactors two 'ifs' above into

    [...]
    if (next_buddy_ok())
    result = next_buddy;
    else if (last_buddy_ok())
    result = last_buddy;
    [...]

    Signed-off-by: Peter Oskolkov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com

    Peter Oskolkov
     

25 Sep, 2020

11 commits

  • This patchset is based on Google-internal RSEQ work done by Paul
    Turner and Andrew Hunter.

    When working with per-CPU RSEQ-based memory allocations, it is
    sometimes important to make sure that a global memory location is no
    longer accessed from RSEQ critical sections. For example, there can be
    two per-CPU lists, one is "active" and accessed per-CPU, while another
    one is inactive and worked on asynchronously "off CPU" (e.g. garbage
    collection is performed). Then at some point the two lists are
    swapped, and a fast RCU-like mechanism is required to make sure that
    the previously active list is no longer accessed.

    This patch introduces such a mechanism: in short, membarrier() syscall
    issues an IPI to a CPU, restarting a potentially active RSEQ critical
    section on the CPU.

    Signed-off-by: Peter Oskolkov
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mathieu Desnoyers
    Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com

    Peter Oskolkov
     
  • Barry Song noted the following

    Something is wrong. In find_busiest_group(), we are checking if
    src has higher load, however, in task_numa_find_cpu(), we are
    checking if dst will have higher load after balancing. It seems
    it is not sensible to check src.

    It maybe cause wrong imbalance value, for example,

    if dst_running = env->dst_stats.nr_running + 1 results in 3 or
    above, and src_running = env->src_stats.nr_running - 1 results
    in 1;

    The current code is thinking imbalance as 0 since src_running is
    smaller than 2. This is inconsistent with load balancer.

    Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
    a task "from an almost idle domain" to a "domain with spare capacity". This
    patch forbids movement "from a misplaced domain" to "an almost idle domain"
    as that is closer to what the CPU load balancer expects.

    This patch is not a universal win. The old behaviour was intended to allow
    a task from an almost idle NUMA node to migrate to its preferred node if
    the destination had capacity but there are corner cases. For example,
    a NAS compute load could be parallelised to use 1/3rd of available CPUs
    but not all those potential tasks are active at all times allowing this
    logic to trigger. An obvious example is specjbb 2005 running various
    numbers of warehouses on a 2 socket box with 80 cpus.

    specjbb
    5.9.0-rc4 5.9.0-rc4
    vanilla dstbalance-v1r1
    Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
    Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
    Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
    Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
    Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
    Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
    Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
    Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
    Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
    Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
    Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
    Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
    Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
    Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
    Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
    Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
    Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
    Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
    Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
    Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
    Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
    Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
    Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
    Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
    Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
    Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
    Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
    Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
    Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
    Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
    Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
    Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
    Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
    Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
    Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
    Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
    Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
    Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
    Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
    Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
    Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
    Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
    Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
    Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
    Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
    Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
    Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
    Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
    Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
    Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
    Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
    Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
    Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
    Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
    Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
    Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
    Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
    Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
    Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
    Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
    Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
    Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
    Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
    Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
    Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
    Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
    Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
    Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
    Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
    Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
    Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
    Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
    Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
    Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
    Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
    Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
    Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
    Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
    Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
    Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*

    The performance at the mid-point is better but not universally better. The
    patch is a mixed bag depending on the workload, machine and overall
    levels of utilisation. Sometimes it's better (sometimes much better),
    other times it is worse (sometimes much worse). Given that there isn't a
    universally good decision in this section and more people seem to prefer
    the patch then it may be best to keep the LB decisions consistent and
    revisit imbalance handling when the load balancer code changes settle down.

    Jirka Hladky added the following observation.

    Our results are mostly in line with what you see. We observe
    big gains (20-50%) when the system is loaded to 1/3 of the
    maximum capacity and mixed results at the full load - some
    workloads benefit from the patch at the full load, others not,
    but performance changes at the full load are mostly within the
    noise of results (+/-5%). Overall, we think this patch is helpful.

    [mgorman@techsingularity.net: Rewrote changelog]
    Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
    Signed-off-by: Barry Song
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net

    Barry Song
     
  • The busy_factor, which increases load balance interval when a cpu is busy,
    is set to 32 by default. This value generates some huge LB interval on
    large system like the THX2 made of 2 node x 28 cores x 4 threads.
    For such system, the interval increases from 112ms to 3584ms at MC level.
    And from 228ms to 7168ms at NUMA level.

    Even on smaller system, a lower busy factor has shown improvement on the
    fair distribution of the running time so let reduce it for all.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org

    Vincent Guittot
     
  • sched domains tend to trigger simultaneously the load balance loop but
    the larger domains often need more time to collect statistics. This
    slowness makes the larger domain trying to detach tasks from a rq whereas
    tasks already migrated somewhere else at a sub-domain level. This is not
    a real problem for idle LB because the period of smaller domains will
    increase with its CPUs being busy and this will let time for higher ones
    to pulled tasks. But this becomes a problem when all CPUs are already busy
    because all domains stay synced when they trigger their LB.

    A simple way to minimize simultaneous LB of all domains is to decrement the
    the busy interval by 1 jiffies. Because of the busy_factor, the interval of
    larger domain will not be a multiple of smaller ones anymore.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org

    Vincent Guittot
     
  • The 25% default imbalance threshold for DIE and NUMA domain is large
    enough to generate significant unfairness between threads. A typical
    example is the case of 11 threads running on 2x4 CPUs. The imbalance of
    20% between the 2 groups of 4 cores is just low enough to not trigger
    the load balance between the 2 groups. We will have always the same 6
    threads on one group of 4 CPUs and the other 5 threads on the other
    group of CPUS. With a fair time sharing in each group, we ends up with
    +20% running time for the group of 5 threads.

    Consider decreasing the imbalance threshold for overloaded case where we
    use the load to balance task and to ensure fair time sharing.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Acked-by: Hillf Danton
    Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org

    Vincent Guittot
     
  • Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
    load balancer currently migrates the waiting task between the CPUs in an
    almost random manner. The success of a rq pulling a task depends of the
    value of nr_balance_failed of its domains and its ability to be faster
    than others to detach it. This behavior results in an unfair distribution
    of the running time between tasks because some CPUs will run most of the
    time, if not always, the same task whereas others will share their time
    between several tasks.

    Instead of using nr_balance_failed as a boolean to relax the condition
    for detaching task, the LB will use nr_balanced_failed to relax the
    threshold between the tasks'load and the imbalance. This mecanism
    prevents the same rq or domain to always win the load balance fight.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org

    Vincent Guittot
     
  • In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
    sometimes update_tg_load_avg(cfs_rq, false) is used.
    update_tg_load_avg() has the parameter force, but in current code,
    it never set 1 or true to it, so remove the force parameter.

    Signed-off-by: Xianting Tian
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com

    Xianting Tian
     
  • We've met problems that occasionally tasks with full cpumask
    (e.g. by putting it into a cpuset or setting to full affinity)
    were migrated to our isolated cpus in production environment.

    After some analysis, we found that it is due to the current
    select_idle_smt() not considering the sched_domain mask.

    Steps to reproduce on my 31-CPU hyperthreads machine:
    1. with boot parameter: "isolcpus=domain,2-31"
    (thread lists: 0,16 and 1,17)
    2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
    3. some threads will be migrated to the isolated cpu16~17.

    Fix it by checking the valid domain mask in select_idle_smt().

    Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
    Reported-by: Wetp Zhang
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Jiang Biao
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com

    Xunlei Pang
     
  • There is no caller in tree, so can remove it.

    Signed-off-by: YueHaibing
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com

    YueHaibing
     
  • The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
    between CPUs, allowing a CPU to run a real-time task up to 100% of the
    time while leaving more space for non-real-time tasks to run on the CPU
    that lend rt_runtime.

    The problem is that a CPU can easily borrow enough rt_runtime to allow
    a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
    which are non-real-time by design.

    This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
    The feature will still be present for users that want to enable it,
    though.

    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Wei Wang
    Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com

    Daniel Bristot de Oliveira
     
  • When a boosted task gets throttled, what normally happens is that it's
    immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
    runtime and clears the dl_throttled flag. There is a special case however:
    if the throttling happened on sched-out and the task has been deboosted in
    the meantime, the replenish is skipped as the task will return to its
    normal scheduling class. This leaves the task with the dl_throttled flag
    set.

    Now if the task gets boosted up to the deadline scheduling class again
    while it is sleeping, it's still in the throttled state. The normal wakeup
    however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
    actually place it on the rq. Thus we end up with a task that is runnable,
    but not actually on the rq and neither a immediate replenishment happens,
    nor is the replenishment timer set up, so the task is stuck in
    forever-throttled limbo.

    Clear the dl_throttled flag before dropping back to the normal scheduling
    class to fix this issue.

    Signed-off-by: Lucas Stach
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de

    Lucas Stach