11 Nov, 2020
3 commits
-
Defer the softirq processing to ksoftirqd if a RT task is running
or queued on the current CPU. This complements the RT task placement
algorithm which tries to find a CPU that is not currently busy with
softirqs.Currently NET_TX, NET_RX, BLOCK and TASKLET softirqs are only deferred
as they can potentially run for long time.Bug: 168521633
Change-Id: Id7665244af6bbd5a96d9e591cf26154e9eaa860c
Signed-off-by: Pavankumar Kondeti
[satyap@codeaurora.org: trivial merge conflict resolution.]
Signed-off-by: Satya Durga Srinivasu Prabhala
[elavila: Port to mainline, squash with bugfix]
Signed-off-by: J. Avila -
The scheduling change to avoid putting RT threads on cores that
are handling softint's was catching cases where there was no reason
to believe the softint would take a long time, resulting in unnecessary
migration overhead. This patch reduces the migration to cases where
the core has a softint that is actually likely to take a long time,
as opposed to the RCU, SCHED, and TIMER softints that are rather quick.Bug: 31752786
Bug: 168521633
Change-Id: Ib4e179f1e15c736b2fdba31070494e357e9fbbe2
Signed-off-by: John Dias
[elavila: Amend commit text for AOSP, port to mainline]
Signed-off-by: J. Avila -
In certain audio use cases, scheduling RT threads on cores that are
handling softirqs can lead to glitches. Prevent this behavior.Bug: 31501544
Bug: 168521633
Change-Id: I99dd7aaa12c11270b28dbabea484bcc8fb8ba0c1
Signed-off-by: John Dias
[elavila: Port to mainline, amend commit text]
Signed-off-by: J. Avila
06 Nov, 2020
2 commits
-
…it.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest") into android-mainline
Steps on the way to 5.10-rc3
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I57f80255bf5d396e92a54807a516cc41cf07be61 -
Create a trace hook when RT tasks are throttled. This allows
vendors to debug long RT runs.Bug: 172264047
Change-Id: I534959f8e8d714463aac2f9f1c5627d2e735f543
Signed-off-by: Sai Harshini Nimmala
03 Nov, 2020
1 commit
-
The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.Reported-by: zhuguangqing
Suggested-by: Rafael J. Wysocki
Signed-off-by: Viresh Kumar
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki
02 Nov, 2020
1 commit
-
…ub/scm/linux/kernel/git/soc/soc") into android-mainline
Steps on the way to 5.10-rc2
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: I2aae8375aa349bd63596d4bd29e50e36993c764f
30 Oct, 2020
1 commit
-
Add padding to below structs to support WALT based accounting:
1. struct cpu_topology
2. struct task_struct
3. struct sched_domain_shared
4. struct task_group
5. struct root_domain
6. struct rqTo accommodate potential future changes, reserving more memory than
what WALT needs today.Bug: 171858786
Change-Id: If6d901174fc7963be3ae44daa799cb2953669ec1
Signed-off-by: Satya Durga Srinivasu Prabhala
29 Oct, 2020
2 commits
-
Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand. In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui
Tested-by: Zhang Rui
Cc: 5.9+ # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki -
Linux 5.10-rc1
Signed-off-by: Greg Kroah-Hartman
Change-Id: Iace3fc84a00d3023c75caa086a266de17dc1847c
27 Oct, 2020
1 commit
-
…nux/kernel/git/soc/soc") into android-mainline
Steps on the way to 5.10-rc1
Resolves conflicts in:
Documentation/admin-guide/sysctl/vm.rstSigned-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic58f28718f28dae42948c935dfb0c62122fe86fc
26 Oct, 2020
2 commits
-
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.Remove the quote operator # from compiler_attributes.h __section macro.
Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.Conversion done using the script at:
https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl
Signed-off-by: Joe Perches
Reviewed-by: Nick Desaulniers
Reviewed-by: Miguel Ojeda
Signed-off-by: Linus Torvalds -
Pull scheduler fixes from Thomas Gleixner:
"Two scheduler fixes:- A trivial build fix for sched_feat() to compile correctly with
CONFIG_JUMP_LABEL=n- Replace a zero lenght array with a flexible array"
* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/features: Fix !CONFIG_JUMP_LABEL case
sched: Replace zero-length array with flexible-array
25 Oct, 2020
1 commit
-
…g/pub/scm/linux/kernel/git/konrad/swiotlb") into android-mainline
Resolves merge issues with:
drivers/cpufreq/cpufreq.cSigned-off-by: Greg Kroah-Hartman <gregkh@google.com>
Change-Id: Ic2907cf71867dab7b8e1b5fcbd4c888fc01f8c22
24 Oct, 2020
2 commits
-
Pull more power management updates from Rafael Wysocki:
"First of all, the adaptive voltage scaling (AVS) drivers go to new
platform-specific locations as planned (this part was reported to have
merge conflicts against the new arm-soc updates in linux-next).In addition to that, there are some fixes (intel_idle, intel_pstate,
RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
accounting support to the generic power domains (genpd) code and some
janitorial changes all over.Specifics:
- Move the AVS drivers to new platform-specific locations and get rid
of the drivers/power/avs directory (Ulf Hansson).- Add on/off notifiers and idle state accounting support to the
generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).- Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).
- Make intel_idle disregard ACPI _CST if it cannot use the data
returned by that method (Mel Gorman).- Modify intel_pstate to avoid leaving useless sysfs directory
structure behind if it cannot be registered (Chen Yu).- Fix domain detection in the RAPL power capping driver and prevent
it from failing to enumerate the Psys RAPL domain (Zhang Rui).- Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
later AMD chips (Wei Huang).- Update the driver assumptions comment in intel_idle and fix a
kerneldoc comment in the runtime PM framework (Alexander Monakov,
Bean Huo).- Avoid unnecessary resets of the cached frequency in the schedutil
cpufreq governor to reduce overhead (Wei Wang).- Clean up the cpufreq core a bit (Viresh Kumar).
- Make assorted minor janitorial changes (Daniel Lezcano, Geert
Uytterhoeven, Hubert Jasudowicz, Tom Rix).- Clean up and optimize the cpupower utility somewhat (Colin Ian
King, Martin Kaistra)"* tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM: sleep: remove unreachable break
PM: AVS: Drop the avs directory and the corresponding Kconfig
PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
PM: runtime: Fix typo in pm_runtime_set_active() helper comment
PM: domains: Fix build error for genpd notifiers
powercap: Fix typo in Kconfig "Plance" -> "Plane"
cpufreq: schedutil: restore cached freq when next_f is not changed
acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
PM: AVS: smartreflex Move driver to soc specific drivers
PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
PM: domains: enable domain idle state accounting
PM: domains: Add curly braces to delimit comment + statement block
PM: domains: Add support for PM domain on/off notifiers for genpd
powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
powercap/intel_rapl: Fix domain detection
intel_idle: Ignore _CST if control cannot be taken from the platform
cpuidle: Remove pointless stub
intel_idle: mention assumption that WBINVD is not needed
MAINTAINERS: Add section for cpuidle-psci PM domain
cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
... -
Enable idle drivers to wakeup idle CPUs by exporting wake_up_if_idle().
Bug: 169136276
Signed-off-by: Lina Iyer
Change-Id: If1529ad4b883f36de1692cd3ac1853ff722e3522
21 Oct, 2020
1 commit
-
…ub/scm/linux/kernel/git/tip/tip") into android-mainline
Steps on the way to 5.10-rc1
Change-Id: Ica8a8f76ad5735ebcf4f30f3156edf7f9f60e576
Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>
19 Oct, 2020
1 commit
-
We have the raw cached freq to reduce the chance in calling cpufreq
driver where it could be costly in some arch/SoC.Currently, the raw cached freq is reset in sugov_update_single() when
it avoids frequency reduction (which is not desirable sometimes), but
it is better to restore the previous value of it in that case,
because it may not change in the next cycle and it is not necessary
to change the CPU frequency then.Adapted from https://android-review.googlesource.com/1352810/
Signed-off-by: Wei Wang
Acked-by: Viresh Kumar
[ rjw: Subject edit and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki
18 Oct, 2020
1 commit
-
A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.Clean this up properly, and define a proper enum for the notification
mode. Now we have:- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner
Signed-off-by: Jens Axboe
16 Oct, 2020
1 commit
-
Declare war on uninterruptible sleep. Add a tracepoint which
walks the kernel stack and dumps the first non-scheduler function
called before the scheduler is invoked.Bug: 120445457
Change-Id: I19e965d5206329360a92cbfe2afcc8c30f65c229
Signed-off-by: Riley Andrews
[astrachan: deleted an unnecessary whitespace change]
Signed-off-by: Alistair Strachan
Bug: 170916884
Signed-off-by: Todd Kjos
15 Oct, 2020
3 commits
-
Commit:
765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Juri Lelli
Signed-off-by: Ingo Molnar
Acked-by: Patrick Bellasi
Reviewed-by: Valentin Schneider
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com -
In the following commit:
04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array")
a zero-length array cpumask[0] has been replaced with cpumask[].
But there is still a cpumask[0] in 'struct sched_group_capacity'
which was missed.The point of using [] instead of [0] is that with [] the compiler will
generate a build warning if it isn't the last member of a struct.[ mingo: Rewrote the changelog. ]
Signed-off-by: zhuguangqing
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com -
Pull power management updates from Rafael Wysocki:
"These rework the collection of cpufreq statistics to allow it to take
place if fast frequency switching is enabled in the governor, rework
the frequency invariance handling in the cpufreq core and drivers, add
new hardware support to a couple of cpufreq drivers, fix a number of
assorted issues and clean up the code all over.Specifics:
- Rework cpufreq statistics collection to allow it to take place when
fast frequency switching is enabled in the governor (Viresh Kumar).- Make the cpufreq core set the frequency scale on behalf of the
driver and update several cpufreq drivers accordingly (Ionela
Voinescu, Valentin Schneider).- Add new hardware support to the STI and qcom cpufreq drivers and
improve them (Alain Volmat, Manivannan Sadhasivam).- Fix multiple assorted issues in cpufreq drivers (Jon Hunter,
Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan
Gerhold, Viresh Kumar).- Fix several assorted issues in the operating performance points
(OPP) framework (Stephan Gerhold, Viresh Kumar).- Allow devfreq drivers to fetch devfreq instances by DT enumeration
instead of using explicit phandles and modify the devfreq core code
to support driver-specific devfreq DT bindings (Leonard Crestez,
Chanwoo Choi).- Improve initial hardware resetting in the tegra30 devfreq driver
and clean up the tegra cpuidle driver (Dmitry Osipenko).- Update the cpuidle core to collect state entry rejection statistics
and expose them via sysfs (Lina Iyer).- Improve the ACPI _CST code handling diagnostics (Chen Yu).
- Update the PSCI cpuidle driver to allow the PM domain
initialization to occur in the OSI mode as well as in the PC mode
(Ulf Hansson).- Rework the generic power domains (genpd) core code to allow domain
power off transition to be aborted in the absence of the "power
off" domain callback (Ulf Hansson).- Fix two suspend-to-idle issues in the ACPI EC driver (Rafael
Wysocki).- Fix the handling of timer_expires in the PM-runtime framework on
32-bit systems and the handling of device links in it (Grygorii
Strashko, Xiang Chen).- Add IO requests batching support to the hibernate image saving and
reading code and drop a bogus get_gendisk() from there (Xiaoyi
Chen, Christoph Hellwig).- Allow PCIe ports to be put into the D3cold power state if they are
power-manageable via ACPI (Lukas Wunner).- Add missing header file include to a power capping driver (Pujin
Shi).- Clean up the qcom-cpr AVS driver a bit (Liu Shixin).
- Kevin Hilman steps down as designated reviwer of adaptive voltage
scaling (AVS) drivers (Kevin Hilman)"* tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
cpufreq: stats: Fix string format specifier mismatch
arm: disable frequency invariance for CONFIG_BL_SWITCHER
cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale()
cpufreq: stats: Add memory barrier to store_reset()
cpufreq: schedutil: Simplify sugov_fast_switch()
ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
ACPI: EC: PM: Flush EC work unconditionally after wakeup
PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
cpufreq: Move traces and update to policy->cur to cpufreq core
cpufreq: stats: Enable stats for fast-switch as well
cpufreq: stats: Mark few conditionals with unlikely()
cpufreq: stats: Remove locking
cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
PM: domains: Allow to abort power off when no ->power_off() callback
PM: domains: Rename power state enums for genpd
PM / devfreq: tegra30: Improve initial hardware resetting
PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function
PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function
PM / devfreq: Add devfreq_get_devfreq_by_node function
...
13 Oct, 2020
1 commit
-
Pull scheduler updates from Ingo Molnar:
- reorganize & clean up the SD* flags definitions and add a bunch of
sanity checks. These new checks caught quite a few bugs or at least
inconsistencies, resulting in another set of patches.- rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
- add a new tracepoint to improve CPU capacity tracking
- improve overloaded SMP system load-balancing behavior
- tweak SMT balancing
- energy-aware scheduling updates
- NUMA balancing improvements
- deadline scheduler fixes and improvements
- CPU isolation fixes
- misc cleanups, simplifications and smaller optimizations
* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
sched/deadline: Unthrottle PI boosted threads while enqueuing
sched/debug: Add new tracepoint to track cpu_capacity
sched/fair: Tweak pick_next_entity()
rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
rseq/selftests,x86_64: Add rseq_offset_deref_addv()
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
sched/fair: Use dst group while checking imbalance for NUMA balancer
sched/fair: Reduce busy load balance interval
sched/fair: Minimize concurrent LBs between domain level
sched/fair: Reduce minimal imbalance threshold
sched/fair: Relax constraint on task's load during load balance
sched/fair: Remove the force parameter of update_tg_load_avg()
sched/fair: Fix wrong cpu selecting from isolated domain
sched: Remove unused inline function uclamp_bucket_base_value()
sched/rt: Disable RT_RUNTIME_SHARE by default
sched/deadline: Fix stale throttling on de-/boosted tasks
sched/numa: Use runnable_avg to classify node
sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
...
07 Oct, 2020
1 commit
-
Drop a redundant local variable definition from sugov_fast_switch()
and rearrange the code in there to avoid the redundant logical
negation.Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
05 Oct, 2020
1 commit
-
The cpufreq core handles the updates to policy->cur and recording of
cpufreq trace events for all the governors except schedutil's fast
switch case.Move that as well to cpufreq core for consistency and readability.
Signed-off-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki
03 Oct, 2020
3 commits
-
stress-ng has a test (stress-ng --cyclic) that creates a set of threads
under SCHED_DEADLINE with the following parameters:dl_runtime = 10000 (10 us)
dl_deadline = 100000 (100 us)
dl_period = 100000 (100 us)These parameters are very aggressive. When using a system without HRTICK
set, these threads can easily execute longer than the dl_runtime because
the throttling happens with 1/HZ resolution.During the main part of the test, the system works just fine because
the workload does not try to run over the 10 us. The problem happens at
the end of the test, on the exit() path. During exit(), the threads need
to do some cleanups that require real-time mutex locks, mainly those
related to memory management, resulting in this scenario:Note: locks are rt_mutexes...
------------------------------------------------------------------------
TASK A: TASK B: TASK C:
activation
activation
activationlock(a): OK! lock(b): OK!
lock(a)
-> block (task A owns it)
-> self notice/set throttled
+--< -> arm replenished timer
| switch-out
| lock(b)
| -> B prio>
| -> boost TASK B
| unlock(a) switch-out
| -> handle lock a to B
| -> wakeup(B)
| -> B is throttled:
| -> do not enqueue
| switch-out
|
|
+---------------------> replenishment timer
-> TASK B is boosted:
-> do not enqueue
------------------------------------------------------------------------BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
crashes with hung task C.This problem is avoided by removing the throttle state from the boosted
thread while boosting it (by TASK A in the example above), allowing it to
be queued and run boosted.The next replenishment will take care of the runtime overrun, pushing
the deadline further away. See the "while (dl_se->runtime
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Juri Lelli
Tested-by: Mark Simmons
Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com -
rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.Signed-off-by: Vincent Donnefort
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com -
Currently, pick_next_entity(...) has the following structure
(simplified):[...]
if (last_buddy_ok())
result = last_buddy;
if (next_buddy_ok())
result = next_buddy;
[...]The intended behavior is to prefer next buddy over last buddy;
the current code somewhat obfuscates this, and also wastes
cycles checking the last buddy when eventually the next buddy is
picked up.So this patch refactors two 'ifs' above into
[...]
if (next_buddy_ok())
result = next_buddy;
else if (last_buddy_ok())
result = last_buddy;
[...]Signed-off-by: Peter Oskolkov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com
25 Sep, 2020
11 commits
-
This patchset is based on Google-internal RSEQ work done by Paul
Turner and Andrew Hunter.When working with per-CPU RSEQ-based memory allocations, it is
sometimes important to make sure that a global memory location is no
longer accessed from RSEQ critical sections. For example, there can be
two per-CPU lists, one is "active" and accessed per-CPU, while another
one is inactive and worked on asynchronously "off CPU" (e.g. garbage
collection is performed). Then at some point the two lists are
swapped, and a fast RCU-like mechanism is required to make sure that
the previously active list is no longer accessed.This patch introduces such a mechanism: in short, membarrier() syscall
issues an IPI to a CPU, restarting a potentially active RSEQ critical
section on the CPU.Signed-off-by: Peter Oskolkov
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mathieu Desnoyers
Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com -
Barry Song noted the following
Something is wrong. In find_busiest_group(), we are checking if
src has higher load, however, in task_numa_find_cpu(), we are
checking if dst will have higher load after balancing. It seems
it is not sensible to check src.It maybe cause wrong imbalance value, for example,
if dst_running = env->dst_stats.nr_running + 1 results in 3 or
above, and src_running = env->src_stats.nr_running - 1 results
in 1;The current code is thinking imbalance as 0 since src_running is
smaller than 2. This is inconsistent with load balancer.Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases. For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.specjbb
5.9.0-rc4 5.9.0-rc4
vanilla dstbalance-v1r1
Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*The performance at the mid-point is better but not universally better. The
patch is a mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle down.Jirka Hladky added the following observation.
Our results are mostly in line with what you see. We observe
big gains (20-50%) when the system is loaded to 1/3 of the
maximum capacity and mixed results at the full load - some
workloads benefit from the patch at the full load, others not,
but performance changes at the full load are mostly within the
noise of results (+/-5%). Overall, we think this patch is helpful.[mgorman@techsingularity.net: Rewrote changelog]
Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Signed-off-by: Barry Song
Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net -
The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org -
sched domains tend to trigger simultaneously the load balance loop but
the larger domains often need more time to collect statistics. This
slowness makes the larger domain trying to detach tasks from a rq whereas
tasks already migrated somewhere else at a sub-domain level. This is not
a real problem for idle LB because the period of smaller domains will
increase with its CPUs being busy and this will let time for higher ones
to pulled tasks. But this becomes a problem when all CPUs are already busy
because all domains stay synced when they trigger their LB.A simple way to minimize simultaneous LB of all domains is to decrement the
the busy interval by 1 jiffies. Because of the busy_factor, the interval of
larger domain will not be a multiple of smaller ones anymore.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org -
The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Acked-by: Hillf Danton
Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org -
Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org -
In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
sometimes update_tg_load_avg(cfs_rq, false) is used.
update_tg_load_avg() has the parameter force, but in current code,
it never set 1 or true to it, so remove the force parameter.Signed-off-by: Xianting Tian
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com -
We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
(thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.Fix it by checking the valid domain mask in select_idle_smt().
Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang
Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Jiang Biao
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com -
There is no caller in tree, so can remove it.
Signed-off-by: YueHaibing
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Link: https://lkml.kernel.org/r/20200922132410.48440-1-yuehaibing@huawei.com -
The RT_RUNTIME_SHARE sched feature enables the sharing of rt_runtime
between CPUs, allowing a CPU to run a real-time task up to 100% of the
time while leaving more space for non-real-time tasks to run on the CPU
that lend rt_runtime.The problem is that a CPU can easily borrow enough rt_runtime to allow
a spinning rt-task to run forever, starving per-cpu tasks like kworkers,
which are non-real-time by design.This patch disables RT_RUNTIME_SHARE by default, avoiding this problem.
The feature will still be present for users that want to enable it,
though.Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Wei Wang
Link: https://lkml.kernel.org/r/b776ab46817e3db5d8ef79175fa0d71073c051c7.1600697903.git.bristot@redhat.com -
When a boosted task gets throttled, what normally happens is that it's
immediately enqueued again with ENQUEUE_REPLENISH, which replenishes the
runtime and clears the dl_throttled flag. There is a special case however:
if the throttling happened on sched-out and the task has been deboosted in
the meantime, the replenish is skipped as the task will return to its
normal scheduling class. This leaves the task with the dl_throttled flag
set.Now if the task gets boosted up to the deadline scheduling class again
while it is sleeping, it's still in the throttled state. The normal wakeup
however will enqueue the task with ENQUEUE_REPLENISH not set, so we don't
actually place it on the rq. Thus we end up with a task that is runnable,
but not actually on the rq and neither a immediate replenishment happens,
nor is the replenishment timer set up, so the task is stuck in
forever-throttled limbo.Clear the dl_throttled flag before dropping back to the normal scheduling
class to fix this issue.Signed-off-by: Lucas Stach
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Juri Lelli
Link: https://lkml.kernel.org/r/20200831110719.2126930-1-l.stach@pengutronix.de