15 Jun, 2020

1 commit

  • This is a kernel enhancement that configures the cpu affinity of kernel
    threads via kernel boot option nohz_full=.

    When this option is specified, the cpumask is immediately applied upon
    kthread launch. This does not affect kernel threads that specify cpu
    and node.

    This allows CPU isolation (that is not allowing certain threads
    to execute on certain CPUs) without using the isolcpus=domain parameter,
    making it possible to enable load balancing on such CPUs
    during runtime (see kernel-parameters.txt).

    Note-1: this is based off on Wind River's patch at
    https://github.com/starlingx-staging/stx-integ/blob/master/kernel/kernel-std/centos/patches/affine-compute-kernel-threads.patch

    Difference being that this patch is limited to modifying kernel thread
    cpumask. Behaviour of other threads can be controlled via cgroups or
    sched_setaffinity.

    Note-2: Wind River's patch was based off Christoph Lameter's patch at
    https://lwn.net/Articles/565932/ with the only difference being
    the kernel parameter changed from kthread to kthread_cpus.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200527142909.23372-3-frederic@kernel.org

    Marcelo Tosatti
     

15 Apr, 2020

1 commit

  • The "isolcpus=" parameter allows sub-parameters before the cpulist is
    specified, and if the parser detects an unknown sub-parameters the whole
    parameter will be ignored.

    This design is incompatible with itself when new sub-parameters are added.
    An older kernel will not recognize the new sub-parameter and will
    invalidate the whole parameter so the CPU isolation will not take
    effect. It emits a warning:

    isolcpus: Error, unknown flag

    The better and compatible way is to allow "isolcpus=" to skip unknown
    sub-parameters, so that even if new sub-parameters are added an older
    kernel will still be able to behave as usual even if with the new
    sub-parameter specified on the command line.

    Ideally this should have been there when the first sub-parameter for
    "isolcpus=" was introduced.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Xu
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com

    Peter Xu
     

22 Jan, 2020

1 commit

  • The affinity of managed interrupts is completely handled in the kernel and
    cannot be changed via the /proc/irq/* interfaces from user space. As the
    kernel tries to spread out interrupts evenly accross CPUs on x86 to prevent
    vector exhaustion, it can happen that a managed interrupt whose affinity
    mask contains both isolated and housekeeping CPUs is routed to an isolated
    CPU. As a consequence IO submitted on a housekeeping CPU causes interrupts
    on the isolated CPU.

    Add a new sub-parameter 'managed_irq' for 'isolcpus' and the corresponding
    logic in the interrupt affinity selection code.

    The subparameter indicates to the interrupt affinity selection logic that
    it should try to avoid the above scenario.

    This isolation is best effort and only effective if the automatically
    assigned interrupt mask of a device queue contains isolated and
    housekeeping CPUs. If housekeeping CPUs are online then such interrupts are
    directed to the housekeeping CPU so that IO submitted on the housekeeping
    CPU cannot disturb the isolated CPU.

    If a queue's affinity mask contains only isolated CPUs then this parameter
    has no effect on the interrupt routing decision, though interrupts are only
    happening when tasks running on those isolated CPUs submit IO. IO submitted
    on housekeeping CPUs has no influence on those queues.

    If the affinity mask contains both housekeeping and isolated CPUs, but none
    of the contained housekeeping CPUs is online, then the interrupt is also
    routed to an isolated CPU. Interrupts are only delivered when one of the
    isolated CPUs in the affinity mask submits IO. If one of the contained
    housekeeping CPUs comes online, the CPU hotplug logic migrates the
    interrupt automatically back to the upcoming housekeeping CPU. Depending on
    the type of interrupt controller, this can require that at least one
    interrupt is delivered to the isolated CPU in order to complete the
    migration.

    [ tglx: Removed unused parameter, added and edited comments/documentation
    and rephrased the changelog so it contains more details. ]

    Signed-off-by: Ming Lei
    Signed-off-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/20200120091625.17912-1-ming.lei@redhat.com

    Ming Lei
     

25 Jul, 2019

1 commit

  • In real product setup, there will be houseeking CPUs in each nodes, it
    is prefer to do housekeeping from local node, fallback to global online
    cpumask if failed to find houseeking CPU from local node.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Frederic Weisbecker
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpengli@tencent.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

20 Jul, 2019

1 commit

  • Dedicated instances are currently disturbed by unnecessary jitter due
    to the emulated lapic timers firing on the same pCPUs where the
    vCPUs reside. There is no hardware virtual timer on Intel for guest
    like ARM, so both programming timer in guest and the emulated timer fires
    incur vmexits. This patch tries to avoid vmexit when the emulated timer
    fires, at least in dedicated instance scenario when nohz_full is enabled.

    In that case, the emulated timers can be offload to the nearest busy
    housekeeping cpus since APICv has been found for several years in server
    processors. The guest timer interrupt can then be injected via posted interrupts,
    which are delivered by the housekeeping cpu once the emulated timer fires.

    The host should tuned so that vCPUs are placed on isolated physical
    processors, and with several pCPUs surplus for busy housekeeping.
    If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
    ~3% redis performance benefit can be observed on Skylake server, and the
    number of external interrupt vmexits drops substantially. Without patch

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )

    While with patch:

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Marcelo Tosatti
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

04 May, 2019

1 commit

  • During housekeeping mask setup, currently a possible CPU is required.
    That does not guarantee the CPU would be available at boot time, so
    check to ensure that at least one present CPU is in the mask.

    Signed-off-by: Nicholas Piggin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rafael J . Wysocki
    Cc: Thomas Gleixner
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: https://lkml.kernel.org/r/20190411033448.20842-5-npiggin@gmail.com
    Signed-off-by: Ingo Molnar

    Nicholas Piggin
     

13 Feb, 2019

1 commit

  • The cpumasks updated here are not subject to concurrency and using
    atomic bitops for them is pointless and expensive. Use the non-atomic
    variants instead.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/2e2a10f84b9049a81eef94ed6d5989447c21e34a.1549963617.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

03 Dec, 2018

1 commit

  • Go over the scheduler source code and fix common typos
    in comments - and a typo in an actual variable name.

    No change in functionality intended.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

04 Mar, 2018

1 commit

  • Do the following cleanups and simplifications:

    - sched/sched.h already includes , so no need to
    include it in sched/core.c again.

    - order the headers alphabetically

    - add all headers to kernel/sched/sched.h

    - remove all unnecessary includes from the .c files that
    are already included in kernel/sched/sched.h.

    Finally, make all scheduler .c files use a single common header:

    #include "sched.h"

    ... which now contains a union of the relied upon headers.

    This makes the various .c files easier to read and easier to handle.

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Mar, 2018

1 commit

  • A good number of small style inconsistencies have accumulated
    in the scheduler core, so do a pass over them to harmonize
    all these details:

    - fix speling in comments,

    - use curly braces for multi-line statements,

    - remove unnecessary parentheses from integer literals,

    - capitalize consistently,

    - remove stray newlines,

    - add comments where necessary,

    - remove invalid/unnecessary comments,

    - align structure definitions and other data types vertically,

    - add missing newlines for increased readability,

    - fix vertical tabulation where it's misaligned,

    - harmonize preprocessor conditional block labeling
    and vertical alignment,

    - remove line-breaks where they uglify the code,

    - add newline after local variable definitions,

    No change in functionality:

    md5:
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Feb, 2018

2 commits

  • When a CPU runs in full dynticks mode, a 1Hz tick remains in order to
    keep the scheduler stats alive. However this residual tick is a burden
    for bare metal tasks that can't stand any interruption at all, or want
    to minimize them.

    The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now
    outsource these scheduler ticks to the global workqueue so that a
    housekeeping CPU handles those remotely. The sched_class::task_tick()
    implementations have been audited and look safe to be called remotely
    as the target runqueue and its current task are passed in parameter
    and don't seem to be accessed locally.

    Note that in the case of using isolcpus, it's still up to the user to
    affine the global workqueues to the housekeeping CPUs through
    /sys/devices/virtual/workqueue/cpumask or domains isolation
    "isolcpus=nohz,domain".

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • As we prepare for offloading the residual 1hz scheduler ticks to
    workqueue, let's affine those to housekeepers so that they don't
    interrupt the CPUs that don't want to be disturbed.

    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1519186649-3242-5-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

27 Oct, 2017

7 commits

  • Add flags to control NOHZ and domain isolation from "isolcpus=", in
    order to centralize the isolation features to a common interface. Domain
    isolation remains the default so not to break the existing isolcpus
    boot paramater behaviour.

    Further flags in the future may include 0hz (1hz tick offload) and timers,
    workqueue, RCU, kthread, watchdog, likely all merged together in a
    common flag ("async"?). In any case, this will have to be modifiable by
    cpusets.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-12-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • We want to centralize the isolation features, to be done by the housekeeping
    subsystem and scheduler domain isolation is a significant part of it.

    No intended behaviour change, we just reuse the housekeeping cpumask
    and core code.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • We want to centralize the isolation management, done by the housekeeping
    subsystem. Therefore we need to handle the nohz_full= parameter from
    there.

    Since nohz_full= so far has involved unbound timers, watchdog, RCU
    and tilegx NAPI isolation, we keep that default behaviour.

    nohz_full= will be deprecated in the future. We want to control
    the isolation features from the isolcpus= parameter.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Before we implement isolcpus under housekeeping, we need the isolation
    features to be more finegrained. For example some people want NOHZ_FULL
    without the full scheduler isolation, others want full scheduler
    isolation without NOHZ_FULL.

    So let's cut all these isolation features piecewise, at the risk of
    overcutting it right now. We can still merge some flags later if they
    always make sense together.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Housekeeping code still depends on the nohz_full static key. Since we want
    to decouple housekeeping from NOHZ, let's create a housekeeping specific
    static key.

    It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-6-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Nobody needs to access this detail. housekeeping_cpumask() already
    takes care of it.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-5-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The housekeeping code is currently tied to the NOHZ code. As we are
    planning to make housekeeping independent from it, start with moving
    the relevant code to its own file.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Acked-by: Paul E. McKenney
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker