06 Mar, 2021

7 commits

  • The function sync_runqueues_membarrier_state() should copy the
    membarrier state from the @mm received as parameter to each runqueue
    currently running tasks using that mm.

    However, the use of smp_call_function_many() skips the current runqueue,
    which is unintended. Replace by a call to on_each_cpu_mask().

    Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
    Reported-by: Nadav Amit
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Cc: stable@vger.kernel.org # 5.4.x+
    Link: https://lore.kernel.org/r/74F1E842-4A84-47BF-B6C2-5407DFDD4A4A@gmail.com

    Mathieu Desnoyers
     
  • Now that we have set_affinity_pending::stop_pending to indicate if a
    stopper is in progress, and we have the guarantee that if that stopper
    exists, it will (eventually) complete our @pending we can simplify the
    refcount scheme by no longer counting the stopper thread.

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.724130207@infradead.org

    Peter Zijlstra
     
  • Consider:

    sched_setaffinity(p, X); sched_setaffinity(p, Y);

    Then the first will install p->migration_pending = &my_pending; and
    issue stop_one_cpu_nowait(pending); and the second one will read
    p->migration_pending and _also_ issue: stop_one_cpu_nowait(pending),
    the _SAME_ @pending.

    This causes stopper list corruption.

    Add set_affinity_pending::stop_pending, to indicate if a stopper is in
    progress.

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.649146419@infradead.org

    Peter Zijlstra
     
  • When the purpose of migration_cpu_stop() is to migrate the task to
    'any' valid CPU, don't migrate the task when it's already running on a
    valid CPU.

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.569238629@infradead.org

    Peter Zijlstra
     
  • The SCA_MIGRATE_ENABLE and task_running() cases are almost identical,
    collapse them to avoid further duplication.

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.500108964@infradead.org

    Peter Zijlstra
     
  • When affine_move_task() issues a migration_cpu_stop(), the purpose of
    that function is to complete that @pending, not any random other
    p->migration_pending that might have gotten installed since.

    This realization much simplifies migration_cpu_stop() and allows
    further necessary steps to fix all this as it provides the guarantee
    that @pending's stopper will complete @pending (and not some random
    other @pending).

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.430014682@infradead.org

    Peter Zijlstra
     
  • When affine_move_task(p) is called on a running task @p, which is not
    otherwise already changing affinity, we'll first set
    p->migration_pending and then do:

    stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);

    This then gets us to migration_cpu_stop() running on the CPU that was
    previously running our victim task @p.

    If we find that our task is no longer on that runqueue (this can
    happen because of a concurrent migration due to load-balance etc.),
    then we'll end up at the:

    } else if (dest_cpu < 1 || pending) {

    branch. Which we'll take because we set pending earlier. Here we first
    check if the task @p has already satisfied the affinity constraints,
    if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
    onto the CPU that is now hosting our task @p:

    stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
    &pending->arg, &pending->stop_work);

    Except, we've never initialized pending->arg, which will be all 0s.

    This then results in running migration_cpu_stop() on the next CPU with
    arg->p == NULL, which gives the by now obvious result of fireworks.

    The cure is to change affine_move_task() to always use pending->arg,
    furthermore we can use the exact same pattern as the
    SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
    completion anyway, no point in adding yet another completion in
    stop_one_cpu().

    This then gives a clear distinction between the two
    migration_cpu_stop() use cases:

    - sched_exec() / migrate_task_to() : arg->pending == NULL
    - affine_move_task() : arg->pending != NULL;

    And we can have it ignore p->migration_pending when !arg->pending. Any
    stop work from sched_exec() / migrate_task_to() is in addition to stop
    works from affine_move_task(), which will be sufficient to issue the
    completion.

    Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210224131355.357743989@infradead.org

    Peter Zijlstra
     

27 Feb, 2021

1 commit

  • Drop repeated words in kernel/events/.
    {if, the, that, with, time}

    Drop repeated words in kernel/locking/.
    {it, no, the}

    Drop repeated words in kernel/sched/.
    {in, not}

    Link: https://lkml.kernel.org/r/20210127023412.26292-1-rdunlap@infradead.org
    Signed-off-by: Randy Dunlap
    Acked-by: Will Deacon [kernel/locking/]
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Will Deacon
    Cc: Mathieu Desnoyers
    Cc: "Paul E. McKenney"
    Cc: Juri Lelli
    Cc: Vincent Guittot
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

24 Feb, 2021

1 commit

  • Pull more power management updates from Rafael Wysocki:
    "These are fixes and cleanups on top of the power management material
    for 5.12-rc1 merged previously.

    Specifics:

    - Address cpufreq regression introduced in 5.11 that causes CPU
    frequency reporting to be distorted on systems with CPPC that use
    acpi-cpufreq as the scaling driver (Rafael Wysocki).

    - Fix regression introduced during the 5.10 development cycle related
    to CPU hotplug and policy recreation in the qcom-cpufreq-hw driver
    (Shawn Guo).

    - Fix recent regression in the operating performance points (OPP)
    framework that may cause frequency updates to be skipped by mistake
    in some cases (Jonathan Marek).

    - Simplify schedutil governor code and remove a misleading comment
    from it (Yue Hu).

    - Fix kerneldoc comment typo in the cpufreq core (Yue Hu)"

    * tag 'pm-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpufreq: Fix typo in kerneldoc comment
    cpufreq: schedutil: Remove update_lock comment from struct sugov_policy definition
    cpufreq: schedutil: Remove needless sg_policy parameter from ignore_dl_rate_limit()
    cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
    cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks
    opp: Don't skip freq update for different frequency

    Linus Torvalds
     

22 Feb, 2021

2 commits

  • Pull KVM updates from Paolo Bonzini:
    "x86:

    - Support for userspace to emulate Xen hypercalls

    - Raise the maximum number of user memslots

    - Scalability improvements for the new MMU.

    Instead of the complex "fast page fault" logic that is used in
    mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
    but the code that can run against page faults is limited. Right now
    only page faults take the lock for reading; in the future this will
    be extended to some cases of page table destruction. I hope to
    switch the default MMU around 5.12-rc3 (some testing was delayed
    due to Chinese New Year).

    - Cleanups for MAXPHYADDR checks

    - Use static calls for vendor-specific callbacks

    - On AMD, use VMLOAD/VMSAVE to save and restore host state

    - Stop using deprecated jump label APIs

    - Workaround for AMD erratum that made nested virtualization
    unreliable

    - Support for LBR emulation in the guest

    - Support for communicating bus lock vmexits to userspace

    - Add support for SEV attestation command

    - Miscellaneous cleanups

    PPC:

    - Support for second data watchpoint on POWER10

    - Remove some complex workarounds for buggy early versions of POWER9

    - Guest entry/exit fixes

    ARM64:

    - Make the nVHE EL2 object relocatable

    - Cleanups for concurrent translation faults hitting the same page

    - Support for the standard TRNG hypervisor call

    - A bunch of small PMU/Debug fixes

    - Simplification of the early init hypercall handling

    Non-KVM changes (with acks):

    - Detection of contended rwlocks (implemented only for qrwlocks,
    because KVM only needs it for x86)

    - Allow __DISABLE_EXPORTS from assembly code

    - Provide a saner follow_pfn replacements for modules"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
    KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
    KVM: selftests: Don't bother mapping GVA for Xen shinfo test
    KVM: selftests: Fix hex vs. decimal snafu in Xen test
    KVM: selftests: Fix size of memslots created by Xen tests
    KVM: selftests: Ignore recently added Xen tests' build output
    KVM: selftests: Add missing header file needed by xAPIC IPI tests
    KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
    KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
    locking/arch: Move qrwlock.h include after qspinlock.h
    KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
    KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
    KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
    KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
    KVM: PPC: remove unneeded semicolon
    KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
    KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
    KVM: PPC: Book3S HV: Fix radix guest SLB side channel
    KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
    KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
    KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "Core scheduler updates:

    - Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
    preempt=none/voluntary/full boot options (default: full), to allow
    distros to build a PREEMPT kernel but fall back to close to
    PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
    a boot time selection.

    There's also the /debug/sched_debug switch to do this runtime.

    This feature is implemented via runtime patching (a new variant of
    static calls).

    The scope of the runtime patching can be best reviewed by looking
    at the sched_dynamic_update() function in kernel/sched/core.c.

    ( Note that the dynamic none/voluntary mode isn't 100% identical,
    for example preempt-RCU is available in all cases, plus the
    preempt count is maintained in all models, which has runtime
    overhead even with the code patching. )

    The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
    majority of distributions, are supposed to be unaffected.

    - Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
    was found via rcutorture triggering a hang. The bug is that
    rcu_idle_enter() may wake up a NOCB kthread, but this happens after
    the last generic need_resched() check. Some cpuidle drivers fix it
    by chance but many others don't.

    In true 2020 fashion the original bug fix has grown into a 5-patch
    scheduler/RCU fix series plus another 16 RCU patches to address the
    underlying issue of missed preemption events. These are the initial
    fixes that should fix current incarnations of the bug.

    - Clean up rbtree usage in the scheduler, by providing & using the
    following consistent set of rbtree APIs:

    partial-order; less() based:
    - rb_add(): add a new entry to the rbtree
    - rb_add_cached(): like rb_add(), but for a rb_root_cached

    total-order; cmp() based:
    - rb_find(): find an entry in an rbtree
    - rb_find_add(): find an entry, and add if not found

    - rb_find_first(): find the first (leftmost) matching entry
    - rb_next_match(): continue from rb_find_first()
    - rb_for_each(): iterate a sub-tree using the previous two

    - Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
    single pass. This is a 4-commit series where each commit improves
    one aspect of the idle sibling scan logic.

    - Improve the cpufreq cooling driver by getting the effective CPU
    utilization metrics from the scheduler

    - Improve the fair scheduler's active load-balancing logic by
    reducing the number of active LB attempts & lengthen the
    load-balancing interval. This improves stress-ng mmapfork
    performance.

    - Fix CFS's estimated utilization (util_est) calculation bug that can
    result in too high utilization values

    Misc updates & fixes:

    - Fix the HRTICK reprogramming & optimization feature

    - Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

    - Reduce dl_add_task_root_domain() overhead

    - Fix uprobes refcount bug

    - Process pending softirqs in flush_smp_call_function_from_idle()

    - Clean up task priority related defines, remove *USER_*PRIO and
    USER_PRIO()

    - Simplify the sched_init_numa() deduplication sort

    - Documentation updates

    - Fix EAS bug in update_misfit_status(), which degraded the quality
    of energy-balancing

    - Smaller cleanups"

    * tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
    sched,x86: Allow !PREEMPT_DYNAMIC
    entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
    entry: Explicitly flush pending rcuog wakeup before last rescheduling point
    rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
    rcu/nocb: Perform deferred wake up before last idle's need_resched() check
    rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
    sched/features: Distinguish between NORMAL and DEADLINE hrtick
    sched/features: Fix hrtick reprogramming
    sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
    uprobes: (Re)add missing get_uprobe() in __find_uprobe()
    smp: Process pending softirqs in flush_smp_call_function_from_idle()
    sched: Harden PREEMPT_DYNAMIC
    static_call: Allow module use without exposing static_call_key
    sched: Add /debug/sched_preempt
    preempt/dynamic: Support dynamic preempt with preempt= boot option
    preempt/dynamic: Provide irqentry_exit_cond_resched() static call
    preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
    preempt/dynamic: Provide cond_resched() and might_resched() static calls
    preempt: Introduce CONFIG_PREEMPT_DYNAMIC
    static_call: Provide DEFINE_STATIC_CALL_RET0()
    ...

    Linus Torvalds
     

19 Feb, 2021

2 commits


17 Feb, 2021

18 commits

  • Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
    kthread (rcuog) to be serviced.

    Usually a local wake up happening while running the idle task is handled
    in one of the need_resched() checks carefully placed within the idle
    loop that can break to the scheduler.

    Unfortunately the call to rcu_idle_enter() is already beyond the last
    generic need_resched() check and we may halt the CPU with a resched
    request unhandled, leaving the task hanging.

    Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
    and place it before the last generic need_resched() check in the idle
    loop. It is then assumed that no call to call_rcu() will be performed
    after that in the idle loop until the CPU is put in low power mode.

    Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
    Reported-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org

    Frederic Weisbecker
     
  • The HRTICK feature has traditionally been servicing configurations that
    need precise preemptions point for NORMAL tasks. More recently, the
    feature has been extended to also service DEADLINE tasks with stringent
    runtime enforcement needs (e.g., runtime < 1ms with HZ=1000).

    Enabling HRTICK sched feature currently enables the additional timer and
    task tick for both classes, which might introduced undesired overhead
    for no additional benefit if one needed it only for one of the cases.

    Separate HRTICK sched feature in two (and leave the traditional case
    name unmodified) so that it can be selectively enabled when needed.

    With:

    $ echo HRTICK > /sys/kernel/debug/sched_features

    the NORMAL/fair hrtick gets enabled.

    With:

    $ echo HRTICK_DL > /sys/kernel/debug/sched_features

    the DEADLINE hrtick gets enabled.

    Signed-off-by: Juri Lelli
    Signed-off-by: Luis Claudio R. Goncalves
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210208073554.14629-3-juri.lelli@redhat.com

    Juri Lelli
     
  • Hung tasks and RCU stall cases were reported on systems which were not
    100% busy. Investigation of such unexpected cases (no sign of potential
    starvation caused by tasks hogging the system) pointed out that the
    periodic sched tick timer wasn't serviced anymore after a certain point
    and that caused all machinery that depends on it (timers, RCU, etc.) to
    stop working as well. This issues was however only reproducible if
    HRTICK was enabled.

    Looking at core dumps it was found that the rbtree of the hrtimer base
    used also for the hrtick was corrupted (i.e. next as seen from the base
    root and actual leftmost obtained by traversing the tree are different).
    Same base is also used for periodic tick hrtimer, which might get "lost"
    if the rbtree gets corrupted.

    Much alike what described in commit 1f71addd34f4c ("tick/sched: Do not
    mess with an enqueued hrtimer") there is a race window between
    hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in
    __hrtick_restart() in which the former might be operating on an already
    queued hrtick hrtimer, which might lead to corruption of the base.

    Use hrtick_start() (which removes the timer before enqueuing it back) to
    ensure hrtick hrtimer reprogramming is entirely guarded by the base
    lock, so that no race conditions can occur.

    Signed-off-by: Juri Lelli
    Signed-off-by: Luis Claudio R. Goncalves
    Signed-off-by: Daniel Bristot de Oliveira
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com

    Juri Lelli
     
  • dl_add_task_root_domain() is called during sched domain rebuild:

    rebuild_sched_domains_locked()
    partition_and_rebuild_sched_domains()
    rebuild_root_domains()
    for all top_cpuset descendants:
    update_tasks_root_domain()
    for all tasks of cpuset:
    dl_add_task_root_domain()

    Change it so that only the task pi lock is taken to check if the task
    has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
    rq lock as well to be able to safely de-reference root domain's DL
    bandwidth structure.

    Most of the tasks will have another policy (namely SCHED_NORMAL) and
    can now bail without taking the rq lock.

    One thing to note here: Even in case that there aren't any DL user
    tasks, a slow frequency switching system with cpufreq gov schedutil has
    a DL task (sugov) per frequency domain running which participates in DL
    bandwidth management.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Quentin Perret
    Reviewed-by: Valentin Schneider
    Reviewed-by: Daniel Bristot de Oliveira
    Acked-by: Juri Lelli
    Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggemann@arm.com

    Dietmar Eggemann
     
  • Use the new EXPORT_STATIC_CALL_TRAMP() / static_call_mod() to unexport
    the static_call_key for the PREEMPT_DYNAMIC calls such that modules
    can no longer update these calls.

    Having modules change/hi-jack the preemption calls would be horrible.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Add a debugfs file to muck about with the preempt mode at runtime.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/YAsGiUYf6NyaTplX@hirez.programming.kicks-ass.net

    Peter Zijlstra
     
  • Support the preempt= boot option and patch the static call sites
    accordingly.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210118141223.123667-9-frederic@kernel.org

    Peter Zijlstra (Intel)
     
  • Provide static calls to control preempt_schedule[_notrace]()
    (called in CONFIG_PREEMPT) so that we can override their behaviour when
    preempt= is overriden.

    Since the default behaviour is full preemption, both their calls are
    initialized to the arch provided wrapper, if any.

    [fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
    dependent on x86 with __preempt_schedule_func]
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210118141223.123667-7-frederic@kernel.org

    Peter Zijlstra (Intel)
     
  • Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
    and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
    can override their behaviour when preempt= is overriden.

    Since the default behaviour is full preemption, both their calls are
    ignored when preempt= isn't passed.

    [fweisbec: branch might_resched() directly to __cond_resched(), only
    define static calls when PREEMPT_DYNAMIC]

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.org

    Peter Zijlstra (Intel)
     
  • The description of the RT offset and the values for 'normal' tasks needs
    update. Moreover there are DL tasks now.
    task_prio() has to stay like it is to guarantee compatibility with the
    /proc//stat priority field:

    # cat /proc//stat | awk '{ print $18; }'

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggemann@arm.com

    Dietmar Eggemann
     
  • The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
    SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
    Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

    Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for
    SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

    Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task
    scheduler in v2.6.24.

    Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
    NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

    With:

    MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

    = MAX_PRIO - MAX_RT_PRIO

    MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

    MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

    MAX_USER_PRIO = NICE_WIDTH

    MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
    {*_}USER_PRIO defines.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggemann@arm.com

    Dietmar Eggemann
     
  • Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
    was introduced due to a a small time period in which the realtime patch
    set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

    This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

    Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
    instead.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggemann@arm.com

    Dietmar Eggemann
     
  • Commit "sched/topology: Make sched_init_numa() use a set for the
    deduplicating sort" allocates 'i + nr_levels (level)' instead of
    'i + nr_levels + 1' sched_domain_topology_level.

    This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

    sched_init_domains
    build_sched_domains()
    __free_domain_allocs()
    __sdt_free() {
    ...
    for_each_sd_topology(tl)
    ...
    sd = *per_cpu_ptr(sdd->sd, j);
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Tested-by: Vincent Guittot
    Tested-by: Barry Song
    Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com

    Dietmar Eggemann
     
  • Reduce rbtree boiler plate by using the new helpers.

    Make rb_add_cached() / rb_erase_cached() return a pointer to the
    leftmost node to aid in updating additional state.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Davidlohr Bueso

    Peter Zijlstra
     
  • Reduce rbtree boiler plate by using the new helper function.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Davidlohr Bueso

    Peter Zijlstra
     
  • Both select_idle_core() and select_idle_cpu() do a loop over the same
    cpumask. Observe that by clearing the already visited CPUs, we can
    fold the iteration and iterate a core at a time.

    All we need to do is remember any non-idle CPU we encountered while
    scanning for an idle core. This way we'll only iterate every CPU once.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20210127135203.19633-5-mgorman@techsingularity.net

    Mel Gorman
     
  • In order to make the next patch more readable, and to quantify the
    actual effectiveness of this pass, start by removing it.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20210125085909.4600-4-mgorman@techsingularity.net

    Mel Gorman
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Feb, 2021

1 commit

  • …ulmck/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    - Documentation updates.

    - Miscellaneous fixes.

    - kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
    addresses to more easily locate bugs. This has a couple of RCU-related commits,
    but is mostly MM. Was pulled in with akpm's agreement.

    - Per-callback-batch tracking of numbers of callbacks,
    which enables better debugging information and smarter
    reactions to large numbers of callbacks.

    - The first round of changes to allow CPUs to be runtime switched from and to
    callback-offloaded state.

    - CONFIG_PREEMPT_RT-related changes.

    - RCU CPU stall warning updates.
    - Addition of polling grace-period APIs for SRCU.

    - Torture-test and torture-test scripting updates, including a "torture everything"
    script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
    Plus does an allmodconfig build.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

04 Feb, 2021

1 commit

  • Safely rescheduling while holding a spin lock is essential for keeping
    long running kernel operations running smoothly. Add the facility to
    cond_resched rwlocks.

    CC: Ingo Molnar
    CC: Will Deacon
    Acked-by: Peter Zijlstra
    Acked-by: Davidlohr Bueso
    Acked-by: Waiman Long
    Acked-by: Paolo Bonzini
    Signed-off-by: Ben Gardon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     

28 Jan, 2021

4 commits

  • As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
    even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
    check and while we are at it, exclude the cost of initialising the CPU
    mask from the average scan cost.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20210125085909.4600-3-mgorman@techsingularity.net

    Mel Gorman
     
  • SIS_AVG_CPU was introduced as a means of avoiding a search when the
    average search cost indicated that the search would likely fail. It was
    a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
    select_idle_cpu() more aggressive") and later replaced with a proportional
    search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
    to scale select_idle_cpu()").

    While there are corner cases where SIS_AVG_CPU is better, it has now been
    disabled for almost three years. As the intent of SIS_PROP is to reduce
    the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
    on SIS_PROP as a throttling mechanism.

    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20210125085909.4600-2-mgorman@techsingularity.net

    Mel Gorman
     
  • The deduplicating sort in sched_init_numa() assumes that the first line in
    the distance table contains all unique values in the entire table. I've
    been trying to pen what this exactly means for the topology, but it's not
    straightforward. For instance, topology.c uses this example:

    node 0 1 2 3
    0: 10 20 20 30
    1: 20 10 20 20
    2: 20 20 10 20
    3: 30 20 20 10

    0 ----- 1
    | / |
    | / |
    | / |
    2 ----- 3

    Which works out just fine. However, if we swap nodes 0 and 1:

    1 ----- 0
    | / |
    | / |
    | / |
    2 ----- 3

    we get this distance table:

    node 0 1 2 3
    0: 10 20 20 20
    1: 20 10 20 30
    2: 20 20 10 20
    3: 20 30 20 10

    Which breaks the deduplicating sort (non-representative first line). In
    this case this would just be a renumbering exercise, but it so happens that
    we can have a deduplicating sort that goes through the whole table in O(n²)
    at the extra cost of a temporary memory allocation (i.e. any form of set).

    The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
    this, implement the set as a 256-bits bitmap. Should this not be
    satisfactory (i.e. we want to support 32-bit values), then we'll have to go
    for some other sparse set implementation.

    This has the added benefit of letting us allocate just the right amount of
    memory for sched_domains_numa_distance[], rather than an arbitrary
    (nr_node_ids + 1).

    Note: DT binding equivalent (distance-map) decodes distances as 32-bit
    values.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com

    Valentin Schneider
     
  • If the task is pinned to a cpu, setting the misfit status means that
    we'll unnecessarily continuously attempt to migrate the task but fail.

    This continuous failure will cause the balance_interval to increase to
    a high value, and eventually cause unnecessary significant delays in
    balancing the system when real imbalance happens.

    Caught while testing uclamp where rt-app calibration loop was pinned to
    cpu 0, shortly after which we spawn another task with high util_clamp
    value. The task was failing to migrate after over 40ms of runtime due to
    balance_interval unnecessary expanded to a very high value from the
    calibration loop.

    Not done here, but it could be useful to extend the check for pinning to
    verify that the affinity of the task has a cpu that fits. We could end
    up in a similar situation otherwise.

    Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Quentin Perret
    Acked-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com

    Qais Yousef
     

23 Jan, 2021

1 commit

  • …'mmdumpobj.2021.01.22a', 'nocb.2021.01.06a', 'rt.2021.01.04a', 'stall.2021.01.06a', 'torture.2021.01.12a' and 'tortureall.2021.01.06a' into HEAD

    doc.2021.01.06a: Documentation updates.
    fixes.2021.01.04b: Miscellaneous fixes.
    kfree_rcu.2021.01.04a: kfree_rcu() updates.
    mmdumpobj.2021.01.22a: Dump allocation point for memory blocks.
    nocb.2021.01.06a: RCU callback offload updates and cblist segment lengths.
    rt.2021.01.04a: Real-time updates.
    stall.2021.01.06a: RCU CPU stall warning updates.
    torture.2021.01.12a: Torture-test updates and polling SRCU grace-period API.
    tortureall.2021.01.06a: Torture-test script updates.

    Paul E. McKenney
     

22 Jan, 2021

2 commits

  • Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
    tasks to retain during CPU offline, we can relax the warning in
    set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
    the last minute will get pushed off before it can run.

    While during CPU online there is no harm, and actual benefit, to
    allowing kthreads back on early, it simplifies hotplug code and fixes
    a number of outstanding races.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Lai jiangshan
    Reviewed-by: Valentin Schneider
    Tested-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org

    Peter Zijlstra
     
  • Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task
    migration on CPU unplug") we'd leave any task on the dying CPU and
    break affinity and force them off at the very end.

    This scheme had to change in order to enable migrate_disable(). One
    cannot wait for migrate_disable() to complete while stuck in
    stop_machine(). Furthermore, since we need at the very least: idle,
    hotplug and stop threads at any point before stop_machine, we can't
    break affinity and/or push those away.

    Under the assumption that all per-cpu kthreads are sanely handled by
    CPU hotplug, the new code no long breaks affinity or migrates any of
    them (which then includes the critical ones above).

    However, there's an important difference between per-cpu kthreads and
    kthreads that happen to have a single CPU affinity which is lost. The
    latter class very much relies on the forced affinity breaking and
    migration semantics previously provided.

    Use the new kthread_is_per_cpu() infrastructure to tighten
    is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
    change.

    Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org

    Peter Zijlstra