01 Oct, 2020

2 commits

  • [ Upstream commit e98fa02c4f2ea4991dae422ac7e34d102d2f0599 ]

    There is a race window in which an entity begins throttling before quota
    is added to the pool, but does not finish throttling until after we have
    finished with distribute_cfs_runtime(). This entity is not observed by
    distribute_cfs_runtime() because it was not on the throttled list at the
    time that distribution was running. This race manifests as rare
    period-length statlls for such entities.

    Rather than heavy-weight the synchronization with the progress of
    distribution, we can fix this by aborting throttling if bandwidth has
    become available. Otherwise, we immediately add the entity to the
    throttled list so that it can be observed by a subsequent distribution.

    Additionally, we can remove the case of adding the throttled entity to
    the head of the throttled list, and simply always add to the tail.
    Thanks to 26a8b12747c97, distribute_cfs_runtime() no longer holds onto
    its own pool of runtime. This means that if we do hit the !assign and
    distribute_running case, we know that distribution is about to end.

    Signed-off-by: Paul Turner
    Signed-off-by: Ben Segall
    Signed-off-by: Josh Don
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200410225208.109717-2-joshdon@google.com
    Signed-off-by: Sasha Levin

    Paul Turner
     
  • [ Upstream commit 62849a9612924a655c67cf6962920544aa5c20db ]

    The kernel test robot triggered a warning with the following race:
    task-ctx A interrupt-ctx B
    worker
    -> process_one_work()
    -> work_item()
    -> schedule();
    -> sched_submit_work()
    -> wq_worker_sleeping()
    -> ->sleeping = 1
    atomic_dec_and_test(nr_running)
    __schedule(); *interrupt*
    async_page_fault()
    -> local_irq_enable();
    -> schedule();
    -> sched_submit_work()
    -> wq_worker_sleeping()
    -> if (WARN_ON(->sleeping)) return
    -> __schedule()
    -> sched_update_worker()
    -> wq_worker_running()
    -> atomic_inc(nr_running);
    -> ->sleeping = 0;

    -> sched_update_worker()
    -> wq_worker_running()
    if (!->sleeping) return

    In this context the warning is pointless everything is fine.
    An interrupt before wq_worker_sleeping() will perform the ->sleeping
    assignment (0 -> 1 > 0) twice.
    An interrupt after wq_worker_sleeping() will trigger the warning and
    nr_running will be decremented (by A) and incremented once (only by B, A
    will skip it). This is the case until the ->sleeping is zeroed again in
    wq_worker_running().

    Remove the WARN statement because this condition may happen. Document
    that preemption around wq_worker_sleeping() needs to be disabled to
    protect ->sleeping and not just as an optimisation.

    Fixes: 6d25be5782e48 ("sched/core, workqueues: Distangle worker accounting from rq lock")
    Reported-by: kernel test robot
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Cc: Tejun Heo
    Link: https://lkml.kernel.org/r/20200327074308.GY11705@shao2-debian
    Signed-off-by: Sasha Levin

    Sebastian Andrzej Siewior
     

03 Sep, 2020

2 commits

  • [ Upstream commit e65855a52b479f98674998cb23b21ef5a8144b04 ]

    The following splat was caught when setting uclamp value of a task:

    BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:49

    cpus_read_lock+0x68/0x130
    static_key_enable+0x1c/0x38
    __sched_setscheduler+0x900/0xad8

    Fix by ensuring we enable the key outside of the critical section in
    __sched_setscheduler()

    Fixes: 46609ce22703 ("sched/uclamp: Protect uclamp fast path code with static key")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200716110347.19553-4-qais.yousef@arm.com
    Signed-off-by: Qais Yousef
    Signed-off-by: Sasha Levin

    Qais Yousef
     
  • [ Upstream commit 46609ce227039fd192e0ecc7d940bed587fd2c78 ]

    There is a report that when uclamp is enabled, a netperf UDP test
    regresses compared to a kernel compiled without uclamp.

    https://lore.kernel.org/lkml/20200529100806.GA3070@suse.de/

    While investigating the root cause, there were no sign that the uclamp
    code is doing anything particularly expensive but could suffer from bad
    cache behavior under certain circumstances that are yet to be
    understood.

    https://lore.kernel.org/lkml/20200616110824.dgkkbyapn3io6wik@e107158-lin/

    To reduce the pressure on the fast path anyway, add a static key that is
    by default will skip executing uclamp logic in the
    enqueue/dequeue_task() fast path until it's needed.

    As soon as the user start using util clamp by:

    1. Changing uclamp value of a task with sched_setattr()
    2. Modifying the default sysctl_sched_util_clamp_{min, max}
    3. Modifying the default cpu.uclamp.{min, max} value in cgroup

    We flip the static key now that the user has opted to use util clamp.
    Effectively re-introducing uclamp logic in the enqueue/dequeue_task()
    fast path. It stays on from that point forward until the next reboot.

    This should help minimize the effect of util clamp on workloads that
    don't need it but still allow distros to ship their kernels with uclamp
    compiled in by default.

    SCHED_WARN_ON() in uclamp_rq_dec_id() was removed since now we can end
    up with unbalanced call to uclamp_rq_dec_id() if we flip the key while
    a task is running in the rq. Since we know it is harmless we just
    quietly return if we attempt a uclamp_rq_dec_id() when
    rq->uclamp[].bucket[].tasks is 0.

    In schedutil, we introduce a new uclamp_is_enabled() helper which takes
    the static key into account to ensure RT boosting behavior is retained.

    The following results demonstrates how this helps on 2 Sockets Xeon E5
    2x10-Cores system.

    nouclamp uclamp uclamp-static-key
    Hmean send-64 162.43 ( 0.00%) 157.84 * -2.82%* 163.39 * 0.59%*
    Hmean send-128 324.71 ( 0.00%) 314.78 * -3.06%* 326.18 * 0.45%*
    Hmean send-256 641.55 ( 0.00%) 628.67 * -2.01%* 648.12 * 1.02%*
    Hmean send-1024 2525.28 ( 0.00%) 2448.26 * -3.05%* 2543.73 * 0.73%*
    Hmean send-2048 4836.14 ( 0.00%) 4712.08 * -2.57%* 4867.69 * 0.65%*
    Hmean send-3312 7540.83 ( 0.00%) 7425.45 * -1.53%* 7621.06 * 1.06%*
    Hmean send-4096 9124.53 ( 0.00%) 8948.82 * -1.93%* 9276.25 * 1.66%*
    Hmean send-8192 15589.67 ( 0.00%) 15486.35 * -0.66%* 15819.98 * 1.48%*
    Hmean send-16384 26386.47 ( 0.00%) 25752.25 * -2.40%* 26773.74 * 1.47%*

    The perf diff between nouclamp and uclamp-static-key when uclamp is
    disabled in the fast path:

    8.73% -1.55% [kernel.kallsyms] [k] try_to_wake_up
    0.07% +0.04% [kernel.kallsyms] [k] deactivate_task
    0.13% -0.02% [kernel.kallsyms] [k] activate_task

    The diff between nouclamp and uclamp-static-key when uclamp is enabled
    in the fast path:

    8.73% -0.72% [kernel.kallsyms] [k] try_to_wake_up
    0.13% +0.39% [kernel.kallsyms] [k] activate_task
    0.07% +0.38% [kernel.kallsyms] [k] deactivate_task

    Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
    Reported-by: Mel Gorman
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Lukasz Luba
    Link: https://lkml.kernel.org/r/20200630112123.12076-3-qais.yousef@arm.com
    [ Fix minor conflict with kernel/sched.h because of function renamed
    later ]
    Signed-off-by: Qais Yousef
    Signed-off-by: Sasha Levin

    Qais Yousef
     

19 Aug, 2020

3 commits

  • [ Upstream commit d81ae8aac85ca2e307d273f6dc7863a721bf054e ]

    struct uclamp_rq was zeroed out entirely in assumption that in the first
    call to uclamp_rq_inc() they'd be initialized correctly in accordance to
    default settings.

    But when next patch introduces a static key to skip
    uclamp_rq_{inc,dec}() until userspace opts in to use uclamp, schedutil
    will fail to perform any frequency changes because the
    rq->uclamp[UCLAMP_MAX].value is zeroed at init and stays as such. Which
    means all rqs are capped to 0 by default.

    Fix it by making sure we do proper initialization at init without
    relying on uclamp_rq_inc() doing it later.

    Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcounting")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Lukasz Luba
    Link: https://lkml.kernel.org/r/20200630112123.12076-2-qais.yousef@arm.com
    Signed-off-by: Sasha Levin

    Qais Yousef
     
  • [ Upstream commit 9b1b234bb86bcdcdb142e900d39b599185465dbb ]

    During sched domain init, we check whether non-topological SD_flags are
    returned by tl->sd_flags(), if found, fire a waning and correct the
    violation, but the code failed to correct the violation. Correct this.

    Fixes: 143e1e28cb40 ("sched: Rework sched_domain topology definition")
    Signed-off-by: Peng Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20200609150936.GA13060@iZj6chx1xj0e0buvshuecpZ
    Signed-off-by: Sasha Levin

    Peng Liu
     
  • [ Upstream commit 3ea2f097b17e13a8280f1f9386c331b326a3dbef ]

    With commit:
    'b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")'
    rebalance_domains of the local cfs_rq happens before others idle cpus have
    updated nohz.next_balance and its value is overwritten.

    Move the update of nohz.next_balance for other idles cpus before balancing
    and updating the next_balance of local cfs_rq.

    Also, the nohz.next_balance is now updated only if all idle cpus got a
    chance to rebalance their domains and the idle balance has not been aborted
    because of new activities on the CPU. In case of need_resched, the idle
    load balance will be kick the next jiffie in order to address remaining
    ilb.

    Fixes: b7031a02ec75 ("sched/fair: Add NOHZ_STATS_KICK")
    Reported-by: Peng Liu
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Acked-by: Mel Gorman
    Link: https://lkml.kernel.org/r/20200609123748.18636-1-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     

22 Jul, 2020

2 commits

  • commit 01cfcde9c26d8555f0e6e9aea9d6049f87683998 upstream.

    task_h_load() can return 0 in some situations like running stress-ng
    mmapfork, which forks thousands of threads, in a sched group on a 224 cores
    system. The load balance doesn't handle this correctly because
    env->imbalance never decreases and it will stop pulling tasks only after
    reaching loop_max, which can be equal to the number of running tasks of
    the cfs. Make sure that imbalance will be decreased by at least 1.

    misfit task is the other feature that doesn't handle correctly such
    situation although it's probably more difficult to face the problem
    because of the smaller number of CPUs and running tasks on heterogenous
    system.

    We can't simply ensure that task_h_load() returns at least one because it
    would imply to handle underflow in other places.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Reviewed-by: Dietmar Eggemann
    Tested-by: Dietmar Eggemann
    Cc: # v4.4+
    Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org
    Signed-off-by: Greg Kroah-Hartman

    Vincent Guittot
     
  • commit ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db upstream.

    While integrating rseq into glibc and replacing glibc's sched_getcpu
    implementation with rseq, glibc's tests discovered an issue with
    incorrect __rseq_abi.cpu_id field value right after the first time
    a newly created process issues sched_setaffinity.

    For the records, it triggers after building glibc and running tests, and
    then issuing:

    for x in {1..2000} ; do posix/tst-affinity-static & done

    and shows up as:

    error: Unexpected CPU 2, expected 0
    error: Unexpected CPU 2, expected 0
    error: Unexpected CPU 2, expected 0
    error: Unexpected CPU 2, expected 0
    error: Unexpected CPU 138, expected 0
    error: Unexpected CPU 138, expected 0
    error: Unexpected CPU 138, expected 0
    error: Unexpected CPU 138, expected 0

    This is caused by the scheduler invoking __set_task_cpu() directly from
    sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
    is done by set_task_cpu().

    Add the missing rseq_migrate() to both functions. The only other direct
    use of __set_task_cpu() is done by init_idle(), which does not involve a
    user-space task.

    Based on my testing with the glibc test-case, just adding rseq_migrate()
    to wake_up_new_task() is sufficient to fix the observed issue. Also add
    it to sched_fork() to keep things consistent.

    The reason why this never triggered so far with the rseq/basic_test
    selftest is unclear.

    The current use of sched_getcpu(3) does not typically require it to be
    always accurate. However, use of the __rseq_abi.cpu_id field within rseq
    critical sections requires it to be accurate. If it is not accurate, it
    can cause corruption in the per-cpu data targeted by rseq critical
    sections in user-space.

    Reported-By: Florian Weimer
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-By: Florian Weimer
    Cc: stable@vger.kernel.org # v4.18+
    Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     

16 Jul, 2020

1 commit

  • [ Upstream commit fd844ba9ae59b51e34e77105d79f8eca780b3bd6 ]

    This function is concerned with the long-term CPU mask, not the
    transitory mask the task might have while migrate disabled. Before
    this patch, if a task was migrate-disabled at the time
    __set_cpus_allowed_ptr() was called, and the new mask happened to be
    equal to the CPU that the task was running on, then the mask update
    would be lost.

    Signed-off-by: Scott Wood
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
    Signed-off-by: Sasha Levin

    Scott Wood
     

09 Jul, 2020

1 commit

  • [ Upstream commit 9818427c6270a9ce8c52c8621026fe9cebae0f92 ]

    Writing to the sysctl of a sched_domain->flags directly updates the value of
    the field, and goes nowhere near update_top_cache_domain(). This means that
    the cached domain pointers can end up containing stale data (e.g. the
    domain pointed to doesn't have the relevant flag set anymore).

    Explicit domain walks that check for flags will be affected by
    the write, but this won't be in sync with the cached pointers which will
    still point to the domains that were cached at the last sched_domain
    build.

    In other words, writing to this interface is playing a dangerous game. It
    could be made to trigger an update of the cached sched_domain pointers when
    written to, but this does not seem to be worth the trouble. Make it
    read-only.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

01 Jul, 2020

2 commits

  • [ Upstream commit 740797ce3a124b7dd22b7fb832d87bc8fba1cf6f ]

    syzbot reported the following warning:

    WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
    enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

    At deadline.c:628 we have:

    623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
    624 {
    625 struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
    626 struct rq *rq = rq_of_dl_rq(dl_rq);
    627
    628 WARN_ON(dl_se->dl_boosted);
    629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
    [...]
    }

    Which means that setup_new_dl_entity() has been called on a task
    currently boosted. This shouldn't happen though, as setup_new_dl_entity()
    is only called when the 'dynamic' deadline of the new entity
    is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
    condition.

    Digging through the PI code I noticed that what above might in fact happen
    if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
    first branch of boosting conditions we check only if a pi_task 'dynamic'
    deadline is earlier than mutex holder's and in this case we set mutex
    holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
    initialized if such tasks get boosted at some point (or if they become
    DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
    to 0 and this verifies the aforementioned condition.

    Fix it by checking that the potential donor task is actually (even if
    temporary because in turn boosted) running at DEADLINE priority before
    using its 'dynamic' deadline value.

    Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
    Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Reviewed-by: Daniel Bristot de Oliveira
    Tested-by: Daniel Wagner
    Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
    Signed-off-by: Sasha Levin

    Juri Lelli
     
  • [ Upstream commit ce9bc3b27f2a21a7969b41ffb04df8cf61bd1592 ]

    syzbot reported the following warning triggered via SYSC_sched_setattr():

    WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline]
    WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline]
    WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441

    This happens because the ->dl_boosted flag is currently not initialized by
    __dl_clear_params() (unlike the other flags) and setup_new_dl_entity()
    rightfully complains about it.

    Initialize dl_boosted to 0.

    Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
    Reported-by: syzbot+5ac8bac25f95e8b221e7@syzkaller.appspotmail.com
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Tested-by: Daniel Wagner
    Link: https://lkml.kernel.org/r/20200617072919.818409-1-juri.lelli@redhat.com
    Signed-off-by: Sasha Levin

    Juri Lelli
     

22 Jun, 2020

3 commits

  • [ Upstream commit d505b8af58912ae1e1a211fabc9995b19bd40828 ]

    When users write some huge number into cpu.cfs_quota_us or
    cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
    schedulable checks.

    to_ratio() could be altered to avoid unnecessary internal overflow, but
    min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
    be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
    prevent overflow.

    Signed-off-by: Huaixin Chang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Huaixin Chang
     
  • [ Upstream commit bf2c59fce4074e55d622089b34be3a6bc95484fb ]

    In the CPU-offline process, it calls mmdrop() after idle entry and the
    subsequent call to cpuhp_report_idle_dead(). Once execution passes the
    call to rcu_report_dead(), RCU is ignoring the CPU, which results in
    lockdep complaining when mmdrop() uses RCU from either memcg or
    debugobjects below.

    Fix it by cleaning up the active_mm state from BP instead. Every arch
    which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
    from AP. The only exception is parisc because it switches them to
    &init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
    but the patch will still work there because it calls mmgrab(&init_mm) in
    smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

    WARNING: suspicious RCU usage
    -----------------------------
    kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

    other info that might help us debug this:

    RCU used illegally from offline CPU!
    Call Trace:
    dump_stack+0xf4/0x164 (unreliable)
    lockdep_rcu_suspicious+0x140/0x164
    get_work_pool+0x110/0x150
    __queue_work+0x1bc/0xca0
    queue_work_on+0x114/0x120
    css_release+0x9c/0xc0
    percpu_ref_put_many+0x204/0x230
    free_pcp_prepare+0x264/0x570
    free_unref_page+0x38/0xf0
    __mmdrop+0x21c/0x2c0
    idle_task_exit+0x170/0x1b0
    pnv_smp_cpu_kill_self+0x38/0x2e0
    cpu_die+0x48/0x64
    arch_cpu_idle_dead+0x30/0x50
    do_idle+0x2f4/0x470
    cpu_startup_entry+0x38/0x40
    start_secondary+0x7a8/0xa80
    start_secondary_resume+0x10/0x14

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Michael Ellerman (powerpc)
    Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     
  • [ Upstream commit 5a6d6a6ccb5f48ca8cf7c6d64ff83fd9c7999390 ]

    In order to prevent possible hardlockup of sched_cfs_period_timer()
    loop, loop count is introduced to denote whether to scale quota and
    period or not. However, scale is done between forwarding period timer
    and refilling cfs bandwidth runtime, which means that period timer is
    forwarded with old "period" while runtime is refilled with scaled
    "quota".

    Move do_sched_cfs_period_timer() before scaling to solve this.

    Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
    Signed-off-by: Huaixin Chang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Reviewed-by: Phil Auld
    Link: https://lkml.kernel.org/r/20200420024421.22442-3-changhuaixin@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Huaixin Chang
     

17 Jun, 2020

1 commit

  • [ Upstream commit 18f855e574d9799a0e7489f8ae6fd8447d0dd74a ]

    Stefano reported a crash with using SQPOLL with io_uring:

    BUG: kernel NULL pointer dereference, address: 00000000000003b0
    CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
    RIP: 0010:task_numa_work+0x4f/0x2c0
    Call Trace:
    task_work_run+0x68/0xa0
    io_sq_thread+0x252/0x3d0
    kthread+0xf9/0x130
    ret_from_fork+0x35/0x40

    which is task_numa_work() oopsing on current->mm being NULL.

    The task work is queued by task_tick_numa(), which checks if current->mm is
    NULL at the time of the call. But this state isn't necessarily persistent,
    if the kthread is using use_mm() to temporarily adopt the mm of a task.

    Change the task_tick_numa() check to exclude kernel threads in general,
    as it doesn't make sense to attempt ot balance for kthreads anyway.

    Reported-by: Stefano Garzarella
    Signed-off-by: Jens Axboe
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
    Signed-off-by: Sasha Levin

    Jens Axboe
     

27 May, 2020

3 commits

  • [ Upstream commit b34cb07dde7c2346dec73d053ce926aeaa087303 ]

    sched/fair: Fix enqueue_task_fair warning some more

    The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
    did not fully resolve the issues with the rq->tmp_alone_branch !=
    &rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
    the first for_each_sched_entity loop exits due to on_rq, having incompletely
    updated the list. In this case the second for_each_sched_entity loop can
    further modify se. The later code to fix up the list management fails to do
    what is needed because se does not point to the sched_entity which broke out
    of the first loop. The list is not fixed up because the throttled parent was
    already added back to the list by a task enqueue in a parallel child hierarchy.

    Address this by calling list_add_leaf_cfs_rq if there are throttled parents
    while doing the second for_each_sched_entity loop.

    Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning")
    Suggested-by: Vincent Guittot
    Signed-off-by: Phil Auld
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
    Signed-off-by: Sasha Levin

    Phil Auld
     
  • [ Upstream commit 5ab297bab984310267734dfbcc8104566658ebef ]

    Even when a cgroup is throttled, the group se of a child cgroup can still
    be enqueued and its gse->on_rq stays true. When a task is enqueued on such
    child, we still have to update the load_avg and increase
    h_nr_running of the throttled cfs. Nevertheless, the 1st
    for_each_sched_entity() loop is skipped because of gse->on_rq == true and the
    2nd loop because the cfs is throttled whereas we have to update both
    load_avg with the old h_nr_running and increase h_nr_running in such case.

    The same sequence can happen during dequeue when se moves to parent before
    breaking in the 1st loop.

    Note that the update of load_avg will effectively happen only once in order
    to sync up to the throttled time. Next call for updating load_avg will stop
    early because the clock stays unchanged.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path")
    Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • [ Upstream commit 6d4d22468dae3d8757af9f8b81b848a76ef4409d ]

    The walk through the cgroup hierarchy during the enqueue/dequeue of a task
    is split in 2 distinct parts for throttled cfs_rq without any added value
    but making code less readable.

    Change the code ordering such that everything related to a cfs_rq
    (throttled or not) will be done in the same loop.

    In addition, the same steps ordering is used when updating a cfs_rq:

    - update_load_avg
    - update_cfs_group
    - update *h_nr_running

    This reordering enables the use of h_nr_running in PELT algorithm.

    No functional and performance changes are expected and have been noticed
    during tests.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Mel Gorman
    Signed-off-by: Ingo Molnar
    Reviewed-by: "Dietmar Eggemann "
    Acked-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Valentin Schneider
    Cc: Phil Auld
    Cc: Hillf Danton
    Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net
    Signed-off-by: Sasha Levin

    Vincent Guittot
     

02 May, 2020

1 commit

  • commit eaf5a92ebde5bca3bb2565616115bd6d579486cd upstream.

    uclamp_fork() resets the uclamp values to their default when the
    reset-on-fork flag is set. It also checks whether the task has a RT
    policy, and sets its uclamp.min to 1024 accordingly. However, during
    reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
    hence leading to an erroneous uclamp.min setting for the new task if it
    was forked from RT.

    Fix this by removing the unnecessary check on rt_task() in
    uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
    set.

    Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks")
    Reported-by: Chitti Babu Theegala
    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
    Signed-off-by: Greg Kroah-Hartman

    Quentin Perret
     

17 Apr, 2020

3 commits

  • commit 82e0516ce3a147365a5dd2a9bedd5ba43a18663d upstream.

    A redundant "curr = rq->curr" was added; remove it.

    Fixes: ebc0f83c78a2 ("timers/nohz: Update NOHZ load in remote tick")
    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Scott Wood
     
  • commit fe61468b2cbc2b7ce5f8d3bf32ae5001d4c434e9 upstream.

    When a cfs rq is throttled, the latter and its child are removed from the
    leaf list but their nr_running is not changed which includes staying higher
    than 1. When a task is enqueued in this throttled branch, the cfs rqs must
    be added back in order to ensure correct ordering in the list but this can
    only happens if nr_running == 1.
    When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq()
    when enqueuing an entity to make sure that the complete branch will be
    added.

    Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent
    is throttled. Iterate the remaining entity to ensure that the complete
    branch will be added in the list.

    Reported-by: Christian Borntraeger
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Tested-by: Christian Borntraeger
    Tested-by: Dietmar Eggemann
    Cc: stable@vger.kernel.org
    Cc: stable@vger.kernel.org #v5.1+
    Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.org
    Signed-off-by: Greg Kroah-Hartman

    Vincent Guittot
     
  • [ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ]

    During our testing, we found a case that shares no longer
    working correctly, the cgroup topology is like:

    /sys/fs/cgroup/cpu/A (shares=102400)
    /sys/fs/cgroup/cpu/A/B (shares=2)
    /sys/fs/cgroup/cpu/A/B/C (shares=1024)

    /sys/fs/cgroup/cpu/D (shares=1024)
    /sys/fs/cgroup/cpu/D/E (shares=1024)
    /sys/fs/cgroup/cpu/D/E/F (shares=1024)

    The same benchmark is running in group C & F, no other tasks are
    running, the benchmark is capable to consumed all the CPUs.

    We suppose the group C will win more CPU resources since it could
    enjoy all the shares of group A, but it's F who wins much more.

    The reason is because we have group B with shares as 2, since
    A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
    so A->cfs_rq.load.weight become very small.

    And in calc_group_shares() we calculate shares as:

    load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
    shares = (tg_shares * load) / tg_weight;

    Since the 'cfs_rq->load.weight' is too small, the load become 0
    after scale down, although 'tg_shares' is 102400, shares of the se
    which stand for group A on root cfs_rq become 2.

    While the se of D on root cfs_rq is far more bigger than 2, so it
    wins the battle.

    Thus when scale_load_down() scale real weight down to 0, it's no
    longer telling the real story, the caller will have the wrong
    information and the calculation will be buggy.

    This patch add check in scale_load_down(), so the real weight will
    be >= MIN_SHARES after scale, after applied the group C wins as
    expected.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Michael Wang
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Michael Wang
     

05 Mar, 2020

4 commits

  • commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.

    select_idle_cpu() will scan the LLC domain for idle CPUs,
    it's always expensive. so the next commit :

    1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")

    introduces a way to limit how many CPUs we scan.

    But it consume some CPUs out of 'nr' that are not allowed
    for the task and thus waste our attempts. The function
    always return nr_cpumask_bits, and we can't find a CPU
    which our task is allowed to run.

    Cpumask may be too big, similar to select_idle_core(), use
    per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

    Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
    Signed-off-by: Cheng Jian
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Srikar Dronamraju
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Cheng Jian
     
  • [ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]

    When a running task is moved on a throttled task group and there is no
    other task enqueued on the CPU, the task can keep running using 100% CPU
    whatever the allocated bandwidth for the group and although its cfs rq is
    throttled. Furthermore, the group entity of the cfs_rq and its parents are
    not enqueued but only set as curr on their respective cfs_rqs.

    We have the following sequence:

    sched_move_task
    -dequeue_task: dequeue task and group_entities.
    -put_prev_task: put task and group entities.
    -sched_change_group: move task to new group.
    -enqueue_task: enqueue only task but not group entities because cfs_rq is
    throttled.
    -set_next_task : set task and group_entities as current sched_entity of
    their cfs_rq.

    Another impact is that the root cfs_rq runnable_load_avg at root rq stays
    null because the group_entities are not enqueued. This situation will stay
    the same until an "external" event triggers a reschedule. Let trigger it
    immediately instead.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Ben Segall
    Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • [ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]

    The way loadavg is tracked during nohz only pays attention to the load
    upon entering nohz. This can be particularly noticeable if full nohz is
    entered while non-idle, and then the cpu goes idle and stays that way for
    a long time.

    Use the remote tick to ensure that full nohz cpus report their deltas
    within a reasonable time.

    [ swood: Added changelog and removed recheck of stopped tick. ]

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Peter Zijlstra (Intel)
     
  • [ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]

    This will be used in the next patch to get a loadavg update from
    nohz cpus. The delta check is skipped because idle_sched_class
    doesn't update se.exec_start.

    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Scott Wood
     

29 Feb, 2020

1 commit

  • commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.

    Issuing write() with count parameter set to 0 on any file under
    /proc/pressure/ will cause an OOB write because of the access to
    buf[buf_size-1] when NUL-termination is performed. Fix this by checking
    for buf_size to be non-zero.

    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Johannes Weiner
    Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

24 Feb, 2020

2 commits

  • [ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]

    topology.c::get_group() relies on the assumption that non-NUMA domains do
    not partially overlap. Zeng Tao pointed out in [1] that such topology
    descriptions, while completely bogus, can end up being exposed to the
    scheduler.

    In his example (8 CPUs, 2-node system), we end up with:
    MC span for CPU3 == 3-7
    MC span for CPU4 == 4-7

    The first pass through get_group(3, sdd@MC) will result in the following
    sched_group list:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `----------------'

    And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `-----------'

    which will completely break things like 'while (sg != sd->groups)' when
    using CPU3's base sched_domain.

    There already are some architecture-specific checks in place such as
    x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
    in the core scheduler, so it seems worthwhile to do so.

    Warn and abort the construction of the sched domains if such a broken
    topology description is detected. Note that this is somewhat
    expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
    gated under SCHED_DEBUG if deemed necessary.

    Testing
    =======

    Dietmar managed to reproduce this using the following qemu incantation:

    $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
    -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
    cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
    node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

    alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
    needed if '-smp cores=X, sockets=Y' would work with qemu):

    8package_id != cpu_topo->package_id)
    continue;

    + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
    + continue;
    +
    cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
    cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

    8
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
    Signed-off-by: Sasha Levin

    Valentin Schneider
     
  • [ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]

    rq::uclamp is an array of struct uclamp_rq, make sure we clear the
    whole thing.

    Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
    Signed-off-by: Li Guanglei
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Qais Yousef
    Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
    Signed-off-by: Sasha Levin

    Li Guanglei
     

20 Feb, 2020

1 commit

  • commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.

    The check to ensure that the new written value into cpu.uclamp.{min,max}
    is within range, [0:100], wasn't working because of the signed
    comparison

    7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
    7302 req.ret = -ERANGE;
    7303 return req;
    7304 }

    # echo -1 > cpu.uclamp.min
    # cat cpu.uclamp.min
    42949671.96

    Cast req.percent into u64 to force the comparison to be unsigned and
    work as intended in capacity_from_percent().

    # echo -1 > cpu.uclamp.min
    sh: write error: Numerical result out of range

    Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     

15 Feb, 2020

1 commit

  • commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.

    When a new cgroup is created, the effective uclamp value wasn't updated
    with a call to cpu_util_update_eff() that looks at the hierarchy and
    update to the most restrictive values.

    Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
    becomes online.

    Without this change, the newly created cgroup uses the default
    root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
    which will cause the rq to to be clamped to max, hence cause the
    system to run at max frequency.

    The problem was observed on Ubuntu server and was reproduced on Debian
    and Buildroot rootfs.

    By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
    and add all tasks to it - which creates enough noise to keep the rq
    uclamp value at max most of the time. Imitating this behavior makes the
    problem visible in Buildroot too which otherwise looks fine since it's a
    minimal userspace.

    Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
    Reported-by: Doug Smythies
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Doug Smythies
    Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     

26 Jan, 2020

2 commits

  • [ Upstream commit bef69dd87828ef5d8ecdab8d857cd3a33cf98675 ]

    update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
    which might be inefficient when the cpufreq driver has rate limitation.

    When a task is attached on a CPU, we have this call path:

    update_load_avg()
    update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
    attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

    The 1st frequency update will not take into account the utilization of the
    newly attached task and the 2nd one might be discarded because of rate
    limitation of the cpufreq driver.

    update_cfs_rq_load_avg() is only called by update_blocked_averages()
    and update_load_avg() so we can move the call to
    cfs_rq_util_change/cpufreq_update_util() into these two functions.

    It's also interesting to note that update_load_avg() already calls
    cfs_rq_util_change() directly for the !SMP case.

    This change will also ensure that cpufreq_update_util() is called even
    when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
    IRQ, RT or DL PELT signals.

    [ mingo: Minor updates. ]

    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Acked-by: Rafael J. Wysocki
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: juri.lelli@redhat.com
    Cc: linux-pm@vger.kernel.org
    Cc: mgorman@suse.de
    Cc: rostedt@goodmis.org
    Cc: sargun@sargun.me
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: tj@kernel.org
    Cc: xiexiuqi@huawei.com
    Cc: xiezhipeng1@huawei.com
    Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
    Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • commit a0e813f26ebcb25c0b5e504498fbd796cca1a4ba upstream.

    It turns out there really is something special to the first
    set_next_task() invocation. In specific the 'change' pattern really
    should not cause balance callbacks.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: juri.lelli@redhat.com
    Cc: ktkhai@virtuozzo.com
    Cc: mgorman@suse.de
    Cc: qais.yousef@arm.com
    Cc: qperret@google.com
    Cc: rostedt@goodmis.org
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
    Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

12 Jan, 2020

2 commits

  • [ Upstream commit c3466952ca1514158d7c16c9cfc48c27d5c5dc0f ]

    The psi window size is a u64 an can be up to 10 seconds right now,
    which exceeds the lower 32 bits of the variable. We currently use
    div_u64 for it, which is meant only for 32-bit divisors. The result is
    garbage pressure sampling values and even potential div0 crashes.

    Use div64_u64.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Suren Baghdasaryan
    Cc: Jingfeng Xie
    Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
    Signed-off-by: Sasha Levin

    Johannes Weiner
     
  • [ Upstream commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 ]

    Jingfeng reports rare div0 crashes in psi on systems with some uptime:

    [58914.066423] divide error: 0000 [#1] SMP
    [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
    [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
    [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
    [58914.102728] Workqueue: events psi_update_work
    [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
    [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
    [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
    [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
    [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
    [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
    [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
    [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
    [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
    [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
    [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [58914.200360] PKRU: 55555554
    [58914.203221] Stack:
    [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
    [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
    [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    [58914.228734] Call Trace:
    [58914.231337] [] psi_update_work+0x22/0x60
    [58914.237067] [] process_one_work+0x189/0x420
    [58914.243063] [] worker_thread+0x4e/0x4b0
    [58914.248701] [] ? process_one_work+0x420/0x420
    [58914.254869] [] kthread+0xe6/0x100
    [58914.259994] [] ? kthread_park+0x60/0x60
    [58914.265640] [] ret_from_fork+0x39/0x50
    [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc f7 f1 48 8b 13 48 89 c7 48 c1
    [58914.279691] RIP [] psi_update_stats+0x1c1/0x330

    The crashing instruction is trying to divide the observed stall time
    by the sampling period. The period, stored in R8, is not 0, but we are
    dividing by the lower 32 bits only, which are all 0 in this instance.

    We could switch to a 64-bit division, but the period shouldn't be that
    big in the first place. It's the time between the last update and the
    next scheduled one, and so should always be around 2s and comfortably
    fit into 32 bits.

    The bug is in the initialization of new cgroups: we schedule the first
    sampling event in a cgroup as an offset of sched_clock(), but fail to
    initialize the last_update timestamp, and it defaults to 0. That
    results in a bogusly large sampling period the first time we run the
    sampling code, and consequently we underreport pressure for the first
    2s of a cgroup's life. But worse, if sched_clock() is sufficiently
    advanced on the system, and the user gets unlucky, the period's lower
    32 bits can all be 0 and the sampling division will crash.

    Fix this by initializing the last update timestamp to the creation
    time of the cgroup, thus correctly marking the start of the first
    pressure sampling period in a new cgroup.

    Reported-by: Jingfeng Xie
    Signed-off-by: Johannes Weiner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Suren Baghdasaryan
    Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

31 Dec, 2019

2 commits

  • commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.

    The scheduler code calling cpufreq_update_util() may run during CPU
    offline on the target CPU after the IRQ work lists have been flushed
    for it, so the target CPU should be prevented from running code that
    may queue up an IRQ work item on it at that point.

    Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
    is set for at least one cpufreq policy in the system, because that
    allows the CPU going offline to run the utilization update callback
    of the cpufreq governor on behalf of another (online) CPU in some
    cases.

    If that happens, the cpufreq governor callback may queue up an IRQ
    work on the CPU running it, which is going offline, and the IRQ work
    may not be flushed after that point. Moreover, that IRQ work cannot
    be flushed until the "offlining" CPU goes back online, so if any
    other CPU calls irq_work_sync() to wait for the completion of that
    IRQ work, it will have to wait until the "offlining" CPU is back
    online and that may not happen forever. In particular, a system-wide
    deadlock may occur during CPU online as a result of that.

    The failing scenario is as follows. CPU0 is the boot CPU, so it
    creates a cpufreq policy and becomes the "leader" of it
    (policy->cpu). It cannot go offline, because it is the boot CPU.
    Next, other CPUs join the cpufreq policy as they go online and they
    leave it when they go offline. The last CPU to go offline, say CPU3,
    may queue up an IRQ work while running the governor callback on
    behalf of CPU0 after leaving the cpufreq policy because of the
    dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
    the only online CPU in the system and the stale IRQ work is still
    queued on CPU3. When, say, CPU1 goes back online, it will run
    irq_work_sync() to wait for that IRQ work to complete and so it
    will wait for CPU3 to go back online (which may never happen even
    in principle), but (worse yet) CPU0 is waiting for CPU1 at that
    point too and a system-wide deadlock occurs.

    To address this problem notice that CPUs which cannot run cpufreq
    utilization update code for themselves (for example, because they
    have left the cpufreq policies that they belonged to), should also
    be prevented from running that code on behalf of the other CPUs that
    belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
    in that case the cpufreq_update_util_data pointer of the CPU running
    the code must not be NULL as well as for the CPU which is the target
    of the cpufreq utilization update in progress.

    Accordingly, change cpufreq_this_cpu_can_update() into a regular
    function in kernel/sched/cpufreq.c (instead of a static inline in a
    header file) and make it check the cpufreq_update_util_data pointer
    of the local CPU if dvfs_possible_from_any_cpu is set for the target
    cpufreq policy.

    Also update the schedutil governor to do the
    cpufreq_this_cpu_can_update() check in the non-fast-switch
    case too to avoid the stale IRQ work issues.

    Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits")
    Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
    Reported-by: Anson Huang
    Tested-by: Anson Huang
    Cc: 4.14+ # 4.14+
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Tested-by: Peng Fan (i.MX8QXP-MEK)
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • [ Upstream commit 7763baace1b738d65efa46d68326c9406311c6bf ]

    Some uclamp helpers had their return type changed from 'unsigned int' to
    'enum uclamp_id' by commit

    0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

    but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
    range, which should really be unsigned int. The affected helpers are
    uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

    Note that this doesn't lead to any obj diff using a relatively recent
    aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
    properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
    anything funny. I'm still marking this as fixing the above commit to be on
    the safe side.

    Signed-off-by: Valentin Schneider
    Reviewed-by: Qais Yousef
    Acked-by: Vincent Guittot
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: patrick.bellasi@matbug.net
    Cc: qperret@google.com
    Cc: surenb@google.com
    Cc: tj@kernel.org
    Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
    Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

15 Nov, 2019

1 commit

  • uclamp_update_active() should perform the update when
    p->uclamp[clamp_id].active is true. But when the logic was inverted in
    [1], the if condition wasn't inverted correctly too.

    [1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/

    Reported-by: Suren Baghdasaryan
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Vincent Guittot
    Cc: Ben Segall
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
    Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
    Signed-off-by: Ingo Molnar

    Qais Yousef