20 Sep, 2022

1 commit

  • [ Upstream commit 8b023accc8df70e72f7704d29fead7ca914d6837 ]

    While looking into a bug related to the compiler's handling of addresses
    of labels, I noticed some uses of _THIS_IP_ seemed unused in lockdep.
    Drive by cleanup.

    -Wunused-parameter:
    kernel/locking/lockdep.c:1383:22: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4246:48: warning: unused parameter 'ip'
    kernel/locking/lockdep.c:4844:19: warning: unused parameter 'ip'

    Signed-off-by: Nick Desaulniers
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Link: https://lore.kernel.org/r/20220314221909.2027027-1-ndesaulniers@google.com
    Stable-dep-of: 54c3931957f6 ("tracing: hold caller_addr to hardirq_{enable,disable}_ip")
    Signed-off-by: Sasha Levin

    Nick Desaulniers
     

15 Sep, 2022

1 commit

  • commit c2e406596571659451f4b95e37ddfd5a8ef1d0dc upstream.

    Kuyo reports that the pattern of using debugfs_remove(debugfs_lookup())
    leaks a dentry and with a hotplug stress test, the machine eventually
    runs out of memory.

    Fix this up by using the newly created debugfs_lookup_and_remove() call
    instead which properly handles the dentry reference counting logic.

    Cc: Major Chen
    Cc: stable
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Vincent Guittot
    Cc: Dietmar Eggemann
    Cc: Steven Rostedt
    Cc: Ben Segall
    Cc: Mel Gorman
    Cc: Daniel Bristot de Oliveira
    Cc: Valentin Schneider
    Cc: Matthias Brugger
    Reported-by: Kuyo Chang
    Tested-by: Kuyo Chang
    Acked-by: Peter Zijlstra (Intel)
    Link: https://lore.kernel.org/r/20220902123107.109274-2-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

17 Aug, 2022

8 commits

  • [ Upstream commit 751d4cbc43879229dbc124afefe240b70fd29a85 ]

    The following warning was triggered on a large machine early in boot on
    a distribution kernel but the same problem should also affect mainline.

    WARNING: CPU: 439 PID: 10 at ../kernel/workqueue.c:2231 process_one_work+0x4d/0x440
    Call Trace:

    rescuer_thread+0x1f6/0x360
    kthread+0x156/0x180
    ret_from_fork+0x22/0x30

    Commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    optimises ttwu by queueing a task that is descheduling on the wakelist,
    but does not check if the task descheduling is still allowed to run on that CPU.

    In this warning, the problematic task is a workqueue rescue thread which
    checks if the rescue is for a per-cpu workqueue and running on the wrong CPU.
    While this is early in boot and it should be possible to create workers,
    the rescue thread may still used if the MAYDAY_INITIAL_TIMEOUT is reached
    or MAYDAY_INTERVAL and on a sufficiently large machine, the rescue
    thread is being used frequently.

    Tracing confirmed that the task should have migrated properly using the
    stopper thread to handle the migration. However, a parallel wakeup from udev
    running on another CPU that does not share CPU cache observes p->on_cpu and
    uses task_cpu(p), queues the task on the old CPU and triggers the warning.

    Check that the wakee task that is descheduling is still allowed to run
    on its current CPU and if not, wait for the descheduling to complete
    and select an allowed CPU.

    Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
    Signed-off-by: Mel Gorman
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20220804092119.20137-1-mgorman@techsingularity.net
    Signed-off-by: Sasha Levin

    Mel Gorman
     
  • [ Upstream commit f3dd3f674555bd9455c5ae7fafce0696bd9931b3 ]

    Wakelist can help avoid cache bouncing and offload the overhead of waker
    cpu. So far, using wakelist within the same llc only happens on
    WF_ON_CPU, and this limitation could be removed to further improve
    wakeup performance.

    The commit 518cd6234178 ("sched: Only queue remote wakeups when
    crossing cache boundaries") disabled queuing tasks on wakelist when
    the cpus share llc. This is because, at that time, the scheduler must
    send IPIs to do ttwu_queue_wakelist. Nowadays, ttwu_queue_wakelist also
    supports TIF_POLLING, so this is not a problem now when the wakee cpu is
    in idle polling.

    Benefits:
    Queuing the task on idle cpu can help improving performance on waker cpu
    and utilization on wakee cpu, and further improve locality because
    the wakee cpu can handle its own rq. This patch helps improving rt on
    our real java workloads where wakeup happens frequently.

    Consider the normal condition (CPU0 and CPU1 share same llc)
    Before this patch:

    CPU0 CPU1

    select_task_rq() idle
    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    notify CPU1 (by sending IPI or CPU1 polling)

    resched()

    After this patch:

    CPU0 CPU1

    select_task_rq() idle
    add to wakelist of CPU1
    notify CPU1 (by sending IPI or CPU1 polling)

    rq_lock(CPU1->rq)
    enqueue_task(CPU1->rq)
    resched()

    We see CPU0 can finish its work earlier. It only needs to put task to
    wakelist and return.
    While CPU1 is idle, so let itself handle its own runqueue data.

    This patch brings no difference about IPI.
    This patch only takes effect when the wakee cpu is:
    1) idle polling
    2) idle not polling

    For 1), there will be no IPI with or without this patch.

    For 2), there will always be an IPI before or after this patch.
    Before this patch: waker cpu will enqueue task and check preempt. Since
    "idle" will be sure to be preempted, waker cpu must send a resched IPI.
    After this patch: waker cpu will put the task to the wakelist of wakee
    cpu, and send an IPI.

    Benchmark:
    We've tested schbench, unixbench, and hachbench on both x86 and arm64.

    On x86 (Intel Xeon Platinum 8269CY):
    schbench -m 2 -t 8

    Latency percentiles (usec) before after
    50.0000th: 8 6
    75.0000th: 10 7
    90.0000th: 11 8
    95.0000th: 12 8
    *99.0000th: 13 10
    99.5000th: 15 11
    99.9000th: 18 14

    Unixbench with full threads (104)
    before after
    Dhrystone 2 using register variables 3011862938 3009935994 -0.06%
    Double-Precision Whetstone 617119.3 617298.5 0.03%
    Execl Throughput 27667.3 27627.3 -0.14%
    File Copy 1024 bufsize 2000 maxblocks 785871.4 784906.2 -0.12%
    File Copy 256 bufsize 500 maxblocks 210113.6 212635.4 1.20%
    File Copy 4096 bufsize 8000 maxblocks 2328862.2 2320529.1 -0.36%
    Pipe Throughput 145535622.8 145323033.2 -0.15%
    Pipe-based Context Switching 3221686.4 3583975.4 11.25%
    Process Creation 101347.1 103345.4 1.97%
    Shell Scripts (1 concurrent) 120193.5 123977.8 3.15%
    Shell Scripts (8 concurrent) 17233.4 17138.4 -0.55%
    System Call Overhead 5300604.8 5312213.6 0.22%

    hackbench -g 1 -l 100000
    before after
    Time 3.246 2.251

    On arm64 (Ampere Altra):
    schbench -m 2 -t 8

    Latency percentiles (usec) before after
    50.0000th: 14 10
    75.0000th: 19 14
    90.0000th: 22 16
    95.0000th: 23 16
    *99.0000th: 24 17
    99.5000th: 24 17
    99.9000th: 28 25

    Unixbench with full threads (80)
    before after
    Dhrystone 2 using register variables 3536194249 3537019613 0.02%
    Double-Precision Whetstone 629383.6 629431.6 0.01%
    Execl Throughput 65920.5 65846.2 -0.11%
    File Copy 1024 bufsize 2000 maxblocks 1063722.8 1064026.8 0.03%
    File Copy 256 bufsize 500 maxblocks 322684.5 318724.5 -1.23%
    File Copy 4096 bufsize 8000 maxblocks 2348285.3 2328804.8 -0.83%
    Pipe Throughput 133542875.3 131619389.8 -1.44%
    Pipe-based Context Switching 3215356.1 3576945.1 11.25%
    Process Creation 108520.5 120184.6 10.75%
    Shell Scripts (1 concurrent) 122636.3 121888 -0.61%
    Shell Scripts (8 concurrent) 17462.1 17381.4 -0.46%
    System Call Overhead 4429998.9 4435006.7 0.11%

    hackbench -g 1 -l 100000
    before after
    Time 4.217 2.916

    Our patch has improvement on schbench, hackbench
    and Pipe-based Context Switching of unixbench
    when there exists idle cpus,
    and no obvious regression on other tests of unixbench.
    This can help improve rt in scenes where wakeup happens frequently.

    Signed-off-by: Tianchen Ding
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20220608233412.327341-3-dtcccc@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Tianchen Ding
     
  • [ Upstream commit 28156108fecb1f808b21d216e8ea8f0d205a530c ]

    The commit 2ebb17717550 ("sched/core: Offload wakee task activation if it
    the wakee is descheduling") checked rq->nr_running on_rq and p->on_cpu, observing p->on_cpu
    (WF_ON_CPU) in ttwu_queue_cond() implies !p->on_rq, IOW p has gone through
    the deactivate_task() in __schedule(), thus p has been accounted out of
    rq->nr_running. As such, the task being the only runnable task on the rq
    implies reading rq->nr_running == 0 at that point.

    The benchmark result is in [1].

    [1] https://lore.kernel.org/all/e34de686-4e85-bde1-9f3c-9bbc86b38627@linux.alibaba.com/

    Suggested-by: Valentin Schneider
    Signed-off-by: Tianchen Ding
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20220608233412.327341-2-dtcccc@linux.alibaba.com
    Signed-off-by: Sasha Levin

    Tianchen Ding
     
  • [ Upstream commit b6e8d40d43ae4dec00c8fea2593eeea3114b8f44 ]

    With cgroup v2, the cpuset's cpus_allowed mask can be empty indicating
    that the cpuset will just use the effective CPUs of its parent. So
    cpuset_can_attach() can call task_can_attach() with an empty mask.
    This can lead to cpumask_any_and() returns nr_cpu_ids causing the call
    to dl_bw_of() to crash due to percpu value access of an out of bound
    CPU value. For example:

    [80468.182258] BUG: unable to handle page fault for address: ffffffff8b6648b0
    :
    [80468.191019] RIP: 0010:dl_cpu_busy+0x30/0x2b0
    :
    [80468.207946] Call Trace:
    [80468.208947] cpuset_can_attach+0xa0/0x140
    [80468.209953] cgroup_migrate_execute+0x8c/0x490
    [80468.210931] cgroup_update_dfl_csses+0x254/0x270
    [80468.211898] cgroup_subtree_control_write+0x322/0x400
    [80468.212854] kernfs_fop_write_iter+0x11c/0x1b0
    [80468.213777] new_sync_write+0x11f/0x1b0
    [80468.214689] vfs_write+0x1eb/0x280
    [80468.215592] ksys_write+0x5f/0xe0
    [80468.216463] do_syscall_64+0x5c/0x80
    [80468.224287] entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix that by using effective_cpus instead. For cgroup v1, effective_cpus
    is the same as cpus_allowed. For v2, effective_cpus is the real cpumask
    to be used by tasks within the cpuset anyway.

    Also update task_can_attach()'s 2nd argument name to cs_effective_cpus to
    reflect the change. In addition, a check is added to task_can_attach()
    to guard against the possibility that cpumask_any_and() may return a
    value >= nr_cpu_ids.

    Fixes: 7f51412a415d ("sched/deadline: Fix bandwidth check/update when migrating tasks between exclusive cpusets")
    Signed-off-by: Waiman Long
    Signed-off-by: Ingo Molnar
    Acked-by: Juri Lelli
    Link: https://lore.kernel.org/r/20220803015451.2219567-1-longman@redhat.com
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 772b6539fdda31462cc08368e78df60b31a58bab ]

    Both functions are doing almost the same, that is checking if admission
    control is still respected.

    With exclusive cpusets, dl_task_can_attach() checks if the destination
    cpuset (i.e. its root domain) has enough CPU capacity to accommodate the
    task.
    dl_cpu_busy() checks if there is enough CPU capacity in the cpuset in
    case the CPU is hot-plugged out.

    dl_task_can_attach() is used to check if a task can be admitted while
    dl_cpu_busy() is used to check if a CPU can be hotplugged out.

    Make dl_cpu_busy() able to deal with a task and use it instead of
    dl_task_can_attach() in task_can_attach().

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Link: https://lore.kernel.org/r/20220302183433.333029-4-dietmar.eggemann@arm.com
    Signed-off-by: Sasha Levin

    Dietmar Eggemann
     
  • [ Upstream commit 5c66d1b9b30f737fcef85a0b75bfe0590e16b62a ]

    dequeue_task_rt() only decrements 'rt_rq->rt_nr_running' after having
    called sched_update_tick_dependency() preventing it from re-enabling the
    tick on systems that no longer have pending SCHED_RT tasks but have
    multiple runnable SCHED_OTHER tasks:

    dequeue_task_rt()
    dequeue_rt_entity()
    dequeue_rt_stack()
    dequeue_top_rt_rq()
    sub_nr_running() // decrements rq->nr_running
    sched_update_tick_dependency()
    sched_can_stop_tick() // checks rq->rt.rt_nr_running,
    ...
    __dequeue_rt_entity()
    dec_rt_tasks() // decrements rq->rt.rt_nr_running
    ...

    Every other scheduler class performs the operation in the opposite
    order, and sched_update_tick_dependency() expects the values to be
    updated as such. So avoid the misbehaviour by inverting the order in
    which the above operations are performed in the RT scheduler.

    Fixes: 76d92ac305f2 ("sched: Migrate sched to use new tick dependency mask model")
    Signed-off-by: Nicolas Saenz Julienne
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Reviewed-by: Phil Auld
    Link: https://lore.kernel.org/r/20220628092259.330171-1-nsaenzju@redhat.com
    Signed-off-by: Sasha Levin

    Nicolas Saenz Julienne
     
  • [ Upstream commit 401e4963bf45c800e3e9ea0d3a0289d738005fd4 ]

    With CONFIG_PREEMPT_RT, it is possible to hit a deadlock between two
    normal priority tasks (SCHED_OTHER, nice level zero):

    INFO: task kworker/u8:0:8 blocked for more than 491 seconds.
    Not tainted 5.15.49-rt46 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:kworker/u8:0 state:D stack: 0 pid: 8 ppid: 2 flags:0x00000000
    Workqueue: writeback wb_workfn (flush-7:0)
    [] (__schedule) from [] (schedule+0xdc/0x134)
    [] (schedule) from [] (rt_mutex_slowlock_block.constprop.0+0xb8/0x174)
    [] (rt_mutex_slowlock_block.constprop.0) from []
    +(rt_mutex_slowlock.constprop.0+0xac/0x174)
    [] (rt_mutex_slowlock.constprop.0) from [] (fat_write_inode+0x34/0x54)
    [] (fat_write_inode) from [] (__writeback_single_inode+0x354/0x3ec)
    [] (__writeback_single_inode) from [] (writeback_sb_inodes+0x250/0x45c)
    [] (writeback_sb_inodes) from [] (__writeback_inodes_wb+0x7c/0xb8)
    [] (__writeback_inodes_wb) from [] (wb_writeback+0x2c8/0x2e4)
    [] (wb_writeback) from [] (wb_workfn+0x1a4/0x3e4)
    [] (wb_workfn) from [] (process_one_work+0x1fc/0x32c)
    [] (process_one_work) from [] (worker_thread+0x22c/0x2d8)
    [] (worker_thread) from [] (kthread+0x16c/0x178)
    [] (kthread) from [] (ret_from_fork+0x14/0x38)
    Exception stack(0xc10e3fb0 to 0xc10e3ff8)
    3fa0: 00000000 00000000 00000000 00000000
    3fc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
    3fe0: 00000000 00000000 00000000 00000000 00000013 00000000

    INFO: task tar:2083 blocked for more than 491 seconds.
    Not tainted 5.15.49-rt46 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    task:tar state:D stack: 0 pid: 2083 ppid: 2082 flags:0x00000000
    [] (__schedule) from [] (schedule+0xdc/0x134)
    [] (schedule) from [] (io_schedule+0x14/0x24)
    [] (io_schedule) from [] (bit_wait_io+0xc/0x30)
    [] (bit_wait_io) from [] (__wait_on_bit_lock+0x54/0xa8)
    [] (__wait_on_bit_lock) from [] (out_of_line_wait_on_bit_lock+0x84/0xb0)
    [] (out_of_line_wait_on_bit_lock) from [] (fat_mirror_bhs+0xa0/0x144)
    [] (fat_mirror_bhs) from [] (fat_alloc_clusters+0x138/0x2a4)
    [] (fat_alloc_clusters) from [] (fat_alloc_new_dir+0x34/0x250)
    [] (fat_alloc_new_dir) from [] (vfat_mkdir+0x58/0x148)
    [] (vfat_mkdir) from [] (vfs_mkdir+0x68/0x98)
    [] (vfs_mkdir) from [] (do_mkdirat+0xb0/0xec)
    [] (do_mkdirat) from [] (ret_fast_syscall+0x0/0x1c)
    Exception stack(0xc2e1bfa8 to 0xc2e1bff0)
    bfa0: 01ee42f0 01ee4208 01ee42f0 000041ed 00000000 00004000
    bfc0: 01ee42f0 01ee4208 00000000 00000027 01ee4302 00000004 000dcb00 01ee4190
    bfe0: 000dc368 bed11924 0006d4b0 b6ebddfc

    Here the kworker is waiting on msdos_sb_info::s_lock which is held by
    tar which is in turn waiting for a buffer which is locked waiting to be
    flushed, but this operation is plugged in the kworker.

    The lock is a normal struct mutex, so tsk_is_pi_blocked() will always
    return false on !RT and thus the behaviour changes for RT.

    It seems that the intent here is to skip blk_flush_plug() in the case
    where a non-preemptible lock (such as a spinlock) has been converted to
    a rtmutex on RT, which is the case covered by the SM_RTLOCK_WAIT
    schedule flag. But sched_submit_work() is only called from schedule()
    which is never called in this scenario, so the check can simply be
    deleted.

    Looking at the history of the -rt patchset, in fact this change was
    present from v5.9.1-rt20 until being dropped in v5.13-rt1 as it was part
    of a larger patch [1] most of which was replaced by commit b4bfa3fcfe3b
    ("sched/core: Rework the __schedule() preempt argument").

    As described in [1]:

    The schedule process must distinguish between blocking on a regular
    sleeping lock (rwsem and mutex) and a RT-only sleeping lock (spinlock
    and rwlock):
    - rwsem and mutex must flush block requests (blk_schedule_flush_plug())
    even if blocked on a lock. This can not deadlock because this also
    happens for non-RT.
    There should be a warning if the scheduling point is within a RCU read
    section.

    - spinlock and rwlock must not flush block requests. This will deadlock
    if the callback attempts to acquire a lock which is already acquired.
    Similarly to being preempted, there should be no warning if the
    scheduling point is within a RCU read section.

    and with the tsk_is_pi_blocked() in the scheduler path, we hit the first
    issue.

    [1] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0022-locking-rtmutex-Use-custom-scheduling-function-for-s.patch?h=linux-5.10.y-rt-patches

    Signed-off-by: John Keeping
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt (Google)
    Link: https://lkml.kernel.org/r/20220708162702.1758865-1-john@metanate.com
    Signed-off-by: Sasha Levin

    John Keeping
     
  • [ Upstream commit 70fb5ccf2ebb09a0c8ebba775041567812d45f86 ]

    [Problem Statement]
    select_idle_cpu() might spend too much time searching for an idle CPU,
    when the system is overloaded.

    The following histogram is the time spent in select_idle_cpu(),
    when running 224 instances of netperf on a system with 112 CPUs
    per LLC domain:

    @usecs:
    [0] 533 | |
    [1] 5495 | |
    [2, 4) 12008 | |
    [4, 8) 239252 | |
    [8, 16) 4041924 |@@@@@@@@@@@@@@ |
    [16, 32) 12357398 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
    [32, 64) 14820255 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [64, 128) 13047682 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
    [128, 256) 8235013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
    [256, 512) 4507667 |@@@@@@@@@@@@@@@ |
    [512, 1K) 2600472 |@@@@@@@@@ |
    [1K, 2K) 927912 |@@@ |
    [2K, 4K) 218720 | |
    [4K, 8K) 98161 | |
    [8K, 16K) 37722 | |
    [16K, 32K) 6715 | |
    [32K, 64K) 477 | |
    [64K, 128K) 7 | |

    netperf latency usecs:
    =======
    case load Lat_99th std%
    TCP_RR thread-224 257.39 ( 0.21)

    The time spent in select_idle_cpu() is visible to netperf and might have a negative
    impact.

    [Symptom analysis]
    The patch [1] from Mel Gorman has been applied to track the efficiency
    of select_idle_sibling. Copy the indicators here:

    SIS Search Efficiency(se_eff%):
    A ratio expressed as a percentage of runqueues scanned versus
    idle CPUs found. A 100% efficiency indicates that the target,
    prev or recent CPU of a task was idle at wakeup. The lower the
    efficiency, the more runqueues were scanned before an idle CPU
    was found.

    SIS Domain Search Efficiency(dom_eff%):
    Similar, except only for the slower SIS
    patch.

    SIS Fast Success Rate(fast_rate%):
    Percentage of SIS that used target, prev or
    recent CPUs.

    SIS Success rate(success_rate%):
    Percentage of scans that found an idle CPU.

    The test is based on Aubrey's schedtests tool, including netperf, hackbench,
    schbench and tbench.

    Test on vanilla kernel:
    schedstat_parse.py -f netperf_vanilla.log
    case load se_eff% dom_eff% fast_rate% success_rate%
    TCP_RR 28 threads 99.978 18.535 99.995 100.000
    TCP_RR 56 threads 99.397 5.671 99.964 100.000
    TCP_RR 84 threads 21.721 6.818 73.632 100.000
    TCP_RR 112 threads 12.500 5.533 59.000 100.000
    TCP_RR 140 threads 8.524 4.535 49.020 100.000
    TCP_RR 168 threads 6.438 3.945 40.309 99.999
    TCP_RR 196 threads 5.397 3.718 32.320 99.982
    TCP_RR 224 threads 4.874 3.661 25.775 99.767
    UDP_RR 28 threads 99.988 17.704 99.997 100.000
    UDP_RR 56 threads 99.528 5.977 99.970 100.000
    UDP_RR 84 threads 24.219 6.992 76.479 100.000
    UDP_RR 112 threads 13.907 5.706 62.538 100.000
    UDP_RR 140 threads 9.408 4.699 52.519 100.000
    UDP_RR 168 threads 7.095 4.077 44.352 100.000
    UDP_RR 196 threads 5.757 3.775 35.764 99.991
    UDP_RR 224 threads 5.124 3.704 28.748 99.860

    schedstat_parse.py -f schbench_vanilla.log
    (each group has 28 tasks)
    case load se_eff% dom_eff% fast_rate% success_rate%
    normal 1 mthread 99.152 6.400 99.941 100.000
    normal 2 mthreads 97.844 4.003 99.908 100.000
    normal 3 mthreads 96.395 2.118 99.917 99.998
    normal 4 mthreads 55.288 1.451 98.615 99.804
    normal 5 mthreads 7.004 1.870 45.597 61.036
    normal 6 mthreads 3.354 1.346 20.777 34.230
    normal 7 mthreads 2.183 1.028 11.257 21.055
    normal 8 mthreads 1.653 0.825 7.849 15.549

    schedstat_parse.py -f hackbench_vanilla.log
    (each group has 28 tasks)
    case load se_eff% dom_eff% fast_rate% success_rate%
    process-pipe 1 group 99.991 7.692 99.999 100.000
    process-pipe 2 groups 99.934 4.615 99.997 100.000
    process-pipe 3 groups 99.597 3.198 99.987 100.000
    process-pipe 4 groups 98.378 2.464 99.958 100.000
    process-pipe 5 groups 27.474 3.653 89.811 99.800
    process-pipe 6 groups 20.201 4.098 82.763 99.570
    process-pipe 7 groups 16.423 4.156 77.398 99.316
    process-pipe 8 groups 13.165 3.920 72.232 98.828
    process-sockets 1 group 99.977 5.882 99.999 100.000
    process-sockets 2 groups 99.927 5.505 99.996 100.000
    process-sockets 3 groups 99.397 3.250 99.980 100.000
    process-sockets 4 groups 79.680 4.258 98.864 99.998
    process-sockets 5 groups 7.673 2.503 63.659 92.115
    process-sockets 6 groups 4.642 1.584 58.946 88.048
    process-sockets 7 groups 3.493 1.379 49.816 81.164
    process-sockets 8 groups 3.015 1.407 40.845 75.500
    threads-pipe 1 group 99.997 0.000 100.000 100.000
    threads-pipe 2 groups 99.894 2.932 99.997 100.000
    threads-pipe 3 groups 99.611 4.117 99.983 100.000
    threads-pipe 4 groups 97.703 2.624 99.937 100.000
    threads-pipe 5 groups 22.919 3.623 87.150 99.764
    threads-pipe 6 groups 18.016 4.038 80.491 99.557
    threads-pipe 7 groups 14.663 3.991 75.239 99.247
    threads-pipe 8 groups 12.242 3.808 70.651 98.644
    threads-sockets 1 group 99.990 6.667 99.999 100.000
    threads-sockets 2 groups 99.940 5.114 99.997 100.000
    threads-sockets 3 groups 99.469 4.115 99.977 100.000
    threads-sockets 4 groups 87.528 4.038 99.400 100.000
    threads-sockets 5 groups 6.942 2.398 59.244 88.337
    threads-sockets 6 groups 4.359 1.954 49.448 87.860
    threads-sockets 7 groups 2.845 1.345 41.198 77.102
    threads-sockets 8 groups 2.871 1.404 38.512 74.312

    schedstat_parse.py -f tbench_vanilla.log
    case load se_eff% dom_eff% fast_rate% success_rate%
    loopback 28 threads 99.976 18.369 99.995 100.000
    loopback 56 threads 99.222 7.799 99.934 100.000
    loopback 84 threads 19.723 6.819 70.215 100.000
    loopback 112 threads 11.283 5.371 55.371 99.999
    loopback 140 threads 0.000 0.000 0.000 0.000
    loopback 168 threads 0.000 0.000 0.000 0.000
    loopback 196 threads 0.000 0.000 0.000 0.000
    loopback 224 threads 0.000 0.000 0.000 0.000

    According to the test above, if the system becomes busy, the
    SIS Search Efficiency(se_eff%) drops significantly. Although some
    benchmarks would finally find an idle CPU(success_rate% = 100%), it is
    doubtful whether it is worth it to search the whole LLC domain.

    [Proposal]
    It would be ideal to have a crystal ball to answer this question:
    How many CPUs must a wakeup path walk down, before it can find an idle
    CPU? Many potential metrics could be used to predict the number.
    One candidate is the sum of util_avg in this LLC domain. The benefit
    of choosing util_avg is that it is a metric of accumulated historic
    activity, which seems to be smoother than instantaneous metrics
    (such as rq->nr_running). Besides, choosing the sum of util_avg
    would help predict the load of the LLC domain more precisely, because
    SIS_PROP uses one CPU's idle time to estimate the total LLC domain idle
    time.

    In summary, the lower the util_avg is, the more select_idle_cpu()
    should scan for idle CPU, and vice versa. When the sum of util_avg
    in this LLC domain hits 85% or above, the scan stops. The reason to
    choose 85% as the threshold is that this is the imbalance_pct(117)
    when a LLC sched group is overloaded.

    Introduce the quadratic function:

    y = SCHED_CAPACITY_SCALE - p * x^2
    and y'= y / SCHED_CAPACITY_SCALE

    x is the ratio of sum_util compared to the CPU capacity:
    x = sum_util / (llc_weight * SCHED_CAPACITY_SCALE)
    y' is the ratio of CPUs to be scanned in the LLC domain,
    and the number of CPUs to scan is calculated by:

    nr_scan = llc_weight * y'

    Choosing quadratic function is because:
    [1] Compared to the linear function, it scans more aggressively when the
    sum_util is low.
    [2] Compared to the exponential function, it is easier to calculate.
    [3] It seems that there is no accurate mapping between the sum of util_avg
    and the number of CPUs to be scanned. Use heuristic scan for now.

    For a platform with 112 CPUs per LLC, the number of CPUs to scan is:
    sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
    scan_nr 112 111 108 102 93 81 65 47 25 1 0 ...

    For a platform with 16 CPUs per LLC, the number of CPUs to scan is:
    sum_util% 0 5 15 25 35 45 55 65 75 85 86 ...
    scan_nr 16 15 15 14 13 11 9 6 3 0 0 ...

    Furthermore, to minimize the overhead of calculating the metrics in
    select_idle_cpu(), borrow the statistics from periodic load balance.
    As mentioned by Abel, on a platform with 112 CPUs per LLC, the
    sum_util calculated by periodic load balance after 112 ms would
    decay to about 0.5 * 0.5 * 0.5 * 0.7 = 8.75%, thus bringing a delay
    in reflecting the latest utilization. But it is a trade-off.
    Checking the util_avg in newidle load balance would be more frequent,
    but it brings overhead - multiple CPUs write/read the per-LLC shared
    variable and introduces cache contention. Tim also mentioned that,
    it is allowed to be non-optimal in terms of scheduling for the
    short-term variations, but if there is a long-term trend in the load
    behavior, the scheduler can adjust for that.

    When SIS_UTIL is enabled, the select_idle_cpu() uses the nr_scan
    calculated by SIS_UTIL instead of the one from SIS_PROP. As Peter and
    Mel suggested, SIS_UTIL should be enabled by default.

    This patch is based on the util_avg, which is very sensitive to the
    CPU frequency invariance. There is an issue that, when the max frequency
    has been clamp, the util_avg would decay insanely fast when
    the CPU is idle. Commit addca285120b ("cpufreq: intel_pstate: Handle no_turbo
    in frequency invariance") could be used to mitigate this symptom, by adjusting
    the arch_max_freq_ratio when turbo is disabled. But this issue is still
    not thoroughly fixed, because the current code is unaware of the user-specified
    max CPU frequency.

    [Test result]

    netperf and tbench were launched with 25% 50% 75% 100% 125% 150%
    175% 200% of CPU number respectively. Hackbench and schbench were launched
    by 1, 2 ,4, 8 groups. Each test lasts for 100 seconds and repeats 3 times.

    The following is the benchmark result comparison between
    baseline:vanilla v5.19-rc1 and compare:patched kernel. Positive compare%
    indicates better performance.

    Each netperf test is a:
    netperf -4 -H 127.0.1 -t TCP/UDP_RR -c -C -l 100
    netperf.throughput
    =======
    case load baseline(std%) compare%( std%)
    TCP_RR 28 threads 1.00 ( 0.34) -0.16 ( 0.40)
    TCP_RR 56 threads 1.00 ( 0.19) -0.02 ( 0.20)
    TCP_RR 84 threads 1.00 ( 0.39) -0.47 ( 0.40)
    TCP_RR 112 threads 1.00 ( 0.21) -0.66 ( 0.22)
    TCP_RR 140 threads 1.00 ( 0.19) -0.69 ( 0.19)
    TCP_RR 168 threads 1.00 ( 0.18) -0.48 ( 0.18)
    TCP_RR 196 threads 1.00 ( 0.16) +194.70 ( 16.43)
    TCP_RR 224 threads 1.00 ( 0.16) +197.30 ( 7.85)
    UDP_RR 28 threads 1.00 ( 0.37) +0.35 ( 0.33)
    UDP_RR 56 threads 1.00 ( 11.18) -0.32 ( 0.21)
    UDP_RR 84 threads 1.00 ( 1.46) -0.98 ( 0.32)
    UDP_RR 112 threads 1.00 ( 28.85) -2.48 ( 19.61)
    UDP_RR 140 threads 1.00 ( 0.70) -0.71 ( 14.04)
    UDP_RR 168 threads 1.00 ( 14.33) -0.26 ( 11.16)
    UDP_RR 196 threads 1.00 ( 12.92) +186.92 ( 20.93)
    UDP_RR 224 threads 1.00 ( 11.74) +196.79 ( 18.62)

    Take the 224 threads as an example, the SIS search metrics changes are
    illustrated below:

    vanilla patched
    4544492 +237.5% 15338634 sched_debug.cpu.sis_domain_search.avg
    38539 +39686.8% 15333634 sched_debug.cpu.sis_failed.avg
    128300000 -87.9% 15551326 sched_debug.cpu.sis_scanned.avg
    5842896 +162.7% 15347978 sched_debug.cpu.sis_search.avg

    There is -87.9% less CPU scans after patched, which indicates lower overhead.
    Besides, with this patch applied, there is -13% less rq lock contention
    in perf-profile.calltrace.cycles-pp._raw_spin_lock.raw_spin_rq_lock_nested
    .try_to_wake_up.default_wake_function.woken_wake_function.
    This might help explain the performance improvement - Because this patch allows
    the waking task to remain on the previous CPU, rather than grabbing other CPUs'
    lock.

    Each hackbench test is a:
    hackbench -g $job --process/threads --pipe/sockets -l 1000000 -s 100
    hackbench.throughput
    =========
    case load baseline(std%) compare%( std%)
    process-pipe 1 group 1.00 ( 1.29) +0.57 ( 0.47)
    process-pipe 2 groups 1.00 ( 0.27) +0.77 ( 0.81)
    process-pipe 4 groups 1.00 ( 0.26) +1.17 ( 0.02)
    process-pipe 8 groups 1.00 ( 0.15) -4.79 ( 0.02)
    process-sockets 1 group 1.00 ( 0.63) -0.92 ( 0.13)
    process-sockets 2 groups 1.00 ( 0.03) -0.83 ( 0.14)
    process-sockets 4 groups 1.00 ( 0.40) +5.20 ( 0.26)
    process-sockets 8 groups 1.00 ( 0.04) +3.52 ( 0.03)
    threads-pipe 1 group 1.00 ( 1.28) +0.07 ( 0.14)
    threads-pipe 2 groups 1.00 ( 0.22) -0.49 ( 0.74)
    threads-pipe 4 groups 1.00 ( 0.05) +1.88 ( 0.13)
    threads-pipe 8 groups 1.00 ( 0.09) -4.90 ( 0.06)
    threads-sockets 1 group 1.00 ( 0.25) -0.70 ( 0.53)
    threads-sockets 2 groups 1.00 ( 0.10) -0.63 ( 0.26)
    threads-sockets 4 groups 1.00 ( 0.19) +11.92 ( 0.24)
    threads-sockets 8 groups 1.00 ( 0.08) +4.31 ( 0.11)

    Each tbench test is a:
    tbench -t 100 $job 127.0.0.1
    tbench.throughput
    ======
    case load baseline(std%) compare%( std%)
    loopback 28 threads 1.00 ( 0.06) -0.14 ( 0.09)
    loopback 56 threads 1.00 ( 0.03) -0.04 ( 0.17)
    loopback 84 threads 1.00 ( 0.05) +0.36 ( 0.13)
    loopback 112 threads 1.00 ( 0.03) +0.51 ( 0.03)
    loopback 140 threads 1.00 ( 0.02) -1.67 ( 0.19)
    loopback 168 threads 1.00 ( 0.38) +1.27 ( 0.27)
    loopback 196 threads 1.00 ( 0.11) +1.34 ( 0.17)
    loopback 224 threads 1.00 ( 0.11) +1.67 ( 0.22)

    Each schbench test is a:
    schbench -m $job -t 28 -r 100 -s 30000 -c 30000
    schbench.latency_90%_us
    ========
    case load baseline(std%) compare%( std%)
    normal 1 mthread 1.00 ( 31.22) -7.36 ( 20.25)*
    normal 2 mthreads 1.00 ( 2.45) -0.48 ( 1.79)
    normal 4 mthreads 1.00 ( 1.69) +0.45 ( 0.64)
    normal 8 mthreads 1.00 ( 5.47) +9.81 ( 14.28)

    *Consider the Standard Deviation, this -7.36% regression might not be valid.

    Also, a OLTP workload with a commercial RDBMS has been tested, and there
    is no significant change.

    There were concerns that unbalanced tasks among CPUs would cause problems.
    For example, suppose the LLC domain is composed of 8 CPUs, and 7 tasks are
    bound to CPU0~CPU6, while CPU7 is idle:

    CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
    util_avg 1024 1024 1024 1024 1024 1024 1024 0

    Since the util_avg ratio is 87.5%( = 7/8 ), which is higher than 85%,
    select_idle_cpu() will not scan, thus CPU7 is undetected during scan.
    But according to Mel, it is unlikely the CPU7 will be idle all the time
    because CPU7 could pull some tasks via CPU_NEWLY_IDLE.

    lkp(kernel test robot) has reported a regression on stress-ng.sock on a
    very busy system. According to the sched_debug statistics, it might be caused
    by SIS_UTIL terminates the scan and chooses a previous CPU earlier, and this
    might introduce more context switch, especially involuntary preemption, which
    impacts a busy stress-ng. This regression has shown that, not all benchmarks
    in every scenario benefit from idle CPU scan limit, and it needs further
    investigation.

    Besides, there is slight regression in hackbench's 16 groups case when the
    LLC domain has 16 CPUs. Prateek mentioned that we should scan aggressively
    in an LLC domain with 16 CPUs. Because the cost to search for an idle one
    among 16 CPUs is negligible. The current patch aims to propose a generic
    solution and only considers the util_avg. Something like the below could
    be applied on top of the current patch to fulfill the requirement:

    if (llc_weight
    Suggested-by: Peter Zijlstra
    Signed-off-by: Chen Yu
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Yicong Yang
    Tested-by: Mohini Narkhede
    Tested-by: K Prateek Nayak
    Link: https://lore.kernel.org/r/20220612163428.849378-1-yu.c.chen@intel.com
    Signed-off-by: Sasha Levin

    Chen Yu
     

29 Jul, 2022

1 commit

  • commit ddfc710395cccc61247348df9eb18ea50321cbed upstream.

    Tasks the are being deboosted from SCHED_DEADLINE might enter
    enqueue_task_dl() one last time and hit an erroneous BUG_ON condition:
    since they are not boosted anymore, the if (is_dl_boosted()) branch is
    not taken, but the else if (!dl_prio) is and inside this one we
    BUG_ON(!is_dl_boosted), which is of course false (BUG_ON triggered)
    otherwise we had entered the if branch above. Long story short, the
    current condition doesn't make sense and always leads to triggering of a
    BUG.

    Fix this by only checking enqueue flags, properly: ENQUEUE_REPLENISH has
    to be present, but additional flags are not a problem.

    Fixes: 64be6f1f5f71 ("sched/deadline: Don't replenish from a !SCHED_DEADLINE entity")
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20220714151908.533052-1-juri.lelli@redhat.com
    Signed-off-by: Greg Kroah-Hartman

    Juri Lelli
     

22 Jun, 2022

1 commit

  • [ Upstream commit 04193d590b390ec7a0592630f46d559ec6564ba1 ]

    The purpose of balance_push() is to act as a filter on task selection
    in the case of CPU hotplug, specifically when taking the CPU out.

    It does this by (ab)using the balance callback infrastructure, with
    the express purpose of keeping all the unlikely/odd cases in a single
    place.

    In order to serve its purpose, the balance_push_callback needs to be
    (exclusively) on the callback list at all times (noting that the
    callback always places itself back on the list the moment it runs,
    also noting that when the CPU goes down, regular balancing concerns
    are moot, so ignoring them is fine).

    And here-in lies the problem, __sched_setscheduler()'s use of
    splice_balance_callbacks() takes the callbacks off the list across a
    lock-break, making it possible for, an interleaving, __schedule() to
    see an empty list and not get filtered.

    Fixes: ae7927023243 ("sched: Optimize finish_lock_switch()")
    Reported-by: Jing-Ting Wu
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Jing-Ting Wu
    Link: https://lkml.kernel.org/r/20220519134706.GH2578@worktop.programming.kicks-ass.net
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     

09 Jun, 2022

3 commits

  • [ Upstream commit 890d550d7dbac7a31ecaa78732aa22be282bb6b8 ]

    Martin find it confusing when look at the /proc/pressure/cpu output,
    and found no hint about that CPU "full" line in psi Documentation.

    % cat /proc/pressure/cpu
    some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
    full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

    The PSI_CPU_FULL state is introduced by commit e7fcd7622823
    ("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
    but also counted at the system level as a side effect.

    Naturally, the FULL state doesn't exist for the CPU resource at
    the system level. These "full" numbers can come from CPU idle
    schedule latency. For example, t1 is the time when task wakeup
    on an idle CPU, t2 is the time when CPU pick and switch to it.
    The delta of (t2 - t1) will be in CPU_FULL state.

    Another case all processes can be stalled is when all cgroups
    have been throttled at the same time, which unlikely to happen.

    Anyway, CPU_FULL metric is meaningless and confusing at the
    system level. So this patch will report zeroes for CPU full
    at the system level, and update psi Documentation accordingly.

    Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
    Reported-by: Martin Steigerwald
    Suggested-by: Johannes Weiner
    Signed-off-by: Chengming Zhou
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Johannes Weiner
    Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
    Signed-off-by: Sasha Levin

    Chengming Zhou
     
  • [ Upstream commit 64eaf50731ac0a8c76ce2fedd50ef6652aabc5ff ]

    Since commit 23127296889f ("sched/fair: Update scale invariance of PELT")
    change to use rq_clock_pelt() instead of rq_clock_task(), we should also
    use rq_clock_pelt() for throttled_clock_task_time and throttled_clock_task
    accounting to get correct cfs_rq_clock_pelt() of throttled cfs_rq. And
    rename throttled_clock_task(_time) to be clock_pelt rather than clock_task.

    Fixes: 23127296889f ("sched/fair: Update scale invariance of PELT")
    Signed-off-by: Chengming Zhou
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Reviewed-by: Vincent Guittot
    Link: https://lore.kernel.org/r/20220408115309.81603-1-zhouchengming@bytedance.com
    Signed-off-by: Sasha Levin

    Chengming Zhou
     
  • [ Upstream commit 2679a83731d51a744657f718fc02c3b077e47562 ]

    When we use raw_spin_rq_lock() to acquire the rq lock and have to
    update the rq clock while holding the lock, the kernel may issue
    a WARN_DOUBLE_CLOCK warning.

    Since we directly use raw_spin_rq_lock() to acquire rq lock instead of
    rq_lock(), there is no corresponding change to rq->clock_update_flags.
    In particular, we have obtained the rq lock of other CPUs, the
    rq->clock_update_flags of this CPU may be RQCF_UPDATED at this time, and
    then calling update_rq_clock() will trigger the WARN_DOUBLE_CLOCK warning.

    So we need to clear RQCF_UPDATED of rq->clock_update_flags to avoid
    the WARN_DOUBLE_CLOCK warning.

    For the sched_rt_period_timer() and migrate_task_rq_dl() cases
    we simply replace raw_spin_rq_lock()/raw_spin_rq_unlock() with
    rq_lock()/rq_unlock().

    For the {pull,push}_{rt,dl}_task() cases, we add the
    double_rq_clock_clear_update() function to clear RQCF_UPDATED of
    rq->clock_update_flags, and call double_rq_clock_clear_update()
    before double_lock_balance()/double_rq_lock() returns to avoid the
    WARN_DOUBLE_CLOCK warning.

    Some call trace reports:
    Call Trace 1:

    sched_rt_period_timer+0x10f/0x3a0
    ? enqueue_top_rt_rq+0x110/0x110
    __hrtimer_run_queues+0x1a9/0x490
    hrtimer_interrupt+0x10b/0x240
    __sysvec_apic_timer_interrupt+0x8a/0x250
    sysvec_apic_timer_interrupt+0x9a/0xd0


    asm_sysvec_apic_timer_interrupt+0x12/0x20

    Call Trace 2:

    activate_task+0x8b/0x110
    push_rt_task.part.108+0x241/0x2c0
    push_rt_tasks+0x15/0x30
    finish_task_switch+0xaa/0x2e0
    ? __switch_to+0x134/0x420
    __schedule+0x343/0x8e0
    ? hrtimer_start_range_ns+0x101/0x340
    schedule+0x4e/0xb0
    do_nanosleep+0x8e/0x160
    hrtimer_nanosleep+0x89/0x120
    ? hrtimer_init_sleeper+0x90/0x90
    __x64_sys_nanosleep+0x96/0xd0
    do_syscall_64+0x34/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 3:

    deactivate_task+0x93/0xe0
    pull_rt_task+0x33e/0x400
    balance_rt+0x7e/0x90
    __schedule+0x62f/0x8e0
    do_task_dead+0x3f/0x50
    do_exit+0x7b8/0xbb0
    do_group_exit+0x2d/0x90
    get_signal+0x9df/0x9e0
    ? preempt_count_add+0x56/0xa0
    ? __remove_hrtimer+0x35/0x70
    arch_do_signal_or_restart+0x36/0x720
    ? nanosleep_copyout+0x39/0x50
    ? do_nanosleep+0x131/0x160
    ? audit_filter_inodes+0xf5/0x120
    exit_to_user_mode_prepare+0x10f/0x1e0
    syscall_exit_to_user_mode+0x17/0x30
    do_syscall_64+0x40/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Call Trace 4:
    update_rq_clock+0x128/0x1a0
    migrate_task_rq_dl+0xec/0x310
    set_task_cpu+0x84/0x1e4
    try_to_wake_up+0x1d8/0x5c0
    wake_up_process+0x1c/0x30
    hrtimer_wakeup+0x24/0x3c
    __hrtimer_run_queues+0x114/0x270
    hrtimer_interrupt+0xe8/0x244
    arch_timer_handler_phys+0x30/0x50
    handle_percpu_devid_irq+0x88/0x140
    generic_handle_domain_irq+0x40/0x60
    gic_handle_irq+0x48/0xe0
    call_on_irq_stack+0x2c/0x60
    do_interrupt_handler+0x80/0x84

    Steps to reproduce:
    1. Enable CONFIG_SCHED_DEBUG when compiling the kernel
    2. echo 1 > /sys/kernel/debug/clear_warn_once
    echo "WARN_DOUBLE_CLOCK" > /sys/kernel/debug/sched/features
    echo "NO_RT_PUSH_IPI" > /sys/kernel/debug/sched/features
    3. Run some rt/dl tasks that periodically work and sleep, e.g.
    Create 2*n rt or dl (90% running) tasks via rt-app (on a system
    with n CPUs), and Dietmar Eggemann reports Call Trace 4 when running
    on PREEMPT_RT kernel.

    Signed-off-by: Hao Jia
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Link: https://lore.kernel.org/r/20220430085843.62939-2-jiahao.os@bytedance.com
    Signed-off-by: Sasha Levin

    Hao Jia
     

27 Apr, 2022

1 commit

  • [ Upstream commit 40f5aa4c5eaebfeaca4566217cb9c468e28ed682 ]

    The warning in cfs_rq_is_decayed() triggered:

    SCHED_WARN_ON(cfs_rq->avg.load_avg ||
    cfs_rq->avg.util_avg ||
    cfs_rq->avg.runnable_avg)

    There exists a corner case in attach_entity_load_avg() which will
    cause load_sum to be zero while load_avg will not be.

    Consider se_weight is 88761 as per the sched_prio_to_weight[] table.
    Further assume the get_pelt_divider() is 47742, this gives:
    se->avg.load_avg is 1.

    However, calculating load_sum:

    se->avg.load_sum = div_u64(se->avg.load_avg * se->avg.load_sum, se_weight(se));
    se->avg.load_sum = 1*47742/88761 = 0.

    Then enqueue_load_avg() adds this to the cfs_rq totals:

    cfs_rq->avg.load_avg += se->avg.load_avg;
    cfs_rq->avg.load_sum += se_weight(se) * se->avg.load_sum;

    Resulting in load_avg being 1 with load_sum is 0, which will trigger
    the WARN.

    Fixes: f207934fb79d ("sched/fair: Align PELT windows between cfs_rq and its se")
    Signed-off-by: kuyo chang
    [peterz: massage changelog]
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Tested-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/20220414090229.342-1-kuyo.chang@mediatek.com
    Signed-off-by: Sasha Levin

    kuyo chang
     

14 Apr, 2022

1 commit

  • commit 386ef214c3c6ab111d05e1790e79475363abaa05 upstream.

    try_steal_cookie() looks at task_struct::cpus_mask to decide if the
    task could be moved to `this' CPU. It ignores that the task might be in
    a migration disabled section while not on the CPU. In this case the task
    must not be moved otherwise per-CPU assumption are broken.

    Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
    task can be moved.

    Fixes: d2dfa17bc7de6 ("sched: Trivial forced-newidle balancer")
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Sebastian Andrzej Siewior
     

08 Apr, 2022

6 commits

  • [ Upstream commit 49bef33e4b87b743495627a529029156c6e09530 ]

    John reported that push_rt_task() can end up invoking
    find_lowest_rq(rq->curr) when curr is not an RT task (in this case a CFS
    one), which causes mayhem down convert_prio().

    This can happen when current gets demoted to e.g. CFS when releasing an
    rt_mutex, and the local CPU gets hit with an rto_push_work irqwork before
    getting the chance to reschedule. Exactly who triggers this work isn't
    entirely clear to me - switched_from_rt() only invokes rt_queue_pull_task()
    if there are no RT tasks on the local RQ, which means the local CPU can't
    be in the rto_mask.

    My current suspected sequence is something along the lines of the below,
    with the demoted task being current.

    mark_wakeup_next_waiter()
    rt_mutex_adjust_prio()
    rt_mutex_setprio() // deboost originally-CFS task
    check_class_changed()
    switched_from_rt() // Only rt_queue_pull_task() if !rq->rt.rt_nr_running
    switched_to_fair() // Sets need_resched
    __balance_callbacks() // if pull_rt_task(), tell_cpu_to_push() can't select local CPU per the above
    raw_spin_rq_unlock(rq)

    // need_resched is set, so task_woken_rt() can't
    // invoke push_rt_tasks(). Best I can come up with is
    // local CPU has rt_nr_migratory >= 2 after the demotion, so stays
    // in the rto_mask, and then:


    push_rt_task()
    // breakage follows here as rq->curr is CFS

    Move an existing check to check rq->curr vs the next pushable task's
    priority before getting anywhere near find_lowest_rq(). While at it, add an
    explicit sched_class of rq->curr check prior to invoking
    find_lowest_rq(rq->curr). Align the DL logic to also reschedule regardless
    of next_task's migratability.

    Fixes: a7c81556ec4d ("sched: Fix migrate_disable() vs rt/dl balancing")
    Reported-by: John Keeping
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Tested-by: John Keeping
    Link: https://lore.kernel.org/r/20220127154059.974729-1-valentin.schneider@arm.com
    Signed-off-by: Sasha Levin

    Valentin Schneider
     
  • [ Upstream commit 248cc9993d1cc12b8e9ed716cc3fc09f6c3517dd ]

    The cpuacct_account_field() is always called by the current task
    itself, so it's ok to use __this_cpu_add() to charge the tick time.

    But cpuacct_charge() maybe called by update_curr() in load_balance()
    on a random CPU, different from the CPU on which the task is running.
    So __this_cpu_add() will charge that cputime to a random incorrect CPU.

    Fixes: 73e6aafd9ea8 ("sched/cpuacct: Simplify the cpuacct code")
    Reported-by: Minye Zhu
    Signed-off-by: Chengming Zhou
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Tejun Heo
    Link: https://lore.kernel.org/r/20220220051426.5274-1-zhouchengming@bytedance.com
    Signed-off-by: Sasha Levin

    Chengming Zhou
     
  • [ Upstream commit 2cfb7a1b031b0e816af7a6ee0c6ab83b0acdf05a ]

    There are inconsistencies when determining if a NUMA imbalance is allowed
    that should be corrected.

    o allow_numa_imbalance changes types and is not always examining
    the destination group so both the type should be corrected as
    well as the naming.
    o find_idlest_group uses the sched_domain's weight instead of the
    group weight which is different to find_busiest_group
    o find_busiest_group uses the source group instead of the destination
    which is different to task_numa_find_cpu
    o Both find_idlest_group and find_busiest_group should account
    for the number of running tasks if a move was allowed to be
    consistent with task_numa_find_cpu

    Fixes: 7d2b5dd0bcc4 ("sched/numa: Allow a floating imbalance between NUMA nodes")
    Signed-off-by: Mel Gorman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Gautham R. Shenoy
    Link: https://lore.kernel.org/r/20220208094334.16379-2-mgorman@techsingularity.net
    Signed-off-by: Sasha Levin

    Mel Gorman
     
  • [ Upstream commit d37aee9018e68b0d356195caefbb651910e0bbfa ]

    iowait_boost signal is applied independently of util and doesn't take
    into account uclamp settings of the rq. An io heavy task that is capped
    by uclamp_max could still request higher frequency because
    sugov_iowait_apply() doesn't clamp the boost via uclamp_rq_util_with()
    like effective_cpu_util() does.

    Make sure that iowait_boost honours uclamp requests by calling
    uclamp_rq_util_with() when applying the boost.

    Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rafael J. Wysocki
    Link: https://lore.kernel.org/r/20211216225320.2957053-3-qais.yousef@arm.com
    Signed-off-by: Sasha Levin

    Qais Yousef
     
  • [ Upstream commit 77cf151b7bbdfa3577b3c3f3a5e267a6c60a263b ]

    We can't use this tracepoint in modules without having the symbol
    exported first, fix that.

    Fixes: 765047932f15 ("sched/pelt: Add support to track thermal pressure")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20211028115005.873539-1-qais.yousef@arm.com
    Signed-off-by: Sasha Levin

    Qais Yousef
     
  • [ Upstream commit 28c988c3ec29db74a1dda631b18785958d57df4f ]

    The older format of /proc/pid/sched printed home node info which
    required the mempolicy and task lock around mpol_get(). However
    the format has changed since then and there is no need for
    sched_show_numa() any more to have mempolicy argument,
    asssociated mpol_get/put and task_lock/unlock. Remove them.

    Fixes: 397f2378f1361 ("sched/numa: Fix numa balancing stats in /proc/pid/sched")
    Signed-off-by: Bharata B Rao
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Srikar Dronamraju
    Acked-by: Mel Gorman
    Link: https://lore.kernel.org/r/20220118050515.2973-1-bharata@amd.com
    Signed-off-by: Sasha Levin

    Bharata B Rao
     

09 Mar, 2022

2 commits

  • commit b1e8206582f9d680cff7d04828708c8b6ab32957 upstream.

    Where commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an
    invalid sched_task_group") fixed a fork race vs cgroup, it opened up a
    race vs syscalls by not placing the task on the runqueue before it
    gets exposed through the pidhash.

    Commit 13765de8148f ("sched/fair: Fix fault in reweight_entity") is
    trying to fix a single instance of this, instead fix the whole class
    of issues, effectively reverting this commit.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Tadeusz Struk
    Tested-by: Zhang Qiao
    Tested-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/YgoeCbwj5mbCR0qA@hirez.programming.kicks-ass.net
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • [ Upstream commit 13765de8148f71fa795e0a6607de37c49ea5915a ]

    Syzbot found a GPF in reweight_entity. This has been bisected to
    commit 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid
    sched_task_group")

    There is a race between sched_post_fork() and setpriority(PRIO_PGRP)
    within a thread group that causes a null-ptr-deref in
    reweight_entity() in CFS. The scenario is that the main process spawns
    number of new threads, which then call setpriority(PRIO_PGRP, 0, -20),
    wait, and exit. For each of the new threads the copy_process() gets
    invoked, which adds the new task_struct and calls sched_post_fork()
    for it.

    In the above scenario there is a possibility that
    setpriority(PRIO_PGRP) and set_one_prio() will be called for a thread
    in the group that is just being created by copy_process(), and for
    which the sched_post_fork() has not been executed yet. This will
    trigger a null pointer dereference in reweight_entity(), as it will
    try to access the run queue pointer, which hasn't been set.

    Before the mentioned change the cfs_rq pointer for the task has been
    set in sched_fork(), which is called much earlier in copy_process(),
    before the new task is added to the thread_group. Now it is done in
    the sched_post_fork(), which is called after that. To fix the issue
    the remove the update_load param from the update_load param() function
    and call reweight_task() only if the task flag doesn't have the
    TASK_NEW flag set.

    Fixes: 4ef0c5c6b5ba ("kernel/sched: Fix sched_fork() access an invalid sched_task_group")
    Reported-by: syzbot+af7a719bc92395ee41b3@syzkaller.appspotmail.com
    Signed-off-by: Tadeusz Struk
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20220203161846.1160750-1-tadeusz.struk@linaro.org
    Signed-off-by: Sasha Levin

    Tadeusz Struk
     

16 Feb, 2022

1 commit

  • [ Upstream commit 7e406d1ff39b8ee574036418a5043c86723170cf ]

    For PREEMPT/DYNAMIC_PREEMPT the *_unlock() will already trigger a
    preemption, no point in then calling preempt_schedule_common()
    *again*.

    Use _cond_resched() instead, since this is a NOP for the preemptible
    configs while it provide a preemption point for the others.

    Reported-by: xuhaifeng
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/YcGnvDEYBwOiV0cR@hirez.programming.kicks-ass.net
    Signed-off-by: Sasha Levin

    Peter Zijlstra
     

02 Feb, 2022

4 commits

  • commit 44585f7bc0cb01095bc2ad4258049c02bbad21ef upstream.

    When CONFIG_PROC_FS is disabled psi code generates the following
    warnings:

    kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
    1364 | static const struct proc_ops psi_cpu_proc_ops = {
    | ^~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
    1355 | static const struct proc_ops psi_memory_proc_ops = {
    | ^~~~~~~~~~~~~~~~~~~
    kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
    1346 | static const struct proc_ops psi_io_proc_ops = {
    | ^~~~~~~~~~~~~~~

    Make definitions of these structures and related functions conditional
    on CONFIG_PROC_FS config.

    Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
    Fixes: 0e94682b73bf ("psi: introduce psi monitor")
    Signed-off-by: Suren Baghdasaryan
    Reported-by: kernel test robot
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     
  • [ Upstream commit 98b0d890220d45418cfbc5157b3382e6da5a12ab ]

    Rick reported performance regressions in bugzilla because of cpu frequency
    being lower than before:
    https://bugzilla.kernel.org/show_bug.cgi?id=215045

    He bisected the problem to:
    commit 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")

    This commit forces util_sum to be synced with the new util_avg after
    removing the contribution of a task and before the next periodic sync. By
    doing so util_sum is rounded to its lower bound and might lost up to
    LOAD_AVG_MAX-1 of accumulated contribution which has not yet been
    reflected in util_avg.

    Instead of always setting util_sum to the low bound of util_avg, which can
    significantly lower the utilization of root cfs_rq after propagating the
    change down into the hierarchy, we revert the change of util_sum and
    propagate the difference.

    In addition, we also check that cfs's util_sum always stays above the
    lower bound for a given util_avg as it has been observed that
    sched_entity's util_sum is sometimes above cfs one.

    Fixes: 1c35b07e6d39 ("sched/fair: Ensure _sum and _avg values stay consistent")
    Reported-by: Rick Yiu
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Tested-by: Sachin Sant
    Link: https://lkml.kernel.org/r/20220111134659.24961-2-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • commit 809232619f5b15e31fb3563985e705454f32621f upstream.

    The membarrier command MEMBARRIER_CMD_QUERY allows querying the
    available membarrier commands. When the membarrier-rseq fence commands
    were added, a new MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK was
    introduced with the intent to expose them with the MEMBARRIER_CMD_QUERY
    command, the but it was never added to MEMBARRIER_CMD_BITMASK.

    The membarrier-rseq fence commands are therefore not wired up with the
    query command.

    Rename MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ_BITMASK to
    MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK (the bitmask is not a command
    per-se), and change the erroneous
    MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ_BITMASK (which does not
    actually exist) to MEMBARRIER_CMD_REGISTER_PRIVATE_EXPEDITED_RSEQ.

    Wire up MEMBARRIER_PRIVATE_EXPEDITED_RSEQ_BITMASK in
    MEMBARRIER_CMD_BITMASK. Fixing this allows discovering availability of
    the membarrier-rseq fence feature.

    Fixes: 2a36ab717e8f ("rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ")
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # 5.10+
    Link: https://lkml.kernel.org/r/20220117203010.30129-1-mathieu.desnoyers@efficios.com
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     
  • commit a06247c6804f1a7c86a2e5398a4c1f1db1471848 upstream.

    With write operation on psi files replacing old trigger with a new one,
    the lifetime of its waitqueue is totally arbitrary. Overwriting an
    existing trigger causes its waitqueue to be freed and pending poll()
    will stumble on trigger->event_wait which was destroyed.
    Fix this by disallowing to redefine an existing psi trigger. If a write
    operation is used on a file descriptor with an already existing psi
    trigger, the operation will fail with EBUSY error.
    Also bypass a check for psi_disabled in the psi_trigger_destroy as the
    flag can be flipped after the trigger is created, leading to a memory
    leak.

    Fixes: 0e94682b73bf ("psi: introduce psi monitor")
    Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
    Suggested-by: Linus Torvalds
    Analyzed-by: Eric Biggers
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Eric Biggers
    Acked-by: Johannes Weiner
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

27 Jan, 2022

6 commits

  • commit dd02d4234c9a2214a81c57a16484304a1a51872a upstream.

    cpuacct has 2 different ways of accounting and showing user
    and system times.

    The first one uses cpuacct_account_field() to account times
    and cpuacct.stat file to expose them. And this one seems to work ok.

    The second one is uses cpuacct_charge() function for accounting and
    set of cpuacct.usage* files to show times. Despite some attempts to
    fix it in the past it still doesn't work. Sometimes while running KVM
    guest the cpuacct_charge() accounts most of the guest time as
    system time. This doesn't match with user&system times shown in
    cpuacct.stat or proc//stat.

    Demonstration:
    # git clone https://github.com/aryabinin/kvmsample
    # make
    # mkdir /sys/fs/cgroup/cpuacct/test
    # echo $$ > /sys/fs/cgroup/cpuacct/test/tasks
    # ./kvmsample &
    # for i in {1..5}; do cat /sys/fs/cgroup/cpuacct/test/cpuacct.usage_sys; sleep 1; done
    1976535645
    2979839428
    3979832704
    4983603153
    5983604157

    Use cpustats accounted in cpuacct_account_field() as the source
    of user/sys times for cpuacct.usage* files. Make cpuacct_charge()
    to account only summary execution time.

    Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and sys_usage")
    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Jordan
    Acked-by: Tejun Heo
    Cc:
    Link: https://lore.kernel.org/r/20211115164607.23784-3-arbn@yandex-team.com
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • commit 9731698ecb9c851f353ce2496292ff9fcea39dff upstream.

    cpuacct.stat in no-root cgroups shows user time without guest time
    included int it. This doesn't match with user time shown in root
    cpuacct.stat and /proc//stat. This also affects cgroup2's cpu.stat
    in the same way.

    Make account_guest_time() to add user time to cgroup's cpustat to
    fix this.

    Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Jordan
    Acked-by: Tejun Heo
    Cc:
    Link: https://lore.kernel.org/r/20211115164607.23784-1-arbn@yandex-team.com
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     
  • [ Upstream commit cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf ]

    We've noticed cases where tasks in a cgroup are stalled on memory but
    there is little memory FULL pressure since tasks stay on the runqueue
    in reclaim.

    A simple example involves a single threaded program that keeps leaking
    and touching large amounts of memory. It runs in a cgroup with swap
    enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
    there is significant CPU pressure and memory SOME, there is barely any
    memory FULL since the task enters reclaim and stays on the runqueue.
    However, this memory-bound task is effectively stalled on memory and
    we expect memory FULL to match memory SOME in this scenario.

    The code is confused about memstall && running, thinking there is a
    stalled task and a productive task when there's only one task: a
    reclaimer that's counted as both. To fix this, we redefine the
    condition for PSI_MEM_FULL to check that all running tasks are in an
    active memstall instead of checking that there are no running tasks.

    case PSI_MEM_FULL:
    - return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
    + return unlikely(tasks[NR_MEMSTALL] &&
    + tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

    This will capture reclaimers. It will also capture tasks that called
    psi_memstall_enter() and are about to sleep, but this should be
    negligible noise.

    Signed-off-by: Brian Chen
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Johannes Weiner
    Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
    Signed-off-by: Sasha Levin

    Brian Chen
     
  • [ Upstream commit 9b58e976b3b391c0cf02e038d53dd0478ed3013c ]

    When rt_runtime is modified from -1 to a valid control value, it may
    cause the task to be throttled all the time. Operations like the following
    will trigger the bug. E.g:

    1. echo -1 > /proc/sys/kernel/sched_rt_runtime_us
    2. Run a FIFO task named A that executes while(1)
    3. echo 950000 > /proc/sys/kernel/sched_rt_runtime_us

    When rt_runtime is -1, The rt period timer will not be activated when task
    A enqueued. And then the task will be throttled after setting rt_runtime to
    950,000. The task will always be throttled because the rt period timer is
    not activated.

    Fixes: d0b27fa77854 ("sched: rt-group: synchonised bandwidth period")
    Reported-by: Hulk Robot
    Signed-off-by: Li Hua
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20211203033618.11895-1-hucool.lihua@huawei.com
    Signed-off-by: Sasha Levin

    Li Hua
     
  • [ Upstream commit 014ba44e8184e1acf93e0cbb7089ee847802f8f0 ]

    select_idle_sibling() has a special case for tasks woken up by a per-CPU
    kthread where the selected CPU is the previous one. For asymmetric CPU
    capacity systems, the assumption was that the wakee couldn't have a
    bigger utilization during task placement than it used to have during the
    last activation. That was not considering uclamp.min which can completely
    change between two task activations and as a consequence mandates the
    fitness criterion asym_fits_capacity(), even for the exit path described
    above.

    Fixes: b4c9c9f15649 ("sched/fair: Prefer prev cpu in asymmetric wakeup path")
    Signed-off-by: Vincent Donnefort
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Reviewed-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/20211129173115.4006346-1-vincent.donnefort@arm.com
    Signed-off-by: Sasha Levin

    Vincent Donnefort
     
  • [ Upstream commit 8b4e74ccb582797f6f0b0a50372ebd9fd2372a27 ]

    select_idle_sibling() has a special case for tasks woken up by a per-CPU
    kthread, where the selected CPU is the previous one. However, the current
    condition for this exit path is incomplete. A task can wake up from an
    interrupt context (e.g. hrtimer), while a per-CPU kthread is running. A
    such scenario would spuriously trigger the special case described above.
    Also, a recent change made the idle task like a regular per-CPU kthread,
    hence making that situation more likely to happen
    (is_per_cpu_kthread(swapper) being true now).

    Checking for task context makes sure select_idle_sibling() will not
    interpret a wake up from any other context as a wake up by a per-CPU
    kthread.

    Fixes: 52262ee567ad ("sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression")
    Signed-off-by: Vincent Donnefort
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lore.kernel.org/r/20211201143450.479472-1-vincent.donnefort@arm.com
    Signed-off-by: Sasha Levin

    Vincent Donnefort
     

14 Dec, 2021

1 commit

  • commit 42288cb44c4b5fff7653bc392b583a2b8bd6a8c0 upstream.

    Several ->poll() implementations are special in that they use a
    waitqueue whose lifetime is the current task, rather than the struct
    file as is normally the case. This is okay for blocking polls, since a
    blocking poll occurs within one task; however, non-blocking polls
    require another solution. This solution is for the queue to be cleared
    before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.

    However, that has a bug: wake_up_poll() calls __wake_up() with
    nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters,
    and the wakeup function for the first one returns a positive value, only
    that one will be called. That's *not* what's needed for POLLFREE;
    POLLFREE is special in that it really needs to wake up everyone.

    Considering the three non-blocking poll systems:

    - io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.

    - aio poll is unaffected, since it doesn't support exclusive waits.
    However, that's fragile, as someone could add this feature later.

    - epoll doesn't appear to be broken by this, since its wakeup function
    returns 0 when it sees POLLFREE. But this is fragile.

    Although there is a workaround (see epoll), it's better to define a
    function which always sends POLLFREE to all waiters. Add such a
    function. Also make it verify that the queue really becomes empty after
    all waiters have been woken up.

    Reported-by: Linus Torvalds
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     

08 Dec, 2021

2 commits

  • [ Upstream commit 315c4f884800c45cb6bd8c90422fad554a8b9588 ]

    Commit d81ae8aac85c ("sched/uclamp: Fix initialization of struct
    uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
    match the woken up task's uclamp_max when the rq is idle.

    The code was relying on rq->uclamp_max initialized to zero, so on first
    enqueue

    static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
    enum uclamp_id clamp_id)
    {
    ...

    if (uc_se->value > READ_ONCE(uc_rq->value))
    WRITE_ONCE(uc_rq->value, uc_se->value);
    }

    was actually resetting it. But since commit d81ae8aac85c changed the
    default to 1024, this no longer works. And since rq->uclamp_flags is
    also initialized to 0, neither above code path nor uclamp_idle_reset()
    update the rq->uclamp_max on first wake up from idle.

    This is only visible from first wake up(s) until the first dequeue to
    idle after enabling the static key. And it only matters if the
    uclamp_max of this task is < 1024 since only then its uclamp_max will be
    effectively ignored.

    Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
    ensure uclamp_idle_reset() is called which then will update the rq
    uclamp_max value as expected.

    Fixes: d81ae8aac85c ("sched/uclamp: Fix initialization of struct uclamp_rq")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
    Signed-off-by: Sasha Levin

    Qais Yousef
     
  • [ Upstream commit 9ed20bafc85806ca6c97c9128cec46c3ef80ae86 ]

    __setup() callbacks expect 1 for success and 0 for failure. Correct the
    usage here to reflect that.

    Fixes: 826bfeb37bb4 ("preempt/dynamic: Support dynamic preempt with preempt= boot option")
    Reported-by: Mark Rutland
    Signed-off-by: Andrew Halaney
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
    Signed-off-by: Sasha Levin

    Andrew Halaney
     

01 Dec, 2021

1 commit

  • [ Upstream commit dce1ca0525bfdc8a69a9343bc714fbc19a2f04b3 ]

    To hot unplug a CPU, the idle task on that CPU calls a few layers of C
    code before finally leaving the kernel. When KASAN is in use, poisoned
    shadow is left around for each of the active stack frames, and when
    shadow call stacks are in use. When shadow call stacks (SCS) are in use
    the task's saved SCS SP is left pointing at an arbitrary point within
    the task's shadow call stack.

    When a CPU is offlined than onlined back into the kernel, this stale
    state can adversely affect execution. Stale KASAN shadow can alias new
    stackframes and result in bogus KASAN warnings. A stale SCS SP is
    effectively a memory leak, and prevents a portion of the shadow call
    stack being used. Across a number of hotplug cycles the idle task's
    entire shadow call stack can become unusable.

    We previously fixed the KASAN issue in commit:

    e1b77c92981a5222 ("sched/kasan: remove stale KASAN poison after hotplug")

    ... by removing any stale KASAN stack poison immediately prior to
    onlining a CPU.

    Subsequently in commit:

    f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")

    ... the refactoring left the KASAN and SCS cleanup in one-time idle
    thread initialization code rather than something invoked prior to each
    CPU being onlined, breaking both as above.

    We fixed SCS (but not KASAN) in commit:

    63acd42c0d4942f7 ("sched/scs: Reset the shadow stack when idle_task_exit")

    ... but as this runs in the context of the idle task being offlined it's
    potentially fragile.

    To fix these consistently and more robustly, reset the SCS SP and KASAN
    shadow of a CPU's idle task immediately before we online that CPU in
    bringup_cpu(). This ensures the idle task always has a consistent state
    when it is running, and removes the need to so so when exiting an idle
    task.

    Whenever any thread is created, dup_task_struct() will give the task a
    stack which is free of KASAN shadow, and initialize the task's SCS SP,
    so there's no need to specially initialize either for idle thread within
    init_idle(), as this was only necessary to handle hotplug cycles.

    I've tested this on arm64 with:

    * gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
    * clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK

    ... offlining and onlining CPUS with:

    | while true; do
    | for C in /sys/devices/system/cpu/cpu*/online; do
    | echo 0 > $C;
    | echo 1 > $C;
    | done
    | done

    Fixes: f1a0a376ca0c4ef1 ("sched/core: Initialize the idle task with preemption disabled")
    Reported-by: Qian Cai
    Signed-off-by: Mark Rutland
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Qian Cai
    Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
    Signed-off-by: Sasha Levin

    Mark Rutland