05 Mar, 2020

4 commits

  • commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.

    select_idle_cpu() will scan the LLC domain for idle CPUs,
    it's always expensive. so the next commit :

    1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")

    introduces a way to limit how many CPUs we scan.

    But it consume some CPUs out of 'nr' that are not allowed
    for the task and thus waste our attempts. The function
    always return nr_cpumask_bits, and we can't find a CPU
    which our task is allowed to run.

    Cpumask may be too big, similar to select_idle_core(), use
    per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

    Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
    Signed-off-by: Cheng Jian
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Srikar Dronamraju
    Reviewed-by: Vincent Guittot
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
    Signed-off-by: Greg Kroah-Hartman

    Cheng Jian
     
  • [ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]

    When a running task is moved on a throttled task group and there is no
    other task enqueued on the CPU, the task can keep running using 100% CPU
    whatever the allocated bandwidth for the group and although its cfs rq is
    throttled. Furthermore, the group entity of the cfs_rq and its parents are
    not enqueued but only set as curr on their respective cfs_rqs.

    We have the following sequence:

    sched_move_task
    -dequeue_task: dequeue task and group_entities.
    -put_prev_task: put task and group entities.
    -sched_change_group: move task to new group.
    -enqueue_task: enqueue only task but not group entities because cfs_rq is
    throttled.
    -set_next_task : set task and group_entities as current sched_entity of
    their cfs_rq.

    Another impact is that the root cfs_rq runnable_load_avg at root rq stays
    null because the group_entities are not enqueued. This situation will stay
    the same until an "external" event triggers a reschedule. Let trigger it
    immediately instead.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Ben Segall
    Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • [ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]

    The way loadavg is tracked during nohz only pays attention to the load
    upon entering nohz. This can be particularly noticeable if full nohz is
    entered while non-idle, and then the cpu goes idle and stays that way for
    a long time.

    Use the remote tick to ensure that full nohz cpus report their deltas
    within a reasonable time.

    [ swood: Added changelog and removed recheck of stopped tick. ]

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Peter Zijlstra (Intel)
     
  • [ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]

    This will be used in the next patch to get a loadavg update from
    nohz cpus. The delta check is skipped because idle_sched_class
    doesn't update se.exec_start.

    Signed-off-by: Scott Wood
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
    Signed-off-by: Sasha Levin

    Scott Wood
     

29 Feb, 2020

1 commit

  • commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.

    Issuing write() with count parameter set to 0 on any file under
    /proc/pressure/ will cause an OOB write because of the access to
    buf[buf_size-1] when NUL-termination is performed. Fix this by checking
    for buf_size to be non-zero.

    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Acked-by: Johannes Weiner
    Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
    Signed-off-by: Greg Kroah-Hartman

    Suren Baghdasaryan
     

24 Feb, 2020

2 commits

  • [ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]

    topology.c::get_group() relies on the assumption that non-NUMA domains do
    not partially overlap. Zeng Tao pointed out in [1] that such topology
    descriptions, while completely bogus, can end up being exposed to the
    scheduler.

    In his example (8 CPUs, 2-node system), we end up with:
    MC span for CPU3 == 3-7
    MC span for CPU4 == 4-7

    The first pass through get_group(3, sdd@MC) will result in the following
    sched_group list:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `----------------'

    And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

    3 -> 4 -> 5 -> 6 -> 7
    ^ /
    `-----------'

    which will completely break things like 'while (sg != sd->groups)' when
    using CPU3's base sched_domain.

    There already are some architecture-specific checks in place such as
    x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
    in the core scheduler, so it seems worthwhile to do so.

    Warn and abort the construction of the sched domains if such a broken
    topology description is detected. Note that this is somewhat
    expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
    gated under SCHED_DEBUG if deemed necessary.

    Testing
    =======

    Dietmar managed to reproduce this using the following qemu incantation:

    $ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
    -append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
    cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
    node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

    alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
    needed if '-smp cores=X, sockets=Y' would work with qemu):

    8package_id != cpu_topo->package_id)
    continue;

    + if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
    + continue;
    +
    cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
    cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

    8
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
    Signed-off-by: Sasha Levin

    Valentin Schneider
     
  • [ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]

    rq::uclamp is an array of struct uclamp_rq, make sure we clear the
    whole thing.

    Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
    Signed-off-by: Li Guanglei
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Qais Yousef
    Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
    Signed-off-by: Sasha Levin

    Li Guanglei
     

20 Feb, 2020

1 commit

  • commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.

    The check to ensure that the new written value into cpu.uclamp.{min,max}
    is within range, [0:100], wasn't working because of the signed
    comparison

    7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
    7302 req.ret = -ERANGE;
    7303 return req;
    7304 }

    # echo -1 > cpu.uclamp.min
    # cat cpu.uclamp.min
    42949671.96

    Cast req.percent into u64 to force the comparison to be unsigned and
    work as intended in capacity_from_percent().

    # echo -1 > cpu.uclamp.min
    sh: write error: Numerical result out of range

    Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     

15 Feb, 2020

1 commit

  • commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.

    When a new cgroup is created, the effective uclamp value wasn't updated
    with a call to cpu_util_update_eff() that looks at the hierarchy and
    update to the most restrictive values.

    Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
    becomes online.

    Without this change, the newly created cgroup uses the default
    root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
    which will cause the rq to to be clamped to max, hence cause the
    system to run at max frequency.

    The problem was observed on Ubuntu server and was reproduced on Debian
    and Buildroot rootfs.

    By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
    and add all tasks to it - which creates enough noise to keep the rq
    uclamp value at max most of the time. Imitating this behavior makes the
    problem visible in Buildroot too which otherwise looks fine since it's a
    minimal userspace.

    Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
    Reported-by: Doug Smythies
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Doug Smythies
    Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
    Signed-off-by: Greg Kroah-Hartman

    Qais Yousef
     

26 Jan, 2020

2 commits

  • [ Upstream commit bef69dd87828ef5d8ecdab8d857cd3a33cf98675 ]

    update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
    which might be inefficient when the cpufreq driver has rate limitation.

    When a task is attached on a CPU, we have this call path:

    update_load_avg()
    update_cfs_rq_load_avg()
    cfs_rq_util_change -- > trig frequency update
    attach_entity_load_avg()
    cfs_rq_util_change -- > trig frequency update

    The 1st frequency update will not take into account the utilization of the
    newly attached task and the 2nd one might be discarded because of rate
    limitation of the cpufreq driver.

    update_cfs_rq_load_avg() is only called by update_blocked_averages()
    and update_load_avg() so we can move the call to
    cfs_rq_util_change/cpufreq_update_util() into these two functions.

    It's also interesting to note that update_load_avg() already calls
    cfs_rq_util_change() directly for the !SMP case.

    This change will also ensure that cpufreq_update_util() is called even
    when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
    IRQ, RT or DL PELT signals.

    [ mingo: Minor updates. ]

    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Acked-by: Rafael J. Wysocki
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: juri.lelli@redhat.com
    Cc: linux-pm@vger.kernel.org
    Cc: mgorman@suse.de
    Cc: rostedt@goodmis.org
    Cc: sargun@sargun.me
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: tj@kernel.org
    Cc: xiexiuqi@huawei.com
    Cc: xiezhipeng1@huawei.com
    Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
    Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Vincent Guittot
     
  • commit a0e813f26ebcb25c0b5e504498fbd796cca1a4ba upstream.

    It turns out there really is something special to the first
    set_next_task() invocation. In specific the 'change' pattern really
    should not cause balance callbacks.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: juri.lelli@redhat.com
    Cc: ktkhai@virtuozzo.com
    Cc: mgorman@suse.de
    Cc: qais.yousef@arm.com
    Cc: qperret@google.com
    Cc: rostedt@goodmis.org
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
    Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

12 Jan, 2020

2 commits

  • [ Upstream commit c3466952ca1514158d7c16c9cfc48c27d5c5dc0f ]

    The psi window size is a u64 an can be up to 10 seconds right now,
    which exceeds the lower 32 bits of the variable. We currently use
    div_u64 for it, which is meant only for 32-bit divisors. The result is
    garbage pressure sampling values and even potential div0 crashes.

    Use div64_u64.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Suren Baghdasaryan
    Cc: Jingfeng Xie
    Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
    Signed-off-by: Sasha Levin

    Johannes Weiner
     
  • [ Upstream commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 ]

    Jingfeng reports rare div0 crashes in psi on systems with some uptime:

    [58914.066423] divide error: 0000 [#1] SMP
    [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
    [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
    [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
    [58914.102728] Workqueue: events psi_update_work
    [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
    [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
    [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
    [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
    [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
    [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
    [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
    [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
    [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
    [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
    [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [58914.200360] PKRU: 55555554
    [58914.203221] Stack:
    [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
    [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
    [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    [58914.228734] Call Trace:
    [58914.231337] [] psi_update_work+0x22/0x60
    [58914.237067] [] process_one_work+0x189/0x420
    [58914.243063] [] worker_thread+0x4e/0x4b0
    [58914.248701] [] ? process_one_work+0x420/0x420
    [58914.254869] [] kthread+0xe6/0x100
    [58914.259994] [] ? kthread_park+0x60/0x60
    [58914.265640] [] ret_from_fork+0x39/0x50
    [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc f7 f1 48 8b 13 48 89 c7 48 c1
    [58914.279691] RIP [] psi_update_stats+0x1c1/0x330

    The crashing instruction is trying to divide the observed stall time
    by the sampling period. The period, stored in R8, is not 0, but we are
    dividing by the lower 32 bits only, which are all 0 in this instance.

    We could switch to a 64-bit division, but the period shouldn't be that
    big in the first place. It's the time between the last update and the
    next scheduled one, and so should always be around 2s and comfortably
    fit into 32 bits.

    The bug is in the initialization of new cgroups: we schedule the first
    sampling event in a cgroup as an offset of sched_clock(), but fail to
    initialize the last_update timestamp, and it defaults to 0. That
    results in a bogusly large sampling period the first time we run the
    sampling code, and consequently we underreport pressure for the first
    2s of a cgroup's life. But worse, if sched_clock() is sufficiently
    advanced on the system, and the user gets unlucky, the period's lower
    32 bits can all be 0 and the sampling division will crash.

    Fix this by initializing the last update timestamp to the creation
    time of the cgroup, thus correctly marking the start of the first
    pressure sampling period in a new cgroup.

    Reported-by: Jingfeng Xie
    Signed-off-by: Johannes Weiner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Suren Baghdasaryan
    Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

31 Dec, 2019

2 commits

  • commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.

    The scheduler code calling cpufreq_update_util() may run during CPU
    offline on the target CPU after the IRQ work lists have been flushed
    for it, so the target CPU should be prevented from running code that
    may queue up an IRQ work item on it at that point.

    Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
    is set for at least one cpufreq policy in the system, because that
    allows the CPU going offline to run the utilization update callback
    of the cpufreq governor on behalf of another (online) CPU in some
    cases.

    If that happens, the cpufreq governor callback may queue up an IRQ
    work on the CPU running it, which is going offline, and the IRQ work
    may not be flushed after that point. Moreover, that IRQ work cannot
    be flushed until the "offlining" CPU goes back online, so if any
    other CPU calls irq_work_sync() to wait for the completion of that
    IRQ work, it will have to wait until the "offlining" CPU is back
    online and that may not happen forever. In particular, a system-wide
    deadlock may occur during CPU online as a result of that.

    The failing scenario is as follows. CPU0 is the boot CPU, so it
    creates a cpufreq policy and becomes the "leader" of it
    (policy->cpu). It cannot go offline, because it is the boot CPU.
    Next, other CPUs join the cpufreq policy as they go online and they
    leave it when they go offline. The last CPU to go offline, say CPU3,
    may queue up an IRQ work while running the governor callback on
    behalf of CPU0 after leaving the cpufreq policy because of the
    dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
    the only online CPU in the system and the stale IRQ work is still
    queued on CPU3. When, say, CPU1 goes back online, it will run
    irq_work_sync() to wait for that IRQ work to complete and so it
    will wait for CPU3 to go back online (which may never happen even
    in principle), but (worse yet) CPU0 is waiting for CPU1 at that
    point too and a system-wide deadlock occurs.

    To address this problem notice that CPUs which cannot run cpufreq
    utilization update code for themselves (for example, because they
    have left the cpufreq policies that they belonged to), should also
    be prevented from running that code on behalf of the other CPUs that
    belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
    in that case the cpufreq_update_util_data pointer of the CPU running
    the code must not be NULL as well as for the CPU which is the target
    of the cpufreq utilization update in progress.

    Accordingly, change cpufreq_this_cpu_can_update() into a regular
    function in kernel/sched/cpufreq.c (instead of a static inline in a
    header file) and make it check the cpufreq_update_util_data pointer
    of the local CPU if dvfs_possible_from_any_cpu is set for the target
    cpufreq policy.

    Also update the schedutil governor to do the
    cpufreq_this_cpu_can_update() check in the non-fast-switch
    case too to avoid the stale IRQ work issues.

    Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits")
    Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
    Reported-by: Anson Huang
    Tested-by: Anson Huang
    Cc: 4.14+ # 4.14+
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Tested-by: Peng Fan (i.MX8QXP-MEK)
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     
  • [ Upstream commit 7763baace1b738d65efa46d68326c9406311c6bf ]

    Some uclamp helpers had their return type changed from 'unsigned int' to
    'enum uclamp_id' by commit

    0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

    but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
    range, which should really be unsigned int. The affected helpers are
    uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

    Note that this doesn't lead to any obj diff using a relatively recent
    aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
    properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
    anything funny. I'm still marking this as fixing the above commit to be on
    the safe side.

    Signed-off-by: Valentin Schneider
    Reviewed-by: Qais Yousef
    Acked-by: Vincent Guittot
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: patrick.bellasi@matbug.net
    Cc: qperret@google.com
    Cc: surenb@google.com
    Cc: tj@kernel.org
    Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
    Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

15 Nov, 2019

1 commit

  • uclamp_update_active() should perform the update when
    p->uclamp[clamp_id].active is true. But when the logic was inverted in
    [1], the if condition wasn't inverted correctly too.

    [1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/

    Reported-by: Suren Baghdasaryan
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Vincent Guittot
    Cc: Ben Segall
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
    Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
    Signed-off-by: Ingo Molnar

    Qais Yousef
     

13 Nov, 2019

2 commits

  • update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an
    update of the frequency. Make sure that RT, DL and IRQ PELT signals have
    been updated before calling cpufreq.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: dsmythies@telus.net
    Cc: juri.lelli@redhat.com
    Cc: mgorman@suse.de
    Cc: rostedt@goodmis.org
    Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
    Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
    Fixes: 91c27493e78d ("sched/irq: Add IRQ utilization tracking")
    Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • While seemingly harmless, __sched_fork() does hrtimer_init(), which,
    when DEBUG_OBJETS, can end up doing allocations.

    This then results in the following lock order:

    rq->lock
    zone->lock.rlock
    batched_entropy_u64.lock

    Which in turn causes deadlocks when we do wakeups while holding that
    batched_entropy lock -- as the random code does.

    Solve this by moving __sched_fork() out from under rq->lock. This is
    safe because nothing there relies on rq->lock, as also evident from the
    other __sched_fork() callsite.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Qian Cai
    Cc: Thomas Gleixner
    Cc: akpm@linux-foundation.org
    Cc: bigeasy@linutronix.de
    Cc: cl@linux.com
    Cc: keescook@chromium.org
    Cc: penberg@kernel.org
    Cc: rientjes@google.com
    Cc: thgarnie@google.com
    Cc: tytso@mit.edu
    Cc: will@kernel.org
    Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy")
    Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Nov, 2019

2 commits

  • Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
    inadvertly introduced a race because it changed a previously
    unexplored dependency between dropping the rq->lock and
    sched_class::put_prev_task().

    The comments about dropping rq->lock, in for example
    newidle_balance(), only mentions the task being current and ->on_cpu
    being set. But when we look at the 'change' pattern (in for example
    sched_setnuma()):

    queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
    running = task_current(rq, p); /* rq->curr == p */

    if (queued)
    dequeue_task(...);
    if (running)
    put_prev_task(...);

    /* change task properties */

    if (queued)
    enqueue_task(...);
    if (running)
    set_next_task(...);

    It becomes obvious that if we do this after put_prev_task() has
    already been called on @p, things go sideways. This is exactly what
    the commit in question allows to happen when it does:

    prev->sched_class->put_prev_task(rq, prev, rf);
    if (!rq->nr_running)
    newidle_balance(rq, rf);

    The newidle_balance() call will drop rq->lock after we've called
    put_prev_task() and that allows the above 'change' pattern to
    interleave and mess up the state.

    Furthermore, it turns out we lost the RT-pull when we put the last DL
    task.

    Fix both problems by extracting the balancing from put_prev_task() and
    doing a multi-class balance() pass before put_prev_task().

    Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
    Reported-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Quentin Perret
    Tested-by: Valentin Schneider

    Peter Zijlstra
     
  • When cgroup is disabled the following compilation error was hit

    kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
    kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
    struct css_task_iter it;
    ^~
    kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
    css_task_iter_start(css, 0, &it);
    ^~~~~~~~~~~~~~~~~~~
    __sg_page_iter_start
    kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
    while ((p = css_task_iter_next(&it))) {
    ^~~~~~~~~~~~~~~~~~
    __sg_page_iter_next
    kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
    css_task_iter_end(&it);
    ^~~~~~~~~~~~~~~~~
    get_task_cred
    kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
    struct css_task_iter it;
    ^~
    cc1: some warnings being treated as errors
    make[2]: *** [kernel/sched/core.o] Error 1

    Fix by protetion uclamp_update_active_tasks() with
    CONFIG_UCLAMP_TASK_GROUP

    Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
    Reported-by: Randy Dunlap
    Signed-off-by: Qais Yousef
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Randy Dunlap
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Vincent Guittot
    Cc: Patrick Bellasi
    Cc: Mel Gorman
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Ben Segall
    Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com

    Qais Yousef
     

29 Oct, 2019

2 commits

  • While the static key is correctly initialized as being disabled, it will
    remain forever enabled once turned on. This means that if we start with an
    asymmetric system and hotplug out enough CPUs to end up with an SMP system,
    the static key will remain set - which is obviously wrong. We should detect
    this and turn off things like misfit migration and capacity aware wakeups.

    As Quentin pointed out, having separate root domains makes this slightly
    trickier. We could have exclusive cpusets that create an SMP island - IOW,
    the domains within this root domain will not see any asymmetry. This means
    we can't just disable the key on domain destruction, we need to count how
    many asymmetric root domains we have.

    Consider the following example using Juno r0 which is 2+4 big.LITTLE, where
    two identical cpusets are created: they both span both big and LITTLE CPUs:

    asym0 asym1
    [ ][ ]
    L L B L L B

    $ cgcreate -g cpuset:asym0
    $ cgset -r cpuset.cpus=0,1,3 asym0
    $ cgset -r cpuset.mems=0 asym0
    $ cgset -r cpuset.cpu_exclusive=1 asym0

    $ cgcreate -g cpuset:asym1
    $ cgset -r cpuset.cpus=2,4,5 asym1
    $ cgset -r cpuset.mems=0 asym1
    $ cgset -r cpuset.cpu_exclusive=1 asym1

    $ cgset -r cpuset.sched_load_balance=0 .

    (the CPU numbering may look odd because on the Juno LITTLEs are CPUs 0,3-5
    and bigs are CPUs 1-2)

    If we make one of those SMP (IOW remove asymmetry) by e.g. hotplugging its
    big core, we would end up with an SMP cpuset and an asymmetric cpuset - the
    static key must remain set, because we still have one asymmetric root domain.

    With the above example, this could be done with:

    $ echo 0 > /sys/devices/system/cpu/cpu2/online

    Which would result in:

    asym0 asym1
    [ ][ ]
    L L B L L

    When both SMP and asymmetric cpusets are present, all CPUs will observe
    sched_asym_cpucapacity being set (it is system-wide), but not all CPUs
    observe asymmetry in their sched domain hierarchy:

    per_cpu(sd_asym_cpucapacity, ) ==
    per_cpu(sd_asym_cpucapacity, ) == NULL

    Change the simple key enablement to an increment, and decrement the key
    counter when destroying domains that cover asymmetric CPUs.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hannes@cmpxchg.org
    Cc: lizefan@huawei.com
    Cc: morten.rasmussen@arm.com
    Cc: qperret@google.com
    Cc: tj@kernel.org
    Cc: vincent.guittot@linaro.org
    Fixes: df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
    Link: https://lkml.kernel.org/r/20191023153745.19515-3-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     
  • Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
    cpuset code feeding empty cpumasks to the sched domain rebuild machinery.

    This leads to the following splat:

    Internal error: Oops: 96000004 [#1] PREEMPT SMP
    Modules linked in:
    CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
    Hardware name: ARM Juno development board (r0) (DT)
    Workqueue: events cpuset_hotplug_workfn
    pstate: 60000005 (nZCv daif -PAN -UAO)
    pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    lr : build_sched_domains (kernel/sched/topology.c:1966)
    Call trace:
    build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
    partition_sched_domains_locked (kernel/sched/topology.c:2250)
    rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
    rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
    cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
    process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
    worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:255)
    ret_from_fork (arch/arm64/kernel/entry.S:1167)
    Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)

    The faulty line in question is:

    cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));

    and we're not checking the return value against nr_cpu_ids (we shouldn't
    have to!), which leads to the above.

    Prevent generate_sched_domains() from returning empty cpumasks, and add
    some assertion in build_sched_domains() to scream bloody murder if it
    happens again.

    The above splat was obtained on my Juno r0 with the following reproducer:

    $ cgcreate -g cpuset:asym
    $ cgset -r cpuset.cpus=0-3 asym
    $ cgset -r cpuset.mems=0 asym
    $ cgset -r cpuset.cpu_exclusive=1 asym

    $ cgcreate -g cpuset:smp
    $ cgset -r cpuset.cpus=4-5 smp
    $ cgset -r cpuset.mems=0 smp
    $ cgset -r cpuset.cpu_exclusive=1 smp

    $ cgset -r cpuset.sched_load_balance=0 .

    $ echo 0 > /sys/devices/system/cpu/cpu4/online
    $ echo 0 > /sys/devices/system/cpu/cpu5/online

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hannes@cmpxchg.org
    Cc: lizefan@huawei.com
    Cc: morten.rasmussen@arm.com
    Cc: qperret@google.com
    Cc: tj@kernel.org
    Cc: vincent.guittot@linaro.org
    Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
    Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     

13 Oct, 2019

1 commit


09 Oct, 2019

2 commits

  • vtime_account_system() assumes that the target task to account cputime
    to is always the current task. This is most often true indeed except on
    task switch where we call:

    vtime_common_task_switch(prev)
    vtime_account_system(prev)

    Here prev is the scheduling-out task where we account the cputime to. It
    doesn't match current that is already the scheduling-in task at this
    stage of the context switch.

    So we end up checking the wrong task flags to determine if we are
    accounting guest or system time to the previous task.

    As a result the wrong task is used to check if the target is running in
    guest mode. We may then spuriously account or leak either system or
    guest time on task switch.

    Fix this assumption and also turn vtime_guest_enter/exit() to use the
    task passed in parameter as well to avoid future similar issues.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
    Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The quota/period ratio is used to ensure a child task group won't get
    more bandwidth than the parent task group, and is calculated as:

    normalized_cfs_quota() = [(quota_us << 20) / period_us]

    If the quota/period ratio was changed during this scaling due to
    precision loss, it will cause inconsistency between parent and child
    task groups.

    See below example:

    A userspace container manager (kubelet) does three operations:

    1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
    2) Create a few children cgroups.
    3) Set quota to 1,000us and period to 10,000us on a child cgroup.

    These operations are expected to succeed. However, if the scaling of
    147/128 happens before step 3, quota and period of the parent cgroup
    will be changed:

    new_quota: 1148437ns, 1148us
    new_period: 11484375ns, 11484us

    And when step 3 comes in, the ratio of the child cgroup will be
    104857, which will be larger than the parent cgroup ratio (104821),
    and will fail.

    Scaling them by a factor of 2 will fix the problem.

    Tested-by: Phil Auld
    Signed-off-by: Xuewei Zhang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Phil Auld
    Cc: Anton Blanchard
    Cc: Ben Segall
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
    Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
    Signed-off-by: Ingo Molnar

    Xuewei Zhang
     

05 Oct, 2019

1 commit

  • …/kernel/git/brauner/linux

    Pull copy_struct_from_user() helper from Christian Brauner:
    "This contains the copy_struct_from_user() helper which got split out
    from the openat2() patchset. It is a generic interface designed to
    copy a struct from userspace.

    The helper will be especially useful for structs versioned by size of
    which we have quite a few. This allows for backwards compatibility,
    i.e. an extended struct can be passed to an older kernel, or a legacy
    struct can be passed to a newer kernel. For the first case (extended
    struct, older kernel) the new fields in an extended struct can be set
    to zero and the struct safely passed to an older kernel.

    The most obvious benefit is that this helper lets us get rid of
    duplicate code present in at least sched_setattr(), perf_event_open(),
    and clone3(). More importantly it will also help to ensure that users
    implementing versioning-by-size end up with the same core semantics.

    This point is especially crucial since we have at least one case where
    versioning-by-size is used but with slighly different semantics:
    sched_setattr(), perf_event_open(), and clone3() all do do similar
    checks to copy_struct_from_user() while rt_sigprocmask(2) always
    rejects differently-sized struct arguments.

    With this pull request we also switch over sched_setattr(),
    perf_event_open(), and clone3() to use the new helper"

    * tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    usercopy: Add parentheses around assignment in test_copy_struct_from_user
    perf_event_open: switch to copy_struct_from_user()
    sched_setattr: switch to copy_struct_from_user()
    clone3: switch to copy_struct_from_user()
    lib: introduce copy_struct_from_user() helper

    Linus Torvalds
     

02 Oct, 2019

1 commit

  • The following commit:

    227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")

    got fat fingered by me when merging it with other patches. It meant to move
    the RCU section out of the for loop but ended up doing it partially, leaving
    a superfluous rcu_read_lock() inside, causing havok.

    Reported-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mathieu Desnoyers
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Cc: linux-tip-commits@vger.kernel.org
    Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
    Link: https://lkml.kernel.org/r/20191001085033.GP4519@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Oct, 2019

1 commit

  • Switch sched_setattr() syscall from it's own copying struct sched_attr
    from userspace to the new dedicated copy_struct_from_user() helper.

    The change is very straightforward, and helps unify the syscall
    interface for struct-from-userspace syscalls. Ideally we could also
    unify sched_getattr(2)-style syscalls as well, but unfortunately the
    correct semantics for such syscalls are much less clear (see [1] for
    more detail). In future we could come up with a more sane idea for how
    the syscall interface should look.

    [1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
    robustify sched_read_attr() ABI logic and code")

    Signed-off-by: Aleksa Sarai
    Reviewed-by: Kees Cook
    Reviewed-by: Christian Brauner
    [christian.brauner@ubuntu.com: improve commit message]
    Link: https://lore.kernel.org/r/20191001011055.19283-4-cyphar@cyphar.com
    Signed-off-by: Christian Brauner

    Aleksa Sarai
     

29 Sep, 2019

1 commit

  • Pull scheduler fixes from Ingo Molnar:

    - Apply a number of membarrier related fixes and cleanups, which fixes
    a use-after-free race in the membarrier code

    - Introduce proper RCU protection for tasks on the runqueue - to get
    rid of the subtle task_rcu_dereference() interface that was easy to
    get wrong

    - Misc fixes, but also an EAS speedup

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Avoid redundant EAS calculation
    sched/core: Remove double update_max_interval() call on CPU startup
    sched/core: Fix preempt_schedule() interrupt return comment
    sched/fair: Fix -Wunused-but-set-variable warnings
    sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
    sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
    sched/membarrier: Skip IPIs when mm->mm_users == 1
    selftests, sched/membarrier: Add multi-threaded test
    sched/membarrier: Fix p->mm->membarrier_state racy load
    sched/membarrier: Call sync_core only before usermode for same mm
    sched/membarrier: Remove redundant check
    sched/membarrier: Fix private expedited registration check
    tasks, sched/core: RCUify the assignment of rq->curr
    tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
    tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
    tasks: Add a count of task RCU users
    sched/core: Convert vcpu_is_preempted() from macro to an inline function
    sched/fair: Remove unused cfs_rq_clock_task() function

    Linus Torvalds
     

25 Sep, 2019

11 commits

  • The EAS wake-up path computes the system energy for several CPU
    candidates: the CPU with maximum spare capacity in each performance
    domain, and the prev_cpu. However, if prev_cpu also happens to be the
    CPU with maximum spare capacity in its performance domain, the energy
    calculation is still done twice, unnecessarily.

    Add a condition to filter out this corner case before doing the energy
    calculation.

    Reported-by: Pavan Kondeti
    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: juri.lelli@redhat.com
    Cc: morten.rasmussen@arm.com
    Cc: qais.yousef@arm.com
    Cc: rjw@rjwysocki.net
    Cc: tkjos@google.com
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Fixes: eb92692b2544 ("sched/fair: Speed-up energy-aware wake-ups")
    Link: https://lkml.kernel.org/r/20190920094115.GA11503@qperret.net
    Signed-off-by: Ingo Molnar

    Quentin Perret
     
  • update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
    and teardown callbacks, but it turns out it's also called at the end of
    the startup callback of CPUHP_AP_ACTIVE (which is further down the
    startup sequence).

    There's no point in repeating this interval update in the startup sequence
    since the CPU will remain online until it goes down the teardown path.

    Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: juri.lelli@redhat.com
    Cc: vincent.guittot@linaro.org
    Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     
  • preempt_schedule_irq() is the one that should be called on return from
    interrupt, clean up the comment to avoid any ambiguity.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linux-riscv@lists.infradead.org
    Cc: uclinux-h8-devel@lists.sourceforge.jp
    Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar

    Valentin Schneider
     
  • Commit:

    de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")

    introduced a few compilation warnings:

    kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
    kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
    kernel/sched/fair.c: In function 'start_cfs_bandwidth':
    kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]

    Also, __refill_cfs_bandwidth_runtime() does no longer update the
    expiration time, so fix the comments accordingly.

    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Reviewed-by: Dave Chiluk
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: pauld@redhat.com
    Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
    Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar

    Qian Cai
     
  • An oops can be triggered in the scheduler when running qemu on arm64:

    Unable to handle kernel paging request at virtual address ffff000008effe40
    Internal error: Oops: 96000007 [#1] SMP
    Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
    pstate: 20000085 (nzCv daIf -PAN -UAO)
    pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
    lr : move_queued_task.isra.21+0x124/0x298
    ...
    Call trace:
    __ll_sc___cmpxchg_case_acq_4+0x4/0x20
    __migrate_task+0xc8/0xe0
    migration_cpu_stop+0x170/0x180
    cpu_stopper_thread+0xec/0x178
    smpboot_thread_fn+0x1ac/0x1e8
    kthread+0x134/0x138
    ret_from_fork+0x10/0x18

    __set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
    migrage the process if process is not currently running on any one of the
    CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
    invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
    CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
    check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
    corresponding bit is coincidentally set. As a consequence, kernel will
    access an invalid rq address associate with the invalid CPU in
    migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.

    The reproduce the crash:

    1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
    sched_setaffinity.

    2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
    and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.

    3) Oops appears if the invalid CPU is set in memory after tested cpumask.

    Signed-off-by: KeMeng Shi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com
    Signed-off-by: Ingo Molnar

    KeMeng Shi
     
  • Remove the IPI fallback code from membarrier to deal with very
    infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
    than GFP_NOWAIT, and relax the blocking guarantees for the expedited
    membarrier system call commands, allowing it to block if waiting for
    memory to be made available.

    In addition, now -ENOMEM can be returned to user-space if the cpumask
    memory allocation fails.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • If there is only a single mm_user for the mm, the private expedited
    membarrier command can skip the IPIs, because only a single thread
    is using the mm.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • The membarrier_state field is located within the mm_struct, which
    is not guaranteed to exist when used from runqueue-lock-free iteration
    on runqueues by the membarrier system call.

    Copy the membarrier_state from the mm_struct into the scheduler runqueue
    when the scheduler switches between mm.

    When registering membarrier for mm, after setting the registration bit
    in the mm membarrier state, issue a synchronize_rcu() to ensure the
    scheduler observes the change. In order to take care of the case
    where a runqueue keeps executing the target mm without swapping to
    other mm, iterate over each runqueue and issue an IPI to copy the
    membarrier_state from the mm_struct into each runqueue which have the
    same mm which state has just been modified.

    Move the mm membarrier_state field closer to pgd in mm_struct to use
    a cache line already touched by the scheduler switch_mm.

    The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
    clear the runqueue's membarrier state in addition to clear the mm
    membarrier state, so move its implementation into the scheduler
    membarrier code so it can access the runqueue structure.

    Add memory barrier in membarrier_exec_mmap() prior to clearing
    the membarrier state, ensuring memory accesses executed prior to exec
    are not reordered with the stores clearing the membarrier state.

    As suggested by Linus, move all membarrier.c RCU read-side locks outside
    of the for each cpu loops.

    Suggested-by: Linus Torvalds
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • Checking that the number of threads is 1 is redundant with checking
    mm_users == 1.

    No change in functionality intended.

    Suggested-by: Oleg Nesterov
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • Fix a logic flaw in the way membarrier_register_private_expedited()
    handles ready state checks for private expedited sync core and private
    expedited registrations.

    If a private expedited membarrier registration is first performed, and
    then a private expedited sync_core registration is performed, the ready
    state check will skip the second registration when it really should not.

    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     
  • The current task on the runqueue is currently read with rcu_dereference().

    To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
    to be paired with rcu_assign_pointer() of rq->curr. Which provides the
    memory barrier necessary to order assignments to the task_struct
    and the assignment to rq->curr.

    Unfortunately the assignment of rq->curr in __schedule is a hot path,
    and it has already been show that additional barriers in that code
    will reduce the performance of the scheduler. So I will attempt to
    describe below why you can effectively have ordinary RCU semantics
    without any additional barriers.

    The assignment of rq->curr in init_idle is a slow path called once
    per cpu and that can use rcu_assign_pointer() without any concerns.

    As I write this there are effectively two users of rcu_dereference() on
    rq->curr. There is the membarrier code in kernel/sched/membarrier.c
    that only looks at "->mm" after the rcu_dereference(). Then there is
    task_numa_compare() in kernel/sched/fair.c. My best reading of the
    code shows that task_numa_compare only access: "->flags",
    "->cpus_ptr", "->numa_group", "->numa_faults[]",
    "->total_numa_faults", and "->se.cfs_rq".

    The code in __schedule() essentially does:
    rq_lock(...);
    smp_mb__after_spinlock();

    next = pick_next_task(...);
    rq->curr = next;

    context_switch(prev, next);

    At the start of the function the rq_lock/smp_mb__after_spinlock
    pair provides a full memory barrier. Further there is a full memory barrier
    in context_switch().

    This means that any task that has already run and modified itself (the
    common case) has already seen two memory barriers before __schedule()
    runs and begins executing. A task that modifies itself then sees a
    third full memory barrier pair with the rq_lock();

    For a brand new task that is enqueued with wake_up_new_task() there
    are the memory barriers present from the taking and release the
    pi_lock and the rq_lock as the processes is enqueued as well as the
    full memory barrier at the start of __schedule() assuming __schedule()
    happens on the same cpu.

    This means that by the time we reach the assignment of rq->curr
    except for values on the task struct modified in pick_next_task
    the code has the same guarantees as if it used rcu_assign_pointer().

    Reading through all of the implementations of pick_next_task it
    appears pick_next_task is limited to modifying the task_struct fields
    "->se", "->rt", "->dl". These fields are the sched_entity structures
    of the varies schedulers.

    Further "->se.cfs_rq" is only changed in cgroup attach/move operations
    initialized by userspace.

    Unless I have missed something this means that in practice that the
    users of "rcu_dereference(rq->curr)" get normal RCU semantics of
    rcu_dereference() for the fields the care about, despite the
    assignment of rq->curr in __schedule() ot using rcu_assign_pointer.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Davidlohr Bueso
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Eric W. Biederman