11 Nov, 2020

1 commit

  • Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
    with small reads causes oopses with slub corruption issues because the kfree is
    free'ing an offset from a previous allocation. Fix this by adding in a new
    pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
    to handle memory copies of the buf offsets.

    Fixes: 5b9f8ff7b320 ("sched/debug: Output SD flag names rather than their values")
    Reported-by: Jeff Bastian
    Signed-off-by: Colin Ian King
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com

    Colin Ian King
     

09 Sep, 2020

1 commit

  • The last sd_flag_debug shuffle inadvertently moved its definition within
    an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to
    produce the sched domain ctl interface (which uses sd_flag_debug to output
    flag names), it isn't required to run any assertion on the sched_domain
    hierarchy itself.

    Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of
    topology.c.

    Now at long last we have:

    - sd_flag_debug declared in include/linux/sched/topology.h iff
    CONFIG_SCHED_DEBUG=y
    - sd_flag_debug defined in kernel/sched/topology.c, conditioned by:
    - CONFIG_SCHED_DEBUG, with an explicit #ifdef block
    - CONFIG_SMP, as a requirement to compile topology.c

    With this change, all symbols pertaining to SD flag metadata (with the
    exception of __SD_FLAG_CNT) are now defined exclusively within topology.c

    Fixes: 8fca9494d4b4 ("sched/topology: Move sd_flag_debug out of linux/sched/topology.h")
    Reported-by: Randy Dunlap
    Signed-off-by: Valentin Schneider
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200908184956.23369-1-valentin.schneider@arm.com

    Valentin Schneider
     

26 Aug, 2020

1 commit

  • Defining an array in a header imported all over the place clearly is a daft
    idea, that still didn't stop me from doing it.

    Leave a declaration of sd_flag_debug in topology.h and move its definition
    to sched/debug.c.

    Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag metadata")
    Reported-by: Andy Shevchenko
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com

    Valentin Schneider
     

19 Aug, 2020

1 commit

  • Decoding the output of /proc/sys/kernel/sched_domain/cpu*/domain*/flags has
    always been somewhat annoying, as one needs to go fetch the bit -> name
    mapping from the source code itself. This encoding can be saved in a script
    somewhere, but that isn't safe from flags being added, removed or even
    shuffled around.

    What matters for debugging purposes is to get *which* flags are set in a
    given domain, their associated value is pretty much meaningless.

    Make the sd flags debug file output flag names.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Link: https://lore.kernel.org/r/20200817113003.20802-7-valentin.schneider@arm.com

    Valentin Schneider
     

28 May, 2020

1 commit

  • In preparation of removing rq->wake_list, replace the
    !list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
    equivalent as this new variable is racy.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org

    Peter Zijlstra
     

20 May, 2020

2 commits

  • Peter Zijlstra
     
  • The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp
    values to SCHED_DEBUG procfs") was to print requested and effective
    task uclamp values. The requested values printed are read from p->uclamp,
    which holds the last effective values. Fix this by printing the values
    from p->uclamp_req.

    Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
    Signed-off-by: Pavankumar Kondeti
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Valentin Schneider
    Tested-by: Valentin Schneider
    Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org

    Pavankumar Kondeti
     

01 May, 2020

2 commits

  • Writing to the sysctl of a sched_domain->flags directly updates the value of
    the field, and goes nowhere near update_top_cache_domain(). This means that
    the cached domain pointers can end up containing stale data (e.g. the
    domain pointed to doesn't have the relevant flag set anymore).

    Explicit domain walks that check for flags will be affected by
    the write, but this won't be in sync with the cached pointers which will
    still point to the domains that were cached at the last sched_domain
    build.

    In other words, writing to this interface is playing a dangerous game. It
    could be made to trigger an update of the cached sched_domain pointers when
    written to, but this does not seem to be worth the trouble. Make it
    read-only.

    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com

    Valentin Schneider
     
  • Ensure leave one space between state and task name.

    w/o patch:
    runnable tasks:
    S task PID tree-key switches prio wait
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20200414125721.195801-1-xiexiuqi@huawei.com

    Xie XiuQi
     

08 Apr, 2020

3 commits

  • Requested and effective uclamp values can be a bit tricky to decipher when
    playing with cgroup hierarchies. Add them to a task's procfs when
    SCHED_DEBUG is enabled.

    Reviewed-by: Qais Yousef
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com

    Valentin Schneider
     
  • The printing macros in debug.c keep redefining the same output
    format. Collect each output format in a single definition, and reuse that
    definition in the other macros. While at it, add a layer of parentheses and
    replace printf's with the newly introduced macros.

    Reviewed-by: Qais Yousef
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com

    Valentin Schneider
     
  • Most printing macros for procfs are defined globally in debug.c, and they
    are re-defined (to the exact same thing) within proc_sched_show_task().

    Get rid of the duplicate defines.

    Reviewed-by: Qais Yousef
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com

    Valentin Schneider
     

24 Feb, 2020

2 commits

  • Now that runnable_load_avg has been removed, we can replace it by a new
    signal that will highlight the runnable pressure on a cfs_rq. This signal
    track the waiting time of tasks on rq and can help to better define the
    state of rqs.

    At now, only util_avg is used to define the state of a rq:
    A rq with more that around 80% of utilization and more than 1 tasks is
    considered as overloaded.

    But the util_avg signal of a rq can become temporaly low after that a task
    migrated onto another rq which can bias the classification of the rq.

    When tasks compete for the same rq, their runnable average signal will be
    higher than util_avg as it will include the waiting time and we can use
    this signal to better classify cfs_rqs.

    The new runnable_avg will track the runnable time of a task which simply
    adds the waiting time to the running time. The runnable _avg of cfs_rq
    will be the /Sum of se's runnable_avg and the runnable_avg of group entity
    will follow the one of the rq similarly to util_avg.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Mel Gorman
    Signed-off-by: Ingo Molnar
    Reviewed-by: "Dietmar Eggemann "
    Acked-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Valentin Schneider
    Cc: Phil Auld
    Cc: Hillf Danton
    Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net

    Vincent Guittot
     
  • Now that runnable_load_avg is no more used, we can remove it to make
    space for a new signal.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Mel Gorman
    Signed-off-by: Ingo Molnar
    Reviewed-by: "Dietmar Eggemann "
    Acked-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Valentin Schneider
    Cc: Phil Auld
    Cc: Hillf Danton
    Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net

    Vincent Guittot
     

17 Jan, 2020

1 commit

  • Lengthy output of sysrq-t may take a lot of time on slow serial console
    with lots of processes and CPUs.

    So we need to reset NMI-watchdog to avoid spurious lockup messages, and
    we also reset softlockup watchdogs on all other CPUs since another CPU
    might be blocked waiting for us to process an IPI or stop_machine.

    Add to sysrq_sched_debug_show() as what we did in show_state_filter().

    Signed-off-by: Wei Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt (VMware)
    Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com

    Wei Li
     

25 Jun, 2019

1 commit


19 Jun, 2019

1 commit

  • Based on 2 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license version 2 as
    published by the free software foundation #

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 4122 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Kate Stewart
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

03 Jun, 2019

4 commits

  • The sched domain per rq load index files also disappear from the
    /proc/sys/kernel/sched_domain/cpuX/domainY directories.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • The per rq load array values also disappear from the cpu#X sections in
    /proc/sched_debug.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • This reverts:

    commit 201c373e8e48 ("sched/debug: Limit sd->*_idx range on sysctl")

    Load indexes (sd->*_idx) are no longer needed without rq->cpu_load[].
    The range check for load indexes can be removed as well. Get rid of it
    before the rq->cpu_load[] since it uses CPU_LOAD_IDX_MAX.

    At the same time, fix the following coding style issues detected by
    scripts/checkpatch.pl:

    ERROR: space prohibited before that ','
    ERROR: space prohibited before that close parenthesis ')'

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190527062116.11512-4-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • The CFS class is the only one maintaining and using the CPU wide load
    (rq->load(.weight)). The last use case of the CPU wide load in CFS's
    set_next_entity() can be replaced by using the load of the CFS class
    (rq->cfs.load(.weight)) instead.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     

20 Apr, 2019

1 commit


04 Feb, 2019

1 commit

  • register_sched_domain_sysctl() copies the cpu_possible_mask into
    sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
    allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when
    CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
    uninitialized (all zeroes) and the kernel may fail to initialize
    sched_domain sysctl entries for all possible CPUs.

    This is visible to the user if the kernel is booted with maxcpus=n, or
    if ACPI tables have been modified to leave CPUs offline, and then
    checking for missing /proc/sys/kernel/sched_domain/cpu* entries.

    Fix this by separating the allocation and initialization, and adding a
    flag to initialize the possible CPU entries while system booting only.

    Tested-by: Syuuichirou Ishii
    Tested-by: Tarumizu, Kohei
    Signed-off-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Masayoshi Mizuma
    Acked-by: Joe Lawrence
    Cc: Linus Torvalds
    Cc: Masayoshi Mizuma
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

06 Jan, 2019

1 commit

  • Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

    The jump label is controlled by HAVE_JUMP_LABEL, which is defined
    like this:

    #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
    # define HAVE_JUMP_LABEL
    #endif

    We can improve this by testing 'asm goto' support in Kconfig, then
    make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

    Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
    match to the real kernel capability.

    Signed-off-by: Masahiro Yamada
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Sedat Dilek

    Masahiro Yamada
     

12 Nov, 2018

1 commit

  • We already have task_has_rt_policy() and task_has_dl_policy() helpers,
    create task_has_idle_policy() as well and update sched core to start
    using it.

    While at it, use task_has_dl_policy() at one more place.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Daniel Lezcano
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

10 Sep, 2018

1 commit

  • The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:

    ======================================================
    WARNING: possible circular locking dependency detected
    4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted
    ------------------------------------------------------
    sh/3358 is trying to acquire lock:
    000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
    but task is already holding lock:
    00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
    lock_acquire+0xb8/0x148
    down_write+0xac/0x140
    start_creating+0x5c/0x168
    debugfs_create_dir+0x18/0x220
    opp_debug_register+0x8c/0x120
    _add_opp_dev+0x104/0x1f8
    dev_pm_opp_get_opp_table+0x174/0x340
    _of_add_opp_table_v2+0x110/0x760
    dev_pm_opp_of_add_table+0x5c/0x240
    dev_pm_opp_of_cpumask_add_table+0x5c/0x100
    cpufreq_init+0x160/0x430
    cpufreq_online+0x1cc/0xe30
    cpufreq_add_dev+0x78/0x198
    subsys_interface_register+0x168/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #2 (opp_table_lock){+.+.}:
    lock_acquire+0xb8/0x148
    __mutex_lock+0x104/0xf50
    mutex_lock_nested+0x1c/0x28
    _of_add_opp_table_v2+0xb4/0x760
    dev_pm_opp_of_add_table+0x5c/0x240
    dev_pm_opp_of_cpumask_add_table+0x5c/0x100
    cpufreq_init+0x160/0x430
    cpufreq_online+0x1cc/0xe30
    cpufreq_add_dev+0x78/0x198
    subsys_interface_register+0x168/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #1 (subsys mutex#6){+.+.}:
    lock_acquire+0xb8/0x148
    __mutex_lock+0x104/0xf50
    mutex_lock_nested+0x1c/0x28
    subsys_interface_register+0xd8/0x270
    cpufreq_register_driver+0x1c8/0x278
    dt_cpufreq_probe+0xdc/0x1b8
    platform_drv_probe+0xb4/0x168
    driver_probe_device+0x318/0x4b0
    __device_attach_driver+0xfc/0x1f0
    bus_for_each_drv+0xf8/0x180
    __device_attach+0x164/0x200
    device_initial_probe+0x10/0x18
    bus_probe_device+0x110/0x178
    device_add+0x6d8/0x908
    platform_device_add+0x138/0x3d8
    platform_device_register_full+0x1cc/0x1f8
    cpufreq_dt_platdev_init+0x174/0x1bc
    do_one_initcall+0xb8/0x310
    kernel_init_freeable+0x4b8/0x56c
    kernel_init+0x10/0x138
    ret_from_fork+0x10/0x18
    -> #0 (cpu_hotplug_lock.rw_sem){++++}:
    __lock_acquire+0x203c/0x21d0
    lock_acquire+0xb8/0x148
    cpus_read_lock+0x58/0x1c8
    static_key_enable+0x14/0x30
    sched_feat_write+0x314/0x428
    full_proxy_write+0xa0/0x138
    __vfs_write+0xd8/0x388
    vfs_write+0xdc/0x318
    ksys_write+0xb4/0x138
    sys_write+0xc/0x18
    __sys_trace_return+0x0/0x4
    other info that might help us debug this:
    Chain exists of:
    cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
    Possible unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(&sb->s_type->i_mutex_key#3);
    lock(opp_table_lock);
    lock(&sb->s_type->i_mutex_key#3);
    lock(cpu_hotplug_lock.rw_sem);
    *** DEADLOCK ***
    2 locks held by sh/3358:
    #0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
    #1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
    stack backtrace:
    CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18
    Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
    Call trace:
    dump_backtrace+0x0/0x288
    show_stack+0x14/0x20
    dump_stack+0x13c/0x1ac
    print_circular_bug.isra.10+0x270/0x438
    check_prev_add.constprop.16+0x4dc/0xb98
    __lock_acquire+0x203c/0x21d0
    lock_acquire+0xb8/0x148
    cpus_read_lock+0x58/0x1c8
    static_key_enable+0x14/0x30
    sched_feat_write+0x314/0x428
    full_proxy_write+0xa0/0x138
    __vfs_write+0xd8/0x388
    vfs_write+0xdc/0x318
    ksys_write+0xb4/0x138
    sys_write+0xc/0x18
    __sys_trace_return+0x0/0x4

    This is because when loading the cpufreq_dt module we first acquire
    cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
    the &sb->s_type->i_mutex_key lock.

    But when writing to /sys/kernel/debug/sched_features, the
    cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.

    To fix this bug, reverse the lock acquisition order when writing to
    sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
    &sb->s_type->i_mutex_key.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Jiada Wang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Eugeniu Rosca
    Cc: George G. Davis
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com
    Signed-off-by: Ingo Molnar

    Jiada Wang
     

14 Aug, 2018

1 commit

  • Pull x86 timer updates from Thomas Gleixner:
    "Early TSC based time stamping to allow better boot time analysis.

    This comes with a general cleanup of the TSC calibration code which
    grew warts and duct taping over the years and removes 250 lines of
    code. Initiated and mostly implemented by Pavel with help from various
    folks"

    * 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    x86/kvmclock: Mark kvm_get_preset_lpj() as __init
    x86/tsc: Consolidate init code
    sched/clock: Disable interrupts when calling generic_sched_clock_init()
    timekeeping: Prevent false warning when persistent clock is not available
    sched/clock: Close a hole in sched_clock_init()
    x86/tsc: Make use of tsc_calibrate_cpu_early()
    x86/tsc: Split native_calibrate_cpu() into early and late parts
    sched/clock: Use static key for sched_clock_running
    sched/clock: Enable sched clock early
    sched/clock: Move sched clock initialization and merge with generic clock
    x86/tsc: Use TSC as sched clock early
    x86/tsc: Initialize cyc2ns when tsc frequency is determined
    x86/tsc: Calibrate tsc only once
    ARM/time: Remove read_boot_clock64()
    s390/time: Remove read_boot_clock64()
    timekeeping: Default boot time offset to local_clock()
    timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
    s390/time: Add read_persistent_wall_and_boot_offset()
    x86/xen/time: Output xen sched_clock time from 0
    x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
    ...

    Linus Torvalds
     

25 Jul, 2018

1 commit

  • Fix the order in which the private and shared numa faults are getting
    printed.

    No functional changes.

    Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
    JVMS LAST_PATCH WITH_PATCH %CHANGE
    16 25215.7 25375.3 0.63
    1 72107 72617 0.70

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Rik van Riel
    Acked-by: Mel Gorman
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

20 Jul, 2018

1 commit

  • sched_clock_running may be read every time sched_clock_cpu() is called.
    Yet, this variable is updated only twice during boot, and never changes
    again, therefore it is better to make it a static key.

    Signed-off-by: Pavel Tatashin
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: steven.sistare@oracle.com
    Cc: daniel.m.jordan@oracle.com
    Cc: linux@armlinux.org.uk
    Cc: schwidefsky@de.ibm.com
    Cc: heiko.carstens@de.ibm.com
    Cc: john.stultz@linaro.org
    Cc: sboyd@codeaurora.org
    Cc: hpa@zytor.com
    Cc: douly.fnst@cn.fujitsu.com
    Cc: prarit@redhat.com
    Cc: feng.tang@intel.com
    Cc: pmladek@suse.com
    Cc: gnomes@lxorguk.ukuu.org.uk
    Cc: linux-s390@vger.kernel.org
    Cc: boris.ostrovsky@oracle.com
    Cc: jgross@suse.com
    Cc: pbonzini@redhat.com
    Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com

    Pavel Tatashin
     

21 Jun, 2018

1 commit

  • match_string() returns the index of an array for a matching string,
    which can be used instead of the open coded variant.

    Signed-off-by: Yisheng Xie
    Reviewed-by: Andy Shevchenko
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lore.kernel.org/lkml/1527765086-19873-15-git-send-email-xieyisheng1@huawei.com
    Signed-off-by: Ingo Molnar

    Yisheng Xie
     

16 May, 2018

1 commit


03 Apr, 2018

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main scheduler changes in this cycle were:

    - NUMA balancing improvements (Mel Gorman)

    - Further load tracking improvements (Patrick Bellasi)

    - Various NOHZ balancing cleanups and optimizations (Peter Zijlstra)

    - Improve blocked load handling, in particular we can now reduce and
    eventually stop periodic load updates on 'very idle' CPUs. (Vincent
    Guittot)

    - On isolated CPUs offload the final 1Hz scheduler tick as well, plus
    related cleanups and reorganization. (Frederic Weisbecker)

    - Core scheduler code cleanups (Ingo Molnar)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
    sched/core: Update preempt_notifier_key to modern API
    sched/cpufreq: Rate limits for SCHED_DEADLINE
    sched/fair: Update util_est only on util_avg updates
    sched/cpufreq/schedutil: Use util_est for OPP selection
    sched/fair: Use util_est in LB and WU paths
    sched/fair: Add util_est on top of PELT
    sched/core: Remove TASK_ALL
    sched/completions: Use bool in try_wait_for_completion()
    sched/fair: Update blocked load when newly idle
    sched/fair: Move idle_balance()
    sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
    sched/fair: Move rebalance_domains()
    sched/nohz: Optimize nohz_idle_balance()
    sched/fair: Reduce the periodic update duration
    sched/nohz: Stop NOHZ stats when decayed
    sched/cpufreq: Provide migration hint
    sched/nohz: Clean up nohz enter/exit
    sched/fair: Update blocked load from NEWIDLE
    sched/fair: Add NOHZ stats balancing
    sched/fair: Restructure nohz_balance_kick()
    ...

    Linus Torvalds
     

20 Mar, 2018

3 commits

  • Scheduler debug stats include newlines that display out of alignment
    when prefixed by timestamps. For example, the dmesg utility:

    % echo t > /proc/sysrq-trigger
    % dmesg
    ...
    [ 83.124251]
    runnable tasks:
    S task PID tree-key switches prio wait-time
    sum-exec sum-sleep
    -----------------------------------------------------------------------------------------------------------

    At the same time, some syslog utilities (like rsyslog by default) don't
    like the additional newlines control characters, saving lines like this
    to /var/log/messages:

    Mar 16 16:02:29 localhost kernel: #012runnable tasks:#012 S task PID tree-key ...
    ^^^^ ^^^^
    Clean these up by moving newline characters to their own SEQ_printf
    invocation. This leaves the /proc/sched_debug unchanged, but brings the
    entire output into alignment when prefixed:

    % echo t > /proc/sysrq-trigger
    % dmesg
    ...
    [ 62.410368] runnable tasks:
    [ 62.410368] S task PID tree-key switches prio wait-time sum-exec sum-sleep
    [ 62.410369] -----------------------------------------------------------------------------------------------------------
    [ 62.410369] I kworker/u12:0 5 1932.215593 332 120 0.000000 3.621252 0.000000 0 0 /

    and no escaped control characters from rsyslog in /var/log/messages:

    Mar 16 16:15:06 localhost kernel: runnable tasks:
    Mar 16 16:15:06 localhost kernel: S task PID tree-key ...

    Signed-off-by: Joe Lawrence
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Ingo Molnar

    Joe Lawrence
     
  • When the SEQ_printf() macro prints to the console, it runs a simple
    printk() without KERN_CONT "continued" line printing. The result of
    this is oddly wrapped task info, for example:

    % echo t > /proc/sysrq-trigger
    % dmesg
    ...
    runnable tasks:
    ...
    [ 29.608611] I
    [ 29.608613] rcu_sched 8 3252.013846 4087 120
    [ 29.608614] 0.000000 29.090111 0.000000
    [ 29.608615] 0 0
    [ 29.608616] /

    Modify SEQ_printf to use pr_cont() for expected one-line results:

    % echo t > /proc/sysrq-trigger
    % dmesg
    ...
    runnable tasks:
    ...
    [ 106.716329] S cpuhp/5 37 2006.315026 14 120 0.000000 0.496893 0.000000 0 0 /

    Signed-off-by: Joe Lawrence
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.com
    Signed-off-by: Ingo Molnar

    Joe Lawrence
     
  • The util_avg signal computed by PELT is too variable for some use-cases.
    For example, a big task waking up after a long sleep period will have its
    utilization almost completely decayed. This introduces some latency before
    schedutil will be able to pick the best frequency to run a task.

    The same issue can affect task placement. Indeed, since the task
    utilization is already decayed at wakeup, when the task is enqueued in a
    CPU, this can result in a CPU running a big task as being temporarily
    represented as being almost empty. This leads to a race condition where
    other tasks can be potentially allocated on a CPU which just started to run
    a big task which slept for a relatively long period.

    Moreover, the PELT utilization of a task can be updated every [ms], thus
    making it a continuously changing value for certain longer running
    tasks. This means that the instantaneous PELT utilization of a RUNNING
    task is not really meaningful to properly support scheduler decisions.

    For all these reasons, a more stable signal can do a better job of
    representing the expected/estimated utilization of a task/cfs_rq.
    Such a signal can be easily created on top of PELT by still using it as
    an estimator which produces values to be aggregated on meaningful
    events.

    This patch adds a simple implementation of util_est, a new signal built on
    top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

    This allows to remember how big a task has been reported by PELT in its
    previous activations via f(task::util_avg@dequeue), which is the new
    _task_util_est(struct task_struct*) function added by this patch.

    If a task should change its behavior and it runs longer in a new
    activation, after a certain time its util_est will just track the
    original PELT signal (i.e. task::util_avg).

    The estimated utilization of cfs_rq is defined only for root ones.
    That's because the only sensible consumer of this signal are the
    scheduler and schedutil when looking for the overall CPU utilization
    due to FAIR tasks.

    For this reason, the estimated utilization of a root cfs_rq is simply
    defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

    where:

    cfs_rq::util_est::enqueued = sum(_task_util_est(task))
    for each RUNNABLE task on that root cfs_rq

    It's worth noting that the estimated utilization is tracked only for
    objects of interests, specifically:

    - Tasks: to better support tasks placement decisions
    - root cfs_rqs: to better support both tasks placement decisions as
    well as frequencies selection

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

04 Mar, 2018

1 commit

  • Do the following cleanups and simplifications:

    - sched/sched.h already includes , so no need to
    include it in sched/core.c again.

    - order the headers alphabetically

    - add all headers to kernel/sched/sched.h

    - remove all unnecessary includes from the .c files that
    are already included in kernel/sched/sched.h.

    Finally, make all scheduler .c files use a single common header:

    #include "sched.h"

    ... which now contains a union of the relied upon headers.

    This makes the various .c files easier to read and easier to handle.

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Mar, 2018

1 commit

  • A good number of small style inconsistencies have accumulated
    in the scheduler core, so do a pass over them to harmonize
    all these details:

    - fix speling in comments,

    - use curly braces for multi-line statements,

    - remove unnecessary parentheses from integer literals,

    - capitalize consistently,

    - remove stray newlines,

    - add comments where necessary,

    - remove invalid/unnecessary comments,

    - align structure definitions and other data types vertically,

    - add missing newlines for increased readability,

    - fix vertical tabulation where it's misaligned,

    - harmonize preprocessor conditional block labeling
    and vertical alignment,

    - remove line-breaks where they uglify the code,

    - add newline after local variable definitions,

    No change in functionality:

    md5:
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

30 Sep, 2017

3 commits

  • The load balancer uses runnable_load_avg as load indicator. For
    !cgroup this is:

    runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

    That is, a direct sum of all runnable tasks on that runqueue. As
    opposed to load_avg, which is a sum of all tasks on the runqueue,
    which includes a blocked component.

    However, in the cgroup case, this comes apart since the group entities
    are always runnable, even if most of their constituent entities are
    blocked.

    Therefore introduce a runnable_weight which for task entities is the
    same as the regular weight, but for group entities is a fraction of
    the entity weight and represents the runnable part of the group
    runqueue.

    Then propagate this load through the PELT hierarchy to arrive at an
    effective runnable load avgerage -- which we should not confuse with
    the canonical runnable load average.

    Suggested-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When an entity migrates in (or out) of a runqueue, we need to add (or
    remove) its contribution from the entire PELT hierarchy, because even
    non-runnable entities are included in the load average sums.

    In order to do this we have some propagation logic that updates the
    PELT tree, however the way it 'propagates' the runnable (or load)
    change is (more or less):

    tg->weight * grq->avg.load_avg
    ge->avg.load_avg = ------------------------------
    tg->load_avg

    But that is the expression for ge->weight, and per the definition of
    load_avg:

    ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

    That destroys the runnable_avg (by setting it to 1) we wanted to
    propagate.

    Instead directly propagate runnable_sum.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since on wakeup migration we don't hold the rq->lock for the old CPU
    we cannot update its state. Instead we add the removed 'load' to an
    atomic variable and have the next update on that CPU collect and
    process it.

    Currently we have 2 atomic variables; which already have the issue
    that they can be read out-of-sync. Also, two atomic ops on a single
    cacheline is already more expensive than an uncontended lock.

    Since we want to add more, convert the thing over to an explicit
    cacheline with a lock in.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra