12 Feb, 2019

1 commit

  • These macros can be reused by governors which don't use the common
    governor code present in cpufreq_governor.c and should be moved to the
    relevant header.

    Now that they are getting moved to the right header file, reuse them in
    schedutil governor as well (that required rename of show/store
    routines).

    Also create gov_attr_wo() macro for write-only sysfs files, this will be
    used by Interactive governor in a later patch.

    Signed-off-by: Viresh Kumar

    Viresh Kumar
     

23 Jan, 2019

1 commit

  • commit 512ac999d2755d2b7109e996a76b6fb8b888631d upstream.

    I noticed that cgroup task groups constantly get throttled even
    if they have low CPU usage, this causes some jitters on the response
    time to some of our business containers when enabling CPU quotas.

    It's very simple to reproduce:

    mkdir /sys/fs/cgroup/cpu/test
    cd /sys/fs/cgroup/cpu/test
    echo 100000 > cpu.cfs_quota_us
    echo $$ > tasks

    then repeat:

    cat cpu.stat | grep nr_throttled # nr_throttled will increase steadily

    After some analysis, we found that cfs_rq::runtime_remaining will
    be cleared by expire_cfs_rq_runtime() due to two equal but stale
    "cfs_{b|q}->runtime_expires" after period timer is re-armed.

    The current condition to judge clock drift in expire_cfs_rq_runtime()
    is wrong, the two runtime_expires are actually the same when clock
    drift happens, so this condtion can never hit. The orginal design was
    correctly done by this commit:

    a9cf55b28610 ("sched: Expire invalid runtime")

    ... but was changed to be the current implementation due to its locking bug.

    This patch introduces another way, it adds a new field in both structures
    cfs_rq and cfs_bandwidth to record the expiration update sequence, and
    uses them to figure out if clock drift happens (true if they are equal).

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    [alakeshh: backport: Fixed merge conflicts:
    - sched.h: Fix the indentation and order in which the variables are
    declared to match with coding style of the existing code in 4.14
    Struct members of same type were declared in separate lines in
    upstream patch which has been changed back to having multiple
    members of same type in the same line.
    e.g. int a; int b; -> int a, b; ]
    Signed-off-by: Alakesh Haloi
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: # 4.14.x
    Fixes: 51f2176d74ac ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")
    Link: http://lkml.kernel.org/r/20180620101834.24455-1-xlpang@linux.alibaba.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Xunlei Pang
     

13 Jan, 2019

1 commit

  • commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream.

    Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the
    scheduler under high loads, starting at around the v4.18 time frame,
    and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list
    manipulation.

    Do a (manual) revert of:

    a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")

    It turns out that the list_del_leaf_cfs_rq() introduced by this commit
    is a surprising property that was not considered in followup commits
    such as:

    9c2791f936ef ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list")

    As Vincent Guittot explains:

    "I think that there is a bigger problem with commit a9e7f6544b9c and
    cfs_rq throttling:

    Let take the example of the following topology TG2 --> TG1 --> root:

    1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1
    cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in
    one path because it has never been used and can't be throttled so
    tmp_alone_branch will point to leaf_cfs_rq_list at the end.

    2) Then TG1 is throttled

    3) and we add TG3 as a new child of TG1.

    4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1
    cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list.

    With commit a9e7f6544b9c, we can del a cfs_rq from rq->leaf_cfs_rq_list.
    So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1
    cfs_rq is removed from the list.
    Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list
    but tmp_alone_branch still points to TG3 cfs_rq because its throttled
    parent can't be enqueued when the lock is released.
    tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should.

    So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch
    points on another TG cfs_rq, the next TG cfs_rq that will be added,
    will be linked outside rq->leaf_cfs_rq_list - which is bad.

    In addition, we can break the ordering of the cfs_rq in
    rq->leaf_cfs_rq_list but this ordering is used to update and
    propagate the update from leaf down to root."

    Instead of trying to work through all these cases and trying to reproduce
    the very high loads that produced the lockup to begin with, simplify
    the code temporarily by reverting a9e7f6544b9c - which change was clearly
    not thought through completely.

    This (hopefully) gives us a kernel that doesn't lock up so people
    can continue to enjoy their holidays without worrying about regressions. ;-)

    [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ]

    Analyzed-by: Xie XiuQi
    Analyzed-by: Vincent Guittot
    Reported-by: Zhipeng Xie
    Reported-by: Sargun Dhillon
    Reported-by: Xie XiuQi
    Tested-by: Zhipeng Xie
    Tested-by: Sargun Dhillon
    Signed-off-by: Linus Torvalds
    Acked-by: Vincent Guittot
    Cc: # v4.13+
    Cc: Bin Li
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: a9e7f6544b9c ("sched/fair: Fix O(nr_cgroups) in load balance path")
    Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

06 Dec, 2018

3 commits

  • commit 321a874a7ef85655e93b3206d0f36b4a6097f948 upstream

    Make the scheduler's 'sched_smt_present' static key globaly available, so
    it can be used in the x86 speculation control code.

    Provide a query function and a stub for the CONFIG_SMP=n case.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     
  • commit c5511d03ec090980732e929c318a7a6374b5550e upstream

    Currently the 'sched_smt_present' static key is enabled when at CPU bringup
    SMT topology is observed, but it is never disabled. However there is demand
    to also disable the key when the topology changes such that there is no SMT
    present anymore.

    Implement this by making the key count the number of cores that have SMT
    enabled.

    In particular, the SMT topology bits are set before interrrupts are enabled
    and similarly, are cleared after interrupts are disabled for the last time
    and the CPU dies.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Ingo Molnar
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Jiri Kosina
    Cc: Tom Lendacky
    Cc: Josh Poimboeuf
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Casey Schaufler
    Cc: Asit Mallick
    Cc: Arjan van de Ven
    Cc: Jon Masters
    Cc: Waiman Long
    Cc: Greg KH
    Cc: Dave Stewart
    Cc: Kees Cook
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra (Intel)
     
  • commit ce48c146495a1a50e48cdbfbfaba3e708be7c07c upstream

    Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion:

    tg_set_cfs_bandwidth()
    get_online_cpus()
    cpus_read_lock()

    cfs_bandwidth_usage_inc()
    static_key_slow_inc()
    cpus_read_lock()

    Reported-by: Tejun Heo
    Tested-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

27 Nov, 2018

1 commit

  • [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]

    When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
    for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
    Juno or HiKey960), we get the following report:

    [ 0.748225] Call trace:
    [ 0.750685] lockdep_assert_cpus_held+0x30/0x40
    [ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
    [ 0.760137] build_sched_domains+0x1034/0x1108
    [ 0.764601] sched_init_domains+0x68/0x90
    [ 0.768628] sched_init_smp+0x30/0x80
    [ 0.772309] kernel_init_freeable+0x278/0x51c
    [ 0.776685] kernel_init+0x10/0x108
    [ 0.780190] ret_from_fork+0x10/0x18

    The static_key in question is 'sched_asym_cpucapacity' introduced by
    commit:

    df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")

    In this particular case, we enable it because smp_prepare_cpus() will
    end up fetching the capacity-dmips-mhz entry from the devicetree,
    so we already have some asymmetry detected when entering sched_init_smp().

    This didn't get detected in tip/sched/core because we were missing:

    commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

    Calls to build_sched_domains() post sched_init_smp() will hold the
    hotplug lock, it just so happens that this very first call is a
    special case. As stated by a comment in sched_init_smp(), "There's no
    userspace yet to cause hotplug operations" so this is a harmless
    warning.

    However, to both respect the semantics of underlying
    callees and make lockdep happy, take the hotplug lock in
    sched_init_smp(). This also satisfies the comment atop
    sched_init_domains() that says "Callers must hold the hotplug lock".

    Reported-by: Sudeep Holla
    Tested-by: Sudeep Holla
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: morten.rasmussen@arm.com
    Cc: quentin.perret@arm.com
    Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

14 Nov, 2018

1 commit

  • [ Upstream commit 9845c49cc9bbb317a0bc9e9cf78d8e09d54c9af0 ]

    The comment and the code around the update_min_vruntime() call in
    dequeue_entity() are not in agreement.

    >From commit:

    b60205c7c558 ("sched/fair: Fix min_vruntime tracking")

    I think that we want to update min_vruntime when a task is sleeping/migrating.
    So, the check is inverted there - fix it.

    Signed-off-by: Song Muchun
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
    Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Song Muchun
     

10 Nov, 2018

1 commit

  • commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.

    With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
    distribute_cfs_runtime may not empty the throttled_list before it runs
    out of runtime to distribute. In that case, due to the change from
    c06f04c7048 to put throttled entries at the head of the list, later entries
    on the list will starve. Essentially, the same X processes will get pulled
    off the list, given CPU time and then, when expired, get put back on the
    head of the list where distribute_cfs_runtime will give runtime to the same
    set of processes leaving the rest.

    Fix the issue by setting a bit in struct cfs_bandwidth when
    distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
    decide to put the throttled entry on the tail or the head of the list. The
    bit is set/cleared by the callers of distribute_cfs_runtime while they hold
    cfs_bandwidth->lock.

    This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
    the live system. In some cases you can simply look at the throttled list and
    see the later entries are not changing:

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -976050
    2 ffff90b56cb2cc00 -484925
    3 ffff90b56cb2bc00 -658814
    4 ffff90b56cb2ba00 -275365
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
    1 ffff90b56cb2d200 -994147
    2 ffff90b56cb2cc00 -306051
    3 ffff90b56cb2bc00 -961321
    4 ffff90b56cb2ba00 -24490
    5 ffff90b166a45600 -135138
    6 ffff90b56cb2da00 -282505
    7 ffff90b56cb2e000 -148065
    8 ffff90b56cb2fa00 -872591
    9 ffff90b56cb2c000 -84687
    10 ffff90b56cb2f000 -87237
    11 ffff90b166a40a00 -164582

    Sometimes it is easier to see by finding a process getting starved and looking
    at the sched_info:

    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },
    crash> task ffff8eb765994500 sched_info
    PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
    sched_info = {
    pcount = 8,
    run_delay = 697094208,
    last_arrival = 240260125039,
    last_queued = 240260327513
    },

    Signed-off-by: Phil Auld
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
    Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Phil Auld
     

29 Sep, 2018

1 commit

  • commit d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe upstream.

    When a task which previously ran on a given CPU is remotely queued to
    wake up on that same CPU, there is a period where the task's state is
    TASK_WAKING and its vruntime is not normalized. This is not accounted
    for in vruntime_normalized() which will cause an error in the task's
    vruntime if it is switched from the fair class during this time.

    For example if it is boosted to RT priority via rt_mutex_setprio(),
    rq->min_vruntime will not be subtracted from the task's vruntime but
    it will be added again when the task returns to the fair class. The
    task's vruntime will have been erroneously doubled and the effective
    priority of the task will be reduced.

    Note this will also lead to inflation of all vruntimes since the doubled
    vruntime value will become the rq's min_vruntime when other tasks leave
    the rq. This leads to repeated doubling of the vruntime and priority
    penalty.

    Fix this by recognizing a WAKING task's vruntime as normalized only if
    sched_remote_wakeup is true. This indicates a migration, in which case
    the vruntime would have been normalized in migrate_task_rq_fair().

    Based on a similar patch from John Dias .

    Suggested-by: Peter Zijlstra
    Tested-by: Dietmar Eggemann
    Signed-off-by: Steve Muckle
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Redpath
    Cc: John Dias
    Cc: Linus Torvalds
    Cc: Miguel de Dios
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Paul Turner
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: kernel-team@android.com
    Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
    Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steve Muckle
     

26 Sep, 2018

2 commits

  • [ Upstream commit 8fe5c5a937d0f4e84221631833a2718afde52285 ]

    When a new task wakes-up for the first time, its initial utilization
    is set to half of the spare capacity of its CPU. The current
    implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
    directly as a capacity reference. As a result, on a big.LITTLE system, a
    new task waking up on an idle little CPU will be given ~512 of util_avg,
    even if the CPU's capacity is significantly less than that.

    Fix this by computing the spare capacity with arch_scale_cpu_capacity().

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Vincent Guittot
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dietmar.eggemann@arm.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Quentin Perret
     
  • [ Upstream commit 76e079fefc8f62bd9b2cd2950814d1ee806e31a5 ]

    wake_woken_function() synchronizes with wait_woken() as follows:

    [wait_woken] [wake_woken_function]

    entry->flags &= ~wq_flag_woken; condition = true;
    smp_mb(); smp_wmb();
    if (condition) wq_entry->flags |= wq_flag_woken;
    break;

    This commit replaces the above smp_wmb() with an smp_mb() in order to
    guarantee that either wait_woken() sees the wait condition being true
    or the store to wq_entry->flags in woken_wake_function() follows the
    store in wait_woken() in the coherence order (so that the former can
    eventually be observed by wait_woken()).

    The commit also fixes a comment associated to set_current_state() in
    wait_woken(): the comment pairs the barrier in set_current_state() to
    the above smp_wmb(), while the actual pairing involves the barrier in
    set_current_state() and the barrier executed by the try_to_wake_up()
    in wake_woken_function().

    Signed-off-by: Andrea Parri
    Signed-off-by: Paul E. McKenney
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: akiyks@gmail.com
    Cc: boqun.feng@gmail.com
    Cc: dhowells@redhat.com
    Cc: j.alglave@ucl.ac.uk
    Cc: linux-arch@vger.kernel.org
    Cc: luc.maranget@inria.fr
    Cc: npiggin@gmail.com
    Cc: parri.andrea@gmail.com
    Cc: stern@rowland.harvard.edu
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Andrea Parri
     

15 Sep, 2018

1 commit

  • commit 295d6d5e373607729bcc8182c25afe964655714f upstream.

    Fix a bug introduced in:

    72f9f3fdc928 ("sched/deadline: Remove dl_new from struct sched_dl_entity")

    After that commit, when switching to -deadline if the scheduling
    deadline of a task is in the past then switched_to_dl() calls
    setup_new_entity() to properly initialize the scheduling deadline
    and runtime.

    The problem is that the task is enqueued _before_ having its parameters
    initialized by setup_new_entity(), and this can cause problems.
    For example, a task with its out-of-date deadline in the past will
    potentially be enqueued as the highest priority one; however, its
    adjusted deadline may not be the earliest one.

    This patch fixes the problem by initializing the task's parameters before
    enqueuing it.

    Signed-off-by: luca abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Daniel Bristot de Oliveira
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1504778971-13573-3-git-send-email-luca.abeni@santannapisa.it
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Luca Abeni
     

05 Sep, 2018

1 commit

  • [ Upstream commit f3d133ee0a17d5694c6f21873eec9863e11fa423 ]

    NO_RT_RUNTIME_SHARE feature is used to prevent a CPU borrow enough
    runtime with a spin-rt-task.

    However, if RT_RUNTIME_SHARE feature is enabled and rt_rq has borrowd
    enough rt_runtime at the beginning, rt_runtime can't be restored to
    its initial bandwidth rt_runtime after we disable RT_RUNTIME_SHARE.

    E.g. on my PC with 4 cores, procedure to reproduce:
    1) Make sure RT_RUNTIME_SHARE is enabled
    cat /sys/kernel/debug/sched_features
    GENTLE_FAIR_SLEEPERS START_DEBIT NO_NEXT_BUDDY LAST_BUDDY
    CACHE_HOT_BUDDY WAKEUP_PREEMPTION NO_HRTICK NO_DOUBLE_TICK
    LB_BIAS NONTASK_CAPACITY TTWU_QUEUE NO_SIS_AVG_CPU SIS_PROP
    NO_WARN_DOUBLE_CLOCK RT_PUSH_IPI RT_RUNTIME_SHARE NO_LB_MIN
    ATTACH_AGE_LOAD WA_IDLE WA_WEIGHT WA_BIAS
    2) Start a spin-rt-task
    ./loop_rr &
    3) set affinity to the last cpu
    taskset -p 8 $pid_of_loop_rr
    4) Observe that last cpu have borrowed enough runtime.
    cat /proc/sched_debug | grep rt_runtime
    .rt_runtime : 950.000000
    .rt_runtime : 900.000000
    .rt_runtime : 950.000000
    .rt_runtime : 1000.000000
    5) Disable RT_RUNTIME_SHARE
    echo NO_RT_RUNTIME_SHARE > /sys/kernel/debug/sched_features
    6) Observe that rt_runtime can not been restored
    cat /proc/sched_debug | grep rt_runtime
    .rt_runtime : 950.000000
    .rt_runtime : 900.000000
    .rt_runtime : 950.000000
    .rt_runtime : 1000.000000

    This patch help to restore rt_runtime after we disable
    RT_RUNTIME_SHARE.

    Signed-off-by: Hailong Liu
    Signed-off-by: Jiang Biao
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: zhong.weidong@zte.com.cn
    Link: http://lkml.kernel.org/r/1531874815-39357-1-git-send-email-liu.hailong6@zte.com.cn
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Hailong Liu
     

16 Aug, 2018

1 commit

  • commit ba2591a5993eabcc8e874e30f361d8ffbb10d6d4 upstream

    The static key sched_smt_present is only updated at boot time when SMT
    siblings have been detected. Booting with maxcpus=1 and bringing the
    siblings online after boot rebuilds the scheduling domains correctly but
    does not update the static key, so the SMT code is not enabled.

    Let the key be updated in the scheduler CPU hotplug code to fix this.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Konrad Rzeszutek Wilk
    Acked-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

08 Jul, 2018

2 commits

  • [ Upstream commit 7af443ee1697607541c6346c87385adab2214743 ]

    select_task_rq() is used in a few paths to select the CPU upon which a
    thread should be run - for example it is used by try_to_wake_up() & by
    fork or exec balancing. As-is it allows use of any online CPU that is
    present in the task's cpus_allowed mask.

    This presents a problem because there is a period whilst CPUs are
    brought online where a CPU is marked online, but is not yet fully
    initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <
    CPUHP_ONLINE. Usually we don't run any user tasks during this window,
    but there are corner cases where this can happen. An example observed
    is:

    - Some user task A, running on CPU X, forks to create task B.

    - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
    task_struct::cpu field to X.

    - CPU X is offlined.

    - Task A, currently somewhere between the __set_task_cpu() in
    copy_process() and the call to wake_up_new_task(), is migrated to
    CPU Y by migrate_tasks() when CPU X is offlined.

    - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
    scheduler is now active on CPU X, but there are no user tasks on
    the runqueue.

    - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
    select_task_rq() with cpu=X, taken from task B's task_struct,
    and select_task_rq() allows CPU X to be returned.

    - Task A enqueues task B on CPU X's runqueue, via activate_task() &
    enqueue_task().

    - CPU X now has a user task on its runqueue before it has reached the
    CPUHP_ONLINE state.

    In most cases, the user tasks that schedule on the newly onlined CPU
    have no idea that anything went wrong, but one case observed to be
    problematic is if the task goes on to invoke the sched_setaffinity
    syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
    before the CPU that brought it online calls stop_machine_unpark(). This
    means that for a portion of the window of time between
    CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
    cpu_stopper has its enabled field set to false. If a user thread is
    executed on the CPU during this window and it invokes sched_setaffinity
    with a CPU mask that does not include the CPU it's running on, then when
    __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
    migration_cpu_stop() and perform the actual migration away from the CPU
    it will simply return -ENOENT rather than calling migration_cpu_stop().
    We then return from the sched_setaffinity syscall back to the user task
    that is now running on a CPU which it just asked not to run on, and
    which is not present in its cpus_allowed mask.

    This patch resolves the problem by having select_task_rq() enforce that
    user tasks run on CPUs that are active - the same requirement that
    select_fallback_rq() already enforces. This should ensure that newly
    onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
    schedule user tasks, and also implies that bringup_wait_for_ap() will
    have called stop_machine_unpark() which resolves the sched_setaffinity
    issue above.

    I haven't yet investigated them, but it may be of interest to review
    whether any of the actions performed by hotplug states between
    CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
    effects on user tasks that might schedule before they are reached, which
    might widen the scope of the problem from just affecting the behaviour
    of sched_setaffinity.

    Signed-off-by: Paul Burton
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul Burton
     
  • [ Upstream commit 175f0e25abeaa2218d431141ce19cf1de70fa82d ]

    As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
    for running on an online && !active CPU are stricter than just being a
    kthread, you need to be a per-cpu kthread.

    If you're not strictly per-CPU, you have better CPUs to run on and
    don't need the partially booted one to get your work done.

    The exception is to allow smpboot threads to bootstrap the CPU itself
    and get kernel 'services' initialized before we allow userspace on it.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: 955dbdf4ce87 ("sched: Allow migrating kthreads into online but inactive CPUs")
    Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

21 Jun, 2018

3 commits

  • [ Upstream commit 3febfc8a219a036633b57a34c6678e21b6a0580d ]

    Since the grub_reclaim() function can be made static, make it so.

    Silences the following GCC warning (W=1):

    kernel/sched/deadline.c:1120:5: warning: no previous prototype for ‘grub_reclaim’ [-Wmissing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180516200902.959-1-malat@debian.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit f6a3463063f42d9fb2c78f386437a822e0ad1792 ]

    In the following commit:

    6b55c9654fcc ("sched/debug: Move print_cfs_rq() declaration to kernel/sched/sched.h")

    the print_cfs_rq() prototype was added to ,
    right next to the prototypes for print_cfs_stats(), print_rt_stats()
    and print_dl_stats().

    Finish this previous commit and also move related prototypes for
    print_rt_rq() and print_dl_rq().

    Remove existing extern declarations now that they not needed anymore.

    Silences the following GCC warning, triggered by W=1:

    kernel/sched/debug.c:573:6: warning: no previous prototype for ‘print_rt_rq’ [-Wmissing-prototypes]
    kernel/sched/debug.c:603:6: warning: no previous prototype for ‘print_dl_rq’ [-Wmissing-prototypes]

    Signed-off-by: Mathieu Malaterre
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180516195348.30426-1-malat@debian.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Malaterre
     
  • [ Upstream commit b5bf9a90bbebffba888c9144c5a8a10317b04064 ]

    Gaurav reported a perceived problem with TASK_PARKED, which turned out
    to be a broken wait-loop pattern in __kthread_parkme(), but the
    reported issue can (and does) in fact happen for states that do not do
    condition based sleeps.

    When the 'current->state = TASK_RUNNING' store of a previous
    (concurrent) try_to_wake_up() collides with the setting of a 'special'
    sleep state, we can loose the sleep state.

    Normal condition based wait-loops are immune to this problem, but for
    sleep states that are not condition based are subject to this problem.

    There already is a fix for TASK_DEAD. Abstract that and also apply it
    to TASK_STOPPED and TASK_TRACED, both of which are also without
    condition based wait-loop.

    Reported-by: Gaurav Kohli
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

30 May, 2018

1 commit

  • [ Upstream commit d29a20645d5e929aa7e8616f28e5d8e1c49263ec ]

    While running rt-tests' pi_stress program I got the following splat:

    rq->clock_update_flags < RQCF_ACT_SKIP
    WARNING: CPU: 27 PID: 0 at kernel/sched/sched.h:960 assert_clock_updated.isra.38.part.39+0x13/0x20

    [...]


    enqueue_top_rt_rq+0xf4/0x150
    ? cpufreq_dbs_governor_start+0x170/0x170
    sched_rt_rq_enqueue+0x65/0x80
    sched_rt_period_timer+0x156/0x360
    ? sched_rt_rq_enqueue+0x80/0x80
    __hrtimer_run_queues+0xfa/0x260
    hrtimer_interrupt+0xcb/0x220
    smp_apic_timer_interrupt+0x62/0x120
    apic_timer_interrupt+0xf/0x20

    [...]

    do_idle+0x183/0x1e0
    cpu_startup_entry+0x5f/0x70
    start_secondary+0x192/0x1d0
    secondary_startup_64+0xa5/0xb0

    We can get rid of it be the "traditional" means of adding an
    update_rq_clock() call after acquiring the rq->lock in
    do_sched_rt_period_timer().

    The case for the RT task throttling (which this workload also hits)
    can be ignored in that the skip_update call is actually bogus and
    quite the contrary (the request bits are removed/reverted).

    By setting RQCF_UPDATED we really don't care if the skip is happening
    or not and will therefore make the assert_clock_updated() check happy.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Matt Fleming
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: linux-kernel@vger.kernel.org
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/20180402164954.16255-1-dave@stgolabs.net
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Davidlohr Bueso
     

16 May, 2018

2 commits

  • commit 354d7793070611b4df5a79fbb0f12752d0ed0cc5 upstream.

    > kernel/sched/autogroup.c:230 proc_sched_autogroup_set_nice() warn: potential spectre issue 'sched_prio_to_weight'

    Userspace controls @nice, sanitize the array index.

    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     
  • commit 97739501f207efe33145b918817f305b822987f8 upstream.

    If the next_freq field of struct sugov_policy is set to UINT_MAX,
    it shouldn't be used for updating the CPU frequency (this is a
    special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq:
    schedutil: Avoid reducing frequency of busy CPUs prematurely) it
    may be passed as the new frequency to sugov_update_commit() in
    sugov_update_single().

    Fix that by adding an extra check for the special UINT_MAX value
    of next_freq to sugov_update_single().

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Reported-by: Viresh Kumar
    Cc: 4.12+ # 4.12+
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

19 Mar, 2018

2 commits

  • [ Upstream commit a0982dfa03efca6c239c52cabebcea4afb93ea6b ]

    The rcutorture test suite occasionally provokes a splat due to invoking
    resched_cpu() on an offline CPU:

    WARNING: CPU: 2 PID: 8 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
    Modules linked in:
    CPU: 2 PID: 8 Comm: rcu_preempt Not tainted 4.14.0-rc4+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff902ede9daf00 task.stack: ffff96c50010c000
    RIP: 0010:native_smp_send_reschedule+0x37/0x40
    RSP: 0018:ffff96c50010fdb8 EFLAGS: 00010096
    RAX: 000000000000002e RBX: ffff902edaab4680 RCX: 0000000000000003
    RDX: 0000000080000003 RSI: 0000000000000000 RDI: 00000000ffffffff
    RBP: ffff96c50010fdb8 R08: 0000000000000000 R09: 0000000000000001
    R10: 0000000000000000 R11: 00000000299f36ae R12: 0000000000000001
    R13: ffffffff9de64240 R14: 0000000000000001 R15: ffffffff9de64240
    FS: 0000000000000000(0000) GS:ffff902edfc80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000f7d4c642 CR3: 000000001e0e2000 CR4: 00000000000006e0
    Call Trace:
    resched_curr+0x8f/0x1c0
    resched_cpu+0x2c/0x40
    rcu_implicit_dynticks_qs+0x152/0x220
    force_qs_rnp+0x147/0x1d0
    ? sync_rcu_exp_select_cpus+0x450/0x450
    rcu_gp_kthread+0x5a9/0x950
    kthread+0x142/0x180
    ? force_qs_rnp+0x1d0/0x1d0
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x27/0x40
    Code: 14 01 0f 92 c0 84 c0 74 14 48 8b 05 14 4f f4 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 38 89 ca 9d e8 e5 56 08 00 ff 5d c3 0f 1f 44 00 00 8b 05 52 9e 37 02 85 c0 75 38 55 48
    ---[ end trace 26df9e5df4bba4ac ]---

    This splat cannot be generated by expedited grace periods because they
    always invoke resched_cpu() on the current CPU, which is good because
    expedited grace periods require that resched_cpu() unconditionally
    succeed. However, other parts of RCU can tolerate resched_cpu() acting
    as a no-op, at least as long as it doesn't happen too often.

    This commit therefore makes resched_cpu() invoke resched_curr() only if
    the CPU is either online or is the current CPU.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • [ Upstream commit 2fe2582649aa2355f79acddb86bd4d6c5363eb63 ]

    The rcutorture test suite occasionally provokes a splat due to invoking
    rt_mutex_lock() which needs to boost the priority of a task currently
    sitting on a runqueue that belongs to an offline CPU:

    WARNING: CPU: 0 PID: 12 at /home/paulmck/public_git/linux-rcu/arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x37/0x40
    Modules linked in:
    CPU: 0 PID: 12 Comm: rcub/7 Not tainted 4.14.0-rc4+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
    task: ffff9ed3de5f8cc0 task.stack: ffffbbf80012c000
    RIP: 0010:native_smp_send_reschedule+0x37/0x40
    RSP: 0018:ffffbbf80012fd10 EFLAGS: 00010082
    RAX: 000000000000002f RBX: ffff9ed3dd9cb300 RCX: 0000000000000004
    RDX: 0000000080000004 RSI: 0000000000000086 RDI: 00000000ffffffff
    RBP: ffffbbf80012fd10 R08: 000000000009da7a R09: 0000000000007b9d
    R10: 0000000000000001 R11: ffffffffbb57c2cd R12: 000000000000000d
    R13: ffff9ed3de5f8cc0 R14: 0000000000000061 R15: ffff9ed3ded59200
    FS: 0000000000000000(0000) GS:ffff9ed3dea00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00000000080686f0 CR3: 000000001b9e0000 CR4: 00000000000006f0
    Call Trace:
    resched_curr+0x61/0xd0
    switched_to_rt+0x8f/0xa0
    rt_mutex_setprio+0x25c/0x410
    task_blocks_on_rt_mutex+0x1b3/0x1f0
    rt_mutex_slowlock+0xa9/0x1e0
    rt_mutex_lock+0x29/0x30
    rcu_boost_kthread+0x127/0x3c0
    kthread+0x104/0x140
    ? rcu_report_unblock_qs_rnp+0x90/0x90
    ? kthread_create_on_node+0x40/0x40
    ret_from_fork+0x22/0x30
    Code: f0 00 0f 92 c0 84 c0 74 14 48 8b 05 34 74 c5 00 be fd 00 00 00 ff 90 a0 00 00 00 5d c3 89 fe 48 c7 c7 a0 c6 fc b9 e8 d5 b5 06 00 ff 5d c3 0f 1f 44 00 00 8b 05 a2 d1 13 02 85 c0 75 38 55 48

    But the target task's priority has already been adjusted, so the only
    purpose of switched_to_rt() invoking resched_curr() is to wake up the
    CPU running some task that needs to be preempted by the boosted task.
    But the CPU is offline, which presumably means that the task must be
    migrated to some other CPU, and that this other CPU will undertake any
    needed preemption at the time of migration. Because the runqueue lock
    is held when resched_curr() is invoked, we know that the boosted task
    cannot go anywhere, so it is not necessary to invoke resched_curr()
    in this particular case.

    This commit therefore makes switched_to_rt() refrain from invoking
    resched_curr() when the target CPU is offline.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     

17 Feb, 2018

3 commits

  • commit 364f56653708ba8bcdefd4f0da2a42904baa8eeb upstream.

    When issuing an IPI RT push, where an IPI is sent to each CPU that has more
    than one RT task scheduled on it, it references the root domain's rto_mask,
    that contains all the CPUs within the root domain that has more than one RT
    task in the runable state. The problem is, after the IPIs are initiated, the
    rq->lock is released. This means that the root domain that is associated to
    the run queue could be freed while the IPIs are going around.

    Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
    the root domain's ref count respectively. This way when initiating the IPIs,
    the scheduler will up the root domain's ref count before releasing the
    rq->lock, ensuring that the root domain does not go away until the IPI round
    is complete.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit ad0f1d9d65938aec72a698116cd73a980916895e upstream.

    When the rto_push_irq_work_func() is called, it looks at the RT overloaded
    bitmask in the root domain via the runqueue (rq->rd). The problem is that
    during CPU up and down, nothing here stops rq->rd from changing between
    taking the rq->rd->rto_lock and releasing it. That means the lock that is
    released is not the same lock that was taken.

    Instead of using this_rq()->rd to get the root domain, as the irq work is
    part of the root domain, we can simply get the root domain from the irq work
    that is passed to the routine:

    container_of(work, struct root_domain, rto_push_work)

    This keeps the root domain consistent.

    Reported-by: Pavan Kondeti
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 4bdced5c9a292 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (VMware)
     
  • commit c6b9d9a33029014446bd9ed84c1688f6d3d4eab9 upstream.

    The following cleanup commit:

    50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

    ... unintentionally changed the behavior of add_wait_queue() from
    inserting the wait entry at the head of the wait queue to the tail
    of the wait queue.

    Beyond a negative performance impact this change in behavior
    theoretically also breaks wait queues which mix exclusive and
    non-exclusive waiters, as non-exclusive waiters will not be
    woken up if they are queued behind enough exclusive waiters.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
    Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

24 Jan, 2018

1 commit

  • commit c96f5471ce7d2aefd0dda560cc23f08ab00bc65d upstream.

    Before commit:

    e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

    delayacct_blkio_end() was called after context-switching into the task which
    completed I/O.

    This resulted in double counting: the task would account a delay both waiting
    for I/O and for time spent in the runqueue.

    With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
    In ttwu, we have not yet context-switched. This is more correct, in that
    the delay accounting ends when the I/O is complete.

    But delayacct_blkio_end() relies on 'get_current()', and we have not yet
    context-switched into the task whose I/O completed. This results in the
    wrong task having its delay accounting statistics updated.

    Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
    so that it can update the statistics of the correct task.

    Signed-off-by: Josh Snyder
    Acked-by: Tejun Heo
    Acked-by: Balbir Singh
    Cc: Brendan Gregg
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-block@vger.kernel.org
    Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
    Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Josh Snyder
     

17 Jan, 2018

1 commit

  • commit 541676078b52f365f53d46ee5517d305cd1b6350 upstream.

    smp_call_function_many() requires disabling preemption around the call.

    Signed-off-by: Mathieu Desnoyers
    Cc: Andrea Parri
    Cc: Andrew Hunter
    Cc: Avi Kivity
    Cc: Benjamin Herrenschmidt
    Cc: Boqun Feng
    Cc: Dave Watson
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Maged Michael
    Cc: Michael Ellerman
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     

03 Jan, 2018

1 commit

  • commit 466a2b42d67644447a1765276259a3ea5531ddff upstream.

    Since the recent remote cpufreq callback work, its possible that a cpufreq
    update is triggered from a remote CPU. For single policies however, the current
    code uses the local CPU when trying to determine if the remote sg_cpu entered
    idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
    idle_calls counter of the remote CPU.

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Joel Fernandes
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     

20 Dec, 2017

1 commit

  • commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream.

    Daniel Wagner reported a crash on the BeagleBone Black SoC.

    This is a single CPU architecture, and does not have a functional
    arch_send_call_function_single_ipi() implementation which can crash
    the kernel if that is called.

    As it only has one CPU, it shouldn't be called, but if the kernel is
    compiled for SMP, the push/pull RT scheduling logic now calls it for
    irq_work if the one CPU is overloaded, it can use that function to call
    itself and crash the kernel.

    Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
    only has a single CPU. But SCHED_FEAT is a constant if sched debugging
    is turned off. Another fix can also be used, and this should also help
    with normal SMP machines. That is, do not initiate the pull code if
    there's only one RT overloaded CPU, and that CPU happens to be the
    current CPU that is scheduling in a lower priority task.

    Even on a system with many CPUs, if there's many RT tasks waiting to
    run on a single CPU, and that CPU schedules in another RT task of lower
    priority, it will initiate the PULL logic in case there's a higher
    priority RT task on another CPU that is waiting to run. But if there is
    no other CPU with waiting RT tasks, it will initiate the RT pull logic
    on itself (as it still has RT tasks waiting to run). This is a wasted
    effort.

    Not only does this help with SMP code where the current CPU is the only
    one with RT overloaded tasks, it should also solve the issue that
    Daniel encountered, because it will prevent the PULL logic from
    executing, as there's only one CPU on the system, and the check added
    here will cause it to exit the RT pull code.

    Reported-by: Daniel Wagner
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: linux-rt-users
    Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     

30 Nov, 2017

3 commits

  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.

    The current implementation of synchronize_sched_expedited() incorrectly
    assumes that resched_cpu() is unconditional, which it is not. This means
    that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
    fails as follows (analysis by Neeraj Upadhyay):

    o CPU1 is waiting for expedited wait to complete:

    sync_rcu_exp_select_cpus
    rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
    IPI sent to CPU5

    synchronize_sched_expedited_wait
    ret = swait_event_timeout(rsp->expedited_wq,
    sync_rcu_preempt_exp_done(rnp_root),
    jiffies_stall);

    expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

    o CPU5 handles IPI and fails to acquire rq lock.

    Handles IPI
    sync_sched_exp_handler
    resched_cpu
    returns while failing to try lock acquire rq->lock
    need_resched is not set

    o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
    idle (schedule() is not called).

    o CPU 1 reports RCU stall.

    Given that resched_cpu() is now used only by RCU, this commit fixes the
    assumption by making resched_cpu() unconditional.

    Reported-by: Neeraj Upadhyay
    Suggested-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.

    'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

09 Nov, 2017

1 commit


05 Nov, 2017

1 commit

  • After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
    callbacks) we stopped to always read the utilization for the CPU we
    are running the governor on, and instead we read it for the CPU
    which we've been told has updated utilization. This is stored in
    sugov_cpu->cpu.

    The value is set in sugov_register() but we clear it in sugov_start()
    which leads to always looking at the utilization of CPU0 instead of
    the correct one.

    Fix this by consolidating the initialization code into sugov_start().

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Signed-off-by: Chris Redpath
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Brendan Jackman
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Chris Redpath
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

10 Oct, 2017

1 commit

  • While load_balance() masks the source CPUs against active_mask, it had
    a hole against the destination CPU. Ensure the destination CPU is also
    part of the 'domain-mask & active-mask' set.

    Reported-by: Levin, Alexander (Sasha Levin)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra