20 Mar, 2018

2 commits

  • The estimated utilization of a task is currently updated every time the
    task is dequeued. However, to keep overheads under control, PELT signals
    are effectively updated at maximum once every 1ms.

    Thus, for really short running tasks, it can happen that their util_avg
    value has not been updates since their last enqueue. If such tasks are
    also frequently running tasks (e.g. the kind of workload generated by
    hackbench) it can also happen that their util_avg is updated only every
    few activations.

    This means that updating util_est at every dequeue potentially introduces
    not necessary overheads and it's also conceptually wrong if the util_avg
    signal has never been updated during a task activation.

    Let's introduce a throttling mechanism on task's util_est updates
    to sync them with util_avg updates. To make the solution memory
    efficient, both in terms of space and load/store operations, we encode a
    synchronization flag into the LSB of util_est.enqueued.
    This makes util_est an even values only metric, which is still
    considered good enough for its purpose.
    The synchronization bit is (re)set by __update_load_avg_se() once the
    PELT signal of a task has been updated during its last activation.

    Such a throttling mechanism allows to keep under control util_est
    overheads in the wakeup hot path, thus making it a suitable mechanism
    which can be enabled also on high-intensity workload systems.
    Thus, this now switches on by default the estimation utilization
    scheduler feature.

    Suggested-by: Chris Redpath
    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • The util_avg signal computed by PELT is too variable for some use-cases.
    For example, a big task waking up after a long sleep period will have its
    utilization almost completely decayed. This introduces some latency before
    schedutil will be able to pick the best frequency to run a task.

    The same issue can affect task placement. Indeed, since the task
    utilization is already decayed at wakeup, when the task is enqueued in a
    CPU, this can result in a CPU running a big task as being temporarily
    represented as being almost empty. This leads to a race condition where
    other tasks can be potentially allocated on a CPU which just started to run
    a big task which slept for a relatively long period.

    Moreover, the PELT utilization of a task can be updated every [ms], thus
    making it a continuously changing value for certain longer running
    tasks. This means that the instantaneous PELT utilization of a RUNNING
    task is not really meaningful to properly support scheduler decisions.

    For all these reasons, a more stable signal can do a better job of
    representing the expected/estimated utilization of a task/cfs_rq.
    Such a signal can be easily created on top of PELT by still using it as
    an estimator which produces values to be aggregated on meaningful
    events.

    This patch adds a simple implementation of util_est, a new signal built on
    top of PELT's util_avg where:

    util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

    This allows to remember how big a task has been reported by PELT in its
    previous activations via f(task::util_avg@dequeue), which is the new
    _task_util_est(struct task_struct*) function added by this patch.

    If a task should change its behavior and it runs longer in a new
    activation, after a certain time its util_est will just track the
    original PELT signal (i.e. task::util_avg).

    The estimated utilization of cfs_rq is defined only for root ones.
    That's because the only sensible consumer of this signal are the
    scheduler and schedutil when looking for the overall CPU utilization
    due to FAIR tasks.

    For this reason, the estimated utilization of a root cfs_rq is simply
    defined as:

    util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

    where:

    cfs_rq::util_est::enqueued = sum(_task_util_est(task))
    for each RUNNABLE task on that root cfs_rq

    It's worth noting that the estimated utilization is tracked only for
    objects of interests, specifically:

    - Tasks: to better support tasks placement decisions
    - root cfs_rqs: to better support both tasks placement decisions as
    well as frequencies selection

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

10 Oct, 2017

2 commits

  • The trivial wake_affine_idle() implementation is very good for a
    number of workloads, but it comes apart at the moment there are no
    idle CPUs left, IOW. the overloaded case.

    hackbench:

    NO_WA_WEIGHT WA_WEIGHT

    hackbench-20 : 7.362717561 seconds 6.450509391 seconds

    (win)

    netperf:

    NO_WA_WEIGHT WA_WEIGHT

    TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
    TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
    TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
    TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
    TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4

    TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
    TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
    TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
    TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
    TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41

    TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
    TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
    TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
    TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
    TCP_RR-80 : Avg: 25330.3 Avg: 26867.9

    UDP_RR-1 : Avg: 108266 Avg: 107844
    UDP_RR-10 : Avg: 95480 Avg: 95245.2
    UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
    UDP_RR-40 : Avg: 76231 Avg: 75419.1
    UDP_RR-80 : Avg: 34578.3 Avg: 35639.1

    UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
    UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
    UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
    UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
    UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97

    (wins and losses)

    sysbench:

    NO_WA_WEIGHT WA_WEIGHT

    sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
    sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
    sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
    sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
    sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
    sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.

    sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
    sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
    sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
    sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
    sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
    sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.

    (increase on the top end)

    tbench:

    NO_WA_WEIGHT

    Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
    Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
    Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
    Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
    Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
    Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms

    WA_WEIGHT

    Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
    Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
    Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
    Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
    Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
    Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms

    (increase on the top end)

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Eric reported a sysbench regression against commit:

    3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")

    Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
    against his v3.10 enterprise kernel.

    PRE (current tip/master):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64110 (2136.94 per sec.)
    5: [30 secs] transactions: 143644 (4787.99 per sec.)
    10: [30 secs] transactions: 274298 (9142.93 per sec.)
    20: [30 secs] transactions: 418683 (13955.45 per sec.)
    40: [30 secs] transactions: 320731 (10690.15 per sec.)
    80: [30 secs] transactions: 355096 (11834.28 per sec.)

    hsw-ex NAS:

    OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
    OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
    OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
    lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
    lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
    lu.C.x_threads_144_run_3.log: Time in seconds = 433.83

    POST (+patch):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64494 (2149.75 per sec.)
    5: [30 secs] transactions: 145114 (4836.99 per sec.)
    10: [30 secs] transactions: 278311 (9276.69 per sec.)
    20: [30 secs] transactions: 437169 (14571.60 per sec.)
    40: [30 secs] transactions: 669837 (22326.73 per sec.)
    80: [30 secs] transactions: 631739 (21055.88 per sec.)

    hsw-ex NAS:

    lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
    lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
    lu.C.x_threads_144_run_3.log: Time in seconds = 22.52

    This patch takes out all the shiny wake_affine() stuff and goes back to
    utter basics. Between the two CPUs involved with the wakeup (the CPU
    doing the wakeup and the CPU we ran on previously) pick the CPU we can
    run on _now_.

    This restores much of the regressions against the older kernels,
    but leaves some ground in the overloaded case. The default-enabled
    WA_WEIGHT (which will be introduced in the next patch) is an attempt
    to address the overloaded situation.

    Reported-by: Eric Farman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Linus Torvalds
    Cc: Matthew Rosato
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: jinpuwang@gmail.com
    Cc: vcaputo@pengaru.com
    Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Jun, 2017

1 commit

  • Hackbench recently suffered a bunch of pain, first by commit:

    4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

    and then by commit:

    c743f0a5c50f ("sched/fair, cpumask: Export for_each_cpu_wrap()")

    which fixed a bug in the initial for_each_cpu_wrap() implementation
    that made select_idle_cpu() even more expensive. The bug was that it
    would skip over CPUs when bits were consequtive in the bitmask.

    This however gave me an idea to fix select_idle_cpu(); where the old
    scheme was a cliff-edge throttle on idle scanning, this introduces a
    more gradual approach. Instead of stopping to scan entirely, we limit
    how many CPUs we scan.

    Initial benchmarks show that it mostly recovers hackbench while not
    hurting anything else, except Mason's schbench, but not as bad as the
    old thing.

    It also appears to recover the tbench high-end, which also suffered like
    hackbench.

    Tested-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Mason
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: hpa@zytor.com
    Cc: kitsunyan
    Cc: linux-kernel@vger.kernel.org
    Cc: lvenanci@redhat.com
    Cc: riel@redhat.com
    Cc: xiaolong.ye@intel.com
    Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 May, 2017

1 commit

  • Its an obsolete debug mechanism and future code wants to rely on
    properties this undermines.

    Namely, it would be good to assume that SD_OVERLAP domains have
    children, but if we build the entire hierarchy with SD_OVERLAP this is
    obviously false.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Mar, 2017

1 commit


02 Mar, 2017

1 commit

  • Kitsunyan reported desktop latency issues on his Celeron 887 because
    of commit:

    1b568f0aabf2 ("sched/core: Optimize SCHED_SMT")

    ... even though his CPU doesn't do SMT.

    The effect of running the SMT code on a !SMT part is basically a more
    aggressive select_idle_cpu(). Removing the avg condition fixed things
    for him.

    I also know FB likes this test gone, even though other workloads like
    having it.

    For now, take it out by default, until we get a better idea.

    Reported-by: kitsunyan
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Mason
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Sep, 2015

3 commits

  • Bring arch_scale_cpu_capacity() in line with the recent change of its
    arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
    Optimize freq invariant accounting") from weak function to #define to
    allow inlining of the function.

    While at it, remove the ARCH_CAPACITY sched_feature as well. With the
    change to #define there isn't a straightforward way to allow runtime
    switch between an arch implementation and the default implementation of
    arch_scale_cpu_capacity() using sched_feature. The default was to use
    the arch-specific implementation, but only the arm architecture provides
    one and that is essentially equivalent to the default implementation.

    Signed-off-by: Morten Rasmussen
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dietmar Eggemann
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: daniel.lezcano@linaro.org
    Cc: mturquette@baylibre.com
    Cc: pang.xunlei@zte.com.cn
    Cc: rjw@rjwysocki.net
    Cc: sgurrappadi@nvidia.com
    Cc: vincent.guittot@linaro.org
    Cc: yuyang.du@intel.com
    Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
    Signed-off-by: Ingo Molnar

    Morten Rasmussen
     
  • Variable sched_numa_balancing is available for both CONFIG_SCHED_DEBUG
    and !CONFIG_SCHED_DEBUG. All code paths now check for
    sched_numa_balancing. Hence remove sched_feat(NUMA).

    Suggested-by: Ingo Molnar
    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1439290813-6683-4-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     
  • In case there are problems with the aging on attach, provide a debug
    knob to turn it off.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Byungchul Park
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: yuyang.du@intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jul, 2015

1 commit

  • The current load balancer may not try to prevent a task from moving
    out of a preferred node to a less preferred node. The reason for this
    being:

    - Since sched features NUMA and NUMA_RESIST_LOWER are disabled by
    default, migrate_degrades_locality() always returns false.

    - Even if NUMA_RESIST_LOWER were to be enabled, if its cache hot,
    migrate_degrades_locality() never gets called.

    The above behaviour can mean that tasks can move out of their
    preferred node but they may be eventually be brought back to their
    preferred node by numa balancer (due to higher numa faults).

    To avoid the above, this commit merges migrate_degrades_locality() and
    migrate_improves_locality(). It also replaces 3 sched features NUMA,
    NUMA_FAVOUR_HIGHER and NUMA_RESIST_LOWER by a single sched feature
    NUMA.

    Signed-off-by: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1434455762-30857-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

23 Mar, 2015

1 commit

  • When debugging the latencies on a 40 core box, where we hit 300 to
    500 microsecond latencies, I found there was a huge contention on the
    runqueue locks.

    Investigating it further, running ftrace, I found that it was due to
    the pulling of RT tasks.

    The test that was run was the following:

    cyclictest --numa -p95 -m -d0 -i100

    This created a thread on each CPU, that would set its wakeup in iterations
    of 100 microseconds. The -d0 means that all the threads had the same
    interval (100us). Each thread sleeps for 100us and wakes up and measures
    its latencies.

    cyclictest is maintained at:
    git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git

    What happened was another RT task would be scheduled on one of the CPUs
    that was running our test, when the other CPU tests went to sleep and
    scheduled idle. This caused the "pull" operation to execute on all
    these CPUs. Each one of these saw the RT task that was overloaded on
    the CPU of the test that was still running, and each one tried
    to grab that task in a thundering herd way.

    To grab the task, each thread would do a double rq lock grab, grabbing
    its own lock as well as the rq of the overloaded CPU. As the sched
    domains on this box was rather flat for its size, I saw up to 12 CPUs
    block on this lock at once. This caused a ripple affect with the
    rq locks especially since the taking was done via a double rq lock, which
    means that several of the CPUs had their own rq locks held while trying
    to take this rq lock. As these locks were blocked, any wakeups or load
    balanceing on these CPUs would also block on these locks, and the wait
    time escalated.

    I've tried various methods to lessen the load, but things like an
    atomic counter to only let one CPU grab the task wont work, because
    the task may have a limited affinity, and we may pick the wrong
    CPU to take that lock and do the pull, to only find out that the
    CPU we picked isn't in the task's affinity.

    Instead of doing the PULL, I now have the CPUs that want the pull to
    send over an IPI to the overloaded CPU, and let that CPU pick what
    CPU to push the task to. No more need to grab the rq lock, and the
    push/pull algorithm still works fine.

    With this patch, the latency dropped to just 150us over a 20 hour run.
    Without the patch, the huge latencies would trigger in seconds.

    I've created a new sched feature called RT_PUSH_IPI, which is enabled
    by default.

    When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
    and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
    is enabled, the IPI is sent to the overloaded CPU to do a push.

    To enabled or disable this at run time:

    # mount -t debugfs nodev /sys/kernel/debug
    # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
    or
    # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features

    Update: This original patch would send an IPI to all CPUs in the RT overload
    list. But that could theoretically cause the reverse issue. That is, there
    could be lots of overloaded RT queues and one CPU lowers its priority. It would
    then send an IPI to all the overloaded RT queues and they could then all try
    to grab the rq lock of the CPU lowering its priority, and then we have the
    same problem.

    The latest design sends out only one IPI to the first overloaded CPU. It tries to
    push any tasks that it can, and then looks for the next overloaded CPU that can
    push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
    tasks that have priorities greater than the source CPU are covered. In case the
    source CPU lowers its priority again, a flag is set to tell the IPI traversal to
    restart with the first RT overloaded CPU after the source CPU.

    Parts-suggested-by: Peter Zijlstra
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Joern Engel
    Cc: Clark Williams
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

05 Jun, 2014

1 commit

  • It is better not to think about compute capacity as being equivalent
    to "CPU power". The upcoming "power aware" scheduler work may create
    confusion with the notion of energy consumption if "power" is used too
    liberally.

    Let's rename the following feature flags since they do relate to capacity:

    SD_SHARE_CPUPOWER -> SD_SHARE_CPUCAPACITY
    ARCH_POWER -> ARCH_CAPACITY
    NONTASK_POWER -> NONTASK_CAPACITY

    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: Vincent Guittot
    Cc: Daniel Lezcano
    Cc: Morten Rasmussen
    Cc: "Rafael J. Wysocki"
    Cc: linaro-kernel@lists.linaro.org
    Cc: Andy Fleming
    Cc: Anton Blanchard
    Cc: Benjamin Herrenschmidt
    Cc: Grant Likely
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Preeti U Murthy
    Cc: Rob Herring
    Cc: Srivatsa S. Bhat
    Cc: Toshi Kani
    Cc: Vasant Hegde
    Cc: Vincent Guittot
    Cc: devicetree@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/n/tip-e93lpnxb87owfievqatey6b5@git.kernel.org
    Signed-off-by: Ingo Molnar

    Nicolas Pitre
     

09 Oct, 2013

3 commits

  • Just as "sched: Favour moving tasks towards the preferred node" favours
    moving tasks towards nodes with a higher number of recorded NUMA hinting
    faults, this patch resists moving tasks towards nodes with lower faults.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-24-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • This patch favours moving tasks towards NUMA node that recorded a higher
    number of NUMA faults during active load balancing. Ideally this is
    self-reinforcing as the longer the task runs on that node, the more faults
    it should incur causing task_numa_placement to keep the task running on that
    node. In reality a big weakness is that the nodes CPUs can be overloaded
    and it would be more efficient to queue tasks on an idle node and migrate
    to the new node. This would require additional smarts in the balancer so
    for now the balancer will simply prefer to place the task on the preferred
    node for a PTE scans which is controlled by the numa_balancing_settle_count
    sysctl. Once the settle_count number of scans has complete the schedule
    is free to place the task on an alternative node if the load is imbalanced.

    [srikar@linux.vnet.ibm.com: Fixed statistics]
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    [ Tunable and use higher faults instead of preferred. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-23-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     
  • PTE scanning and NUMA hinting fault handling is expensive so commit
    5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
    on a new node") deferred the PTE scan until a task had been scheduled on
    another node. The problem is that in the purely shared memory case that
    this may never happen and no NUMA hinting fault information will be
    captured. We are not ruling out the possibility that something better
    can be done here but for now, this patch needs to be reverted and depend
    entirely on the scan_delay to avoid punishing short-lived processes.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: Johannes Weiner
    Cc: Srikar Dronamraju
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.de
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

19 Apr, 2013

1 commit

  • As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
    feature bit was really just an early hack to make with/without
    mutex-spinning testable. So it is no longer necessary.

    This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
    move the mutex spinning code from kernel/sched/core.c back to
    kernel/mutex.c which is where they should belong.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

17 Dec, 2012

1 commit

  • Pull Automatic NUMA Balancing bare-bones from Mel Gorman:
    "There are three implementations for NUMA balancing, this tree
    (balancenuma), numacore which has been developed in tip/master and
    autonuma which is in aa.git.

    In almost all respects balancenuma is the dumbest of the three because
    its main impact is on the VM side with no attempt to be smart about
    scheduling. In the interest of getting the ball rolling, it would be
    desirable to see this much merged for 3.8 with the view to building
    scheduler smarts on top and adapting the VM where required for 3.9.

    The most recent set of comparisons available from different people are

    mel: https://lkml.org/lkml/2012/12/9/108
    mingo: https://lkml.org/lkml/2012/12/7/331
    tglx: https://lkml.org/lkml/2012/12/10/437
    srikar: https://lkml.org/lkml/2012/12/10/397

    The results are a mixed bag. In my own tests, balancenuma does
    reasonably well. It's dumb as rocks and does not regress against
    mainline. On the other hand, Ingo's tests shows that balancenuma is
    incapable of converging for this workloads driven by perf which is bad
    but is potentially explained by the lack of scheduler smarts. Thomas'
    results show balancenuma improves on mainline but falls far short of
    numacore or autonuma. Srikar's results indicate we all suffer on a
    large machine with imbalanced node sizes.

    My own testing showed that recent numacore results have improved
    dramatically, particularly in the last week but not universally.
    We've butted heads heavily on system CPU usage and high levels of
    migration even when it shows that overall performance is better.
    There are also cases where it regresses. Of interest is that for
    specjbb in some configurations it will regress for lower numbers of
    warehouses and show gains for higher numbers which is not reported by
    the tool by default and sometimes missed in treports. Recently I
    reported for numacore that the JVM was crashing with
    NullPointerExceptions but currently it's unclear what the source of
    this problem is. Initially I thought it was in how numacore batch
    handles PTEs but I'm no longer think this is the case. It's possible
    numacore is just able to trigger it due to higher rates of migration.

    These reports were quite late in the cycle so I/we would like to start
    with this tree as it contains much of the code we can agree on and has
    not changed significantly over the last 2-3 weeks."

    * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits)
    mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable
    mm/rmap: Convert the struct anon_vma::mutex to an rwsem
    mm: migrate: Account a transhuge page properly when rate limiting
    mm: numa: Account for failed allocations and isolations as migration failures
    mm: numa: Add THP migration for the NUMA working set scanning fault case build fix
    mm: numa: Add THP migration for the NUMA working set scanning fault case.
    mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node
    mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG
    mm: sched: numa: Control enabling and disabling of NUMA balancing
    mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate
    mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely tasknode relationships
    mm: numa: migrate: Set last_nid on newly allocated page
    mm: numa: split_huge_page: Transfer last_nid on tail page
    mm: numa: Introduce last_nid to the page frame
    sched: numa: Slowly increase the scanning period as NUMA faults are handled
    mm: numa: Rate limit setting of pte_numa if node is saturated
    mm: numa: Rate limit the amount of memory that is migrated between nodes
    mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting
    mm: numa: Migrate pages handled during a pmd_numa hinting fault
    mm: numa: Migrate on reference policy
    ...

    Linus Torvalds
     

11 Dec, 2012

3 commits

  • Due to the fact that migrations are driven by the CPU a task is running
    on there is no point tracking NUMA faults until one task runs on a new
    node. This patch tracks the first node used by an address space. Until
    it changes, PTE scanning is disabled and no NUMA hinting faults are
    trapped. This should help workloads that are short-lived, do not care
    about NUMA placement or have bound themselves to a single node.

    This takes advantage of the logic in "mm: sched: numa: Implement slow
    start for working set sampling" to delay when the checks are made. This
    will take advantage of processes that set their CPU and node bindings
    early in their lifetime. It will also potentially allow any initial load
    balancing to take place.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • This patch adds Kconfig options and kernel parameters to allow the
    enabling and disabling of automatic NUMA balancing. The existance
    of such a switch was and is very important when debugging problems
    related to transparent hugepages and we should have the same for
    automatic NUMA placement.

    Signed-off-by: Mel Gorman

    Mel Gorman
     
  • NOTE: This patch is based on "sched, numa, mm: Add fault driven
    placement and migration policy" but as it throws away all the policy
    to just leave a basic foundation I had to drop the signed-offs-by.

    This patch creates a bare-bones method for setting PTEs pte_numa in the
    context of the scheduler that when faulted later will be faulted onto the
    node the CPU is running on. In itself this does nothing useful but any
    placement policy will fundamentally depend on receiving hints on placement
    from fault context and doing something intelligent about it.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel

    Peter Zijlstra
     

16 Oct, 2012

1 commit


13 Sep, 2012

1 commit

  • Heteregeneous ARM platform uses arch_scale_freq_power function
    to reflect the relative capacity of each core

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

04 Sep, 2012

1 commit

  • Commit beac4c7e4a1c ("sched: Remove AFFINE_WAKEUPS feature") removed
    use of the flag but left the definition. Get rid of it.

    Signed-off-by: Namhyung Kim
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
    Signed-off-by: Ingo Molnar

    Namhyung Kim
     

26 Apr, 2012

1 commit

  • Commits 367456c756a6 ("sched: Ditch per cgroup task lists for
    load-balancing") and 5d6523ebd ("sched: Fix load-balance wreckage")
    left some more wreckage.

    By setting loop_max unconditionally to ->nr_running load-balancing
    could take a lot of time on very long runqueues (hackbench!). So keep
    the sysctl as max limit of the amount of tasks we'll iterate.

    Furthermore, the min load filter for migration completely fails with
    cgroups since inequality in per-cpu state can easily lead to such
    small loads :/

    Furthermore the change to add new tasks to the tail of the queue
    instead of the head seems to have some effect.. not quite sure I
    understand why.

    Combined these fixes solve the huge hackbench regression reported by
    Tim when hackbench is ran in a cgroup.

    Reported-by: Tim Chen
    Acked-by: Tim Chen
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1335365763.28150.267.camel@twins
    [ got rid of the CONFIG_PREEMPT tuning and made small readability edits ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Dec, 2011

1 commit

  • Now that we initialize jump_labels before sched_init() we can use them
    for the debug features without having to worry about a window where
    they have the wrong setting.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-vpreo4hal9e0kzqmg5y0io2k@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

17 Nov, 2011

1 commit