24 Jan, 2018

1 commit

  • commit c96f5471ce7d2aefd0dda560cc23f08ab00bc65d upstream.

    Before commit:

    e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")

    delayacct_blkio_end() was called after context-switching into the task which
    completed I/O.

    This resulted in double counting: the task would account a delay both waiting
    for I/O and for time spent in the runqueue.

    With e33a9bba85a8, delayacct_blkio_end() is called by try_to_wake_up().
    In ttwu, we have not yet context-switched. This is more correct, in that
    the delay accounting ends when the I/O is complete.

    But delayacct_blkio_end() relies on 'get_current()', and we have not yet
    context-switched into the task whose I/O completed. This results in the
    wrong task having its delay accounting statistics updated.

    Instead of doing that, pass the task_struct being woken to delayacct_blkio_end(),
    so that it can update the statistics of the correct task.

    Signed-off-by: Josh Snyder
    Acked-by: Tejun Heo
    Acked-by: Balbir Singh
    Cc: Brendan Gregg
    Cc: Jens Axboe
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-block@vger.kernel.org
    Fixes: e33a9bba85a8 ("sched/core: move IO scheduling accounting from io_schedule_timeout() into scheduler")
    Link: http://lkml.kernel.org/r/1513613712-571-1-git-send-email-joshs@netflix.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Josh Snyder
     

17 Jan, 2018

1 commit

  • commit 541676078b52f365f53d46ee5517d305cd1b6350 upstream.

    smp_call_function_many() requires disabling preemption around the call.

    Signed-off-by: Mathieu Desnoyers
    Cc: Andrea Parri
    Cc: Andrew Hunter
    Cc: Avi Kivity
    Cc: Benjamin Herrenschmidt
    Cc: Boqun Feng
    Cc: Dave Watson
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Maged Michael
    Cc: Michael Ellerman
    Cc: Paul E . McKenney
    Cc: Paul E. McKenney
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171215192310.25293-1-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Mathieu Desnoyers
     

03 Jan, 2018

1 commit

  • commit 466a2b42d67644447a1765276259a3ea5531ddff upstream.

    Since the recent remote cpufreq callback work, its possible that a cpufreq
    update is triggered from a remote CPU. For single policies however, the current
    code uses the local CPU when trying to determine if the remote sg_cpu entered
    idle or is busy. This is incorrect. To remedy this, compare with the nohz tick
    idle_calls counter of the remote CPU.

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Joel Fernandes
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Joel Fernandes
     

20 Dec, 2017

1 commit

  • commit f73c52a5bcd1710994e53fbccc378c42b97a06b6 upstream.

    Daniel Wagner reported a crash on the BeagleBone Black SoC.

    This is a single CPU architecture, and does not have a functional
    arch_send_call_function_single_ipi() implementation which can crash
    the kernel if that is called.

    As it only has one CPU, it shouldn't be called, but if the kernel is
    compiled for SMP, the push/pull RT scheduling logic now calls it for
    irq_work if the one CPU is overloaded, it can use that function to call
    itself and crash the kernel.

    Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
    only has a single CPU. But SCHED_FEAT is a constant if sched debugging
    is turned off. Another fix can also be used, and this should also help
    with normal SMP machines. That is, do not initiate the pull code if
    there's only one RT overloaded CPU, and that CPU happens to be the
    current CPU that is scheduling in a lower priority task.

    Even on a system with many CPUs, if there's many RT tasks waiting to
    run on a single CPU, and that CPU schedules in another RT task of lower
    priority, it will initiate the PULL logic in case there's a higher
    priority RT task on another CPU that is waiting to run. But if there is
    no other CPU with waiting RT tasks, it will initiate the RT pull logic
    on itself (as it still has RT tasks waiting to run). This is a wasted
    effort.

    Not only does this help with SMP code where the current CPU is the only
    one with RT overloaded tasks, it should also solve the issue that
    Daniel encountered, because it will prevent the PULL logic from
    executing, as there's only one CPU on the system, and the check added
    here will cause it to exit the RT pull code.

    Reported-by: Daniel Wagner
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: linux-rt-users
    Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt
     

30 Nov, 2017

3 commits

  • commit 4bdced5c9a2922521e325896a7bbbf0132c94e56 upstream.

    When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Steven Rostedt (Red Hat)
     
  • commit 7c2102e56a3f7d85b5d8f33efbd7aecc1f36fdd8 upstream.

    The current implementation of synchronize_sched_expedited() incorrectly
    assumes that resched_cpu() is unconditional, which it is not. This means
    that synchronize_sched_expedited() can hang when resched_cpu()'s trylock
    fails as follows (analysis by Neeraj Upadhyay):

    o CPU1 is waiting for expedited wait to complete:

    sync_rcu_exp_select_cpus
    rdp->exp_dynticks_snap & 0x1 // returns 1 for CPU5
    IPI sent to CPU5

    synchronize_sched_expedited_wait
    ret = swait_event_timeout(rsp->expedited_wq,
    sync_rcu_preempt_exp_done(rnp_root),
    jiffies_stall);

    expmask = 0x20, CPU 5 in idle path (in cpuidle_enter())

    o CPU5 handles IPI and fails to acquire rq lock.

    Handles IPI
    sync_sched_exp_handler
    resched_cpu
    returns while failing to try lock acquire rq->lock
    need_resched is not set

    o CPU5 calls rcu_idle_enter() and as need_resched is not set, goes to
    idle (schedule() is not called).

    o CPU 1 reports RCU stall.

    Given that resched_cpu() is now used only by RCU, this commit fixes the
    assumption by making resched_cpu() unconditional.

    Reported-by: Neeraj Upadhyay
    Suggested-by: Neeraj Upadhyay
    Signed-off-by: Paul E. McKenney
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Greg Kroah-Hartman

    Paul E. McKenney
     
  • commit 07458f6a5171d97511dfbdf6ce549ed2ca0280c7 upstream.

    'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Viresh Kumar
     

09 Nov, 2017

1 commit


05 Nov, 2017

1 commit

  • After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
    callbacks) we stopped to always read the utilization for the CPU we
    are running the governor on, and instead we read it for the CPU
    which we've been told has updated utilization. This is stored in
    sugov_cpu->cpu.

    The value is set in sugov_register() but we clear it in sugov_start()
    which leads to always looking at the utilization of CPU0 instead of
    the correct one.

    Fix this by consolidating the initialization code into sugov_start().

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Signed-off-by: Chris Redpath
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Brendan Jackman
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Chris Redpath
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

10 Oct, 2017

3 commits

  • While load_balance() masks the source CPUs against active_mask, it had
    a hole against the destination CPU. Ensure the destination CPU is also
    part of the 'domain-mask & active-mask' set.

    Reported-by: Levin, Alexander (Sasha Levin)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The trivial wake_affine_idle() implementation is very good for a
    number of workloads, but it comes apart at the moment there are no
    idle CPUs left, IOW. the overloaded case.

    hackbench:

    NO_WA_WEIGHT WA_WEIGHT

    hackbench-20 : 7.362717561 seconds 6.450509391 seconds

    (win)

    netperf:

    NO_WA_WEIGHT WA_WEIGHT

    TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3
    TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3
    TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3
    TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12
    TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4

    TCP_STREAM-1 : Avg: 41448.3 Avg: 42254
    TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9
    TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4
    TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57
    TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41

    TCP_RR-1 : Avg: 80473.5 Avg: 82638.8
    TCP_RR-10 : Avg: 72660.5 Avg: 73265.1
    TCP_RR-20 : Avg: 52607.1 Avg: 52634.5
    TCP_RR-40 : Avg: 57199.2 Avg: 56302.3
    TCP_RR-80 : Avg: 25330.3 Avg: 26867.9

    UDP_RR-1 : Avg: 108266 Avg: 107844
    UDP_RR-10 : Avg: 95480 Avg: 95245.2
    UDP_RR-20 : Avg: 68770.8 Avg: 68673.7
    UDP_RR-40 : Avg: 76231 Avg: 75419.1
    UDP_RR-80 : Avg: 34578.3 Avg: 35639.1

    UDP_STREAM-1 : Avg: 64684.3 Avg: 66606
    UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5
    UDP_STREAM-20 : Avg: 30376.4 Avg: 29704
    UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5
    UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97

    (wins and losses)

    sysbench:

    NO_WA_WEIGHT WA_WEIGHT

    sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec.
    sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec.
    sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec.
    sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec.
    sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec.
    sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec.

    sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec.
    sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec.
    sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec.
    sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec.
    sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec.
    sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec.

    (increase on the top end)

    tbench:

    NO_WA_WEIGHT

    Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms
    Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms
    Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms
    Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms
    Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms
    Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms

    WA_WEIGHT

    Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms
    Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms
    Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms
    Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms
    Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms
    Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms

    (increase on the top end)

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Eric reported a sysbench regression against commit:

    3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")

    Similarly, Rik was looking at the NAS-lu.C benchmark, which regressed
    against his v3.10 enterprise kernel.

    PRE (current tip/master):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64110 (2136.94 per sec.)
    5: [30 secs] transactions: 143644 (4787.99 per sec.)
    10: [30 secs] transactions: 274298 (9142.93 per sec.)
    20: [30 secs] transactions: 418683 (13955.45 per sec.)
    40: [30 secs] transactions: 320731 (10690.15 per sec.)
    80: [30 secs] transactions: 355096 (11834.28 per sec.)

    hsw-ex NAS:

    OMP_PROC_BIND/lu.C.x_threads_144_run_1.log: Time in seconds = 18.01
    OMP_PROC_BIND/lu.C.x_threads_144_run_2.log: Time in seconds = 17.89
    OMP_PROC_BIND/lu.C.x_threads_144_run_3.log: Time in seconds = 17.93
    lu.C.x_threads_144_run_1.log: Time in seconds = 434.68
    lu.C.x_threads_144_run_2.log: Time in seconds = 405.36
    lu.C.x_threads_144_run_3.log: Time in seconds = 433.83

    POST (+patch):

    ivb-ep sysbench:

    2: [30 secs] transactions: 64494 (2149.75 per sec.)
    5: [30 secs] transactions: 145114 (4836.99 per sec.)
    10: [30 secs] transactions: 278311 (9276.69 per sec.)
    20: [30 secs] transactions: 437169 (14571.60 per sec.)
    40: [30 secs] transactions: 669837 (22326.73 per sec.)
    80: [30 secs] transactions: 631739 (21055.88 per sec.)

    hsw-ex NAS:

    lu.C.x_threads_144_run_1.log: Time in seconds = 23.36
    lu.C.x_threads_144_run_2.log: Time in seconds = 22.96
    lu.C.x_threads_144_run_3.log: Time in seconds = 22.52

    This patch takes out all the shiny wake_affine() stuff and goes back to
    utter basics. Between the two CPUs involved with the wakeup (the CPU
    doing the wakeup and the CPU we ran on previously) pick the CPU we can
    run on _now_.

    This restores much of the regressions against the older kernels,
    but leaves some ground in the overloaded case. The default-enabled
    WA_WEIGHT (which will be introduced in the next patch) is an attempt
    to address the overloaded situation.

    Reported-by: Eric Farman
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Linus Torvalds
    Cc: Matthew Rosato
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: jinpuwang@gmail.com
    Cc: vcaputo@pengaru.com
    Fixes: 3fed382b46ba ("sched/numa: Implement NUMA node level wake_affine()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Sep, 2017

2 commits


15 Sep, 2017

2 commits

  • Now that we have added breaks in the wait queue scan and allow bookmark
    on scan position, we put this logic in the wake_up_page_bit function.

    We can have very long page wait list in large system where multiple
    pages share the same wait list. We break the wake up walk here to allow
    other cpus a chance to access the list, and not to disable the interrupts
    when traversing the list for too long. This reduces the interrupt and
    rescheduling latency, and excessive page wait queue lock hold time.

    [ v2: Remove bookmark_wake_function ]

    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • We encountered workloads that have very long wake up list on large
    systems. A waker takes a long time to traverse the entire wake list and
    execute all the wake functions.

    We saw page wait list that are up to 3700+ entries long in tests of
    large 4 and 8 socket systems. It took 0.8 sec to traverse such list
    during wake up. Any other CPU that contends for the list spin lock will
    spin for a long time. It is a result of the numa balancing migration of
    hot pages that are shared by many threads.

    Multiple CPUs waking are queued up behind the lock, and the last one
    queued has to wait until all CPUs did all the wakeups.

    The page wait list is traversed with interrupt disabled, which caused
    various problems. This was the original cause that triggered the NMI
    watch dog timer in: https://patchwork.kernel.org/patch/9800303/ . Only
    extending the NMI watch dog timer there helped.

    This patch bookmarks the waker's scan position in wake list and break
    the wake up walk, to allow access to the list before the waker resume
    its walk down the rest of the wait list. It lowers the interrupt and
    rescheduling latency.

    This patch also provides a performance boost when combined with the next
    patch to break up page wakeup list walk. We saw 22% improvement in the
    will-it-scale file pread2 test on a Xeon Phi system running 256 threads.

    [ v2: Merged in Linus' changes to remove the bookmark_wake_function, and
    simply access to flags. ]

    Reported-by: Kan Liang
    Tested-by: Kan Liang
    Signed-off-by: Tim Chen
    Signed-off-by: Linus Torvalds

    Tim Chen
     

14 Sep, 2017

1 commit


13 Sep, 2017

1 commit


12 Sep, 2017

4 commits

  • I'm forever late for editing my kernel cmdline, add a runtime knob to
    disable the "sched_debug" thing.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170907150614.142924283@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Migrating tasks to offline CPUs is a pretty big fail, warn about it.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170907150614.094206976@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The load balancer applies cpu_active_mask to whatever sched_domains it
    finds, however in the case of active_balance there is a hole between
    setting rq->{active_balance,push_cpu} and running the stop_machine
    work doing the actual migration.

    The @push_cpu can go offline in this window, which would result in us
    moving a task onto a dead cpu, which is a fairly bad thing.

    Double check the active mask before the stop work does the migration.

    CPU0 CPU1


    stop_machine(takedown_cpu)
    load_balance() cpu_stopper_thread()
    ... work = multi_cpu_stop
    stop_one_cpu_nowait( /* wait for CPU0 */
    .func = active_load_balance_cpu_stop
    );

    cpu_stopper_thread()
    work = multi_cpu_stop
    /* sync with CPU1 */
    take_cpu_down()

    play_dead();

    work = active_load_balance_cpu_stop
    set_task_cpu(p, CPU1); /* oops!! */

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20170907150614.044460912@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • On CPU hot unplug, when parking the last kthread we'll try and
    schedule into idle to kill the CPU. This last schedule can (and does)
    trigger newidle balance because at this point the sched domains are
    still up because of commit:

    77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")

    Obviously pulling tasks to an already offline CPU is a bad idea, and
    all balancing operations _should_ be subject to cpu_active_mask, make
    it so.

    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Fixes: 77d1dfda0e79 ("sched/topology, cpuset: Avoid spurious/wrong domain rebuilds")
    Link: http://lkml.kernel.org/r/20170907150613.994135806@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Sep, 2017

1 commit

  • Work around kernel-doc warning ('*' in Sphinx doc means "emphasis"):

    ../kernel/sched/fair.c:7584: WARNING: Inline emphasis start-string without end-string.

    Signed-off-by: Randy Dunlap
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/f18b30f9-6251-6d86-9d44-16501e386891@infradead.org
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     

09 Sep, 2017

3 commits

  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-9-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • ... with the generic rbtree flavor instead. No changes
    in semantics whatsoever.

    Link: http://lkml.kernel.org/r/20170719014603.19029-8-dave@stgolabs.net
    Signed-off-by: Davidlohr Bueso
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • First, number of CPUs can't be negative number.

    Second, different signnnedness leads to suboptimal code in the following
    cases:

    1)
    kmalloc(nr_cpu_ids * sizeof(X));

    "int" has to be sign extended to size_t.

    2)
    while (loff_t *pos < nr_cpu_ids)

    MOVSXD is 1 byte longed than the same MOV.

    Other cases exist as well. Basically compiler is told that nr_cpu_ids
    can't be negative which can't be deduced if it is "int".

    Code savings on allyesconfig kernel: -3KB

    add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
    function old new delta
    coretemp_cpu_online 450 512 +62
    rcu_init_one 1234 1272 +38
    pci_device_probe 374 399 +25

    ...

    pgdat_reclaimable_pages 628 556 -72
    select_fallback_rq 446 369 -77
    task_numa_find_cpu 1923 1807 -116

    Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

07 Sep, 2017

2 commits

  • Cpusets vs. suspend-resume is _completely_ broken. And it got noticed
    because it now resulted in non-cpuset usage breaking too.

    On suspend cpuset_cpu_inactive() doesn't call into
    cpuset_update_active_cpus() because it doesn't want to move tasks about,
    there is no need, all tasks are frozen and won't run again until after
    we've resumed everything.

    But this means that when we finally do call into
    cpuset_update_active_cpus() after resuming the last frozen cpu in
    cpuset_cpu_active(), the top_cpuset will not have any difference with
    the cpu_active_mask and this it will not in fact do _anything_.

    So the cpuset configuration will not be restored. This was largely
    hidden because we would unconditionally create identity domains and
    mobile users would not in fact use cpusets much. And servers what do use
    cpusets tend to not suspend-resume much.

    An addition problem is that we'd not in fact wait for the cpuset work to
    finish before resuming the tasks, allowing spurious migrations outside
    of the specified domains.

    Fix the rebuild by introducing cpuset_force_rebuild() and fix the
    ordering with cpuset_wait_for_hotplug().

    Reported-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Fixes: deb7aa308ea2 ("cpuset: reorganize CPU / memory hotplug handling")
    Link: http://lkml.kernel.org/r/20170907091338.orwxrqkbfkki3c24@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Chris Wilson reported that the SMT balance rules got the +1 on the
    wrong side, resulting in a bias towards the current LLC; which the
    load-balancer would then try and undo.

    Reported-by: Chris Wilson
    Tested-by: Chris Wilson
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andy Lutomirski
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Fixes: 90001d67be2f ("sched/fair: Fix wake_affine() for !NUMA_BALANCING")
    Link: http://lkml.kernel.org/r/20170906105131.gqjmaextmn3u6tj2@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

06 Sep, 2017

1 commit

  • Pull power management updates from Rafael Wysocki:
    "This time (again) cpufreq gets the majority of changes which mostly
    are driver updates (including a major consolidation of intel_pstate),
    some schedutil governor modifications and core cleanups.

    There also are some changes in the system suspend area, mostly related
    to diagnostics and debug messages plus some renames of things related
    to suspend-to-idle. One major change here is that suspend-to-idle is
    now going to be preferred over S3 on systems where the ACPI tables
    indicate to do so and provide requsite support (the Low Power Idle S0
    _DSM in particular). The system sleep documentation and the tools
    related to it are updated too.

    The rest is a few cpuidle changes (nothing major), devfreq updates,
    generic power domains (genpd) framework updates and a few assorted
    modifications elsewhere.

    Specifics:

    - Drop the P-state selection algorithm based on a PID controller from
    intel_pstate and make it use the same P-state selection method
    (based on the CPU load) for all types of systems in the active mode
    (Rafael Wysocki, Srinivas Pandruvada).

    - Rework the cpufreq core and governors to make it possible to take
    cross-CPU utilization updates into account and modify the schedutil
    governor to actually do so (Viresh Kumar).

    - Clean up the handling of transition latency information in the
    cpufreq core and untangle it from the information on which drivers
    cannot do dynamic frequency switching (Viresh Kumar).

    - Add support for new SoCs (MT2701/MT7623 and MT7622) to the mediatek
    cpufreq driver and update its DT bindings (Sean Wang).

    - Modify the cpufreq dt-platdev driver to autimatically create
    cpufreq devices for the new (v2) Operating Performance Points (OPP)
    DT bindings and update its whitelist of supported systems (Viresh
    Kumar, Shubhrajyoti Datta, Marc Gonzalez, Khiem Nguyen, Finley
    Xiao).

    - Add support for Ux500 to the cpufreq-dt driver and drop the
    obsolete dbx500 cpufreq driver (Linus Walleij, Arnd Bergmann).

    - Add new SoC (R8A7795) support to the cpufreq rcar driver (Khiem
    Nguyen).

    - Fix and clean up assorted issues in the cpufreq drivers and core
    (Arvind Yadav, Christophe Jaillet, Colin Ian King, Gustavo Silva,
    Julia Lawall, Leonard Crestez, Rob Herring, Sudeep Holla).

    - Update the IO-wait boost handling in the schedutil governor to make
    it less aggressive (Joel Fernandes).

    - Rework system suspend diagnostics to make it print fewer messages
    to the kernel log by default, add a sysfs knob to allow more
    suspend-related messages to be printed and add Low Power S0 Idle
    constraints checks to the ACPI suspend-to-idle code (Rafael
    Wysocki, Srinivas Pandruvada).

    - Prefer suspend-to-idle over S3 on ACPI-based systems with the
    ACPI_FADT_LOW_POWER_S0 flag set and the Low Power Idle S0 _DSM
    interface present in the ACPI tables (Rafael Wysocki).

    - Update documentation related to system sleep and rename a number of
    items in the code to make it cleare that they are related to
    suspend-to-idle (Rafael Wysocki).

    - Export a variable allowing device drivers to check the target
    system sleep state from the core system suspend code (Florian
    Fainelli).

    - Clean up the cpuidle subsystem to handle the polling state on x86
    in a more straightforward way and to use %pOF instead of full_name
    (Rafael Wysocki, Rob Herring).

    - Update the devfreq framework to fix and clean up a few minor issues
    (Chanwoo Choi, Rob Herring).

    - Extend diagnostics in the generic power domains (genpd) framework
    and clean it up slightly (Thara Gopinath, Rob Herring).

    - Fix and clean up a couple of issues in the operating performance
    points (OPP) framework (Viresh Kumar, Waldemar Rymarkiewicz).

    - Add support for RV1108 to the rockchip-io Adaptive Voltage Scaling
    (AVS) driver (David Wu).

    - Fix the usage of notifiers in CPU power management on some
    platforms (Alex Shi).

    - Update the pm-graph system suspend/hibernation and boot profiling
    utility (Todd Brandt).

    - Make it possible to run the cpupower utility without CPU0 (Prarit
    Bhargava)"

    * tag 'pm-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (87 commits)
    cpuidle: Make drivers initialize polling state
    cpuidle: Move polling state initialization code to separate file
    cpuidle: Eliminate the CPUIDLE_DRIVER_STATE_START symbol
    cpufreq: imx6q: Fix imx6sx low frequency support
    cpufreq: speedstep-lib: make several arrays static, makes code smaller
    PM: docs: Delete the obsolete states.txt document
    PM: docs: Describe high-level PM strategies and sleep states
    PM / devfreq: Fix memory leak when fail to register device
    PM / devfreq: Add dependency on PM_OPP
    PM / devfreq: Move private devfreq_update_stats() into devfreq
    PM / devfreq: Convert to using %pOF instead of full_name
    PM / AVS: rockchip-io: add io selectors and supplies for RV1108
    cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
    cpufreq: dt-platdev: Drop few entries from whitelist
    cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
    ARM: ux500: don't select CPUFREQ_DT
    cpuidle: Convert to using %pOF instead of full_name
    cpufreq: Convert to using %pOF instead of full_name
    PM / Domains: Convert to using %pOF instead of full_name
    cpufreq: Cap the default transition delay value to 10 ms
    ...

    Linus Torvalds
     

05 Sep, 2017

2 commits

  • Pull locking updates from Ingo Molnar:

    - Add 'cross-release' support to lockdep, which allows APIs like
    completions, where it's not the 'owner' who releases the lock, to be
    tracked. It's all activated automatically under
    CONFIG_PROVE_LOCKING=y.

    - Clean up (restructure) the x86 atomics op implementation to be more
    readable, in preparation of KASAN annotations. (Dmitry Vyukov)

    - Fix static keys (Paolo Bonzini)

    - Add killable versions of down_read() et al (Kirill Tkhai)

    - Rework and fix jump_label locking (Marc Zyngier, Paolo Bonzini)

    - Rework (and fix) tlb_flush_pending() barriers (Peter Zijlstra)

    - Remove smp_mb__before_spinlock() and convert its usages, introduce
    smp_mb__after_spinlock() (Peter Zijlstra)

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (56 commits)
    locking/lockdep/selftests: Fix mixed read-write ABBA tests
    sched/completion: Avoid unnecessary stack allocation for COMPLETION_INITIALIZER_ONSTACK()
    acpi/nfit: Fix COMPLETION_INITIALIZER_ONSTACK() abuse
    locking/pvqspinlock: Relax cmpxchg's to improve performance on some architectures
    smp: Avoid using two cache lines for struct call_single_data
    locking/lockdep: Untangle xhlock history save/restore from task independence
    locking/refcounts, x86/asm: Disable CONFIG_ARCH_HAS_REFCOUNT for the time being
    futex: Remove duplicated code and fix undefined behaviour
    Documentation/locking/atomic: Finish the document...
    locking/lockdep: Fix workqueue crossrelease annotation
    workqueue/lockdep: 'Fix' flush_work() annotation
    locking/lockdep/selftests: Add mixed read-write ABBA tests
    mm, locking/barriers: Clarify tlb_flush_pending() barriers
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE and CONFIG_LOCKDEP_COMPLETIONS truly non-interactive
    locking/lockdep: Explicitly initialize wq_barrier::done::map
    locking/lockdep: Rename CONFIG_LOCKDEP_COMPLETE to CONFIG_LOCKDEP_COMPLETIONS
    locking/lockdep: Reword title of LOCKDEP_CROSSRELEASE config
    locking/lockdep: Make CONFIG_LOCKDEP_CROSSRELEASE part of CONFIG_PROVE_LOCKING
    locking/refcounts, x86/asm: Implement fast refcount overflow protection
    locking/lockdep: Fix the rollback and overwrite detection logic in crossrelease
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - fix affine wakeups (Peter Zijlstra)

    - improve CPU onlining (and general bootup) scalability on systems
    with ridiculous number (thousands) of CPUs (Peter Zijlstra)

    - sched/numa updates (Rik van Riel)

    - sched/deadline updates (Byungchul Park)

    - sched/cpufreq enhancements and related cleanups (Viresh Kumar)

    - sched/debug enhancements (Xie XiuQi)

    - various fixes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
    sched/debug: Optimize sched_domain sysctl generation
    sched/topology: Avoid pointless rebuild
    sched/topology, cpuset: Avoid spurious/wrong domain rebuilds
    sched/topology: Improve comments
    sched/topology: Fix memory leak in __sdt_alloc()
    sched/completion: Document that reinit_completion() must be called after complete_all()
    sched/autogroup: Fix error reporting printk text in autogroup_create()
    sched/fair: Fix wake_affine() for !NUMA_BALANCING
    sched/debug: Intruduce task_state_to_char() helper function
    sched/debug: Show task state in /proc/sched_debug
    sched/debug: Use task_pid_nr_ns in /proc/$pid/sched
    sched/core: Remove unnecessary initialization init_idle_bootup_task()
    sched/deadline: Change return value of cpudl_find()
    sched/deadline: Make find_later_rq() choose a closer CPU in topology
    sched/numa: Scale scan period with tasks in group and shared/private
    sched/numa: Slow down scan rate if shared faults dominate
    sched/pelt: Fix false running accounting
    sched: Mark pick_next_task_dl() and build_sched_domain() as static
    sched/cpupri: Don't re-initialize 'struct cpupri'
    sched/deadline: Don't re-initialize 'struct cpudl'
    ...

    Linus Torvalds
     

04 Sep, 2017

5 commits

  • Pull RCU updates from Ingo Molnad:
    "The main RCU related changes in this cycle were:

    - Removal of spin_unlock_wait()
    - SRCU updates
    - RCU torture-test updates
    - RCU Documentation updates
    - Extend the sys_membarrier() ABI with the MEMBARRIER_CMD_PRIVATE_EXPEDITED variant
    - Miscellaneous RCU fixes
    - CPU-hotplug fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (63 commits)
    arch: Remove spin_unlock_wait() arch-specific definitions
    locking: Remove spin_unlock_wait() generic definitions
    drivers/ata: Replace spin_unlock_wait() with lock/unlock pair
    ipc: Replace spin_unlock_wait() with lock/unlock pair
    exit: Replace spin_unlock_wait() with lock/unlock pair
    completion: Replace spin_unlock_wait() with lock/unlock pair
    doc: Set down RCU's scheduling-clock-interrupt needs
    doc: No longer allowed to use rcu_dereference on non-pointers
    doc: Add RCU files to docbook-generation files
    doc: Update memory-barriers.txt for read-to-write dependencies
    doc: Update RCU documentation
    membarrier: Provide expedited private command
    rcu: Remove exports from rcu_idle_exit() and rcu_idle_enter()
    rcu: Add warning to rcu_idle_enter() for irqs enabled
    rcu: Make rcu_idle_enter() rely on callers disabling irqs
    rcu: Add assertions verifying blocked-tasks list
    rcu/tracing: Set disable_rcu_irq_enter on rcu_eqs_exit()
    rcu: Add TPS() protection for _rcu_barrier_trace strings
    rcu: Use idle versions of swait to make idle-hack clear
    swait: Add idle variants which don't contribute to load average
    ...

    Linus Torvalds
     
  • Conflicts:
    mm/page_alloc.c

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • * pm-sleep:
    ACPI / PM: Check low power idle constraints for debug only
    PM / s2idle: Rename platform operations structure
    PM / s2idle: Rename ->enter_freeze to ->enter_s2idle
    PM / s2idle: Rename freeze_state enum and related items
    PM / s2idle: Rename PM_SUSPEND_FREEZE to PM_SUSPEND_TO_IDLE
    ACPI / PM: Prefer suspend-to-idle over S3 on some systems
    platform/x86: intel-hid: Wake up Dell Latitude 7275 from suspend-to-idle
    PM / suspend: Define pr_fmt() in suspend.c
    PM / suspend: Use mem_sleep_labels[] strings in messages
    PM / sleep: Put pm_test under CONFIG_PM_SLEEP_DEBUG
    PM / sleep: Check pm_wakeup_pending() in __device_suspend_noirq()
    PM / core: Add error argument to dpm_show_time()
    PM / core: Split dpm_suspend_noirq() and dpm_resume_noirq()
    PM / s2idle: Rearrange the main suspend-to-idle loop
    PM / timekeeping: Print debug messages when requested
    PM / sleep: Mark suspend/hibernation start and finish
    PM / sleep: Do not print debug messages by default
    PM / suspend: Export pm_suspend_target_state

    Rafael J. Wysocki
     
  • * pm-cpufreq-sched:
    cpufreq: schedutil: Always process remote callback with slow switching
    cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily
    cpufreq: Return 0 from ->fast_switch() on errors
    cpufreq: Simplify cpufreq_can_do_remote_dvfs()
    cpufreq: Process remote callbacks from any CPU if the platform permits
    sched: cpufreq: Allow remote cpufreq callbacks
    cpufreq: schedutil: Use unsigned int for iowait boost
    cpufreq: schedutil: Make iowait boost more energy efficient

    Rafael J. Wysocki
     
  • * pm-cpufreq: (33 commits)
    cpufreq: imx6q: Fix imx6sx low frequency support
    cpufreq: speedstep-lib: make several arrays static, makes code smaller
    cpufreq: ti: Fix 'of_node_put' being called twice in error handling path
    cpufreq: dt-platdev: Drop few entries from whitelist
    cpufreq: dt-platdev: Automatically create cpufreq device with OPP v2
    ARM: ux500: don't select CPUFREQ_DT
    cpufreq: Convert to using %pOF instead of full_name
    cpufreq: Cap the default transition delay value to 10 ms
    cpufreq: dbx500: Delete obsolete driver
    mfd: db8500-prcmu: Get rid of cpufreq dependency
    cpufreq: enable the DT cpufreq driver on the Ux500
    cpufreq: Loongson2: constify platform_device_id
    cpufreq: dt: Add r8a7796 support to to use generic cpufreq driver
    cpufreq: remove setting of policy->cpu in policy->cpus during init
    cpufreq: mediatek: add support of cpufreq to MT7622 SoC
    cpufreq: mediatek: add cleanups with the more generic naming
    cpufreq: rcar: Add support for R8A7795 SoC
    cpufreq: dt: Add rk3328 compatible to use generic cpufreq driver
    cpufreq: s5pv210: add missing of_node_put()
    cpufreq: Allow dynamic switching with CPUFREQ_ETERNAL latency
    ...

    Rafael J. Wysocki
     

29 Aug, 2017

1 commit

  • struct call_single_data is used in IPIs to transfer information between
    CPUs. Its size is bigger than sizeof(unsigned long) and less than
    cache line size. Currently it is not allocated with any explicit alignment
    requirements. This makes it possible for allocated call_single_data to
    cross two cache lines, which results in double the number of the cache lines
    that need to be transferred among CPUs.

    This can be fixed by requiring call_single_data to be aligned with the
    size of call_single_data. Currently the size of call_single_data is the
    power of 2. If we add new fields to call_single_data, we may need to
    add padding to make sure the size of new definition is the power of 2
    as well.

    Fortunately, this is enforced by GCC, which will report bad sizes.

    To set alignment requirements of call_single_data to the size of
    call_single_data, a struct definition and a typedef is used.

    To test the effect of the patch, I used the vm-scalability multiple
    thread swap test case (swap-w-seq-mt). The test will create multiple
    threads and each thread will eat memory until all RAM and part of swap
    is used, so that huge number of IPIs are triggered when unmapping
    memory. In the test, the throughput of memory writing improves ~5%
    compared with misaligned call_single_data, because of faster IPIs.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Huang, Ying
    [ Add call_single_data_t and align with size of call_single_data. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Borislav Petkov
    Cc: Eric Dumazet
    Cc: Juergen Gross
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.com
    Signed-off-by: Ingo Molnar

    Ying Huang
     

28 Aug, 2017

1 commit

  • Tim Chen and Kan Liang have been battling a customer load that shows
    extremely long page wakeup lists. The cause seems to be constant NUMA
    migration of a hot page that is shared across a lot of threads, but the
    actual root cause for the exact behavior has not been found.

    Tim has a patch that batches the wait list traversal at wakeup time, so
    that we at least don't get long uninterruptible cases where we traverse
    and wake up thousands of processes and get nasty latency spikes. That
    is likely 4.14 material, but we're still discussing the page waitqueue
    specific parts of it.

    In the meantime, I've tried to look at making the page wait queues less
    expensive, and failing miserably. If you have thousands of threads
    waiting for the same page, it will be painful. We'll need to try to
    figure out the NUMA balancing issue some day, in addition to avoiding
    the excessive spinlock hold times.

    That said, having tried to rewrite the page wait queues, I can at least
    fix up some of the braindamage in the current situation. In particular:

    (a) we don't want to continue walking the page wait list if the bit
    we're waiting for already got set again (which seems to be one of
    the patterns of the bad load). That makes no progress and just
    causes pointless cache pollution chasing the pointers.

    (b) we don't want to put the non-locking waiters always on the front of
    the queue, and the locking waiters always on the back. Not only is
    that unfair, it means that we wake up thousands of reading threads
    that will just end up being blocked by the writer later anyway.

    Also add a comment about the layout of 'struct wait_page_key' - there is
    an external user of it in the cachefiles code that means that it has to
    match the layout of 'struct wait_bit_key' in the two first members. It
    so happens to match, because 'struct page *' and 'unsigned long *' end
    up having the same values simply because the page flags are the first
    member in struct page.

    Cc: Tim Chen
    Cc: Kan Liang
    Cc: Mel Gorman
    Cc: Christopher Lameter
    Cc: Andi Kleen
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds