15 Dec, 2017

1 commit

  • Daniel Wagner reported a crash on the BeagleBone Black SoC.

    This is a single CPU architecture, and does not have a functional
    arch_send_call_function_single_ipi() implementation which can crash
    the kernel if that is called.

    As it only has one CPU, it shouldn't be called, but if the kernel is
    compiled for SMP, the push/pull RT scheduling logic now calls it for
    irq_work if the one CPU is overloaded, it can use that function to call
    itself and crash the kernel.

    Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
    only has a single CPU. But SCHED_FEAT is a constant if sched debugging
    is turned off. Another fix can also be used, and this should also help
    with normal SMP machines. That is, do not initiate the pull code if
    there's only one RT overloaded CPU, and that CPU happens to be the
    current CPU that is scheduling in a lower priority task.

    Even on a system with many CPUs, if there's many RT tasks waiting to
    run on a single CPU, and that CPU schedules in another RT task of lower
    priority, it will initiate the PULL logic in case there's a higher
    priority RT task on another CPU that is waiting to run. But if there is
    no other CPU with waiting RT tasks, it will initiate the RT pull logic
    on itself (as it still has RT tasks waiting to run). This is a wasted
    effort.

    Not only does this help with SMP code where the current CPU is the only
    one with RT overloaded tasks, it should also solve the issue that
    Daniel encountered, because it will prevent the PULL logic from
    executing, as there's only one CPU on the system, and the check added
    here will cause it to exit the RT pull code.

    Reported-by: Daniel Wagner
    Signed-off-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Cc: linux-rt-users
    Cc: stable@vger.kernel.org
    Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
    Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

11 Dec, 2017

1 commit

  • Fix the following kernel-doc warnings after code restructuring:

    ../kernel/sched/core.c:5113: warning: No description found for parameter 't'
    ../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'

    get rid of set_fs()")

    Signed-off-by: Randy Dunlap
    Cc: Al Viro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: abca5fc535a3e ("sched_rr_get_interval(): move compat to native,
    Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
    Signed-off-by: Ingo Molnar

    Randy Dunlap
     

07 Dec, 2017

2 commits

  • Unlike running, the runnable part can't be directly propagated through
    the hierarchy when we migrate a task. The main reason is that runnable
    time can be shared with other sched_entities that stay on the rq and
    this runnable time will also remain on prev cfs_rq and must not be
    removed.

    Instead, we can estimate what should be the new runnable of the prev
    cfs_rq and check that this estimation stay in a possible range. The
    prop_runnable_sum is a good estimation when adding runnable_sum but
    fails most often when we remove it. Instead, we could use the formula
    below instead:

    gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

    which assumes that tasks are equally runnable which is not true but
    easy to compute.

    Beside these estimates, we have several simple rules that help us to filter
    out wrong ones:

    - ge->avg.runnable_sum avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
    - ge->avg.runnable_sum can't increase when we detach a task

    The effect of these fixes is better cgroups balancing.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Peter Zijlstra (Intel)
    Cc: Ben Segall
    Cc: Chris Mason
    Cc: Dietmar Eggemann
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • The following cleanup commit:

    50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

    ... unintentionally changed the behavior of add_wait_queue() from
    inserting the wait entry at the head of the wait queue to the tail
    of the wait queue.

    Beyond a negative performance impact this change in behavior
    theoretically also breaks wait queues which mix exclusive and
    non-exclusive waiters, as non-exclusive waiters will not be
    woken up if they are queued behind enough exclusive waiters.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Jens Axboe
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
    Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
    Signed-off-by: Ingo Molnar

    Omar Sandoval
     

18 Nov, 2017

1 commit

  • Pull compat and uaccess updates from Al Viro:

    - {get,put}_compat_sigset() series

    - assorted compat ioctl stuff

    - more set_fs() elimination

    - a few more timespec64 conversions

    - several removals of pointless access_ok() in places where it was
    followed only by non-__ variants of primitives

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
    coredump: call do_unlinkat directly instead of sys_unlink
    fs: expose do_unlinkat for built-in callers
    ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
    ipmi: get rid of pointless access_ok()
    pi433: sanitize ioctl
    cxlflash: get rid of pointless access_ok()
    mtdchar: get rid of pointless access_ok()
    r128: switch compat ioctls to drm_ioctl_kernel()
    selection: get rid of field-by-field copyin
    VT_RESIZEX: get rid of field-by-field copyin
    i2c compat ioctls: move to ->compat_ioctl()
    sched_rr_get_interval(): move compat to native, get rid of set_fs()
    mips: switch to {get,put}_compat_sigset()
    sparc: switch to {get,put}_compat_sigset()
    s390: switch to {get,put}_compat_sigset()
    ppc: switch to {get,put}_compat_sigset()
    parisc: switch to {get,put}_compat_sigset()
    get_compat_sigset()
    get rid of {get,put}_compat_itimerspec()
    io_getevents: Use timespec64 to represent timeouts
    ...

    Linus Torvalds
     

17 Nov, 2017

1 commit

  • Pull AFS updates from David Howells:
    "kAFS filesystem driver overhaul.

    The major points of the overhaul are:

    (1) Preliminary groundwork is laid for supporting network-namespacing
    of kAFS. The remainder of the namespacing work requires some way
    to pass namespace information to submounts triggered by an
    automount. This requires something like the mount overhaul that's
    in progress.

    (2) sockaddr_rxrpc is used in preference to in_addr for holding
    addresses internally and add support for talking to the YFS VL
    server. With this, kAFS can do everything over IPv6 as well as
    IPv4 if it's talking to servers that support it.

    (3) Callback handling is overhauled to be generally passive rather
    than active. 'Callbacks' are promises by the server to tell us
    about data and metadata changes. Callbacks are now checked when
    we next touch an inode rather than actively going and looking for
    it where possible.

    (4) File access permit caching is overhauled to store the caching
    information per-inode rather than per-directory, shared over
    subordinate files. Whilst older AFS servers only allow ACLs on
    directories (shared to the files in that directory), newer AFS
    servers break that restriction.

    To improve memory usage and to make it easier to do mass-key
    removal, permit combinations are cached and shared.

    (5) Cell database management is overhauled to allow lighter locks to
    be used and to make cell records autonomous state machines that
    look after getting their own DNS records and cleaning themselves
    up, in particular preventing races in acquiring and relinquishing
    the fscache token for the cell.

    (6) Volume caching is overhauled. The afs_vlocation record is got rid
    of to simplify things and the superblock is now keyed on the cell
    and the numeric volume ID only. The volume record is tied to a
    superblock and normal superblock management is used to mediate
    the lifetime of the volume fscache token.

    (7) File server record caching is overhauled to make server records
    independent of cells and volumes. A server can be in multiple
    cells (in such a case, the administrator must make sure that the
    VL services for all cells correctly reflect the volumes shared
    between those cells).

    Server records are now indexed using the UUID of the server
    rather than the address since a server can have multiple
    addresses.

    (8) File server rotation is overhauled to handle VMOVED, VBUSY (and
    similar), VOFFLINE and VNOVOL indications and to handle rotation
    both of servers and addresses of those servers. The rotation will
    also wait and retry if the server says it is busy.

    (9) Data writeback is overhauled. Each inode no longer stores a list
    of modified sections tagged with the key that authorised it in
    favour of noting the modified region of a page in page->private
    and storing a list of keys that made modifications in the inode.

    This simplifies things and allows other keys to be used to
    actually write to the server if a key that made a modification
    becomes useless.

    (10) Writable mmap() is implemented. This allows a kernel to be build
    entirely on AFS.

    Note that Pre AFS-3.4 servers are no longer supported, though this can
    be added back if necessary (AFS-3.4 was released in 1998)"

    * tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
    afs: Protect call->state changes against signals
    afs: Trace page dirty/clean
    afs: Implement shared-writeable mmap
    afs: Get rid of the afs_writeback record
    afs: Introduce a file-private data record
    afs: Use a dynamic port if 7001 is in use
    afs: Fix directory read/modify race
    afs: Trace the sending of pages
    afs: Trace the initiation and completion of client calls
    afs: Fix documentation on # vs % prefix in mount source specification
    afs: Fix total-length calculation for multiple-page send
    afs: Only progress call state at end of Tx phase from rxrpc callback
    afs: Make use of the YFS service upgrade to fully support IPv6
    afs: Overhaul volume and server record caching and fileserver rotation
    afs: Move server rotation code into its own file
    afs: Add an address list concept
    afs: Overhaul cell database management
    afs: Overhaul permit caching
    afs: Overhaul the callback handling
    afs: Rename struct afs_call server member to cm_server
    ...

    Linus Torvalds
     

16 Nov, 2017

1 commit

  • Pull cgroup updates from Tejun Heo:
    "Cgroup2 cpu controller support is finally merged.

    - Basic cpu statistics support to allow monitoring by default without
    the CPU controller enabled.

    - cgroup2 cpu controller support.

    - /sys/kernel/cgroup files to help dealing with new / optional
    features"

    * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: export list of cgroups v2 features using sysfs
    cgroup: export list of delegatable control files using sysfs
    cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
    MAINTAINERS: relocate cpuset.c
    cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
    sched: Implement interface for cgroup unified hierarchy
    sched: Misc preps for cgroup unified hierarchy interface
    sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
    cgroup: statically initialize init_css_set->dfl_cgrp
    cgroup: Implement cgroup2 basic CPU usage accounting
    cpuacct: Introduce cgroup_account_cputime[_field]()
    sched/cputime: Expose cputime_adjust()

    Linus Torvalds
     

14 Nov, 2017

4 commits

  • Pull power management updates from Rafael Wysocki:
    "There are no real big ticket items here this time.

    The most noticeable change is probably the relocation of the OPP
    (Operating Performance Points) framework to its own directory under
    drivers/ as it has grown big enough for that. Also Viresh is now going
    to maintain it and send pull requests for it to me, so you will see
    this change in the git history going forward (but still not right
    now).

    Another noticeable set of changes is the modifications of the PM core,
    the PCI subsystem and the ACPI PM domain to allow of more integration
    between system-wide suspend/resume and runtime PM. For now it's just a
    way to avoid resuming devices from runtime suspend unnecessarily
    during system suspend (if the driver sets a flag to indicate its
    readiness for that) and in the works is an analogous mechanism to
    allow devices to stay suspended after system resume.

    In addition to that, we have some changes related to supporting
    frequency-invariant CPU utilization metrics in the scheduler and in
    the schedutil cpufreq governor on ARM and changes to add support for
    device performance states to the generic power domains (genpd)
    framework.

    The rest is mostly fixes and cleanups of various sorts.

    Specifics:

    - Relocate the OPP (Operating Performance Points) framework to its
    own directory under drivers/ and add support for power domain
    performance states to it (Viresh Kumar).

    - Modify the PM core, the PCI bus type and the ACPI PM domain to
    support power management driver flags allowing device drivers to
    specify their capabilities and preferences regarding the handling
    of devices with enabled runtime PM during system suspend/resume and
    clean up that code somewhat (Rafael Wysocki, Ulf Hansson).

    - Add frequency-invariant accounting support to the task scheduler on
    ARM and ARM64 (Dietmar Eggemann).

    - Fix PM QoS device resume latency framework to prevent "no
    restriction" requests from overriding requests with specific
    requirements and drop the confusing PM_QOS_FLAG_REMOTE_WAKEUP
    device PM QoS flag (Rafael Wysocki).

    - Drop legacy class suspend/resume operations from the PM core and
    drop legacy bus type suspend and resume callbacks from ARM/locomo
    (Rafael Wysocki).

    - Add min/max frequency support to devfreq and clean it up somewhat
    (Chanwoo Choi).

    - Rework wakeup support in the generic power domains (genpd)
    framework and update some of its users accordingly (Geert
    Uytterhoeven).

    - Convert timers in the PM core to use timer_setup() (Kees Cook).

    - Add support for exposing the SLP_S0 (Low Power S0 Idle) residency
    counter based on the LPIT ACPI table on Intel platforms (Srinivas
    Pandruvada).

    - Add per-CPU PM QoS resume latency support to the ladder cpuidle
    governor (Ramesh Thomas).

    - Fix a deadlock between the wakeup notify handler and the notifier
    removal in the ACPI core (Ville Syrjälä).

    - Fix a cpufreq schedutil governor issue causing it to use stale
    cached frequency values sometimes (Viresh Kumar).

    - Fix an issue in the system suspend core support code causing wakeup
    events detection to fail in some cases (Rajat Jain).

    - Fix the generic power domains (genpd) framework to prevent the PM
    core from using the direct-complete optimization with it as that is
    guaranteed to fail (Ulf Hansson).

    - Fix a minor issue in the cpuidle core and clean it up a bit (Gaurav
    Jindal, Nicholas Piggin).

    - Fix and clean up the intel_idle and ARM cpuidle drivers (Jason
    Baron, Len Brown, Leo Yan).

    - Fix a couple of minor issues in the OPP framework and clean it up
    (Arvind Yadav, Fabio Estevam, Sudeep Holla, Tobias Jordan).

    - Fix and clean up some cpufreq drivers and fix a minor issue in the
    cpufreq statistics code (Arvind Yadav, Bhumika Goyal, Fabio
    Estevam, Gautham Shenoy, Gustavo Silva, Marek Szyprowski, Masahiro
    Yamada, Robert Jarzmik, Zumeng Chen).

    - Fix minor issues in the system suspend and hibernation core, in
    power management documentation and in the AVS (Adaptive Voltage
    Scaling) framework (Helge Deller, Himanshu Jha, Joe Perches, Rafael
    Wysocki).

    - Fix some issues in the cpupower utility and document that Shuah
    Khan is going to maintain it going forward (Prarit Bhargava, Shuah
    Khan)"

    * tag 'pm-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (88 commits)
    tools/power/cpupower: add libcpupower.so.0.0.1 to .gitignore
    tools/power/cpupower: Add 64 bit library detection
    intel_idle: Graceful probe failure when MWAIT is disabled
    cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
    freezer: Fix typo in freezable_schedule_timeout() comment
    PM / s2idle: Clear the events_check_enabled flag
    cpufreq: stats: Handle the case when trans_table goes beyond PAGE_SIZE
    cpufreq: arm_big_little: make cpufreq_arm_bL_ops structures const
    cpufreq: arm_big_little: make function arguments and structure pointer const
    cpuidle: Avoid assignment in if () argument
    cpuidle: Clean up cpuidle_enable_device() error handling a bit
    ACPI / PM: Fix acpi_pm_notifier_lock vs flush_workqueue() deadlock
    PM / Domains: Fix genpd to deal with drivers returning 1 from ->prepare()
    cpuidle: ladder: Add per CPU PM QoS resume latency support
    PM / QoS: Fix device resume latency framework
    PM / domains: Rework governor code to be more consistent
    PM / Domains: Remove gpd_dev_ops.active_wakeup() callback
    soc: rockchip: power-domain: Use GENPD_FLAG_ACTIVE_WAKEUP
    soc: mediatek: Use GENPD_FLAG_ACTIVE_WAKEUP
    ARM: shmobile: pm-rmobile: Use GENPD_FLAG_ACTIVE_WAKEUP
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main updates in this cycle were:

    - Group balancing enhancements and cleanups (Brendan Jackman)

    - Move CPU isolation related functionality into its separate
    kernel/sched/isolation.c file, with related 'housekeeping_*()'
    namespace and nomenclature et al. (Frederic Weisbecker)

    - Improve the interactive/cpu-intense fairness calculation (Josef
    Bacik)

    - Improve the PELT code and related cleanups (Peter Zijlstra)

    - Improve the logic of pick_next_task_fair() (Uladzislau Rezki)

    - Improve the RT IPI based balancing logic (Steven Rostedt)

    - Various micro-optimizations:

    - better !CONFIG_SCHED_DEBUG optimizations (Patrick Bellasi)

    - better idle loop (Cheng Jian)

    - ... plus misc fixes, cleanups and updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
    sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds
    sched/sysctl: Fix attributes of some extern declarations
    sched/isolation: Document isolcpus= boot parameter flags, mark it deprecated
    sched/isolation: Add basic isolcpus flags
    sched/isolation: Move isolcpus= handling to the housekeeping code
    sched/isolation: Handle the nohz_full= parameter
    sched/isolation: Introduce housekeeping flags
    sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL
    sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()
    sched/isolation: Use its own static key
    sched/isolation: Make the housekeeping cpumask private
    sched/isolation: Provide a dynamic off-case to housekeeping_any_cpu()
    sched/isolation, watchdog: Use housekeeping_cpumask() instead of ad-hoc version
    sched/isolation: Move housekeeping related code to its own file
    sched/idle: Micro-optimize the idle loop
    sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK
    x86/tsc: Append the 'tsc=' description for the 'tsc=unstable' boot parameter
    sched/rt: Simplify the IPI based RT balancing logic
    block/ioprio: Use a helper to check for RT prio
    sched/rt: Add a helper to test for a RT task
    ...

    Linus Torvalds
     
  • Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle are:

    - Another attempt at enabling cross-release lockdep dependency
    tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
    with better performance and fewer false positives. (Byungchul Park)

    - Introduce lockdep_assert_irqs_enabled()/disabled() and convert
    open-coded equivalents to lockdep variants. (Frederic Weisbecker)

    - Add down_read_killable() and use it in the VFS's iterate_dir()
    method. (Kirill Tkhai)

    - Convert remaining uses of ACCESS_ONCE() to
    READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
    driven. (Mark Rutland, Paul E. McKenney)

    - Get rid of lockless_dereference(), by strengthening Alpha atomics,
    strengthening READ_ONCE() with smp_read_barrier_depends() and thus
    being able to convert users of lockless_dereference() to
    READ_ONCE(). (Will Deacon)

    - Various micro-optimizations:

    - better PV qspinlocks (Waiman Long),
    - better x86 barriers (Michael S. Tsirkin)
    - better x86 refcounts (Kees Cook)

    - ... plus other fixes and enhancements. (Borislav Petkov, Juergen
    Gross, Miguel Bernal Marin)"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
    rcu: Use lockdep to assert IRQs are disabled/enabled
    netpoll: Use lockdep to assert IRQs are disabled/enabled
    timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
    sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
    irq_work: Use lockdep to assert IRQs are disabled/enabled
    irq/timings: Use lockdep to assert IRQs are disabled/enabled
    perf/core: Use lockdep to assert IRQs are disabled/enabled
    x86: Use lockdep to assert IRQs are disabled/enabled
    smp/core: Use lockdep to assert IRQs are disabled/enabled
    timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
    timers/nohz: Use lockdep to assert IRQs are disabled/enabled
    workqueue: Use lockdep to assert IRQs are disabled/enabled
    irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
    locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
    locking/pvqspinlock: Implement hybrid PV queued/unfair locks
    locking/rwlocks: Fix comments
    x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
    block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
    workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "The main changes in this cycle are:

    - Documentation updates

    - RCU CPU stall-warning updates

    - Torture-test updates

    - Miscellaneous fixes

    Size wise the biggest updates are to documentation. Excluding
    documentation most of the code increase comes from a single commit
    which expands debugging"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    srcu: Add parameters to SRCU docbook comments
    doc: Rewrite confusing statement about memory barriers
    memory-barriers.txt: Fix typo in pairing example
    rcu/segcblist: Include rcupdate.h
    rcu: Add extended-quiescent-state testing advice
    rcu: Suppress lockdep false-positive ->boost_mtx complaints
    rcu: Do not include rtmutex_common.h unconditionally
    torture: Provide TMPDIR environment variable to specify tmpdir
    rcutorture: Dump writer stack if stalled
    rcutorture: Add interrupt-disable capability to stall-warning tests
    rcu: Suppress RCU CPU stall warnings while dumping trace
    rcu: Turn off tracing before dumping trace
    rcu: Make RCU CPU stall warnings check for irq-disabled CPUs
    sched,rcu: Make cond_resched() provide RCU quiescent state
    sched: Make resched_cpu() unconditional
    irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP
    rcu: Create call_rcu_tasks() kthread at boot time
    rcu: Fix up pending cbs check in rcu_prepare_for_idle
    memory-barriers: Rework multicopy-atomicity section
    memory-barriers: Replace uses of "transitive"
    ...

    Linus Torvalds
     

13 Nov, 2017

2 commits

  • Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
    extra argument and make it 'unsigned int throughout.

    Also, consolidate a bunch of identical action functions into a default
    function that can do the appropriate thing for the mode.

    Also, change the argument name in the bit_wait*() function declarations to
    reflect the fact that it's the mode and not the bit number.

    [Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
    should be done differently, though he's not immediately sure as to how]

    Signed-off-by: David Howells
    Acked-by: Peter Zijlstra
    cc: Ingo Molnar

    David Howells
     
  • * pm-cpufreq-sched:
    cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq

    * pm-opp:
    PM / OPP: Add dev_pm_opp_{un}register_get_pstate_helper()
    PM / OPP: Support updating performance state of device's power domain
    PM / OPP: add missing of_node_put() for of_get_cpu_node()
    PM / OPP: Rename dev_pm_opp_register_put_opp_helper()
    PM / OPP: Add missing of_node_put(np)
    PM / OPP: Move error message to debug level
    PM / OPP: Use snprintf() to avoid kasprintf() and kfree()
    PM / OPP: Move the OPP directory out of power/

    Rafael J. Wysocki
     

09 Nov, 2017

3 commits

  • When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
    all SCHED_FEAT are turned into compile time constants being propagated
    to support compiler optimizations.

    Specifically, we expect that code blocks like this:

    if (sched_feat(FEATURE_NAME) [&& ]) {
    /* FEATURE CODE */
    }

    are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
    being removed by the compiler from the finale image.

    For this mechanism to properly work it's required for the compiler to
    have full access, from each translation unit, to whatever is the value
    defined by the sched_feat macro. This macro is defined as:

    #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))

    and thus, the compiler can optimize that code only if the value of
    sysctl_sched_features is visible within each translation unit.

    Since:

    029632fbb ("sched: Make separate sched*.c translation units")

    the scheduler code has been split into separate translation units
    however the definition of sysctl_sched_features is part of
    kernel/sched/core.c while, for all the other scheduler modules, it is
    visible only via kernel/sched/sched.h as an:

    extern const_debug unsigned int sysctl_sched_features

    Unfortunately, an extern reference does not allow the compiler to apply
    constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
    with code to load a memory reference and (eventually) doing an unconditional
    jump of a chunk of code.

    This mechanism is unavoidable when sched_features can be turned on and off at
    run-time. However, this is not the case for "production" kernels compiled with
    !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
    which cannot be changed at run-time and thus memory loads and jumps can be
    avoided altogether.

    This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
    of the sysctl_sched_features constant for each translation unit. This will
    ultimately allow the compiler to perform constants propagation and dead-code
    pruning.

    Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
    the patch, by running 30 iterations of:

    perf bench sched messaging --pipe --thread --group 4 --loop 50000

    on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
    powersave governor to rule out variations due to frequency scaling.

    Statistics on the reported completion time:

    count mean std min 99% max
    v4.14-rc8 30.0 15.7831 0.176032 15.442 16.01226 16.014
    v4.14-rc8+patch 30.0 15.5033 0.189681 15.232 15.93938 15.962

    ... show a 1.8% speedup on average completion time and 0.5% speedup in the
    99 percentile.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Chris Redpath
    Reviewed-by: Dietmar Eggemann
    Reviewed-by: Brendan Jackman
    Acked-by: Peter Zijlstra
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • * pm-cpufreq-sched:
    cpufreq: schedutil: Examine the correct CPU when we update util

    Rafael J. Wysocki
     
  • 'cached_raw_freq' is used to get the next frequency quickly but should
    always be in sync with sg_policy->next_freq. There is a case where it is
    not and in such cases it should be reset to avoid switching to incorrect
    frequencies.

    Consider this case for example:

    - policy->cur is 1.2 GHz (Max)
    - New request comes for 780 MHz and we store that in cached_raw_freq.
    - Based on 780 MHz, we calculate the effective frequency as 800 MHz.
    - We then see the CPU wasn't idle recently and choose to keep the next
    freq as 1.2 GHz.
    - Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
    1.2 GHz.
    - Now if the utilization doesn't change in then next request, then the
    next target frequency will still be 780 MHz and it will match with
    cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Signed-off-by: Viresh Kumar
    Cc: 4.12+ # 4.12+
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

08 Nov, 2017

2 commits


05 Nov, 2017

1 commit

  • After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
    callbacks) we stopped to always read the utilization for the CPU we
    are running the governor on, and instead we read it for the CPU
    which we've been told has updated utilization. This is stored in
    sugov_cpu->cpu.

    The value is set in sugov_register() but we clear it in sugov_start()
    which leads to always looking at the utilization of CPU0 instead of
    the correct one.

    Fix this by consolidating the initialization code into sugov_start().

    Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
    Signed-off-by: Chris Redpath
    Reviewed-by: Patrick Bellasi
    Reviewed-by: Brendan Jackman
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Chris Redpath
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

27 Oct, 2017

10 commits

  • Add flags to control NOHZ and domain isolation from "isolcpus=", in
    order to centralize the isolation features to a common interface. Domain
    isolation remains the default so not to break the existing isolcpus
    boot paramater behaviour.

    Further flags in the future may include 0hz (1hz tick offload) and timers,
    workqueue, RCU, kthread, watchdog, likely all merged together in a
    common flag ("async"?). In any case, this will have to be modifiable by
    cpusets.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-12-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • We want to centralize the isolation features, to be done by the housekeeping
    subsystem and scheduler domain isolation is a significant part of it.

    No intended behaviour change, we just reuse the housekeeping cpumask
    and core code.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • We want to centralize the isolation management, done by the housekeeping
    subsystem. Therefore we need to handle the nohz_full= parameter from
    there.

    Since nohz_full= so far has involved unbound timers, watchdog, RCU
    and tilegx NAPI isolation, we keep that default behaviour.

    nohz_full= will be deprecated in the future. We want to control
    the isolation features from the isolcpus= parameter.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Before we implement isolcpus under housekeeping, we need the isolation
    features to be more finegrained. For example some people want NOHZ_FULL
    without the full scheduler isolation, others want full scheduler
    isolation without NOHZ_FULL.

    So let's cut all these isolation features piecewise, at the risk of
    overcutting it right now. We can still merge some flags later if they
    always make sense together.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Split the housekeeping config from CONFIG_NO_HZ_FULL. This way we finally
    separate the isolation code from NOHZ.

    Although a dependency to CONFIG_NO_HZ_FULL remains for now, while the
    housekeeping code still deals with NOHZ internals.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-8-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Fit it into the housekeeping_*() namespace.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Housekeeping code still depends on the nohz_full static key. Since we want
    to decouple housekeeping from NOHZ, let's create a housekeeping specific
    static key.

    It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-6-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • Nobody needs to access this detail. housekeeping_cpumask() already
    takes care of it.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-5-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The housekeeping code is currently tied to the NOHZ code. As we are
    planning to make housekeeping independent from it, start with moving
    the relevant code to its own file.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Thomas Gleixner
    Acked-by: Paul E. McKenney
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Luiz Capitulino
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The basic cpu stat is currently shown with "cpu." prefix in
    cgroup.stat, and the same information is duplicated in cpu.stat when
    cpu controller is enabled. This is ugly and not very scalable as we
    want to expand the coverage of stat information which is always
    available.

    This patch makes cgroup core always create "cpu.stat" file and show
    the basic cpu stat there and calls the cpu controller to show the
    extra stats when enabled. This ensures that the same information
    isn't presented in multiple places and makes future expansion of basic
    stats easier.

    Signed-off-by: Tejun Heo
    Acked-by: Peter Zijlstra (Intel)

    Tejun Heo
     

26 Oct, 2017

1 commit

  • Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
    because the current CPU is always constant.

    This improves the generated code both on x86-64 and ARM64:

    x86-64:

    Before patch (execution in loop):
    864: 0f ae e8 lfence
    867: 65 8b 05 c2 38 f1 7e mov %gs:0x7ef138c2(%rip),%eax
    86e: 89 c0 mov %eax,%eax
    870: 48 0f a3 05 68 19 08 bt %rax,0x1081968(%rip)
    877: 01

    After patch (execution in loop):
    872: 0f ae e8 lfence
    875: 4c 0f a3 25 63 19 08 bt %r12,0x1081963(%rip)
    87c: 01

    ARM64:

    Before patch (execution in loop):
    c58: d5033d9f dsb ld
    c5c: d538d080 mrs x0, tpidr_el1
    c60: b8606a61 ldr w1, [x19,x0]
    c64: 1100fc20 add w0, w1, #0x3f
    c68: 7100003f cmp w1, #0x0
    c6c: 1a81b000 csel w0, w0, w1, lt
    c70: 13067c00 asr w0, w0, #6
    c74: 93407c00 sxtw x0, w0
    c78: f8607a80 ldr x0, [x20,x0,lsl #3]
    c7c: 9ac12401 lsr x1, x0, x1
    c80: 36000581 tbz w1, #0, d30

    After patch (execution in loop):
    c84: d5033d9f dsb ld
    c88: f9400260 ldr x0, [x19]
    c8c: ea14001f tst x0, x20
    c90: 54000580 b.eq d40

    Signed-off-by: Cheng Jian
    [ Rewrote the title and the changelog. ]
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: huawei.libin@huawei.com
    Cc: xiexiuqi@huawei.com
    Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
    Signed-off-by: Ingo Molnar

    Cheng Jian
     

24 Oct, 2017

2 commits

  • cpulist_parse() uses nr_cpumask_bits as a limit to parse the
    passed buffer from kernel commandline. What nr_cpumask_bits
    represents varies depending upon the CONFIG_CPUMASK_OFFSTACK option:

    - If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is the same as
    NR_CPUS, which might not represent the # of CPUs that really exist
    (default 64). So, there's a chance of a gap between nr_cpu_ids
    and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
    operation. For example, if isolcpus=9 is passed on an 8 cpu
    system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
    that it's supposed to.

    This patch fixes this bug by finding the last CPU of the passed
    isolcpus= list and checking it against nr_cpu_ids.

    It also fixes the error message where the nr_cpu_ids should be
    nr_cpu_ids-1, since CPU numbering starts from 0.

    Signed-off-by: Rakib Mullick
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: adobriyan@gmail.com
    Cc: akpm@linux-foundation.org
    Cc: longman@redhat.com
    Cc: mka@chromium.org
    Cc: tj@kernel.org
    Link: http://lkml.kernel.org/r/20171023130154.9050-1-rakib.mullick@gmail.com
    [ Enhanced the changelog and the kernel message. ]
    Signed-off-by: Ingo Molnar

    include/linux/cpumask.h | 16 ++++++++++++++++
    kernel/sched/topology.c | 4 ++--
    2 files changed, 18 insertions(+), 2 deletions(-)

    Rakib Mullick
     
  • …k/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    - Documentation updates
    - Miscellaneous fixes
    - RCU CPU stall-warning updates
    - Torture-test updates

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

21 Oct, 2017

1 commit


20 Oct, 2017

1 commit

  • This introduces a "register private expedited" membarrier command which
    allows eventual removal of important memory barrier constraints on the
    scheduler fast-paths. It changes how the "private expedited" membarrier
    command (new to 4.14) is used from user-space.

    This new command allows processes to register their intent to use the
    private expedited command. This affects how the expedited private
    command introduced in 4.14-rc is meant to be used, and should be merged
    before 4.14 final.

    Processes are now required to register before using
    MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

    This fixes a problem that arose when designing requested extensions to
    sys_membarrier() to allow JITs to efficiently flush old code from
    instruction caches. Several potential algorithms are much less painful
    if the user register intent to use this functionality early on, for
    example, before the process spawns the second thread. Registering at
    this time removes the need to interrupt each and every thread in that
    process at the first expedited sys_membarrier() system call.

    Signed-off-by: Mathieu Desnoyers
    Acked-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Signed-off-by: Linus Torvalds

    Mathieu Desnoyers
     

10 Oct, 2017

5 commits

  • When a CPU lowers its priority (schedules out a high priority task for a
    lower priority one), a check is made to see if any other CPU has overloaded
    RT tasks (more than one). It checks the rto_mask to determine this and if so
    it will request to pull one of those tasks to itself if the non running RT
    task is of higher priority than the new priority of the next task to run on
    the current CPU.

    When we deal with large number of CPUs, the original pull logic suffered
    from large lock contention on a single CPU run queue, which caused a huge
    latency across all CPUs. This was caused by only having one CPU having
    overloaded RT tasks and a bunch of other CPUs lowering their priority. To
    solve this issue, commit:

    b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

    changed the way to request a pull. Instead of grabbing the lock of the
    overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

    Although the IPI logic worked very well in removing the large latency build
    up, it still could suffer from a large number of IPIs being sent to a single
    CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
    when I tested this on a 120 CPU box, with a stress test that had lots of
    RT tasks scheduling on all CPUs, it actually triggered the hard lockup
    detector! One CPU had so many IPIs sent to it, and due to the restart
    mechanism that is triggered when the source run queue has a priority status
    change, the CPU spent minutes! processing the IPIs.

    Thinking about this further, I realized there's no reason for each run queue
    to send its own IPI. As all CPUs with overloaded tasks must be scanned
    regardless if there's one or many CPUs lowering their priority, because
    there's no current way to find the CPU with the highest priority task that
    can schedule to one of these CPUs, there really only needs to be one IPI
    being sent around at a time.

    This greatly simplifies the code!

    The new approach is to have each root domain have its own irq work, as the
    rto_mask is per root domain. The root domain has the following fields
    attached to it:

    rto_push_work - the irq work to process each CPU set in rto_mask
    rto_lock - the lock to protect some of the other rto fields
    rto_loop_start - an atomic that keeps contention down on rto_lock
    the first CPU scheduling in a lower priority task
    is the one to kick off the process.
    rto_loop_next - an atomic that gets incremented for each CPU that
    schedules in a lower priority task.
    rto_loop - a variable protected by rto_lock that is used to
    compare against rto_loop_next
    rto_cpu - The cpu to send the next IPI to, also protected by
    the rto_lock.

    When a CPU schedules in a lower priority task and wants to make sure
    overloaded CPUs know about it. It increments the rto_loop_next. Then it
    atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
    then it is done, as another CPU is kicking off the IPI loop. If the old
    value is "0", then it will take the rto_lock to synchronize with a possible
    IPI being sent around to the overloaded CPUs.

    If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
    IPI being sent around, or one is about to finish. Then rto_cpu is set to the
    first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
    in rto_mask, then there's nothing to be done.

    When the CPU receives the IPI, it will first try to push any RT tasks that is
    queued on the CPU but can't run because a higher priority RT task is
    currently running on that CPU.

    Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
    finds one, it simply sends an IPI to that CPU and the process continues.

    If there's no more CPUs in the rto_mask, then rto_loop is compared with
    rto_loop_next. If they match, everything is done and the process is over. If
    they do not match, then a CPU scheduled in a lower priority task as the IPI
    was being passed around, and the process needs to start again. The first CPU
    in rto_mask is sent the IPI.

    This change removes this duplication of work in the IPI logic, and greatly
    lowers the latency caused by the IPIs. This removed the lockup happening on
    the 120 CPU machine. It also simplifies the code tremendously. What else
    could anyone ask for?

    Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
    supplying me with the rto_start_trylock() and rto_start_unlock() helper
    functions.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Clark Williams
    Cc: Daniel Bristot de Oliveira
    Cc: John Kacur
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Scott Wood
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (Red Hat)
     
  • find_idlest_group() returns NULL when the local group is idlest. The
    caller then continues the find_idlest_group() search at a lower level
    of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
    not consulted and, crucially, @new_cpu is not updated. This means the
    search is pointless and we return @prev_cpu from select_task_rq_fair().

    This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.

    Signed-off-by: Brendan Jackman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Josef Bacik
    Reviewed-by: Vincent Guittot
    Cc: Dietmar Eggemann
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com
    Signed-off-by: Ingo Molnar

    Brendan Jackman
     
  • When 'p' is not allowed on any of the CPUs in the sched_domain, we
    currently return NULL from find_idlest_group(), and pointlessly
    continue the search on lower sched_domain levels (where 'p' is also not
    allowed) before returning prev_cpu regardless (as we have not updated
    new_cpu).

    Add an explicit check for this case, and add a comment to
    find_idlest_group(). Now when find_idlest_group() returns NULL, it always
    means that the local group is allowed and idlest.

    Signed-off-by: Brendan Jackman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Reviewed-by: Josef Bacik
    Cc: Dietmar Eggemann
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com
    Signed-off-by: Ingo Molnar

    Brendan Jackman
     
  • When the local group is not allowed we do not modify this_*_load from
    their initial value of 0. That means that the load checks at the end
    of find_idlest_group cause us to incorrectly return NULL. Fixing the
    initial values to ULONG_MAX means we will instead return the idlest
    remote group in that case.

    Signed-off-by: Brendan Jackman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Vincent Guittot
    Reviewed-by: Josef Bacik
    Cc: Dietmar Eggemann
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com
    Signed-off-by: Ingo Molnar

    Brendan Jackman
     
  • Since commit:

    83a0a96a5f26 ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")

    find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
    so we can simplify the checking of the return value in find_idlest_cpu().

    Signed-off-by: Brendan Jackman
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Josef Bacik
    Reviewed-by: Vincent Guittot
    Cc: Dietmar Eggemann
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com
    Signed-off-by: Ingo Molnar

    Brendan Jackman