Eric Lee / smarc-fsl-linux-kernel

15 Dec, 2017

1 commit

f73c52a5b sched/rt: Do not pull from current CPU if only one CPU to pull ... Browse Code »

Daniel Wagner reported a crash on the BeagleBone Black SoC.

This is a single CPU architecture, and does not have a functional
arch_send_call_function_single_ipi() implementation which can crash
the kernel if that is called.

As it only has one CPU, it shouldn't be called, but if the kernel is
compiled for SMP, the push/pull RT scheduling logic now calls it for
irq_work if the one CPU is overloaded, it can use that function to call
itself and crash the kernel.

Ideally, we should disable the SCHED_FEAT(RT_PUSH_IPI) if the system
only has a single CPU. But SCHED_FEAT is a constant if sched debugging
is turned off. Another fix can also be used, and this should also help
with normal SMP machines. That is, do not initiate the pull code if
there's only one RT overloaded CPU, and that CPU happens to be the
current CPU that is scheduling in a lower priority task.

Even on a system with many CPUs, if there's many RT tasks waiting to
run on a single CPU, and that CPU schedules in another RT task of lower
priority, it will initiate the PULL logic in case there's a higher
priority RT task on another CPU that is waiting to run. But if there is
no other CPU with waiting RT tasks, it will initiate the RT pull logic
on itself (as it still has RT tasks waiting to run). This is a wasted
effort.

Not only does this help with SMP code where the current CPU is the only
one with RT overloaded tasks, it should also solve the issue that
Daniel encountered, because it will prevent the PULL logic from
executing, as there's only one CPU on the system, and the check added
here will cause it to exit the RT pull code.

Reported-by: Daniel Wagner
Signed-off-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Sebastian Andrzej Siewior
Cc: Thomas Gleixner
Cc: linux-rt-users
Cc: stable@vger.kernel.org
Fixes: 4bdced5c9 ("sched/rt: Simplify the IPI based RT balancing logic")
Link: http://lkml.kernel.org/r/20171202130454.4cbbfe8d@vmware.local.home
Signed-off-by: Ingo Molnar

Steven Rostedt
2017-12-15 23:28:02 +0800

11 Dec, 2017

1 commit

2064a5ab0 sched/core: Fix kernel-doc warnings after code movement ... Browse Code »

Fix the following kernel-doc warnings after code restructuring:

../kernel/sched/core.c:5113: warning: No description found for parameter 't'
../kernel/sched/core.c:5113: warning: Excess function parameter 'interval' description in 'sched_rr_get_interval'

get rid of set_fs()")

Signed-off-by: Randy Dunlap
Cc: Al Viro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: abca5fc535a3e ("sched_rr_get_interval(): move compat to native,
Link: http://lkml.kernel.org/r/995c6ded-b32e-bbe4-d9f5-4d42d121aff1@infradead.org
Signed-off-by: Ingo Molnar

Randy Dunlap
2017-12-11 23:10:42 +0800

07 Dec, 2017

2 commits

a4c3c0497 sched/fair: Update and fix the runnable propagation rule ... Browse Code »

Unlike running, the runnable part can't be directly propagated through
the hierarchy when we migrate a task. The main reason is that runnable
time can be shared with other sched_entities that stay on the rq and
this runnable time will also remain on prev cfs_rq and must not be
removed.

Instead, we can estimate what should be the new runnable of the prev
cfs_rq and check that this estimation stay in a possible range. The
prop_runnable_sum is a good estimation when adding runnable_sum but
fails most often when we remove it. Instead, we could use the formula
below instead:

gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight

which assumes that tasks are equally runnable which is not true but
easy to compute.

Beside these estimates, we have several simple rules that help us to filter
out wrong ones:

- ge->avg.runnable_sum avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX)
- ge->avg.runnable_sum can't increase when we detach a task

The effect of these fixes is better cgroups balancing.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Peter Zijlstra (Intel)
Cc: Ben Segall
Cc: Chris Mason
Cc: Dietmar Eggemann
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Paul Turner
Cc: Peter Zijlstra
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Yuyang Du
Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2017-12-07 02:30:50 +0800
c6b9d9a33 sched/wait: Fix add_wait_queue() behavioral change ... Browse Code »

The following cleanup commit:

50816c48997a ("sched/wait: Standardize internal naming of wait-queue entries")

... unintentionally changed the behavior of add_wait_queue() from
inserting the wait entry at the head of the wait queue to the tail
of the wait queue.

Beyond a negative performance impact this change in behavior
theoretically also breaks wait queues which mix exclusive and
non-exclusive waiters, as non-exclusive waiters will not be
woken up if they are queued behind enough exclusive waiters.

Signed-off-by: Omar Sandoval
Reviewed-by: Jens Axboe
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: kernel-team@fb.com
Fixes: ("sched/wait: Standardize internal naming of wait-queue entries")
Link: http://lkml.kernel.org/r/a16c8ccffd39bd08fdaa45a5192294c784b803a7.1512544324.git.osandov@fb.com
Signed-off-by: Ingo Molnar

Omar Sandoval
2017-12-07 02:30:34 +0800

18 Nov, 2017

1 commit

93f30c73e Merge branch 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull compat and uaccess updates from Al Viro:

- {get,put}_compat_sigset() series

- assorted compat ioctl stuff

- more set_fs() elimination

- a few more timespec64 conversions

- several removals of pointless access_ok() in places where it was
followed only by non-__ variants of primitives

* 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (24 commits)
coredump: call do_unlinkat directly instead of sys_unlink
fs: expose do_unlinkat for built-in callers
ext4: take handling of EXT4_IOC_GROUP_ADD into a helper, get rid of set_fs()
ipmi: get rid of pointless access_ok()
pi433: sanitize ioctl
cxlflash: get rid of pointless access_ok()
mtdchar: get rid of pointless access_ok()
r128: switch compat ioctls to drm_ioctl_kernel()
selection: get rid of field-by-field copyin
VT_RESIZEX: get rid of field-by-field copyin
i2c compat ioctls: move to ->compat_ioctl()
sched_rr_get_interval(): move compat to native, get rid of set_fs()
mips: switch to {get,put}_compat_sigset()
sparc: switch to {get,put}_compat_sigset()
s390: switch to {get,put}_compat_sigset()
ppc: switch to {get,put}_compat_sigset()
parisc: switch to {get,put}_compat_sigset()
get_compat_sigset()
get rid of {get,put}_compat_itimerspec()
io_getevents: Use timespec64 to represent timeouts
...

Linus Torvalds
2017-11-18 03:54:55 +0800

17 Nov, 2017

1 commit

487e2c9f4 Merge tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs ... Browse Code »

Pull AFS updates from David Howells:
"kAFS filesystem driver overhaul.

The major points of the overhaul are:

(1) Preliminary groundwork is laid for supporting network-namespacing
of kAFS. The remainder of the namespacing work requires some way
to pass namespace information to submounts triggered by an
automount. This requires something like the mount overhaul that's
in progress.

(2) sockaddr_rxrpc is used in preference to in_addr for holding
addresses internally and add support for talking to the YFS VL
server. With this, kAFS can do everything over IPv6 as well as
IPv4 if it's talking to servers that support it.

(3) Callback handling is overhauled to be generally passive rather
than active. 'Callbacks' are promises by the server to tell us
about data and metadata changes. Callbacks are now checked when
we next touch an inode rather than actively going and looking for
it where possible.

(4) File access permit caching is overhauled to store the caching
information per-inode rather than per-directory, shared over
subordinate files. Whilst older AFS servers only allow ACLs on
directories (shared to the files in that directory), newer AFS
servers break that restriction.

To improve memory usage and to make it easier to do mass-key
removal, permit combinations are cached and shared.

(5) Cell database management is overhauled to allow lighter locks to
be used and to make cell records autonomous state machines that
look after getting their own DNS records and cleaning themselves
up, in particular preventing races in acquiring and relinquishing
the fscache token for the cell.

(6) Volume caching is overhauled. The afs_vlocation record is got rid
of to simplify things and the superblock is now keyed on the cell
and the numeric volume ID only. The volume record is tied to a
superblock and normal superblock management is used to mediate
the lifetime of the volume fscache token.

(7) File server record caching is overhauled to make server records
independent of cells and volumes. A server can be in multiple
cells (in such a case, the administrator must make sure that the
VL services for all cells correctly reflect the volumes shared
between those cells).

Server records are now indexed using the UUID of the server
rather than the address since a server can have multiple
addresses.

(8) File server rotation is overhauled to handle VMOVED, VBUSY (and
similar), VOFFLINE and VNOVOL indications and to handle rotation
both of servers and addresses of those servers. The rotation will
also wait and retry if the server says it is busy.

(9) Data writeback is overhauled. Each inode no longer stores a list
of modified sections tagged with the key that authorised it in
favour of noting the modified region of a page in page->private
and storing a list of keys that made modifications in the inode.

This simplifies things and allows other keys to be used to
actually write to the server if a key that made a modification
becomes useless.

(10) Writable mmap() is implemented. This allows a kernel to be build
entirely on AFS.

Note that Pre AFS-3.4 servers are no longer supported, though this can
be added back if necessary (AFS-3.4 was released in 1998)"

* tag 'afs-next-20171113' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (35 commits)
afs: Protect call->state changes against signals
afs: Trace page dirty/clean
afs: Implement shared-writeable mmap
afs: Get rid of the afs_writeback record
afs: Introduce a file-private data record
afs: Use a dynamic port if 7001 is in use
afs: Fix directory read/modify race
afs: Trace the sending of pages
afs: Trace the initiation and completion of client calls
afs: Fix documentation on # vs % prefix in mount source specification
afs: Fix total-length calculation for multiple-page send
afs: Only progress call state at end of Tx phase from rxrpc callback
afs: Make use of the YFS service upgrade to fully support IPv6
afs: Overhaul volume and server record caching and fileserver rotation
afs: Move server rotation code into its own file
afs: Add an address list concept
afs: Overhaul cell database management
afs: Overhaul permit caching
afs: Overhaul the callback handling
afs: Rename struct afs_call server member to cm_server
...

Linus Torvalds
2017-11-17 03:41:22 +0800

16 Nov, 2017

1 commit

22714a2ba Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup updates from Tejun Heo:
"Cgroup2 cpu controller support is finally merged.

- Basic cpu statistics support to allow monitoring by default without
the CPU controller enabled.

- cgroup2 cpu controller support.

- /sys/kernel/cgroup files to help dealing with new / optional
features"

* 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: export list of cgroups v2 features using sysfs
cgroup: export list of delegatable control files using sysfs
cgroup: mark @cgrp __maybe_unused in cpu_stat_show()
MAINTAINERS: relocate cpuset.c
cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat
sched: Implement interface for cgroup unified hierarchy
sched: Misc preps for cgroup unified hierarchy interface
sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
cgroup: statically initialize init_css_set->dfl_cgrp
cgroup: Implement cgroup2 basic CPU usage accounting
cpuacct: Introduce cgroup_account_cputime[_field]()
sched/cputime: Expose cputime_adjust()

Linus Torvalds
2017-11-16 06:29:44 +0800

14 Nov, 2017

4 commits

bd2cd7d5a Merge tag 'pm-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"There are no real big ticket items here this time.

The most noticeable change is probably the relocation of the OPP
(Operating Performance Points) framework to its own directory under
drivers/ as it has grown big enough for that. Also Viresh is now going
to maintain it and send pull requests for it to me, so you will see
this change in the git history going forward (but still not right
now).

Another noticeable set of changes is the modifications of the PM core,
the PCI subsystem and the ACPI PM domain to allow of more integration
between system-wide suspend/resume and runtime PM. For now it's just a
way to avoid resuming devices from runtime suspend unnecessarily
during system suspend (if the driver sets a flag to indicate its
readiness for that) and in the works is an analogous mechanism to
allow devices to stay suspended after system resume.

In addition to that, we have some changes related to supporting
frequency-invariant CPU utilization metrics in the scheduler and in
the schedutil cpufreq governor on ARM and changes to add support for
device performance states to the generic power domains (genpd)
framework.

The rest is mostly fixes and cleanups of various sorts.

Specifics:

- Relocate the OPP (Operating Performance Points) framework to its
own directory under drivers/ and add support for power domain
performance states to it (Viresh Kumar).

- Modify the PM core, the PCI bus type and the ACPI PM domain to
support power management driver flags allowing device drivers to
specify their capabilities and preferences regarding the handling
of devices with enabled runtime PM during system suspend/resume and
clean up that code somewhat (Rafael Wysocki, Ulf Hansson).

- Add frequency-invariant accounting support to the task scheduler on
ARM and ARM64 (Dietmar Eggemann).

- Fix PM QoS device resume latency framework to prevent "no
restriction" requests from overriding requests with specific
requirements and drop the confusing PM_QOS_FLAG_REMOTE_WAKEUP
device PM QoS flag (Rafael Wysocki).

- Drop legacy class suspend/resume operations from the PM core and
drop legacy bus type suspend and resume callbacks from ARM/locomo
(Rafael Wysocki).

- Add min/max frequency support to devfreq and clean it up somewhat
(Chanwoo Choi).

- Rework wakeup support in the generic power domains (genpd)
framework and update some of its users accordingly (Geert
Uytterhoeven).

- Convert timers in the PM core to use timer_setup() (Kees Cook).

- Add support for exposing the SLP_S0 (Low Power S0 Idle) residency
counter based on the LPIT ACPI table on Intel platforms (Srinivas
Pandruvada).

- Add per-CPU PM QoS resume latency support to the ladder cpuidle
governor (Ramesh Thomas).

- Fix a deadlock between the wakeup notify handler and the notifier
removal in the ACPI core (Ville Syrjälä).

- Fix a cpufreq schedutil governor issue causing it to use stale
cached frequency values sometimes (Viresh Kumar).

- Fix an issue in the system suspend core support code causing wakeup
events detection to fail in some cases (Rajat Jain).

- Fix the generic power domains (genpd) framework to prevent the PM
core from using the direct-complete optimization with it as that is
guaranteed to fail (Ulf Hansson).

- Fix a minor issue in the cpuidle core and clean it up a bit (Gaurav
Jindal, Nicholas Piggin).

- Fix and clean up the intel_idle and ARM cpuidle drivers (Jason
Baron, Len Brown, Leo Yan).

- Fix a couple of minor issues in the OPP framework and clean it up
(Arvind Yadav, Fabio Estevam, Sudeep Holla, Tobias Jordan).

- Fix and clean up some cpufreq drivers and fix a minor issue in the
cpufreq statistics code (Arvind Yadav, Bhumika Goyal, Fabio
Estevam, Gautham Shenoy, Gustavo Silva, Marek Szyprowski, Masahiro
Yamada, Robert Jarzmik, Zumeng Chen).

- Fix minor issues in the system suspend and hibernation core, in
power management documentation and in the AVS (Adaptive Voltage
Scaling) framework (Helge Deller, Himanshu Jha, Joe Perches, Rafael
Wysocki).

- Fix some issues in the cpupower utility and document that Shuah
Khan is going to maintain it going forward (Prarit Bhargava, Shuah
Khan)"

* tag 'pm-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (88 commits)
tools/power/cpupower: add libcpupower.so.0.0.1 to .gitignore
tools/power/cpupower: Add 64 bit library detection
intel_idle: Graceful probe failure when MWAIT is disabled
cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq
freezer: Fix typo in freezable_schedule_timeout() comment
PM / s2idle: Clear the events_check_enabled flag
cpufreq: stats: Handle the case when trans_table goes beyond PAGE_SIZE
cpufreq: arm_big_little: make cpufreq_arm_bL_ops structures const
cpufreq: arm_big_little: make function arguments and structure pointer const
cpuidle: Avoid assignment in if () argument
cpuidle: Clean up cpuidle_enable_device() error handling a bit
ACPI / PM: Fix acpi_pm_notifier_lock vs flush_workqueue() deadlock
PM / Domains: Fix genpd to deal with drivers returning 1 from ->prepare()
cpuidle: ladder: Add per CPU PM QoS resume latency support
PM / QoS: Fix device resume latency framework
PM / domains: Rework governor code to be more consistent
PM / Domains: Remove gpd_dev_ops.active_wakeup() callback
soc: rockchip: power-domain: Use GENPD_FLAG_ACTIVE_WAKEUP
soc: mediatek: Use GENPD_FLAG_ACTIVE_WAKEUP
ARM: shmobile: pm-rmobile: Use GENPD_FLAG_ACTIVE_WAKEUP
...

Linus Torvalds
2017-11-14 11:43:50 +0800
3e2014637 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main updates in this cycle were:

- Group balancing enhancements and cleanups (Brendan Jackman)

- Move CPU isolation related functionality into its separate
kernel/sched/isolation.c file, with related 'housekeeping_*()'
namespace and nomenclature et al. (Frederic Weisbecker)

- Improve the interactive/cpu-intense fairness calculation (Josef
Bacik)

- Improve the PELT code and related cleanups (Peter Zijlstra)

- Improve the logic of pick_next_task_fair() (Uladzislau Rezki)

- Improve the RT IPI based balancing logic (Steven Rostedt)

- Various micro-optimizations:

- better !CONFIG_SCHED_DEBUG optimizations (Patrick Bellasi)

- better idle loop (Cheng Jian)

- ... plus misc fixes, cleanups and updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds
sched/sysctl: Fix attributes of some extern declarations
sched/isolation: Document isolcpus= boot parameter flags, mark it deprecated
sched/isolation: Add basic isolcpus flags
sched/isolation: Move isolcpus= handling to the housekeeping code
sched/isolation: Handle the nohz_full= parameter
sched/isolation: Introduce housekeeping flags
sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL
sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()
sched/isolation: Use its own static key
sched/isolation: Make the housekeeping cpumask private
sched/isolation: Provide a dynamic off-case to housekeeping_any_cpu()
sched/isolation, watchdog: Use housekeeping_cpumask() instead of ad-hoc version
sched/isolation: Move housekeeping related code to its own file
sched/idle: Micro-optimize the idle loop
sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK
x86/tsc: Append the 'tsc=' description for the 'tsc=unstable' boot parameter
sched/rt: Simplify the IPI based RT balancing logic
block/ioprio: Use a helper to check for RT prio
sched/rt: Add a helper to test for a RT task
...

Linus Torvalds
2017-11-14 05:37:52 +0800
8e9a2dba8 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking updates from Ingo Molnar:
"The main changes in this cycle are:

- Another attempt at enabling cross-release lockdep dependency
tracking (automatically part of CONFIG_PROVE_LOCKING=y), this time
with better performance and fewer false positives. (Byungchul Park)

- Introduce lockdep_assert_irqs_enabled()/disabled() and convert
open-coded equivalents to lockdep variants. (Frederic Weisbecker)

- Add down_read_killable() and use it in the VFS's iterate_dir()
method. (Kirill Tkhai)

- Convert remaining uses of ACCESS_ONCE() to
READ_ONCE()/WRITE_ONCE(). Most of the conversion was Coccinelle
driven. (Mark Rutland, Paul E. McKenney)

- Get rid of lockless_dereference(), by strengthening Alpha atomics,
strengthening READ_ONCE() with smp_read_barrier_depends() and thus
being able to convert users of lockless_dereference() to
READ_ONCE(). (Will Deacon)

- Various micro-optimizations:

- better PV qspinlocks (Waiman Long),
- better x86 barriers (Michael S. Tsirkin)
- better x86 refcounts (Kees Cook)

- ... plus other fixes and enhancements. (Borislav Petkov, Juergen
Gross, Miguel Bernal Marin)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
locking/x86: Use LOCK ADD for smp_mb() instead of MFENCE
rcu: Use lockdep to assert IRQs are disabled/enabled
netpoll: Use lockdep to assert IRQs are disabled/enabled
timers/posix-cpu-timers: Use lockdep to assert IRQs are disabled/enabled
sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled
irq_work: Use lockdep to assert IRQs are disabled/enabled
irq/timings: Use lockdep to assert IRQs are disabled/enabled
perf/core: Use lockdep to assert IRQs are disabled/enabled
x86: Use lockdep to assert IRQs are disabled/enabled
smp/core: Use lockdep to assert IRQs are disabled/enabled
timers/hrtimer: Use lockdep to assert IRQs are disabled/enabled
timers/nohz: Use lockdep to assert IRQs are disabled/enabled
workqueue: Use lockdep to assert IRQs are disabled/enabled
irq/softirqs: Use lockdep to assert IRQs are disabled/enabled
locking/lockdep: Add IRQs disabled/enabled assertion APIs: lockdep_assert_irqs_enabled()/disabled()
locking/pvqspinlock: Implement hybrid PV queued/unfair locks
locking/rwlocks: Fix comments
x86/paravirt: Set up the virt_spin_lock_key after static keys get initialized
block, locking/lockdep: Assign a lock_class per gendisk used for wait_for_completion()
workqueue: Remove now redundant lock acquisitions wrt. workqueue flushes
...

Linus Torvalds
2017-11-14 04:38:26 +0800
6098850e7 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull RCU updates from Ingo Molnar:
"The main changes in this cycle are:

- Documentation updates

- RCU CPU stall-warning updates

- Torture-test updates

- Miscellaneous fixes

Size wise the biggest updates are to documentation. Excluding
documentation most of the code increase comes from a single commit
which expands debugging"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
srcu: Add parameters to SRCU docbook comments
doc: Rewrite confusing statement about memory barriers
memory-barriers.txt: Fix typo in pairing example
rcu/segcblist: Include rcupdate.h
rcu: Add extended-quiescent-state testing advice
rcu: Suppress lockdep false-positive ->boost_mtx complaints
rcu: Do not include rtmutex_common.h unconditionally
torture: Provide TMPDIR environment variable to specify tmpdir
rcutorture: Dump writer stack if stalled
rcutorture: Add interrupt-disable capability to stall-warning tests
rcu: Suppress RCU CPU stall warnings while dumping trace
rcu: Turn off tracing before dumping trace
rcu: Make RCU CPU stall warnings check for irq-disabled CPUs
sched,rcu: Make cond_resched() provide RCU quiescent state
sched: Make resched_cpu() unconditional
irq_work: Map irq_work_on_queue() to irq_work_on() in !SMP
rcu: Create call_rcu_tasks() kthread at boot time
rcu: Fix up pending cbs check in rcu_prepare_for_idle
memory-barriers: Rework multicopy-atomicity section
memory-barriers: Replace uses of "transitive"
...

Linus Torvalds
2017-11-14 04:18:10 +0800

13 Nov, 2017

2 commits

5e4def203 Pass mode to wait_on_atomic_t() action funcs and provide default actions ... Browse Code »

Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
extra argument and make it 'unsigned int throughout.

Also, consolidate a bunch of identical action functions into a default
function that can do the appropriate thing for the mode.

Also, change the argument name in the bit_wait*() function declarations to
reflect the fact that it's the mode and not the bit number.

[Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
should be done differently, though he's not immediately sure as to how]

Signed-off-by: David Howells
Acked-by: Peter Zijlstra
cc: Ingo Molnar

David Howells
2017-11-13 23:38:16 +0800
28da43956 Merge branches 'pm-cpufreq-sched' and 'pm-opp' ... Browse Code »

* pm-cpufreq-sched:
cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq

* pm-opp:
PM / OPP: Add dev_pm_opp_{un}register_get_pstate_helper()
PM / OPP: Support updating performance state of device's power domain
PM / OPP: add missing of_node_put() for of_get_cpu_node()
PM / OPP: Rename dev_pm_opp_register_put_opp_helper()
PM / OPP: Add missing of_node_put(np)
PM / OPP: Move error message to debug level
PM / OPP: Use snprintf() to avoid kasprintf() and kfree()
PM / OPP: Move the OPP directory out of power/

Rafael J. Wysocki
2017-11-13 08:40:52 +0800

09 Nov, 2017

3 commits

765cc3a4b sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds ... Browse Code »

When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
all SCHED_FEAT are turned into compile time constants being propagated
to support compiler optimizations.

Specifically, we expect that code blocks like this:

if (sched_feat(FEATURE_NAME) [&& ]) {
/* FEATURE CODE */
}

are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
being removed by the compiler from the finale image.

For this mechanism to properly work it's required for the compiler to
have full access, from each translation unit, to whatever is the value
defined by the sched_feat macro. This macro is defined as:

#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))

and thus, the compiler can optimize that code only if the value of
sysctl_sched_features is visible within each translation unit.

Since:

029632fbb ("sched: Make separate sched*.c translation units")

the scheduler code has been split into separate translation units
however the definition of sysctl_sched_features is part of
kernel/sched/core.c while, for all the other scheduler modules, it is
visible only via kernel/sched/sched.h as an:

extern const_debug unsigned int sysctl_sched_features

Unfortunately, an extern reference does not allow the compiler to apply
constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
with code to load a memory reference and (eventually) doing an unconditional
jump of a chunk of code.

This mechanism is unavoidable when sched_features can be turned on and off at
run-time. However, this is not the case for "production" kernels compiled with
!CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
which cannot be changed at run-time and thus memory loads and jumps can be
avoided altogether.

This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
of the sysctl_sched_features constant for each translation unit. This will
ultimately allow the compiler to perform constants propagation and dead-code
pruning.

Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
the patch, by running 30 iterations of:

perf bench sched messaging --pipe --thread --group 4 --loop 50000

on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
powersave governor to rule out variations due to frequency scaling.

Statistics on the reported completion time:

count mean std min 99% max
v4.14-rc8 30.0 15.7831 0.176032 15.442 16.01226 16.014
v4.14-rc8+patch 30.0 15.5033 0.189681 15.232 15.93938 15.962

... show a 1.8% speedup on average completion time and 0.5% speedup in the
99 percentile.

Signed-off-by: Patrick Bellasi
Signed-off-by: Chris Redpath
Reviewed-by: Dietmar Eggemann
Reviewed-by: Brendan Jackman
Acked-by: Peter Zijlstra
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Thomas Gleixner
Cc: Vincent Guittot
Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar

Patrick Bellasi
2017-11-09 14:35:08 +0800
e029b9bf1 Merge branch 'pm-cpufreq-sched' ... Browse Code »

* pm-cpufreq-sched:
cpufreq: schedutil: Examine the correct CPU when we update util

Rafael J. Wysocki
2017-11-09 07:07:56 +0800
07458f6a5 cpufreq: schedutil: Reset cached_raw_freq when not in sync with next_freq ... Browse Code »

'cached_raw_freq' is used to get the next frequency quickly but should
always be in sync with sg_policy->next_freq. There is a case where it is
not and in such cases it should be reset to avoid switching to incorrect
frequencies.

Consider this case for example:

- policy->cur is 1.2 GHz (Max)
- New request comes for 780 MHz and we store that in cached_raw_freq.
- Based on 780 MHz, we calculate the effective frequency as 800 MHz.
- We then see the CPU wasn't idle recently and choose to keep the next
freq as 1.2 GHz.
- Now we have cached_raw_freq is 780 MHz and sg_policy->next_freq is
1.2 GHz.
- Now if the utilization doesn't change in then next request, then the
next target frequency will still be 780 MHz and it will match with
cached_raw_freq. But we will choose 1.2 GHz instead of 800 MHz here.

Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
Signed-off-by: Viresh Kumar
Cc: 4.12+ # 4.12+
Signed-off-by: Rafael J. Wysocki

Viresh Kumar
2017-11-09 06:59:33 +0800

08 Nov, 2017

2 commits

2c11dba00 sched/clock, sched/cputime: Use lockdep to assert IRQs are disabled/enabled ... Browse Code »

Use lockdep to check that IRQs are enabled or disabled as expected. This
way the sanity check only shows overhead when concurrency correctness
debug code is enabled.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: David S . Miller
Cc: Lai Jiangshan
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Tejun Heo
Link: http://lkml.kernel.org/r/1509980490-4285-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-11-08 18:13:53 +0800
8a103df44 Merge branch 'linus' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2017-11-08 17:17:15 +0800

05 Nov, 2017

1 commit

d62d813c0 cpufreq: schedutil: Examine the correct CPU when we update util ... Browse Code »

After commit 674e75411fc2 (sched: cpufreq: Allow remote cpufreq
callbacks) we stopped to always read the utilization for the CPU we
are running the governor on, and instead we read it for the CPU
which we've been told has updated utilization. This is stored in
sugov_cpu->cpu.

The value is set in sugov_register() but we clear it in sugov_start()
which leads to always looking at the utilization of CPU0 instead of
the correct one.

Fix this by consolidating the initialization code into sugov_start().

Fixes: 674e75411fc2 (sched: cpufreq: Allow remote cpufreq callbacks)
Signed-off-by: Chris Redpath
Reviewed-by: Patrick Bellasi
Reviewed-by: Brendan Jackman
Acked-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki

Chris Redpath
2017-11-05 00:44:28 +0800

02 Nov, 2017

1 commit

b24413180 License cleanup: add SPDX GPL-2.0 license identifier to files with no license ... Browse Code »

Many source files in the tree are missing licensing information, which
makes it harder for compliance tools to determine the correct license.

By default all files without license information are under the default
license of the kernel, which is GPL version 2.

Update the files which contain no license information with the 'GPL-2.0'
SPDX license identifier. The SPDX identifier is a legally binding
shorthand, which can be used instead of the full boiler plate text.

This patch is based on work done by Thomas Gleixner and Kate Stewart and
Philippe Ombredanne.

How this work was done:

Patches were generated and checked against linux-4.14-rc6 for a subset of
the use cases:
- file had no licensing information it it.
- file was a */uapi/* one with no licensing information in it,
- file was a */uapi/* one with existing licensing information,

Further patches will be generated in subsequent months to fix up cases
where non-standard license headers were used, and references to license
had to be inferred by heuristics based on keywords.

The analysis to determine which SPDX License Identifier to be applied to
a file was done in a spreadsheet of side by side results from of the
output of two independent scanners (ScanCode & Windriver) producing SPDX
tag:value files created by Philippe Ombredanne. Philippe prepared the
base worksheet, and did an initial spot review of a few 1000 files.

The 4.13 kernel was the starting point of the analysis with 60,537 files
assessed. Kate Stewart did a file by file comparison of the scanner
results in the spreadsheet to determine which SPDX license identifier(s)
to be applied to the file. She confirmed any determination that was not
immediately clear with lawyers working with the Linux Foundation.

Criteria used to select files for SPDX license identifier tagging was:
- Files considered eligible had to be source code files.
- Make and config files were included as candidates if they contained >5
lines of source
- File already had some variant of a license header in it (even if
Reviewed-by: Philippe Ombredanne
Reviewed-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2017-11-02 18:10:55 +0800

27 Oct, 2017

10 commits

150dfee95 sched/isolation: Add basic isolcpus flags ... Browse Code »

Add flags to control NOHZ and domain isolation from "isolcpus=", in
order to centralize the isolation features to a common interface. Domain
isolation remains the default so not to break the existing isolcpus
boot paramater behaviour.

Further flags in the future may include 0hz (1hz tick offload) and timers,
workqueue, RCU, kthread, watchdog, likely all merged together in a
common flag ("async"?). In any case, this will have to be modifiable by
cpusets.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-12-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:31 +0800
edb938217 sched/isolation: Move isolcpus= handling to the housekeeping code ... Browse Code »

We want to centralize the isolation features, to be done by the housekeeping
subsystem and scheduler domain isolation is a significant part of it.

No intended behaviour change, we just reuse the housekeeping cpumask
and core code.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:30 +0800
6f1982fed sched/isolation: Handle the nohz_full= parameter ... Browse Code »

We want to centralize the isolation management, done by the housekeeping
subsystem. Therefore we need to handle the nohz_full= parameter from
there.

Since nohz_full= so far has involved unbound timers, watchdog, RCU
and tilegx NAPI isolation, we keep that default behaviour.

nohz_full= will be deprecated in the future. We want to control
the isolation features from the isolcpus= parameter.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-10-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:30 +0800
de201559d sched/isolation: Introduce housekeeping flags ... Browse Code »

Before we implement isolcpus under housekeeping, we need the isolation
features to be more finegrained. For example some people want NOHZ_FULL
without the full scheduler isolation, others want full scheduler
isolation without NOHZ_FULL.

So let's cut all these isolation features piecewise, at the risk of
overcutting it right now. We can still merge some flags later if they
always make sense together.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:29 +0800
5c4991e24 sched/isolation: Split out new CONFIG_CPU_ISOLATION=y config from CONFIG_NO_HZ_FULL ... Browse Code »

Split the housekeeping config from CONFIG_NO_HZ_FULL. This way we finally
separate the isolation code from NOHZ.

Although a dependency to CONFIG_NO_HZ_FULL remains for now, while the
housekeeping code still deals with NOHZ internals.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-8-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:28 +0800
204c083a0 sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu() ... Browse Code »

Fit it into the housekeeping_*() namespace.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:28 +0800
e179f5a04 sched/isolation: Use its own static key ... Browse Code »

Housekeeping code still depends on the nohz_full static key. Since we want
to decouple housekeeping from NOHZ, let's create a housekeeping specific
static key.

It's mostly relevant for calls to is_housekeeping_cpu() from the scheduler.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-6-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:27 +0800
7e56a1cf4 sched/isolation: Make the housekeeping cpumask private ... Browse Code »

Nobody needs to access this detail. housekeeping_cpumask() already
takes care of it.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-5-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:26 +0800
786340614 sched/isolation: Move housekeeping related code to its own file ... Browse Code »

The housekeeping code is currently tied to the NOHZ code. As we are
planning to make housekeeping independent from it, start with moving
the relevant code to its own file.

Signed-off-by: Frederic Weisbecker
Acked-by: Thomas Gleixner
Acked-by: Paul E. McKenney
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Linus Torvalds
Cc: Luiz Capitulino
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2017-10-27 15:55:24 +0800
d41bf8c9d cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat ... Browse Code »

The basic cpu stat is currently shown with "cpu." prefix in
cgroup.stat, and the same information is duplicated in cpu.stat when
cpu controller is enabled. This is ugly and not very scalable as we
want to expand the coverage of stat information which is always
available.

This patch makes cgroup core always create "cpu.stat" file and show
the basic cpu stat there and calls the cpu controller to show the
extra stats when enabled. This ensures that the same information
isn't presented in multiple places and makes future expansion of basic
stats easier.

Signed-off-by: Tejun Heo
Acked-by: Peter Zijlstra (Intel)

Tejun Heo
2017-10-27 01:56:33 +0800

26 Oct, 2017

1 commit

54b933c6c sched/idle: Micro-optimize the idle loop ... Browse Code »

Move the loop-invariant calculation of 'cpu' in do_idle() out of the loop body,
because the current CPU is always constant.

This improves the generated code both on x86-64 and ARM64:

x86-64:

Before patch (execution in loop):
864: 0f ae e8 lfence
867: 65 8b 05 c2 38 f1 7e mov %gs:0x7ef138c2(%rip),%eax
86e: 89 c0 mov %eax,%eax
870: 48 0f a3 05 68 19 08 bt %rax,0x1081968(%rip)
877: 01

After patch (execution in loop):
872: 0f ae e8 lfence
875: 4c 0f a3 25 63 19 08 bt %r12,0x1081963(%rip)
87c: 01

ARM64:

Before patch (execution in loop):
c58: d5033d9f dsb ld
c5c: d538d080 mrs x0, tpidr_el1
c60: b8606a61 ldr w1, [x19,x0]
c64: 1100fc20 add w0, w1, #0x3f
c68: 7100003f cmp w1, #0x0
c6c: 1a81b000 csel w0, w0, w1, lt
c70: 13067c00 asr w0, w0, #6
c74: 93407c00 sxtw x0, w0
c78: f8607a80 ldr x0, [x20,x0,lsl #3]
c7c: 9ac12401 lsr x1, x0, x1
c80: 36000581 tbz w1, #0, d30

After patch (execution in loop):
c84: d5033d9f dsb ld
c88: f9400260 ldr x0, [x19]
c8c: ea14001f tst x0, x20
c90: 54000580 b.eq d40

Signed-off-by: Cheng Jian
[ Rewrote the title and the changelog. ]
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: huawei.libin@huawei.com
Cc: xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1508930907-107755-1-git-send-email-cj.chengjian@huawei.com
Signed-off-by: Ingo Molnar

Cheng Jian
2017-10-26 14:31:29 +0800

24 Oct, 2017

2 commits

e22cdc3fc sched/isolcpus: Fix "isolcpus=" boot parameter handling when !CONFIG_CPUMASK_OFFSTACK ... Browse Code »

cpulist_parse() uses nr_cpumask_bits as a limit to parse the
passed buffer from kernel commandline. What nr_cpumask_bits
represents varies depending upon the CONFIG_CPUMASK_OFFSTACK option:

- If CONFIG_CPUMASK_OFFSTACK=n, then nr_cpumask_bits is the same as
NR_CPUS, which might not represent the # of CPUs that really exist
(default 64). So, there's a chance of a gap between nr_cpu_ids
and NR_CPUS, which ultimately lead towards invalid cpulist_parse()
operation. For example, if isolcpus=9 is passed on an 8 cpu
system (CONFIG_CPUMASK_OFFSTACK=n) it doesn't show the error
that it's supposed to.

This patch fixes this bug by finding the last CPU of the passed
isolcpus= list and checking it against nr_cpu_ids.

It also fixes the error message where the nr_cpu_ids should be
nr_cpu_ids-1, since CPU numbering starts from 0.

Signed-off-by: Rakib Mullick
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: adobriyan@gmail.com
Cc: akpm@linux-foundation.org
Cc: longman@redhat.com
Cc: mka@chromium.org
Cc: tj@kernel.org
Link: http://lkml.kernel.org/r/20171023130154.9050-1-rakib.mullick@gmail.com
[ Enhanced the changelog and the kernel message. ]
Signed-off-by: Ingo Molnar

include/linux/cpumask.h | 16 ++++++++++++++++
kernel/sched/topology.c | 4 ++--
2 files changed, 18 insertions(+), 2 deletions(-)

Rakib Mullick
2017-10-24 17:47:25 +0800
72bc286b8 Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmc… ... Browse Code »

…k/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Documentation updates
- Miscellaneous fixes
- RCU CPU stall-warning updates
- Torture-test updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2017-10-24 16:49:44 +0800

21 Oct, 2017

1 commit

ad4e25a3a Merge branches 'doc.2017.10.20a', 'fixes.2017.10.19a', 'stall.2017.10.09a' and '… ... Browse Code »

…torture.2017.10.09a' into HEAD

doc.2017.10.20a: Documentation updates.
fixes.2017.10.19a: Miscellaneous fixes.
stall.2017.10.09a: RCU CPU stall-warning updates.
torture.2017.10.09a: Torture-test updates.

Paul E. McKenney
2017-10-21 02:11:15 +0800

20 Oct, 2017

1 commit

a961e4091 membarrier: Provide register expedited private command ... Browse Code »

This introduces a "register private expedited" membarrier command which
allows eventual removal of important memory barrier constraints on the
scheduler fast-paths. It changes how the "private expedited" membarrier
command (new to 4.14) is used from user-space.

This new command allows processes to register their intent to use the
private expedited command. This affects how the expedited private
command introduced in 4.14-rc is meant to be used, and should be merged
before 4.14 final.

Processes are now required to register before using
MEMBARRIER_CMD_PRIVATE_EXPEDITED, otherwise that command returns EPERM.

This fixes a problem that arose when designing requested extensions to
sys_membarrier() to allow JITs to efficiently flush old code from
instruction caches. Several potential algorithms are much less painful
if the user register intent to use this functionality early on, for
example, before the process spawns the second thread. Registering at
this time removes the need to interrupt each and every thread in that
process at the first expedited sys_membarrier() system call.

Signed-off-by: Mathieu Desnoyers
Acked-by: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Alexander Viro
Signed-off-by: Linus Torvalds

Mathieu Desnoyers
2017-10-20 10:13:40 +0800

10 Oct, 2017

5 commits

4bdced5c9 sched/rt: Simplify the IPI based RT balancing logic ... Browse Code »

When a CPU lowers its priority (schedules out a high priority task for a
lower priority one), a check is made to see if any other CPU has overloaded
RT tasks (more than one). It checks the rto_mask to determine this and if so
it will request to pull one of those tasks to itself if the non running RT
task is of higher priority than the new priority of the next task to run on
the current CPU.

When we deal with large number of CPUs, the original pull logic suffered
from large lock contention on a single CPU run queue, which caused a huge
latency across all CPUs. This was caused by only having one CPU having
overloaded RT tasks and a bunch of other CPUs lowering their priority. To
solve this issue, commit:

b6366f048e0c ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")

changed the way to request a pull. Instead of grabbing the lock of the
overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.

Although the IPI logic worked very well in removing the large latency build
up, it still could suffer from a large number of IPIs being sent to a single
CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
when I tested this on a 120 CPU box, with a stress test that had lots of
RT tasks scheduling on all CPUs, it actually triggered the hard lockup
detector! One CPU had so many IPIs sent to it, and due to the restart
mechanism that is triggered when the source run queue has a priority status
change, the CPU spent minutes! processing the IPIs.

Thinking about this further, I realized there's no reason for each run queue
to send its own IPI. As all CPUs with overloaded tasks must be scanned
regardless if there's one or many CPUs lowering their priority, because
there's no current way to find the CPU with the highest priority task that
can schedule to one of these CPUs, there really only needs to be one IPI
being sent around at a time.

This greatly simplifies the code!

The new approach is to have each root domain have its own irq work, as the
rto_mask is per root domain. The root domain has the following fields
attached to it:

rto_push_work - the irq work to process each CPU set in rto_mask
rto_lock - the lock to protect some of the other rto fields
rto_loop_start - an atomic that keeps contention down on rto_lock
the first CPU scheduling in a lower priority task
is the one to kick off the process.
rto_loop_next - an atomic that gets incremented for each CPU that
schedules in a lower priority task.
rto_loop - a variable protected by rto_lock that is used to
compare against rto_loop_next
rto_cpu - The cpu to send the next IPI to, also protected by
the rto_lock.

When a CPU schedules in a lower priority task and wants to make sure
overloaded CPUs know about it. It increments the rto_loop_next. Then it
atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
then it is done, as another CPU is kicking off the IPI loop. If the old
value is "0", then it will take the rto_lock to synchronize with a possible
IPI being sent around to the overloaded CPUs.

If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
IPI being sent around, or one is about to finish. Then rto_cpu is set to the
first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
in rto_mask, then there's nothing to be done.

When the CPU receives the IPI, it will first try to push any RT tasks that is
queued on the CPU but can't run because a higher priority RT task is
currently running on that CPU.

Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
finds one, it simply sends an IPI to that CPU and the process continues.

If there's no more CPUs in the rto_mask, then rto_loop is compared with
rto_loop_next. If they match, everything is done and the process is over. If
they do not match, then a CPU scheduled in a lower priority task as the IPI
was being passed around, and the process needs to start again. The first CPU
in rto_mask is sent the IPI.

This change removes this duplication of work in the IPI logic, and greatly
lowers the latency caused by the IPIs. This removed the lockup happening on
the 120 CPU machine. It also simplifies the code tremendously. What else
could anyone ask for?

Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
supplying me with the rto_start_trylock() and rto_start_unlock() helper
functions.

Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Clark Williams
Cc: Daniel Bristot de Oliveira
Cc: John Kacur
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Scott Wood
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.home
Signed-off-by: Ingo Molnar

Steven Rostedt (Red Hat)
2017-10-10 17:45:40 +0800
93f50f902 sched/fair: Fix usage of find_idlest_group() when the local group is idlest ... Browse Code »

find_idlest_group() returns NULL when the local group is idlest. The
caller then continues the find_idlest_group() search at a lower level
of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is
not consulted and, crucially, @new_cpu is not updated. This means the
search is pointless and we return @prev_cpu from select_task_rq_fair().

This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu.

Signed-off-by: Brendan Jackman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Josef Bacik
Reviewed-by: Vincent Guittot
Cc: Dietmar Eggemann
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar

Brendan Jackman
2017-10-10 17:45:36 +0800
6fee85ccb sched/fair: Fix usage of find_idlest_group() when no groups are allowed ... Browse Code »

When 'p' is not allowed on any of the CPUs in the sched_domain, we
currently return NULL from find_idlest_group(), and pointlessly
continue the search on lower sched_domain levels (where 'p' is also not
allowed) before returning prev_cpu regardless (as we have not updated
new_cpu).

Add an explicit check for this case, and add a comment to
find_idlest_group(). Now when find_idlest_group() returns NULL, it always
means that the local group is allowed and idlest.

Signed-off-by: Brendan Jackman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Reviewed-by: Josef Bacik
Cc: Dietmar Eggemann
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar

Brendan Jackman
2017-10-10 17:45:35 +0800
0d10ab952 sched/fair: Fix find_idlest_group() when local group is not allowed ... Browse Code »

When the local group is not allowed we do not modify this_*_load from
their initial value of 0. That means that the load checks at the end
of find_idlest_group cause us to incorrectly return NULL. Fixing the
initial values to ULONG_MAX means we will instead return the idlest
remote group in that case.

Signed-off-by: Brendan Jackman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Reviewed-by: Josef Bacik
Cc: Dietmar Eggemann
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar

Brendan Jackman
2017-10-10 17:45:34 +0800
e90381eae sched/fair: Remove unnecessary comparison with -1 ... Browse Code »

Since commit:

83a0a96a5f26 ("sched/fair: Leverage the idle state info when choosing the "idlest" cpu")

find_idlest_group_cpu() (formerly find_idlest_cpu) no longer returns -1,
so we can simplify the checking of the return value in find_idlest_cpu().

Signed-off-by: Brendan Jackman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Josef Bacik
Reviewed-by: Vincent Guittot
Cc: Dietmar Eggemann
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20171005114516.18617-3-brendan.jackman@arm.com
Signed-off-by: Ingo Molnar

Brendan Jackman
2017-10-10 17:45:34 +0800