Eric Lee / smarc-fsl-linux-kernel

28 Jul, 2016

1 commit

08fd8c176 Merge tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip ... Browse Code »

Pull xen updates from David Vrabel:
"Features and fixes for 4.8-rc0:

- ACPI support for guests on ARM platforms.
- Generic steal time support for arm and x86.
- Support cases where kernel cpu is not Xen VCPU number (e.g., if
in-guest kexec is used).
- Use the system workqueue instead of a custom workqueue in various
places"

* tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
xen: add static initialization of steal_clock op to xen_time_ops
xen/pvhvm: run xen_vcpu_setup() for the boot CPU
xen/evtchn: use xen_vcpu_id mapping
xen/events: fifo: use xen_vcpu_id mapping
xen/events: use xen_vcpu_id mapping in events_base
x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
xen: introduce xen_vcpu_id mapping
x86/acpi: store ACPI ids from MADT for future usage
x86/xen: update cpuid.h from Xen-4.7
xen/evtchn: add IOCTL_EVTCHN_RESTRICT
xen-blkback: really don't leak mode property
xen-blkback: constify instance of "struct attribute_group"
xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
xen-blkback: prefer xenbus_scanf() over xenbus_gather()
xen: support runqueue steal time on xen
arm/xen: add support for vm_assist hypercall
xen: update xen headers
xen-pciback: drop superfluous variables
xen-pciback: short-circuit read path used for merging write values
...

Linus Torvalds
2016-07-28 02:35:37 +0800

27 Jul, 2016

1 commit

6453dbdda Merge tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"Again, the majority of changes go into the cpufreq subsystem, but
there are no big features this time. The cpufreq changes that stand
out somewhat are the governor interface rework and improvements
related to the handling of frequency tables. Apart from those, there
are fixes and new device/CPU IDs in drivers, cleanups and an
improvement of the new schedutil governor.

Next, there are some changes in the hibernation core, including a fix
for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
during resume from hibernation, a few core improvements related to
memory management during resume, a couple of additional debug features
and cleanups.

Finally, we have some fixes and cleanups in the devfreq subsystem,
generic power domains framework improvements related to system
suspend/resume, support for some new chips in intel_idle and in the
power capping RAPL driver, a new version of the AnalyzeSuspend utility
and some assorted fixes and cleanups.

Specifics:

- Rework the cpufreq governor interface to make it more
straightforward and modify the conservative governor to avoid using
transition notifications (Rafael Wysocki).

- Rework the handling of frequency tables by the cpufreq core to make
it more efficient (Viresh Kumar).

- Modify the schedutil governor to reduce the number of wakeups it
causes to occur in cases when the CPU frequency doesn't need to be
changed (Steve Muckle, Viresh Kumar).

- Fix some minor issues and clean up code in the cpufreq core and
governors (Rafael Wysocki, Viresh Kumar).

- Add Intel Broxton support to the intel_pstate driver (Srinivas
Pandruvada).

- Fix problems related to the config TDP feature and to the validity
of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
Srinivas Pandruvada).

- Make intel_pstate update the cpu_frequency tracepoint even if the
frequency doesn't change to avoid confusing powertop (Rafael
Wysocki).

- Clean up the usage of __init/__initdata in intel_pstate, mark some
of its internal variables as __read_mostly and drop an unused
structure element from it (Jisheng Zhang, Carsten Emde).

- Clean up the usage of some duplicate MSR symbols in intel_pstate
and turbostat (Srinivas Pandruvada).

- Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
Adiga, Viresh Kumar, Ben Dooks).

- Fix a regression (introduced during the 4.5 cycle) in the
pcc-cpufreq driver by reverting the problematic commit (Andreas
Herrmann).

- Add support for Intel Denverton to intel_idle, clean up Broxton
support in it and make it explicitly non-modular (Jacob Pan, Jan
Beulich, Paul Gortmaker).

- Add support for Denverton and Ivy Bridge server to the Intel RAPL
power capping driver and make it more careful about the handing of
MSRs that may not be present (Jacob Pan, Xiaolong Wang).

- Fix resume from hibernation on x86-64 by making the CPU offline
during resume avoid using MONITOR/MWAIT in the "play dead" loop
which may lead to an inadvertent "revival" of a "dead" CPU and a
page fault leading to a kernel crash from it (Rafael Wysocki).

- Make memory management during resume from hibernation more
straightforward (Rafael Wysocki).

- Add debug features that should help to detect problems related to
hibernation and resume from it (Rafael Wysocki, Chen Yu).

- Clean up hibernation core somewhat (Rafael Wysocki).

- Prevent KASAN from instrumenting the hibernation core which leads
to large numbers of false-positives from it (James Morse).

- Prevent PM (hibernate and suspend) notifiers from being called
during the cleanup phase if they have not been called during the
corresponding preparation phase which is possible if one of the
other notifiers returns an error at that time (Lianwei Wang).

- Improve suspend-related debug printout in the tasks freezer and
clean up suspend-related console handling (Roger Lu, Borislav
Petkov).

- Update the AnalyzeSuspend script in the kernel sources to version
4.2 (Todd Brandt).

- Modify the generic power domains framework to make it handle system
suspend/resume better (Ulf Hansson).

- Make the runtime PM framework avoid resuming devices synchronously
when user space changes the runtime PM settings for them and
improve its error reporting (Rafael Wysocki, Linus Walleij).

- Fix error paths in devfreq drivers (exynos, exynos-ppmu,
exynos-bus) and in the core, make some devfreq code explicitly
non-modular and change some of it into tristate (Bartlomiej
Zolnierkiewicz, Peter Chen, Paul Gortmaker).

- Add DT support to the generic PM clocks management code and make it
export some more symbols (Jon Hunter, Paul Gortmaker).

- Make the PCI PM core code slightly more robust against possible
driver errors (Andy Shevchenko).

- Make it possible to change DESTDIR and PREFIX in turbostat (Andy
Shevchenko)"

* tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
PM / hibernate: Introduce test_resume mode for hibernation
cpufreq: export cpufreq_driver_resolve_freq()
cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
PCI / PM: check all fields in pci_set_platform_pm()
cpufreq: acpi-cpufreq: use cached frequency mapping when possible
cpufreq: schedutil: map raw required frequency to driver frequency
cpufreq: add cpufreq_driver_resolve_freq()
cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
intel_pstate: Update cpu_frequency tracepoint every time
cpufreq: intel_pstate: clean remnant struct element
PM / tools: scripts: AnalyzeSuspend v4.2
x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
cpufreq: powernv: Replacing pstate_id with frequency table index
intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
PM / hibernate: Image data protection during restoration
PM / hibernate: Add missing braces in __register_nosave_region()
PM / hibernate: Clean up comments in snapshot.c
PM / hibernate: Clean up function headers in snapshot.c
PM / hibernate: Add missing braces in hibernate_setup()
...

Linus Torvalds
2016-07-27 08:29:07 +0800

26 Jul, 2016

3 commits

766fd5f6c Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull NOHZ updates from Ingo Molnar:

- fix system/idle cputime leaked on cputime accounting (all nohz
configs) (Rik van Riel)

- remove the messy, ad-hoc irqtime account on nohz-full and make it
compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

- cleanups (Frederic Weisbecker)

- remove unecessary irq disablement in the irqtime code (Rik van Riel)

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
sched/cputime: Reorganize vtime native irqtime accounting headers
sched/cputime: Clean up the old vtime gen irqtime accounting completely
sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
sched/cputime: Count actually elapsed irq & softirq time

Linus Torvalds
2016-07-26 05:43:00 +0800
cca08cd66 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- introduce and use task_rcu_dereference()/try_get_task_struct() to fix
and generalize task_struct handling (Oleg Nesterov)

- do various per entity load tracking (PELT) fixes and optimizations
(Peter Zijlstra)

- cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

- introduce consolidated cputime output file cpuacct.usage_all and
related refactorings (Zhao Lei)

- ... plus misc fixes and enhancements

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
sched/fair: Rework throttle_count sync
sched/core: Fix sched_getaffinity() return value kerneldoc comment
sched/fair: Reorder cgroup creation code
sched/fair: Apply more PELT fixes
sched/fair: Fix PELT integrity for new tasks
sched/cgroup: Fix cpu_cgroup_fork() handling
sched/fair: Fix PELT integrity for new groups
sched/fair: Fix and optimize the fork() path
sched/cputime: Add steal time support to full dynticks CPU time accounting
sched/cputime: Fix prev steal time accouting during CPU hotplug
KVM: Fix steal clock warp during guest CPU hotplug
sched/debug: Always show 'nr_migrations'
sched/fair: Use task_rcu_dereference()
sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
sched/idle: Optimize the generic idle loop
sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()

Linus Torvalds
2016-07-26 04:59:34 +0800
c86ad14d3 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The locking tree was busier in this cycle than the usual pattern - a
couple of major projects happened to coincide.

The main changes are:

- implement the atomic_fetch_{add,sub,and,or,xor}() API natively
across all SMP architectures (Peter Zijlstra)

- add atomic_fetch_{inc/dec}() as well, using the generic primitives
(Davidlohr Bueso)

- optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
Waiman Long)

- optimize smp_cond_load_acquire() on arm64 and implement LSE based
atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
on arm64 (Will Deacon)

- introduce smp_acquire__after_ctrl_dep() and fix various barrier
mis-uses and bugs (Peter Zijlstra)

- after discovering ancient spin_unlock_wait() barrier bugs in its
implementation and usage, strengthen its semantics and update/fix
usage sites (Peter Zijlstra)

- optimize mutex_trylock() fastpath (Peter Zijlstra)

- ... misc fixes and cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
locking/static_keys: Fix non static symbol Sparse warning
locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
locking/atomic, arch/tile: Fix tilepro build
locking/atomic, arch/m68k: Remove comment
locking/atomic, arch/arc: Fix build
locking/Documentation: Clarify limited control-dependency scope
locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
locking/atomic, arch/mips: Convert to _relaxed atomics
locking/atomic, arch/alpha: Convert to _relaxed atomics
locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
locking/atomic: Fix atomic64_relaxed() bits
locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
...

Linus Torvalds
2016-07-26 03:41:29 +0800

25 Jul, 2016

1 commit

9def970ea Merge branch 'pm-cpufreq' ... Browse Code »

* pm-cpufreq: (41 commits)
Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
cpufreq: export cpufreq_driver_resolve_freq()
cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
cpufreq: acpi-cpufreq: use cached frequency mapping when possible
cpufreq: schedutil: map raw required frequency to driver frequency
cpufreq: add cpufreq_driver_resolve_freq()
cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
intel_pstate: Update cpu_frequency tracepoint every time
cpufreq: intel_pstate: clean remnant struct element
cpufreq: powernv: Replacing pstate_id with frequency table index
intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
cpufreq: Reuse new freq-table helpers
cpufreq: Handle sorted frequency tables more efficiently
cpufreq: Drop redundant check from cpufreq_update_current_freq()
intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly
intel_pstate: add __init/__initdata marker to some functions/variables
intel_pstate: Fix incorrect placement of __initdata
cpufreq: mvebu: fix integer to pointer cast
cpufreq: intel_pstate: Broxton support
cpufreq: conservative: Do not use transition notifications
...

Rafael J. Wysocki
2016-07-25 19:46:08 +0800

22 Jul, 2016

1 commit

5cbea4698 cpufreq: schedutil: map raw required frequency to driver frequency ... Browse Code »

The slow-path frequency transition path is relatively expensive as it
requires waking up a thread to do work. Should support be added for
remote CPU cpufreq updates that is also expensive since it requires an
IPI. These activities should be avoided if they are not necessary.

To that end, calculate the actual driver-supported frequency required by
the new utilization value in schedutil by using the recently added
cpufreq_driver_resolve_freq API. If it is the same as the previously
requested driver frequency then there is no need to continue with the
update assuming the cpu frequency limits have not changed. This will
have additional benefits should the semantics of the rate limit be
changed to apply solely to frequency transitions rather than to
frequency calculations in schedutil.

The last raw required frequency is cached. This allows the driver
frequency lookup to be skipped in the event that the new raw required
frequency matches the last one, assuming a frequency update has not been
forced due to limits changing (indicated by a next_freq value of
UINT_MAX, see sugov_should_update_freq).

Signed-off-by: Steve Muckle
Reviewed-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki

Steve Muckle
2016-07-22 04:28:21 +0800

14 Jul, 2016

4 commits

553bf6bbf sched/cputime: Drop local_irq_save/restore from irqtime_account_irq() ... Browse Code »

Paolo pointed out that irqs are already blocked when irqtime_account_irq()
is called. That means there is no reason to call local_irq_save/restore()
again.

Suggested-by: Paolo Bonzini
Signed-off-by: Rik van Riel
Signed-off-by: Frederic Weisbecker
Reviewed-by: Paolo Bonzini
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Radim Krcmar
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1468421405-20056-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-07-14 16:42:35 +0800
0cfdf9a19 sched/cputime: Clean up the old vtime gen irqtime accounting completely ... Browse Code »

Vtime generic irqtime accounting has been removed but there are a few
remnants to clean up:

* The vtime_accounting_cpu_enabled() check in irq entry was only used
by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

* Without the vtime_accounting_cpu_enabled(), we no longer need to
have a vtime_common_account_irq_enter() indirect function.

* Move vtime_account_irq_enter() implementation under
CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

* The vtime_account_user() call was only used on irq entry for
CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

Signed-off-by: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krcmar
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2016-07-14 16:42:35 +0800
b58c35840 sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code ... Browse Code »

The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
appear to currently work right.

On CPUs without nohz_full=, only tick based irq time sampling is
done, which breaks down when dealing with a nohz_idle CPU.

On firewalls and similar systems, no ticks may happen on a CPU for a
while, and the irq time spent may never get accounted properly. This
can cause issues with capacity planning and power saving, which use
the CPU statistics as inputs in decision making.

Remove the VTIME_GEN vtime irq time code, and replace it with the
IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

Signed-off-by: Rik van Riel
Signed-off-by: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krcmar
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-07-14 16:42:34 +0800
574302183 sched/cputime: Count actually elapsed irq & softirq time ... Browse Code »

Currently, if there was any irq or softirq time during 'ticks'
jiffies, the entire period will be accounted as irq or softirq
time.

This is inaccurate if only a subset of the time was actually spent
handling irqs, and could conceivably mis-count all of the ticks during
a period as irq time, when there was some irq and some softirq time.

This can actually happen when irqtime_account_process_tick is called
from account_idle_ticks, which can pass a larger number of ticks down
all at once.

Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
and steal_account_process_ticks() to work with cputime_t time units, and
return the amount of time spent in each mode.

Rename steal_account_process_ticks() to steal_account_process_time(), to
reflect that time is now accounted in cputime_t, instead of ticks.

Additionally, have irqtime_account_process_tick() take into account how
much time was spent in each of steal, irq, and softirq time.

The latter could help improve the accuracy of cputime
accounting when returning from idle on a NO_HZ_IDLE CPU.

Properly accounting how much time was spent in hardirq and
softirq time will also allow the NO_HZ_FULL code to re-use
these same functions for hardirq and softirq accounting.

Signed-off-by: Rik van Riel
[ Make nsecs_to_cputime64() actually return cputime64_t. ]
Signed-off-by: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paolo Bonzini
Cc: Peter Zijlstra
Cc: Radim Krcmar
Cc: Thomas Gleixner
Cc: Wanpeng Li
Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Rik van Riel
2016-07-14 16:42:34 +0800

13 Jul, 2016

1 commit

d60585c57 sched/core: Correct off by one bug in load migration calculation ... Browse Code »

The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
account that the function is now called from a thread running on the outgoing
CPU. As a result a cpu unplug leakes a load of 1 into the global load
accounting mechanism.

Fix it by adjusting for the currently running thread which calls
calc_load_migrate().

Reported-by: Anton Blanchard
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Vaidyanathan Srinivasan
Cc: rt@linutronix.de
Cc: shreyas@linux.vnet.ibm.com
Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-13 20:58:20 +0800

11 Jul, 2016

1 commit

748c7201e sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set ... Browse Code »

Currently, a schedule while atomic error prints the stack trace to the
kernel log and the system continue running.

Although it is possible to collect the kernel log messages and analyze
it, often more information are needed. Furthermore, keep the system
running is not always the best choice. For example, when the preempt
count underflows the system will not stop to complain about scheduling
while atomic, so the kernel log can wrap around overwriting the first
stack trace, tuning the analysis even more challenging.

This patch uses the kernel.panic_on_warn sysctl to help out on these
more complex situations.

When kernel.panic_on_warn is set to 1, the kernel will panic() in the
schedule while atomic detection.

The default value of the sysctl is 0, maintaining the current behavior.

Signed-off-by: Daniel Bristot de Oliveira
Reviewed-by: Luis Claudio R. Goncalves
Cc: Christian Borntraeger
Cc: Linus Torvalds
Cc: Luis Claudio R. Goncalves
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/e8f7b80f353aa22c63bd8557208163989af8493d.1464983675.git.bristot@redhat.com
Signed-off-by: Ingo Molnar

Daniel Bristot de Oliveira
2016-07-11 02:17:27 +0800

09 Jul, 2016

3 commits

277a13e4f sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together ... Browse Code »

In current code, we can get cpuacct data from several files,
but each file has various limitations.

For example:

- We can get CPU usage in user and kernel mode via cpuacct.stat,
but we can't get detailed data about each CPU.

- We can get each CPU's kernel mode usage in cpuacct.usage_percpu_sys,
but we can't get user mode usage data at the same time.

This patch introduces cpuacct.usage_all, to show all detailed CPU
accounting data together:

# cat cpuacct.usage_all
cpu user system
0 3809760299 5807968992
1 3250329855 454612211
..

Signed-off-by: Zhao Lei
Cc: KOSAKI Motohiro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/7744460969edd7caaf0e903592ee52353ed9bdd6.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Zhao Lei
2016-07-09 19:56:15 +0800
8e546bfaf sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show() ... Browse Code »

In cpuacct_stats_show() we currently we have copies of similar code,
for each cpustat(system/user) variant.

Use a loop instead to consolidate the code. This will also work better
if we extend the CPUACCT_STAT_NSTATS type.

Signed-off-by: Zhao Lei
Cc: KOSAKI Motohiro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/b0597d4224655e9f333f1a6224ed9654c7d7d36a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Zhao Lei
2016-07-09 19:56:15 +0800
9acacc2ac sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums ... Browse Code »

These two types have similar function, no need to separate them.

Signed-off-by: Zhao Lei
Cc: KOSAKI Motohiro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/436748885270d64363c7dc67167507d486c2057a.1466415271.git.zhaolei@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Zhao Lei
2016-07-09 19:56:15 +0800

06 Jul, 2016

1 commit

ecb23dc6f xen: add steal_clock support on x86 ... Browse Code »

The pv_time_ops structure contains a function pointer for the
"steal_clock" functionality used only by KVM and Xen on ARM. Xen on x86
uses its own mechanism to account for the "stolen" time a thread wasn't
able to run due to hypervisor scheduling.

Add support in Xen arch independent time handling for this feature by
moving it out of the arm arch into drivers/xen and remove the x86 Xen
hack.

Signed-off-by: Juergen Gross
Reviewed-by: Boris Ostrovsky
Reviewed-by: Stefano Stabellini
Signed-off-by: David Vrabel

Juergen Gross
2016-07-06 17:34:48 +0800

04 Jul, 2016

1 commit

5d1191ab6 Merge back earlier cpufreq material for v4.8. Browse Code »

Rafael J. Wysocki
2016-07-04 19:21:43 +0800

27 Jun, 2016

11 commits

55e16d30b sched/fair: Rework throttle_count sync ... Browse Code »

Since we already take rq->lock when creating a cgroup, use it to also
sync the throttle_count and avoid the extra state and enqueue path
branch.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Cc: linux-kernel@vger.kernel.org
[ Fixed build warning. ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:53:19 +0800
599b4840b sched/core: Fix sched_getaffinity() return value kerneldoc comment ... Browse Code »

Previous version was probably written referencing the man page for
glibc's wrapper, but the wrapper's behavior differs from that of the
syscall itself in this case.

Signed-off-by: Zev Weiss
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.net
Signed-off-by: Ingo Molnar

Zev Weiss
2016-06-27 18:53:12 +0800
8663e24d5 sched/fair: Reorder cgroup creation code ... Browse Code »

A future patch needs rq->lock held _after_ we link the task_group into
the hierarchy. In order to avoid taking every rq->lock twice, reorder
things a little and create online_fair_sched_group() to be called
after we link the task_group.

All this code is still ran from css_alloc() so css_online() isn't in
fact used for this.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Konstantin Khlebnikov
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:17:55 +0800
3d30544f0 sched/fair: Apply more PELT fixes ... Browse Code »

One additional 'rule' for using update_cfs_rq_load_avg() is that one
should call update_tg_load_avg() if it returns true.

Add a bunch of comments to hopefully clarify some of the rules:

o You need to update cfs_rq _before_ any entity attach/detach,
this is important, because while for mathmatical consisency this
isn't strictly needed, it is required for the physical
interpretation of the model, you attach/detach _now_.

o When you modify the cfs_rq avg, you have to then call
update_tg_load_avg() in order to propagate changes upwards.

o (Fair) entities are always attached, switched_{to,from}_fair()
deal with !fair. This directly follows from the definition of the
cfs_rq averages, namely that they are a direct sum of all
(runnable or blocked) entities on that rq.

It is the second rule that this patch enforces, but it adds comments
pertaining to all of them.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:17:54 +0800
7dc603c90 sched/fair: Fix PELT integrity for new tasks ... Browse Code »

Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.

The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.

Scenario 1: switch to fair class

p->sched_class = fair_class;
if (queued)
enqueue_task(p);
...
enqueue_entity()
enqueue_entity_load_avg()
migrated = !sa->last_update_time (true)
if (migrated)
attach_entity_load_avg()
check_class_changed()
switched_from() (!fair)
switched_to() (fair)
switched_to_fair()
attach_entity_load_avg()

If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.

Scenario 2: change between cgroups

sched_move_group(p)
if (queued)
dequeue_task()
task_move_group_fair()
detach_task_cfs_rq()
detach_entity_load_avg()
set_task_rq()
attach_task_cfs_rq()
attach_entity_load_avg()
if (queued)
enqueue_task();
...
enqueue_entity()
enqueue_entity_load_avg()
migrated = !sa->last_update_time (true)
if (migrated)
attach_entity_load_avg()

Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.

Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.

As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.

This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().

Of course, this isn't entirely as straightforward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.

Reported-by: Yuyang Du
Reported-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:17:53 +0800
ea86cb4b7 sched/cgroup: Fix cpu_cgroup_fork() handling ... Browse Code »

A new fair task is detached and attached from/to task_group with:

cgroup_post_fork()
ss->fork(child) := cpu_cgroup_fork()
sched_move_task()
task_move_group_fair()

Which is wrong, because at this point in fork() the task isn't fully
initialized and it cannot 'move' to another group, because its not
attached to any group as yet.

In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
can just call this small part directly instead sched_move_task(). And
the task doesn't really migrate because it is not yet attached so we
need the following sequence:

do_fork()
sched_fork()
__set_task_cpu()

cgroup_post_fork()
set_task_rq() # set task group and runqueue

wake_up_new_task()
select_task_rq() can select a new cpu
__set_task_cpu
post_init_entity_util_avg
attach_task_cfs_rq()
activate_task
enqueue_task

This patch makes that happen.

Signed-off-by: Vincent Guittot
[ Added TASK_SET_GROUP to set depth properly. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2016-06-27 18:17:52 +0800
010114739 sched/fair: Fix PELT integrity for new groups ... Browse Code »

Vincent reported that when a new task is moved into a new cgroup it
gets attached twice to the load tracking:

sched_move_task()
task_move_group_fair()
detach_task_cfs_rq()
set_task_rq()
attach_task_cfs_rq()
attach_entity_load_avg()
se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0

enqueue_entity()
enqueue_entity_load_avg()
update_cfs_rq_load_avg()
now = clock()
__update_load_avg(&cfs_rq->avg)
cfs_rq->avg.last_load_update = now
// ages load/util for: now - 0, load/util -> 0
if (migrated)
attach_entity_load_avg()
se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

The problem is that we don't update cfs_rq load_avg before all
entity attach/detach operations. Only enqueue_task() and migrate_task()
do this.

By fixing this, the above will not happen, because the
sched_move_task() attach will have updated cfs_rq's last_load_update
time before attach, and in turn the attach will have set the entity's
last_load_update stamp.

Note that there is a further problem with sched_move_task() calling
detach on a task that hasn't yet been attached; this will be taken
care of in a subsequent patch.

Reported-by: Vincent Guittot
Tested-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:17:51 +0800
e210bffd3 sched/fair: Fix and optimize the fork() path ... Browse Code »

The task_fork_fair() callback already calls __set_task_cpu() and takes
rq->lock.

If we move the sched_class::task_fork callback in sched_fork() under
the existing p->pi_lock, right after its set_task_cpu() call, we can
avoid doing two such calls and omit the IRQ disabling on the rq->lock.

Change to __set_task_cpu() to skip the migration bits, this is a new
task, not a migration. Similarly, make wake_up_new_task() use
__set_task_cpu() for the same reason, the task hasn't actually
migrated as it hasn't ever ran.

This cures the problem of calling migrate_task_rq_fair(), which does
remove_entity_from_load_avg() on tasks that have never been added to
the load avg to begin with.

This bug would result in transiently messed up load_avg values, averaged
out after a few dozen milliseconds. This is probably the reason why
this bug was not found for such a long time.

Reported-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 18:17:50 +0800
630741fb6 Merge branch 'sched/urgent' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-06-27 17:35:02 +0800
ea1dc6fc6 sched/fair: Fix calc_cfs_shares() fixed point arithmetics width confusion ... Browse Code »

Commit:

fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

did something non-obvious but also did it buggy yet latent.

The problem was exposed for real by a later commit in the v4.7 merge window:

2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

... after which tg->load_avg and cfs_rq->load.weight had different
units (10 bit fixed point and 20 bit fixed point resp.).

Add a comment to explain the use of cfs_rq->load.weight over the
'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
for the difference in unit.

Since this is (now, as per a previous commit) the only user of
calc_tg_weight(), collapse it.

The effects of this bug should be randomly inconsistent SMP-balancing
of cgroups workloads.

Reported-by: Jirka Hladky
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 17:18:37 +0800
7dd491259 sched/fair: Fix effective_load() to consistently use smoothed load ... Browse Code »

Starting with the following commit:

fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

calc_tg_weight() doesn't compute the right value as expected by effective_load().

The difference is in the 'correction' term. In order to ensure \Sum
rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
lagging a correction on the current cfs_rq->avg.load_avg value.
Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
cfs_rq->avg.load_avg.

Now, per the referenced commit, calc_tg_weight() doesn't use
cfs_rq->avg.load_avg, as is later used in @w, but uses
cfs_rq->load.weight instead.

So stop using calc_tg_weight() and do it explicitly.

The effects of this bug are wake_affine() making randomly
poor choices in cgroup-intense workloads.

Signed-off-by: Peter Zijlstra (Intel)
Cc: # v4.3+
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-27 17:18:36 +0800

24 Jun, 2016

3 commits

feb245e30 sched/core: Allow kthreads to fall back to online && !active cpus ... Browse Code »

During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
online but not active. A CPU_ONLINE callback may create or bind a
kthread so that its cpus_allowed mask only allows the CPU which is
being brought online. The kthread may start executing before the CPU
is made active and can end up in select_fallback_rq().

In such cases, the expected behavior is selecting the CPU which is
coming online; however, because select_fallback_rq() only chooses from
active CPUs, it determines that the task doesn't have any viable CPU
in its allowed mask and ends up overriding it to cpu_possible_mask.

CPU_ONLINE callbacks should be able to put kthreads on the CPU which
is coming online. Update select_fallback_rq() so that it follows
cpu_online() rather than cpu_active() for kthreads.

Reported-by: Gautham R Shenoy
Tested-by: Gautham R. Shenoy
Signed-off-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Abdul Haleem
Cc: Aneesh Kumar
Cc: Linus Torvalds
Cc: Michael Ellerman
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: kernel-team@fb.com
Cc: linuxppc-dev@lists.ozlabs.org
Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
Signed-off-by: Ingo Molnar

Tejun Heo
2016-06-24 14:26:53 +0800
754bd598b sched/fair: Do not announce throttled next buddy in dequeue_task_fair() ... Browse Code »

Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar

Konstantin Khlebnikov
2016-06-24 14:26:45 +0800
094f46917 sched/fair: Initialize throttle_count for new task-groups lazily ... Browse Code »

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar

Konstantin Khlebnikov
2016-06-24 14:26:44 +0800

20 Jun, 2016

1 commit

897418922 sched/fair: Fix cfs_rq avg tracking underflow ... Browse Code »

As per commit:

b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

> the code generated from update_cfs_rq_load_avg():
>
> if (atomic_long_read(&cfs_rq->removed_load_avg)) {
> s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
> sa->load_avg = max_t(long, sa->load_avg - r, 0);
> sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
> removed_load = 1;
> }
>
> turns into:
>
> ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
> ffffffff8108706b: 48 85 c0 test %rax,%rax
> ffffffff8108706e: 74 40 je ffffffff810870b0
> ffffffff81087070: 4c 89 f8 mov %r15,%rax
> ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
> ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
> ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
> ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
> ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
> ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
>
> Which you'll note ends up with sa->load_avg -= r in memory at
> ffffffff8108707a.

So I _should_ have looked at other unserialized users of ->load_avg,
but alas. Luckily nikbor reported a similar /0 from task_h_load() which
instantly triggered recollection of this here problem.

Aside from the intermediate value hitting memory and causing problems,
there's another problem: the underflow detection relies on the signed
bit. This reduces the effective width of the variables, IOW its
effectively the same as having these variables be of signed type.

This patch changes to a different means of unsigned underflow
detection to not rely on the signed bit. This allows the variables to
use the 'full' unsigned range. And it does so with explicit LOAD -
STORE to ensure any intermediate value will never be visible in
memory, allowing these unserialized loads.

Note: GCC generates crap code for this, might warrant a look later.

Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
maybe we should do clamping on add too.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Chris Wilson
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: kernel@kyup.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-20 17:29:09 +0800

14 Jun, 2016

7 commits

57675cb97 kernel/sysrq, watchdog, sched/core: Reset watchdog on all CPUs while processing sysrq-w ... Browse Code »

Lengthy output of sysrq-w may take a lot of time on slow serial console.

Currently we reset NMI-watchdog on the current CPU to avoid spurious
lockup messages. Sometimes this doesn't work since softlockup watchdog
might trigger on another CPU which is waiting for an IPI to proceed.
We reset softlockup watchdogs on all CPUs, but we do this only after
listing all tasks, and this may be too late on a busy system.

So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

Signed-off-by: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc:
Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar

Andrey Ryabinin
2016-06-14 18:48:38 +0800
eda8dca51 sched/debug: Fix deadlock when enabling sched events ... Browse Code »

I see a hang when enabling sched events:

echo 1 > /sys/kernel/debug/tracing/events/sched/enable

The printk buffer shows:

BUG: spinlock recursion on CPU#1, swapper/1/0
lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
...
Call Trace:
[] dump_stack+0x85/0xc2
[] spin_dump+0x78/0xc0
[] do_raw_spin_lock+0x11a/0x150
[] _raw_spin_lock+0x61/0x80
[] ? try_to_wake_up+0x256/0x4e0
[] try_to_wake_up+0x256/0x4e0
[] ? _raw_spin_unlock_irqrestore+0x4a/0x80
[] wake_up_process+0x15/0x20
[] insert_work+0x84/0xc0
[] __queue_work+0x18f/0x660
[] queue_work_on+0x46/0x90
[] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
[] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
[] soft_cursor+0x1ad/0x230
[] bit_cursor+0x649/0x680
[] ? update_attr.isra.2+0x90/0x90
[] fbcon_cursor+0x14a/0x1c0
[] hide_cursor+0x28/0x90
[] vt_console_print+0x3bf/0x3f0
[] call_console_drivers.constprop.24+0x183/0x200
[] console_unlock+0x3d4/0x610
[] vprintk_emit+0x3c5/0x610
[] vprintk_default+0x29/0x40
[] printk+0x57/0x73
[] enqueue_entity+0xc2e/0xc70
[] enqueue_task_fair+0x59/0xab0
[] ? kvm_sched_clock_read+0x9/0x20
[] ? sched_clock+0x9/0x10
[] activate_task+0x5c/0xa0
[] ttwu_do_activate+0x54/0xb0
[] sched_ttwu_pending+0x7a/0xb0
[] scheduler_ipi+0x61/0x170
[] smp_trace_reschedule_interrupt+0x4f/0x2a0
[] trace_reschedule_interrupt+0x96/0xa0
[] ? native_safe_halt+0x6/0x10
[] ? trace_hardirqs_on+0xd/0x10
[] default_idle+0x20/0x1a0
[] arch_cpu_idle+0xf/0x20
[] default_idle_call+0x2f/0x50
[] cpu_startup_entry+0x37e/0x450
[] start_secondary+0x160/0x1a0

Note the hang only occurs when echoing the above from a physical serial
console, not from an ssh session.

The bug is caused by a deadlock where the task is trying to grab the rq
lock twice because printk()'s aren't safe in sched code.

Signed-off-by: Josh Poimboeuf
Cc: Linus Torvalds
Cc: Matt Fleming
Cc: Mel Gorman
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
Signed-off-by: Ingo Molnar

Josh Poimboeuf
2016-06-14 18:47:21 +0800
1f03e8d29 locking/barriers: Replace smp_cond_acquire() with smp_cond_load_acquire() ... Browse Code »

This new form allows using hardware assisted waiting.

Some hardware (ARM64 and x86) allow monitoring an address for changes,
so by providing a pointer we can use this to replace the cpu_relax()
with hardware optimized methods in the future.

Requested-by: Will Deacon
Suggested-by: Linus Torvalds
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-14 17:54:27 +0800
807e5b806 sched/cputime: Add steal time support to full dynticks CPU time accounting ... Browse Code »

This patch adds guest steal-time support to full dynticks CPU
time accounting. After the following commit:

ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")

... time sampling became jiffy based, even if we do the sampling from the
context tracking code, so steal_account_process_tick() can be reused
to account how many 'ticks' are stolen-time, after the last accumulation.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Paolo Bonzini
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Radim Krčmář
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1465813966-3116-4-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-06-14 17:13:16 +0800
3d89e5478 sched/cputime: Fix prev steal time accouting during CPU hotplug ... Browse Code »

Commit:

e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
fix the case where (after CPU hotplug) steal time is smaller than
rq->prev_steal_time.

However, this should never happen. Steal time was only smaller because of the
KVM-specific bug fixed by the previous patch. Worse, the previous patch
triggers a bug on CPU hot-unplug/plug operation: because
rq->prev_steal_time is cleared, all of the CPU's past steal time will be
accounted again on hot-plug.

Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Paolo Bonzini
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Radim Krčmář
Cc: Rik van Riel
Cc: Thomas Gleixner
Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2016-06-14 17:13:15 +0800
07f9f2208 Merge branch 'sched/urgent' into sched/core, to pick up fixes ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-06-14 17:04:13 +0800
b7fa30c9c sched/fair: Fix post_init_entity_util_avg() serialization ... Browse Code »

Chris Wilson reported a divide by 0 at:

post_init_entity_util_avg():

> 725 if (cfs_rq->avg.util_avg != 0) {
> 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
> -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
> 728
> 729 if (sa->util_avg > cap)
> 730 sa->util_avg = cap;
> 731 } else {

Which given the lack of serialization, and the code generated from
update_cfs_rq_load_avg() is entirely possible:

if (atomic_long_read(&cfs_rq->removed_load_avg)) {
s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
sa->load_avg = max_t(long, sa->load_avg - r, 0);
sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
removed_load = 1;
}

turns into:

ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
ffffffff8108706b: 48 85 c0 test %rax,%rax
ffffffff8108706e: 74 40 je ffffffff810870b0
ffffffff81087070: 4c 89 f8 mov %r15,%rax
ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx

Which you'll note ends up with 'sa->load_avg - r' in memory at
ffffffff8108707a.

By calling post_init_entity_util_avg() under rq->lock we're sure to be
fully serialized against PELT updates and cannot observe intermediate
state like this.

Reported-by: Chris Wilson
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrey Ryabinin
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Yuyang Du
Cc: bsegall@google.com
Cc: morten.rasmussen@arm.com
Cc: pjt@google.com
Cc: steve.muckle@linaro.org
Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-06-14 16:58:34 +0800