28 Jul, 2016

1 commit

  • Pull xen updates from David Vrabel:
    "Features and fixes for 4.8-rc0:

    - ACPI support for guests on ARM platforms.
    - Generic steal time support for arm and x86.
    - Support cases where kernel cpu is not Xen VCPU number (e.g., if
    in-guest kexec is used).
    - Use the system workqueue instead of a custom workqueue in various
    places"

    * tag 'for-linus-4.8-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: (47 commits)
    xen: add static initialization of steal_clock op to xen_time_ops
    xen/pvhvm: run xen_vcpu_setup() for the boot CPU
    xen/evtchn: use xen_vcpu_id mapping
    xen/events: fifo: use xen_vcpu_id mapping
    xen/events: use xen_vcpu_id mapping in events_base
    x86/xen: use xen_vcpu_id mapping when pointing vcpu_info to shared_info
    x86/xen: use xen_vcpu_id mapping for HYPERVISOR_vcpu_op
    xen: introduce xen_vcpu_id mapping
    x86/acpi: store ACPI ids from MADT for future usage
    x86/xen: update cpuid.h from Xen-4.7
    xen/evtchn: add IOCTL_EVTCHN_RESTRICT
    xen-blkback: really don't leak mode property
    xen-blkback: constify instance of "struct attribute_group"
    xen-blkfront: prefer xenbus_scanf() over xenbus_gather()
    xen-blkback: prefer xenbus_scanf() over xenbus_gather()
    xen: support runqueue steal time on xen
    arm/xen: add support for vm_assist hypercall
    xen: update xen headers
    xen-pciback: drop superfluous variables
    xen-pciback: short-circuit read path used for merging write values
    ...

    Linus Torvalds
     

27 Jul, 2016

1 commit

  • Pull power management updates from Rafael Wysocki:
    "Again, the majority of changes go into the cpufreq subsystem, but
    there are no big features this time. The cpufreq changes that stand
    out somewhat are the governor interface rework and improvements
    related to the handling of frequency tables. Apart from those, there
    are fixes and new device/CPU IDs in drivers, cleanups and an
    improvement of the new schedutil governor.

    Next, there are some changes in the hibernation core, including a fix
    for a nasty problem related to the MONITOR/MWAIT usage by CPU offline
    during resume from hibernation, a few core improvements related to
    memory management during resume, a couple of additional debug features
    and cleanups.

    Finally, we have some fixes and cleanups in the devfreq subsystem,
    generic power domains framework improvements related to system
    suspend/resume, support for some new chips in intel_idle and in the
    power capping RAPL driver, a new version of the AnalyzeSuspend utility
    and some assorted fixes and cleanups.

    Specifics:

    - Rework the cpufreq governor interface to make it more
    straightforward and modify the conservative governor to avoid using
    transition notifications (Rafael Wysocki).

    - Rework the handling of frequency tables by the cpufreq core to make
    it more efficient (Viresh Kumar).

    - Modify the schedutil governor to reduce the number of wakeups it
    causes to occur in cases when the CPU frequency doesn't need to be
    changed (Steve Muckle, Viresh Kumar).

    - Fix some minor issues and clean up code in the cpufreq core and
    governors (Rafael Wysocki, Viresh Kumar).

    - Add Intel Broxton support to the intel_pstate driver (Srinivas
    Pandruvada).

    - Fix problems related to the config TDP feature and to the validity
    of the MSR_HWP_INTERRUPT register in intel_pstate (Jan Kiszka,
    Srinivas Pandruvada).

    - Make intel_pstate update the cpu_frequency tracepoint even if the
    frequency doesn't change to avoid confusing powertop (Rafael
    Wysocki).

    - Clean up the usage of __init/__initdata in intel_pstate, mark some
    of its internal variables as __read_mostly and drop an unused
    structure element from it (Jisheng Zhang, Carsten Emde).

    - Clean up the usage of some duplicate MSR symbols in intel_pstate
    and turbostat (Srinivas Pandruvada).

    - Update/fix the powernv, s3c24xx and mvebu cpufreq drivers (Akshay
    Adiga, Viresh Kumar, Ben Dooks).

    - Fix a regression (introduced during the 4.5 cycle) in the
    pcc-cpufreq driver by reverting the problematic commit (Andreas
    Herrmann).

    - Add support for Intel Denverton to intel_idle, clean up Broxton
    support in it and make it explicitly non-modular (Jacob Pan, Jan
    Beulich, Paul Gortmaker).

    - Add support for Denverton and Ivy Bridge server to the Intel RAPL
    power capping driver and make it more careful about the handing of
    MSRs that may not be present (Jacob Pan, Xiaolong Wang).

    - Fix resume from hibernation on x86-64 by making the CPU offline
    during resume avoid using MONITOR/MWAIT in the "play dead" loop
    which may lead to an inadvertent "revival" of a "dead" CPU and a
    page fault leading to a kernel crash from it (Rafael Wysocki).

    - Make memory management during resume from hibernation more
    straightforward (Rafael Wysocki).

    - Add debug features that should help to detect problems related to
    hibernation and resume from it (Rafael Wysocki, Chen Yu).

    - Clean up hibernation core somewhat (Rafael Wysocki).

    - Prevent KASAN from instrumenting the hibernation core which leads
    to large numbers of false-positives from it (James Morse).

    - Prevent PM (hibernate and suspend) notifiers from being called
    during the cleanup phase if they have not been called during the
    corresponding preparation phase which is possible if one of the
    other notifiers returns an error at that time (Lianwei Wang).

    - Improve suspend-related debug printout in the tasks freezer and
    clean up suspend-related console handling (Roger Lu, Borislav
    Petkov).

    - Update the AnalyzeSuspend script in the kernel sources to version
    4.2 (Todd Brandt).

    - Modify the generic power domains framework to make it handle system
    suspend/resume better (Ulf Hansson).

    - Make the runtime PM framework avoid resuming devices synchronously
    when user space changes the runtime PM settings for them and
    improve its error reporting (Rafael Wysocki, Linus Walleij).

    - Fix error paths in devfreq drivers (exynos, exynos-ppmu,
    exynos-bus) and in the core, make some devfreq code explicitly
    non-modular and change some of it into tristate (Bartlomiej
    Zolnierkiewicz, Peter Chen, Paul Gortmaker).

    - Add DT support to the generic PM clocks management code and make it
    export some more symbols (Jon Hunter, Paul Gortmaker).

    - Make the PCI PM core code slightly more robust against possible
    driver errors (Andy Shevchenko).

    - Make it possible to change DESTDIR and PREFIX in turbostat (Andy
    Shevchenko)"

    * tag 'pm-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (89 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    PM / hibernate: Introduce test_resume mode for hibernation
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    PCI / PM: check all fields in pci_set_platform_pm()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    PM / tools: scripts: AnalyzeSuspend v4.2
    x86 / hibernate: Use hlt_play_dead() when resuming from hibernation
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    PM / hibernate: Image data protection during restoration
    PM / hibernate: Add missing braces in __register_nosave_region()
    PM / hibernate: Clean up comments in snapshot.c
    PM / hibernate: Clean up function headers in snapshot.c
    PM / hibernate: Add missing braces in hibernate_setup()
    ...

    Linus Torvalds
     

26 Jul, 2016

3 commits

  • Pull NOHZ updates from Ingo Molnar:

    - fix system/idle cputime leaked on cputime accounting (all nohz
    configs) (Rik van Riel)

    - remove the messy, ad-hoc irqtime account on nohz-full and make it
    compatible with CONFIG_IRQ_TIME_ACCOUNTING=y instead (Rik van Riel)

    - cleanups (Frederic Weisbecker)

    - remove unecessary irq disablement in the irqtime code (Rik van Riel)

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/cputime: Drop local_irq_save/restore from irqtime_account_irq()
    sched/cputime: Reorganize vtime native irqtime accounting headers
    sched/cputime: Clean up the old vtime gen irqtime accounting completely
    sched/cputime: Replace VTIME_GEN irq time code with IRQ_TIME_ACCOUNTING code
    sched/cputime: Count actually elapsed irq & softirq time

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:

    - introduce and use task_rcu_dereference()/try_get_task_struct() to fix
    and generalize task_struct handling (Oleg Nesterov)

    - do various per entity load tracking (PELT) fixes and optimizations
    (Peter Zijlstra)

    - cputime virt-steal time accounting enhancements/fixes (Wanpeng Li)

    - introduce consolidated cputime output file cpuacct.usage_all and
    related refactorings (Zhao Lei)

    - ... plus misc fixes and enhancements

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/core: Panic on scheduling while atomic bugs if kernel.panic_on_warn is set
    sched/cpuacct: Introduce cpuacct.usage_all to show all CPU stats together
    sched/cpuacct: Use loop to consolidate code in cpuacct_stats_show()
    sched/cpuacct: Merge cpuacct_usage_index and cpuacct_stat_index enums
    sched/fair: Rework throttle_count sync
    sched/core: Fix sched_getaffinity() return value kerneldoc comment
    sched/fair: Reorder cgroup creation code
    sched/fair: Apply more PELT fixes
    sched/fair: Fix PELT integrity for new tasks
    sched/cgroup: Fix cpu_cgroup_fork() handling
    sched/fair: Fix PELT integrity for new groups
    sched/fair: Fix and optimize the fork() path
    sched/cputime: Add steal time support to full dynticks CPU time accounting
    sched/cputime: Fix prev steal time accouting during CPU hotplug
    KVM: Fix steal clock warp during guest CPU hotplug
    sched/debug: Always show 'nr_migrations'
    sched/fair: Use task_rcu_dereference()
    sched/api: Introduce task_rcu_dereference() and try_get_task_struct()
    sched/idle: Optimize the generic idle loop
    sched/fair: Fix the wrong throttled clock time for cfs_rq_clock_task()

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The locking tree was busier in this cycle than the usual pattern - a
    couple of major projects happened to coincide.

    The main changes are:

    - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
    across all SMP architectures (Peter Zijlstra)

    - add atomic_fetch_{inc/dec}() as well, using the generic primitives
    (Davidlohr Bueso)

    - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
    Waiman Long)

    - optimize smp_cond_load_acquire() on arm64 and implement LSE based
    atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    on arm64 (Will Deacon)

    - introduce smp_acquire__after_ctrl_dep() and fix various barrier
    mis-uses and bugs (Peter Zijlstra)

    - after discovering ancient spin_unlock_wait() barrier bugs in its
    implementation and usage, strengthen its semantics and update/fix
    usage sites (Peter Zijlstra)

    - optimize mutex_trylock() fastpath (Peter Zijlstra)

    - ... misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
    locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
    locking/static_keys: Fix non static symbol Sparse warning
    locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
    locking/atomic, arch/tile: Fix tilepro build
    locking/atomic, arch/m68k: Remove comment
    locking/atomic, arch/arc: Fix build
    locking/Documentation: Clarify limited control-dependency scope
    locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
    locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
    locking/atomic, arch/mips: Convert to _relaxed atomics
    locking/atomic, arch/alpha: Convert to _relaxed atomics
    locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
    locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
    locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    locking/atomic: Fix atomic64_relaxed() bits
    locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • * pm-cpufreq: (41 commits)
    Revert "cpufreq: pcc-cpufreq: update default value of cpuinfo_transition_latency"
    cpufreq: export cpufreq_driver_resolve_freq()
    cpufreq: Disallow ->resolve_freq() for drivers providing ->target_index()
    cpufreq: acpi-cpufreq: use cached frequency mapping when possible
    cpufreq: schedutil: map raw required frequency to driver frequency
    cpufreq: add cpufreq_driver_resolve_freq()
    cpufreq: intel_pstate: Check cpuid for MSR_HWP_INTERRUPT
    intel_pstate: Update cpu_frequency tracepoint every time
    cpufreq: intel_pstate: clean remnant struct element
    cpufreq: powernv: Replacing pstate_id with frequency table index
    intel_pstate: Fix MSR_CONFIG_TDP_x addressing in core_get_max_pstate()
    cpufreq: Reuse new freq-table helpers
    cpufreq: Handle sorted frequency tables more efficiently
    cpufreq: Drop redundant check from cpufreq_update_current_freq()
    intel_pstate: Declare pid_params/pstate_funcs/hwp_active __read_mostly
    intel_pstate: add __init/__initdata marker to some functions/variables
    intel_pstate: Fix incorrect placement of __initdata
    cpufreq: mvebu: fix integer to pointer cast
    cpufreq: intel_pstate: Broxton support
    cpufreq: conservative: Do not use transition notifications
    ...

    Rafael J. Wysocki
     

22 Jul, 2016

1 commit

  • The slow-path frequency transition path is relatively expensive as it
    requires waking up a thread to do work. Should support be added for
    remote CPU cpufreq updates that is also expensive since it requires an
    IPI. These activities should be avoided if they are not necessary.

    To that end, calculate the actual driver-supported frequency required by
    the new utilization value in schedutil by using the recently added
    cpufreq_driver_resolve_freq API. If it is the same as the previously
    requested driver frequency then there is no need to continue with the
    update assuming the cpu frequency limits have not changed. This will
    have additional benefits should the semantics of the rate limit be
    changed to apply solely to frequency transitions rather than to
    frequency calculations in schedutil.

    The last raw required frequency is cached. This allows the driver
    frequency lookup to be skipped in the event that the new raw required
    frequency matches the last one, assuming a frequency update has not been
    forced due to limits changing (indicated by a next_freq value of
    UINT_MAX, see sugov_should_update_freq).

    Signed-off-by: Steve Muckle
    Reviewed-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Steve Muckle
     

14 Jul, 2016

4 commits

  • Paolo pointed out that irqs are already blocked when irqtime_account_irq()
    is called. That means there is no reason to call local_irq_save/restore()
    again.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Paolo Bonzini
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-6-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Vtime generic irqtime accounting has been removed but there are a few
    remnants to clean up:

    * The vtime_accounting_cpu_enabled() check in irq entry was only used
    by CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can safely remove it.

    * Without the vtime_accounting_cpu_enabled(), we no longer need to
    have a vtime_common_account_irq_enter() indirect function.

    * Move vtime_account_irq_enter() implementation under
    CONFIG_VIRT_CPU_ACCOUNTING_NATIVE which is the last user.

    * The vtime_account_user() call was only used on irq entry for
    CONFIG_VIRT_CPU_ACCOUNTING_GEN. We can remove that too.

    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     
  • The CONFIG_VIRT_CPU_ACCOUNTING_GEN irq time tracking code does not
    appear to currently work right.

    On CPUs without nohz_full=, only tick based irq time sampling is
    done, which breaks down when dealing with a nohz_idle CPU.

    On firewalls and similar systems, no ticks may happen on a CPU for a
    while, and the irq time spent may never get accounted properly. This
    can cause issues with capacity planning and power saving, which use
    the CPU statistics as inputs in decision making.

    Remove the VTIME_GEN vtime irq time code, and replace it with the
    IRQ_TIME_ACCOUNTING code, when selected as a config option by the user.

    Signed-off-by: Rik van Riel
    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-3-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • Currently, if there was any irq or softirq time during 'ticks'
    jiffies, the entire period will be accounted as irq or softirq
    time.

    This is inaccurate if only a subset of the time was actually spent
    handling irqs, and could conceivably mis-count all of the ticks during
    a period as irq time, when there was some irq and some softirq time.

    This can actually happen when irqtime_account_process_tick is called
    from account_idle_ticks, which can pass a larger number of ticks down
    all at once.

    Fix this by changing irqtime_account_hi_update(), irqtime_account_si_update(),
    and steal_account_process_ticks() to work with cputime_t time units, and
    return the amount of time spent in each mode.

    Rename steal_account_process_ticks() to steal_account_process_time(), to
    reflect that time is now accounted in cputime_t, instead of ticks.

    Additionally, have irqtime_account_process_tick() take into account how
    much time was spent in each of steal, irq, and softirq time.

    The latter could help improve the accuracy of cputime
    accounting when returning from idle on a NO_HZ_IDLE CPU.

    Properly accounting how much time was spent in hardirq and
    softirq time will also allow the NO_HZ_FULL code to re-use
    these same functions for hardirq and softirq accounting.

    Signed-off-by: Rik van Riel
    [ Make nsecs_to_cputime64() actually return cputime64_t. ]
    Signed-off-by: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paolo Bonzini
    Cc: Peter Zijlstra
    Cc: Radim Krcmar
    Cc: Thomas Gleixner
    Cc: Wanpeng Li
    Link: http://lkml.kernel.org/r/1468421405-20056-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

13 Jul, 2016

1 commit

  • The move of calc_load_migrate() from CPU_DEAD to CPU_DYING did not take into
    account that the function is now called from a thread running on the outgoing
    CPU. As a result a cpu unplug leakes a load of 1 into the global load
    accounting mechanism.

    Fix it by adjusting for the currently running thread which calls
    calc_load_migrate().

    Reported-by: Anton Blanchard
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Vaidyanathan Srinivasan
    Cc: rt@linutronix.de
    Cc: shreyas@linux.vnet.ibm.com
    Fixes: e9cd8fa4fcfd: ("sched/migration: Move calc_load_migrate() into CPU_DYING")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1607121744350.4083@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

11 Jul, 2016

1 commit

  • Currently, a schedule while atomic error prints the stack trace to the
    kernel log and the system continue running.

    Although it is possible to collect the kernel log messages and analyze
    it, often more information are needed. Furthermore, keep the system
    running is not always the best choice. For example, when the preempt
    count underflows the system will not stop to complain about scheduling
    while atomic, so the kernel log can wrap around overwriting the first
    stack trace, tuning the analysis even more challenging.

    This patch uses the kernel.panic_on_warn sysctl to help out on these
    more complex situations.

    When kernel.panic_on_warn is set to 1, the kernel will panic() in the
    schedule while atomic detection.

    The default value of the sysctl is 0, maintaining the current behavior.

    Signed-off-by: Daniel Bristot de Oliveira
    Reviewed-by: Luis Claudio R. Goncalves
    Cc: Christian Borntraeger
    Cc: Linus Torvalds
    Cc: Luis Claudio R. Goncalves
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/e8f7b80f353aa22c63bd8557208163989af8493d.1464983675.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     

09 Jul, 2016

3 commits

  • In current code, we can get cpuacct data from several files,
    but each file has various limitations.

    For example:

    - We can get CPU usage in user and kernel mode via cpuacct.stat,
    but we can't get detailed data about each CPU.

    - We can get each CPU's kernel mode usage in cpuacct.usage_percpu_sys,
    but we can't get user mode usage data at the same time.

    This patch introduces cpuacct.usage_all, to show all detailed CPU
    accounting data together:

    # cat cpuacct.usage_all
    cpu user system
    0 3809760299 5807968992
    1 3250329855 454612211
    ..

    Signed-off-by: Zhao Lei
    Cc: KOSAKI Motohiro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/7744460969edd7caaf0e903592ee52353ed9bdd6.1466415271.git.zhaolei@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Zhao Lei
     
  • In cpuacct_stats_show() we currently we have copies of similar code,
    for each cpustat(system/user) variant.

    Use a loop instead to consolidate the code. This will also work better
    if we extend the CPUACCT_STAT_NSTATS type.

    Signed-off-by: Zhao Lei
    Cc: KOSAKI Motohiro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/b0597d4224655e9f333f1a6224ed9654c7d7d36a.1466415271.git.zhaolei@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Zhao Lei
     
  • These two types have similar function, no need to separate them.

    Signed-off-by: Zhao Lei
    Cc: KOSAKI Motohiro
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/436748885270d64363c7dc67167507d486c2057a.1466415271.git.zhaolei@cn.fujitsu.com
    Signed-off-by: Ingo Molnar

    Zhao Lei
     

06 Jul, 2016

1 commit

  • The pv_time_ops structure contains a function pointer for the
    "steal_clock" functionality used only by KVM and Xen on ARM. Xen on x86
    uses its own mechanism to account for the "stolen" time a thread wasn't
    able to run due to hypervisor scheduling.

    Add support in Xen arch independent time handling for this feature by
    moving it out of the arm arch into drivers/xen and remove the x86 Xen
    hack.

    Signed-off-by: Juergen Gross
    Reviewed-by: Boris Ostrovsky
    Reviewed-by: Stefano Stabellini
    Signed-off-by: David Vrabel

    Juergen Gross
     

04 Jul, 2016

1 commit


27 Jun, 2016

11 commits

  • Since we already take rq->lock when creating a cgroup, use it to also
    sync the throttle_count and avoid the extra state and enqueue path
    branch.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: linux-kernel@vger.kernel.org
    [ Fixed build warning. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Previous version was probably written referencing the man page for
    glibc's wrapper, but the wrapper's behavior differs from that of the
    syscall itself in this case.

    Signed-off-by: Zev Weiss
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1466975603-25408-1-git-send-email-zev@bewilderbeest.net
    Signed-off-by: Ingo Molnar

    Zev Weiss
     
  • A future patch needs rq->lock held _after_ we link the task_group into
    the hierarchy. In order to avoid taking every rq->lock twice, reorder
    things a little and create online_fair_sched_group() to be called
    after we link the task_group.

    All this code is still ran from css_alloc() so css_online() isn't in
    fact used for this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Konstantin Khlebnikov
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • One additional 'rule' for using update_cfs_rq_load_avg() is that one
    should call update_tg_load_avg() if it returns true.

    Add a bunch of comments to hopefully clarify some of the rules:

    o You need to update cfs_rq _before_ any entity attach/detach,
    this is important, because while for mathmatical consisency this
    isn't strictly needed, it is required for the physical
    interpretation of the model, you attach/detach _now_.

    o When you modify the cfs_rq avg, you have to then call
    update_tg_load_avg() in order to propagate changes upwards.

    o (Fair) entities are always attached, switched_{to,from}_fair()
    deal with !fair. This directly follows from the definition of the
    cfs_rq averages, namely that they are a direct sum of all
    (runnable or blocked) entities on that rq.

    It is the second rule that this patch enforces, but it adds comments
    pertaining to all of them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Vincent and Yuyang found another few scenarios in which entity
    tracking goes wobbly.

    The scenarios are basically due to the fact that new tasks are not
    immediately attached and thereby differ from the normal situation -- a
    task is always attached to a cfs_rq load average (such that it
    includes its blocked contribution) and are explicitly
    detached/attached on migration to another cfs_rq.

    Scenario 1: switch to fair class

    p->sched_class = fair_class;
    if (queued)
    enqueue_task(p);
    ...
    enqueue_entity()
    enqueue_entity_load_avg()
    migrated = !sa->last_update_time (true)
    if (migrated)
    attach_entity_load_avg()
    check_class_changed()
    switched_from() (!fair)
    switched_to() (fair)
    switched_to_fair()
    attach_entity_load_avg()

    If @p is a new task that hasn't been fair before, it will have
    !last_update_time and, per the above, end up in
    attach_entity_load_avg() _twice_.

    Scenario 2: change between cgroups

    sched_move_group(p)
    if (queued)
    dequeue_task()
    task_move_group_fair()
    detach_task_cfs_rq()
    detach_entity_load_avg()
    set_task_rq()
    attach_task_cfs_rq()
    attach_entity_load_avg()
    if (queued)
    enqueue_task();
    ...
    enqueue_entity()
    enqueue_entity_load_avg()
    migrated = !sa->last_update_time (true)
    if (migrated)
    attach_entity_load_avg()

    Similar as with scenario 1, if @p is a new task, it will have
    !load_update_time and we'll end up in attach_entity_load_avg()
    _twice_.

    Furthermore, notice how we do a detach_entity_load_avg() on something
    that wasn't attached to begin with.

    As stated above; the problem is that the new task isn't yet attached
    to the load tracking and thereby violates the invariant assumption.

    This patch remedies this by ensuring a new task is indeed properly
    attached to the load tracking on creation, through
    post_init_entity_util_avg().

    Of course, this isn't entirely as straightforward as one might think,
    since the task is hashed before we call wake_up_new_task() and thus
    can be poked at. We avoid this by adding TASK_NEW and teaching
    cpu_cgroup_can_attach() to refuse such tasks.

    Reported-by: Yuyang Du
    Reported-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • A new fair task is detached and attached from/to task_group with:

    cgroup_post_fork()
    ss->fork(child) := cpu_cgroup_fork()
    sched_move_task()
    task_move_group_fair()

    Which is wrong, because at this point in fork() the task isn't fully
    initialized and it cannot 'move' to another group, because its not
    attached to any group as yet.

    In fact, cpu_cgroup_fork() needs a small part of sched_move_task() so we
    can just call this small part directly instead sched_move_task(). And
    the task doesn't really migrate because it is not yet attached so we
    need the following sequence:

    do_fork()
    sched_fork()
    __set_task_cpu()

    cgroup_post_fork()
    set_task_rq() # set task group and runqueue

    wake_up_new_task()
    select_task_rq() can select a new cpu
    __set_task_cpu
    post_init_entity_util_avg
    attach_task_cfs_rq()
    activate_task
    enqueue_task

    This patch makes that happen.

    Signed-off-by: Vincent Guittot
    [ Added TASK_SET_GROUP to set depth properly. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Vincent reported that when a new task is moved into a new cgroup it
    gets attached twice to the load tracking:

    sched_move_task()
    task_move_group_fair()
    detach_task_cfs_rq()
    set_task_rq()
    attach_task_cfs_rq()
    attach_entity_load_avg()
    se->avg.last_load_update = cfs_rq->avg.last_load_update // == 0

    enqueue_entity()
    enqueue_entity_load_avg()
    update_cfs_rq_load_avg()
    now = clock()
    __update_load_avg(&cfs_rq->avg)
    cfs_rq->avg.last_load_update = now
    // ages load/util for: now - 0, load/util -> 0
    if (migrated)
    attach_entity_load_avg()
    se->avg.last_load_update = cfs_rq->avg.last_load_update; // now != 0

    The problem is that we don't update cfs_rq load_avg before all
    entity attach/detach operations. Only enqueue_task() and migrate_task()
    do this.

    By fixing this, the above will not happen, because the
    sched_move_task() attach will have updated cfs_rq's last_load_update
    time before attach, and in turn the attach will have set the entity's
    last_load_update stamp.

    Note that there is a further problem with sched_move_task() calling
    detach on a task that hasn't yet been attached; this will be taken
    care of in a subsequent patch.

    Reported-by: Vincent Guittot
    Tested-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The task_fork_fair() callback already calls __set_task_cpu() and takes
    rq->lock.

    If we move the sched_class::task_fork callback in sched_fork() under
    the existing p->pi_lock, right after its set_task_cpu() call, we can
    avoid doing two such calls and omit the IRQ disabling on the rq->lock.

    Change to __set_task_cpu() to skip the migration bits, this is a new
    task, not a migration. Similarly, make wake_up_new_task() use
    __set_task_cpu() for the same reason, the task hasn't actually
    migrated as it hasn't ever ran.

    This cures the problem of calling migrate_task_rq_fair(), which does
    remove_entity_from_load_avg() on tasks that have never been added to
    the load avg to begin with.

    This bug would result in transiently messed up load_avg values, averaged
    out after a few dozen milliseconds. This is probably the reason why
    this bug was not found for such a long time.

    Reported-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit:

    fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

    did something non-obvious but also did it buggy yet latent.

    The problem was exposed for real by a later commit in the v4.7 merge window:

    2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")

    ... after which tg->load_avg and cfs_rq->load.weight had different
    units (10 bit fixed point and 20 bit fixed point resp.).

    Add a comment to explain the use of cfs_rq->load.weight over the
    'natural' cfs_rq->avg.load_avg and add scale_load_down() to correct
    for the difference in unit.

    Since this is (now, as per a previous commit) the only user of
    calc_tg_weight(), collapse it.

    The effects of this bug should be randomly inconsistent SMP-balancing
    of cgroups workloads.

    Reported-by: Jirka Hladky
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: 2159197d6677 ("sched/core: Enable increased load resolution on 64-bit kernels")
    Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Starting with the following commit:

    fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")

    calc_tg_weight() doesn't compute the right value as expected by effective_load().

    The difference is in the 'correction' term. In order to ensure \Sum
    rw_j >= rw_i we cannot use tg->load_avg directly, since that might be
    lagging a correction on the current cfs_rq->avg.load_avg value.
    Therefore we use tg->load_avg - cfs_rq->tg_load_avg_contrib +
    cfs_rq->avg.load_avg.

    Now, per the referenced commit, calc_tg_weight() doesn't use
    cfs_rq->avg.load_avg, as is later used in @w, but uses
    cfs_rq->load.weight instead.

    So stop using calc_tg_weight() and do it explicitly.

    The effects of this bug are wake_affine() making randomly
    poor choices in cgroup-intense workloads.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: # v4.3+
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: fde7d22e01aa ("sched/fair: Fix overly small weight for interactive group entities")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 Jun, 2016

3 commits

  • During CPU hotplug, CPU_ONLINE callbacks are run while the CPU is
    online but not active. A CPU_ONLINE callback may create or bind a
    kthread so that its cpus_allowed mask only allows the CPU which is
    being brought online. The kthread may start executing before the CPU
    is made active and can end up in select_fallback_rq().

    In such cases, the expected behavior is selecting the CPU which is
    coming online; however, because select_fallback_rq() only chooses from
    active CPUs, it determines that the task doesn't have any viable CPU
    in its allowed mask and ends up overriding it to cpu_possible_mask.

    CPU_ONLINE callbacks should be able to put kthreads on the CPU which
    is coming online. Update select_fallback_rq() so that it follows
    cpu_online() rather than cpu_active() for kthreads.

    Reported-by: Gautham R Shenoy
    Tested-by: Gautham R. Shenoy
    Signed-off-by: Tejun Heo
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Abdul Haleem
    Cc: Aneesh Kumar
    Cc: Linus Torvalds
    Cc: Michael Ellerman
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: kernel-team@fb.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20160616193504.GB3262@mtj.duckdns.org
    Signed-off-by: Ingo Molnar

    Tejun Heo
     
  • Hierarchy could be already throttled at this point. Throttled next
    buddy could trigger a NULL pointer dereference in pick_next_task_fair().

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Ben Segall
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     
  • Cgroup created inside throttled group must inherit current throttle_count.
    Broken throttle_count allows to nominate throttled entries as a next buddy,
    later this leads to null pointer dereference in pick_next_task_fair().

    This patch initialize cfs_rq->throttle_count at first enqueue: laziness
    allows to skip locking all rq at group creation. Lazy approach also allows
    to skip full sub-tree scan at throttling hierarchy (not in this patch).

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
    Signed-off-by: Ingo Molnar

    Konstantin Khlebnikov
     

20 Jun, 2016

1 commit

  • As per commit:

    b7fa30c9cc48 ("sched/fair: Fix post_init_entity_util_avg() serialization")

    > the code generated from update_cfs_rq_load_avg():
    >
    > if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    > s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    > sa->load_avg = max_t(long, sa->load_avg - r, 0);
    > sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    > removed_load = 1;
    > }
    >
    > turns into:
    >
    > ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
    > ffffffff8108706b: 48 85 c0 test %rax,%rax
    > ffffffff8108706e: 74 40 je ffffffff810870b0
    > ffffffff81087070: 4c 89 f8 mov %r15,%rax
    > ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
    > ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
    > ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
    > ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
    > ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
    > ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx
    >
    > Which you'll note ends up with sa->load_avg -= r in memory at
    > ffffffff8108707a.

    So I _should_ have looked at other unserialized users of ->load_avg,
    but alas. Luckily nikbor reported a similar /0 from task_h_load() which
    instantly triggered recollection of this here problem.

    Aside from the intermediate value hitting memory and causing problems,
    there's another problem: the underflow detection relies on the signed
    bit. This reduces the effective width of the variables, IOW its
    effectively the same as having these variables be of signed type.

    This patch changes to a different means of unsigned underflow
    detection to not rely on the signed bit. This allows the variables to
    use the 'full' unsigned range. And it does so with explicit LOAD -
    STORE to ensure any intermediate value will never be visible in
    memory, allowing these unserialized loads.

    Note: GCC generates crap code for this, might warrant a look later.

    Note2: I say 'full' above, if we end up at U*_MAX we'll still explode;
    maybe we should do clamping on add too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrey Ryabinin
    Cc: Chris Wilson
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: bsegall@google.com
    Cc: kernel@kyup.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
    Link: http://lkml.kernel.org/r/20160617091948.GJ30927@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jun, 2016

7 commits

  • Lengthy output of sysrq-w may take a lot of time on slow serial console.

    Currently we reset NMI-watchdog on the current CPU to avoid spurious
    lockup messages. Sometimes this doesn't work since softlockup watchdog
    might trigger on another CPU which is waiting for an IPI to proceed.
    We reset softlockup watchdogs on all CPUs, but we do this only after
    listing all tasks, and this may be too late on a busy system.

    So, reset watchdogs CPUs earlier, in for_each_process_thread() loop.

    Signed-off-by: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc:
    Link: http://lkml.kernel.org/r/1465474805-14641-1-git-send-email-aryabinin@virtuozzo.com
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • I see a hang when enabling sched events:

    echo 1 > /sys/kernel/debug/tracing/events/sched/enable

    The printk buffer shows:

    BUG: spinlock recursion on CPU#1, swapper/1/0
    lock: 0xffff88007d5d8c00, .magic: dead4ead, .owner: swapper/1/0, .owner_cpu: 1
    CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.7.0-rc2+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014
    ...
    Call Trace:
    [] dump_stack+0x85/0xc2
    [] spin_dump+0x78/0xc0
    [] do_raw_spin_lock+0x11a/0x150
    [] _raw_spin_lock+0x61/0x80
    [] ? try_to_wake_up+0x256/0x4e0
    [] try_to_wake_up+0x256/0x4e0
    [] ? _raw_spin_unlock_irqrestore+0x4a/0x80
    [] wake_up_process+0x15/0x20
    [] insert_work+0x84/0xc0
    [] __queue_work+0x18f/0x660
    [] queue_work_on+0x46/0x90
    [] drm_fb_helper_dirty.isra.11+0xcb/0xe0 [drm_kms_helper]
    [] drm_fb_helper_sys_imageblit+0x30/0x40 [drm_kms_helper]
    [] soft_cursor+0x1ad/0x230
    [] bit_cursor+0x649/0x680
    [] ? update_attr.isra.2+0x90/0x90
    [] fbcon_cursor+0x14a/0x1c0
    [] hide_cursor+0x28/0x90
    [] vt_console_print+0x3bf/0x3f0
    [] call_console_drivers.constprop.24+0x183/0x200
    [] console_unlock+0x3d4/0x610
    [] vprintk_emit+0x3c5/0x610
    [] vprintk_default+0x29/0x40
    [] printk+0x57/0x73
    [] enqueue_entity+0xc2e/0xc70
    [] enqueue_task_fair+0x59/0xab0
    [] ? kvm_sched_clock_read+0x9/0x20
    [] ? sched_clock+0x9/0x10
    [] activate_task+0x5c/0xa0
    [] ttwu_do_activate+0x54/0xb0
    [] sched_ttwu_pending+0x7a/0xb0
    [] scheduler_ipi+0x61/0x170
    [] smp_trace_reschedule_interrupt+0x4f/0x2a0
    [] trace_reschedule_interrupt+0x96/0xa0
    [] ? native_safe_halt+0x6/0x10
    [] ? trace_hardirqs_on+0xd/0x10
    [] default_idle+0x20/0x1a0
    [] arch_cpu_idle+0xf/0x20
    [] default_idle_call+0x2f/0x50
    [] cpu_startup_entry+0x37e/0x450
    [] start_secondary+0x160/0x1a0

    Note the hang only occurs when echoing the above from a physical serial
    console, not from an ssh session.

    The bug is caused by a deadlock where the task is trying to grab the rq
    lock twice because printk()'s aren't safe in sched code.

    Signed-off-by: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Matt Fleming
    Cc: Mel Gorman
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Fixes: cb2517653fcc ("sched/debug: Make schedstats a runtime tunable that is disabled by default")
    Link: http://lkml.kernel.org/r/20160613073209.gdvdybiruljbkn3p@treble
    Signed-off-by: Ingo Molnar

    Josh Poimboeuf
     
  • This new form allows using hardware assisted waiting.

    Some hardware (ARM64 and x86) allow monitoring an address for changes,
    so by providing a pointer we can use this to replace the cpu_relax()
    with hardware optimized methods in the future.

    Requested-by: Will Deacon
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This patch adds guest steal-time support to full dynticks CPU
    time accounting. After the following commit:

    ff9a9b4c4334 ("sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity")

    ... time sampling became jiffy based, even if we do the sampling from the
    context tracking code, so steal_account_process_tick() can be reused
    to account how many 'ticks' are stolen-time, after the last accumulation.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Paolo Bonzini
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1465813966-3116-4-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Commit:

    e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")

    ... set rq->prev_* to 0 after a CPU hotplug comes back, in order to
    fix the case where (after CPU hotplug) steal time is smaller than
    rq->prev_steal_time.

    However, this should never happen. Steal time was only smaller because of the
    KVM-specific bug fixed by the previous patch. Worse, the previous patch
    triggers a bug on CPU hot-unplug/plug operation: because
    rq->prev_steal_time is cleared, all of the CPU's past steal time will be
    accounted again on hot-plug.

    Since the root cause has been fixed, we can just revert commit e9532e69b8d1.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Paolo Bonzini
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Radim Krčmář
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Fixes: 'commit e9532e69b8d1 ("sched/cputime: Fix steal time accounting vs. CPU hotplug")'
    Link: http://lkml.kernel.org/r/1465813966-3116-3-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Chris Wilson reported a divide by 0 at:

    post_init_entity_util_avg():

    > 725 if (cfs_rq->avg.util_avg != 0) {
    > 726 sa->util_avg = cfs_rq->avg.util_avg * se->load.weight;
    > -> 727 sa->util_avg /= (cfs_rq->avg.load_avg + 1);
    > 728
    > 729 if (sa->util_avg > cap)
    > 730 sa->util_avg = cap;
    > 731 } else {

    Which given the lack of serialization, and the code generated from
    update_cfs_rq_load_avg() is entirely possible:

    if (atomic_long_read(&cfs_rq->removed_load_avg)) {
    s64 r = atomic_long_xchg(&cfs_rq->removed_load_avg, 0);
    sa->load_avg = max_t(long, sa->load_avg - r, 0);
    sa->load_sum = max_t(s64, sa->load_sum - r * LOAD_AVG_MAX, 0);
    removed_load = 1;
    }

    turns into:

    ffffffff81087064: 49 8b 85 98 00 00 00 mov 0x98(%r13),%rax
    ffffffff8108706b: 48 85 c0 test %rax,%rax
    ffffffff8108706e: 74 40 je ffffffff810870b0
    ffffffff81087070: 4c 89 f8 mov %r15,%rax
    ffffffff81087073: 49 87 85 98 00 00 00 xchg %rax,0x98(%r13)
    ffffffff8108707a: 49 29 45 70 sub %rax,0x70(%r13)
    ffffffff8108707e: 4c 89 f9 mov %r15,%rcx
    ffffffff81087081: bb 01 00 00 00 mov $0x1,%ebx
    ffffffff81087086: 49 83 7d 70 00 cmpq $0x0,0x70(%r13)
    ffffffff8108708b: 49 0f 49 4d 70 cmovns 0x70(%r13),%rcx

    Which you'll note ends up with 'sa->load_avg - r' in memory at
    ffffffff8108707a.

    By calling post_init_entity_util_avg() under rq->lock we're sure to be
    fully serialized against PELT updates and cannot observe intermediate
    state like this.

    Reported-by: Chris Wilson
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Yuyang Du
    Cc: bsegall@google.com
    Cc: morten.rasmussen@arm.com
    Cc: pjt@google.com
    Cc: steve.muckle@linaro.org
    Fixes: 2b8c41daba32 ("sched/fair: Initiate a new task's util avg to a bounded value")
    Link: http://lkml.kernel.org/r/20160609130750.GQ30909@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra