02 Sep, 2015

1 commit

  • The problem addressed in this patch is about affining unpinned
    timers. Adaptive or Full Dynticks CPUs are currently disturbed
    by unnecessary jitter due to firing of such timers on them.

    This patch will affine timers to online CPUs which are not full
    dynticks in NOHZ_FULL configured systems. It should not
    introduce overhead in nohz full off case due to static keys.

    Signed-off-by: Vatika Harlalka
    Signed-off-by: Frederic Weisbecker
    Reviewed-by: Preeti U Murthy
    Acked-by: Thomas Gleixner
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1441119060-2230-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Vatika Harlalka
     

29 Jul, 2015

3 commits

  • Leftover from early code.

    Cc: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Restart the tick when necessary from the irq exit path. It makes nohz
    full more flexible, simplify the related IPIs and doesn't bring
    significant overhead on irq exit.

    In a longer term view, it will allow us to piggyback the nohz kick
    on the scheduler IPI in the future instead of sending a dedicated IPI
    that often doubles the scheduler IPI on task wakeup. This will require
    more changes though including careful review of resched_curr() callers
    to include nohz full needs.

    Reviewed-by: Rik van Riel
    Cc: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     
  • Normally the tilegx networking shim sends irqs to all the cores
    to distribute the load of processing incoming-packet interrupts,
    so that you can get to multiple Gb's of traffic inbound.

    However, in nohz_full mode we don't want to interrupt the
    nohz_full cores by default, so we limit the set of cores we use
    to only the online housekeeping cores.

    To make client code easier to read, we introduce a new nohz_full
    accessor, housekeeping_cpumask(), which returns a pointer to the
    housekeeping_mask if nohz_full is enabled, and otherwise returns
    the cpu_possible_mask.

    Signed-off-by: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Chris Metcalf
     

08 Jul, 2015

2 commits

  • Making tick_broadcast_oneshot_control() independent from
    CONFIG_GENERIC_CLOCKEVENTS_BROADCAST broke the build for
    CONFIG_GENERIC_CLOCKEVENTS=n because the function is not defined
    there.

    Provide a proper stub inline.

    Fixes: f32dd1170511 'tick/broadcast: Make idle check independent from mode and config'
    Reported-by: kbuild test robot
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Currently the broadcast busy check, which prevents the idle code from
    going into deep idle, works only in one shot mode.

    If NOHZ and HIGHRES are off (config or command line) there is no
    sanity check at all, so under certain conditions cpus are allowed to
    go into deep idle, where the local timer stops, and are not woken up
    again because there is no broadcast timer installed or a hrtimer based
    broadcast device is not evaluated.

    Move tick_broadcast_oneshot_control() into the common code and provide
    proper subfunctions for the various config combinations.

    The common check in tick_broadcast_oneshot_control() is for the C3STOP
    misfeature flag of the local clock event device. If its not set, idle
    can proceed. If set, further checks are necessary.

    Provide checks for the trivial cases:

    - If broadcast is disabled in the config, then return busy

    - If oneshot mode (NOHZ/HIGHES) is disabled in the config, return
    busy if the broadcast device is hrtimer based.

    - If oneshot mode is enabled in the config call the original
    tick_broadcast_oneshot_control() function. That function needs
    extra checks which will be implemented in seperate patches.

    [ Split out from a larger combo patch ]

    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Thomas Gleixner
    Cc: Suzuki Poulose
    Cc: Lorenzo Pieralisi
    Cc: Catalin Marinas
    Cc: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Ingo Molnar
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos

    Thomas Gleixner
     

24 Jun, 2015

1 commit

  • Pull power management and ACPI updates from Rafael Wysocki:
    "The rework of backlight interface selection API from Hans de Goede
    stands out from the number of commits and the number of affected
    places perspective. The cpufreq core fixes from Viresh Kumar are
    quite significant too as far as the number of commits goes and because
    they should reduce CPU online/offline overhead quite a bit in the
    majority of cases.

    From the new featues point of view, the ACPICA update (to upstream
    revision 20150515) adding support for new ACPI 6 material to ACPICA is
    the one that matters the most as some new significant features will be
    based on it going forward. Also included is an update of the ACPI
    device power management core to follow ACPI 6 (which in turn reflects
    the Windows' device PM implementation), a PM core extension to support
    wakeup interrupts in a more generic way and support for the ACPI _CCA
    device configuration object.

    The rest is mostly fixes and cleanups all over and some documentation
    updates, including new DT bindings for Operating Performance Points.

    There is one fix for a regression introduced in the 4.1 cycle, but it
    adds quite a number of lines of code, it wasn't really ready before
    Thursday and you were on vacation, so I refrained from pushing it on
    the last minute for 4.1.

    Specifics:

    - ACPICA update to upstream revision 20150515 including basic support
    for ACPI 6 features: new ACPI tables introduced by ACPI 6 (STAO,
    XENV, WPBT, NFIT, IORT), changes related to the other tables (DTRM,
    FADT, LPIT, MADT), new predefined names (_BTH, _CR3, _DSD, _LPI,
    _MTL, _PRR, _RDI, _RST, _TFP, _TSN), fixes and cleanups (Bob Moore,
    Lv Zheng).

    - ACPI device power management core code update to follow ACPI 6
    which reflects the ACPI device power management implementation in
    Windows (Rafael J Wysocki).

    - rework of the backlight interface selection logic to reduce the
    number of kernel command line options and improve the handling of
    DMI quirks that may be involved in that and to make the code
    generally more straightforward (Hans de Goede).

    - fixes for the ACPI Embedded Controller (EC) driver related to the
    handling of EC transactions (Lv Zheng).

    - fix for a regression related to the ACPI resources management and
    resulting from a recent change of ACPI initialization code ordering
    (Rafael J Wysocki).

    - fix for a system initialization regression related to ACPI
    introduced during the 3.14 cycle and caused by running the code
    that switches the platform over to the ACPI mode too early in the
    initialization sequence (Rafael J Wysocki).

    - support for the ACPI _CCA device configuration object related to
    DMA cache coherence (Suravee Suthikulpanit).

    - ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).

    - ACPI battery driver cleanups (Luis Henriques, Mathias Krause).

    - ACPI processor driver cleanups (Hanjun Guo).

    - cleanups and documentation update related to the ACPI device
    properties interface based on _DSD (Rafael J Wysocki).

    - ACPI device power management fixes (Rafael J Wysocki).

    - assorted cleanups related to ACPI (Dominik Brodowski, Fabian
    Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).

    - fix for a long-standing issue causing General Protection Faults to
    be generated occasionally on return to user space after resume from
    ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).

    - fix to make the suspend core code return -EBUSY consistently in all
    cases when system suspend is aborted due to wakeup detection (Ruchi
    Kandoi).

    - support for automated device wakeup IRQ handling allowing drivers
    to make their PM support more starightforward (Tony Lindgren).

    - new tracepoints for suspend-to-idle tracing and rework of the
    prepare/complete callbacks tracing in the PM core (Todd E Brandt,
    Rafael J Wysocki).

    - wakeup sources framework enhancements (Jin Qian).

    - new macro for noirq system PM callbacks (Grygorii Strashko).

    - assorted cleanups related to system suspend (Rafael J Wysocki).

    - cpuidle core cleanups to make the code more efficient (Rafael J
    Wysocki).

    - powernv/pseries cpuidle driver update (Shilpasri G Bhat).

    - cpufreq core fixes related to CPU online/offline that should reduce
    the overhead of these operations quite a bit, unless the CPU in
    question is physically going away (Viresh Kumar, Saravana Kannan).

    - serialization of cpufreq governor callbacks to avoid race
    conditions in some cases (Viresh Kumar).

    - intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
    Bhargava, Joe Konno).

    - cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
    Holla, Felipe Balbi, Tang Yuantian).

    - assorted cleanups in cpufreq drivers and core (Shailendra Verma,
    Fabian Frederick, Wang Long).

    - new Device Tree bindings for representing Operating Performance
    Points (Viresh Kumar).

    - updates for the common clock operations support code in the PM core
    (Rajendra Nayak, Geert Uytterhoeven).

    - PM domains core code update (Geert Uytterhoeven).

    - Intel Knights Landing support for the RAPL (Running Average Power
    Limit) power capping driver (Dasaratharaman Chandramouli).

    - fixes related to the floor frequency setting on Atom SoCs in the
    RAPL power capping driver (Ajay Thomas).

    - runtime PM framework documentation update (Ben Dooks).

    - cpupower tool fix (Herton R Krzesinski)"

    * tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (194 commits)
    cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
    x86: Load __USER_DS into DS/ES after resume
    PM / OPP: Add binding for 'opp-suspend'
    PM / OPP: Allow multiple OPP tables to be passed via DT
    PM / OPP: Add new bindings to address shortcomings of existing bindings
    ACPI: Constify ACPI device IDs in documentation
    ACPI / enumeration: Document the rules regarding the PRP0001 device ID
    ACPI / video: Make acpi_video_unregister_backlight() private
    acpi-video-detect: Remove old API
    toshiba-acpi: Port to new backlight interface selection API
    thinkpad-acpi: Port to new backlight interface selection API
    sony-laptop: Port to new backlight interface selection API
    samsung-laptop: Port to new backlight interface selection API
    msi-wmi: Port to new backlight interface selection API
    msi-laptop: Port to new backlight interface selection API
    intel-oaktrail: Port to new backlight interface selection API
    ideapad-laptop: Port to new backlight interface selection API
    fujitsu-laptop: Port to new backlight interface selection API
    eeepc-laptop: Port to new backlight interface selection API
    dell-wmi: Port to new backlight interface selection API
    ...

    Linus Torvalds
     

19 May, 2015

1 commit


07 May, 2015

1 commit

  • This API is useful to modify a cpumask indicating some special
    nohz-type functionality so that the nohz cores are automatically
    added to that set.

    Signed-off-by: Chris Metcalf
    Signed-off-by: Frederic Weisbecker
    Acked-by: Peter Zijlstra (Intel)
    Cc: Borislav Petkov
    Cc: Dave Jones
    Cc: H. Peter Anvin
    Cc: Martin Schwidefsky
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1429024675-18938-1-git-send-email-cmetcalf@ezchip.com
    Link: http://lkml.kernel.org/r/1430928266-24888-4-git-send-email-fweisbec@gmail.com
    Signed-off-by: Ingo Molnar

    Chris Metcalf
     

03 Apr, 2015

4 commits

  • clockevents_notify() is a leftover from the early design of the
    clockevents facility. It's really not a notification mechanism,
    it's a multiplex call. We are way better off to have explicit
    calls instead of this monstrosity.

    Split out the cleanup function for a dead cpu and invoke it
    directly from the cpu down code. Make it conditional on
    CPU_HOTPLUG as well.

    Temporary change, will be refined in the future.

    Signed-off-by: Thomas Gleixner
    [ Rebased, added clockevents_notify() removal ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1735025.raBZdQHM3m@vostro.rjw.lan
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • clockevents_notify() is a leftover from the early design of the
    clockevents facility. It's really not a notification mechanism,
    it's a multiplex call. We are way better off to have explicit
    calls instead of this monstrosity.

    Split out the tick_handover call and invoke it explicitely from
    the hotplug code. Temporary solution will be cleaned up in later
    patches.

    Signed-off-by: Thomas Gleixner
    [ Rebase ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/1658173.RkEEILFiQZ@vostro.rjw.lan
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • clockevents_notify() is a leftover from the early design of the
    clockevents facility. It's really not a notification mechanism,
    it's a multiplex call. We are way better off to have explicit
    calls instead of this monstrosity.

    Split out the broadcast oneshot control into a separate function
    and provide inline helpers. Switch clockevents_notify() over.
    This will go away once all callers are converted.

    This also gets rid of the nested locking of clockevents_lock and
    broadcast_lock. The broadcast oneshot control functions do not
    require clockevents_lock. Only the managing functions
    (setup/shutdown/suspend/resume of the broadcast device require
    clockevents_lock.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki
    Cc: Alexandre Courbot
    Cc: Daniel Lezcano
    Cc: Len Brown
    Cc: Peter Zijlstra
    Cc: Stephen Warren
    Cc: Thierry Reding
    Cc: Tony Lindgren
    Link: http://lkml.kernel.org/r/13000649.8qZuEDV0OA@vostro.rjw.lan
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • clockevents_notify() is a leftover from the early design of the
    clockevents facility. It's really not a notification mechanism,
    it's a multiplex call. We are way better off to have explicit
    calls instead of this monstrosity.

    Split out the broadcast control into a separate function and
    provide inline helpers. Switch clockevents_notify() over. This
    will go away once all callers are converted.

    This also gets rid of the nested locking of clockevents_lock and
    broadcast_lock. The broadcast control functions do not require
    clockevents_lock. Only the managing functions
    (setup/shutdown/suspend/resume of the broadcast device require
    clockevents_lock.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki
    Cc: Daniel Lezcano
    Cc: Len Brown
    Cc: Peter Zijlstra
    Cc: Tony Lindgren
    Link: http://lkml.kernel.org/r/8086559.ttsuS0n1Xr@vostro.rjw.lan
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

02 Apr, 2015

1 commit

  • It was found when doing a hotplug stress test on POWER, that the
    machine either hit softlockups or rcu_sched stall warnings. The
    issue was traced to commit:

    7cba160ad789 ("powernv/cpuidle: Redesign idle states management")

    which exposed the cpu_down() race with hrtimer based broadcast mode:

    5d1638acb9f6 ("tick: Introduce hrtimer based broadcast")

    The race is the following:

    Assume CPU1 is the CPU which holds the hrtimer broadcasting duty
    before it is taken down.

    CPU0 CPU1

    cpu_down() take_cpu_down()
    disable_interrupts()

    cpu_die()

    while (CPU1 != CPU_DEAD) {
    msleep(100);
    switch_to_idle();
    stop_cpu_timer();
    schedule_broadcast();
    }

    tick_cleanup_cpu_dead()
    take_over_broadcast()

    So after CPU1 disabled interrupts it cannot handle the broadcast
    hrtimer anymore, so CPU0 will be stuck forever.

    Fix this by explicitly taking over broadcast duty before cpu_die().

    This is a temporary workaround. What we really want is a callback
    in the clockevent device which allows us to do that from the dying
    CPU by pushing the hrtimer onto a different cpu. That might involve
    an IPI and is definitely more complex than this immediate fix.

    Changelog was picked up from:

    https://lkml.org/lkml/2015/2/16/213

    Suggested-by: Thomas Gleixner
    Tested-by: Nicolas Pitre
    Signed-off-by: Preeti U. Murthy
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: nicolas.pitre@linaro.org
    Cc: peterz@infradead.org
    Cc: rjw@rjwysocki.net
    Fixes: http://linuxppc.10917.n7.nabble.com/offlining-cpus-breakage-td88619.html
    Link: http://lkml.kernel.org/r/20150330092410.24979.59887.stgit@preeti.in.ibm.com
    [ Merged it to the latest timer tree, renamed the callback, tidied up the changelog. ]
    Signed-off-by: Ingo Molnar

    Preeti U Murthy
     

01 Apr, 2015

4 commits

  • Use the new tick_suspend/resume_local() and get rid of the
    homebrewn implementation of these in the ARM bL switcher. The
    check for the cpumask is completely pointless. There is no harm
    to suspend a per cpu tick device unconditionally. If that's a
    real issue then we fix it proper at the core level and not with
    some completely undocumented hacks in some random core code.

    Move the tick internals to the core code, now that this nuisance
    is gone.

    Signed-off-by: Thomas Gleixner
    [ rjw: Rebase, changelog ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Nicolas Pitre
    Cc: Peter Zijlstra
    Cc: Russell King
    Link: http://lkml.kernel.org/r/1655112.Ws17YsMfN7@vostro.rjw.lan
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Xen calls on every cpu into tick_resume() which is just wrong.
    tick_resume() is for the syscore global suspend/resume
    invocation. What XEN really wants is a per cpu local resume
    function.

    Provide a tick_resume_local() function and use it in XEN.

    Also provide a complementary tick_suspend_local() and modify
    tick_unfreeze() and tick_freeze(), respectively, to use the
    new local tick resume/suspend functions.

    Signed-off-by: Thomas Gleixner
    [ Combined two patches, rebased, modified subject/changelog. ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Boris Ostrovsky
    Cc: David Vrabel
    Cc: Konrad Rzeszutek Wilk
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1698741.eezk9tnXtG@vostro.rjw.lan
    [ Merged to latest timers/core. ]
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • clockevents_notify() is a leftover from the early design of the
    clockevents facility. It's really not a notification mechanism,
    it's a multiplex call.

    We are way better off to have explicit calls instead of this
    monstrosity. Split out the suspend/resume() calls and invoke
    them directly from the call sites.

    No locking required at this point because these calls happen
    with interrupts disabled and a single cpu online.

    Signed-off-by: Thomas Gleixner
    [ Rebased on top of 4.0-rc5. ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/713674030.jVm1qaHuPf@vostro.rjw.lan
    [ Rebased on top of latest timers/core. ]
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • No point to expose everything to the world. People just believe
    such functions can be abused for whatever purposes. Sigh.

    Signed-off-by: Thomas Gleixner
    [ Rebased on top of 4.0-rc5 ]
    Signed-off-by: Rafael J. Wysocki
    Cc: Nicolas Pitre
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/28017337.VbCUc39Gme@vostro.rjw.lan
    [ Merged to latest timers/core ]
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

16 Feb, 2015

1 commit

  • The efficiency of suspend-to-idle depends on being able to keep CPUs
    in the deepest available idle states for as much time as possible.
    Ideally, they should only be brought out of idle by system wakeup
    interrupts.

    However, timer interrupts occurring periodically prevent that from
    happening and it is not practical to chase all of the "misbehaving"
    timers in a whack-a-mole fashion. A much more effective approach is
    to suspend the local ticks for all CPUs and the entire timekeeping
    along the lines of what is done during full suspend, which also
    helps to keep suspend-to-idle and full suspend reasonably similar.

    The idea is to suspend the local tick on each CPU executing
    cpuidle_enter_freeze() and to make the last of them suspend the
    entire timekeeping. That should prevent timer interrupts from
    triggering until an IO interrupt wakes up one of the CPUs. It
    needs to be done with interrupts disabled on all of the CPUs,
    though, because otherwise the suspended clocksource might be
    accessed by an interrupt handler which might lead to fatal
    consequences.

    Unfortunately, the existing ->enter callbacks provided by cpuidle
    drivers generally cannot be used for implementing that, because some
    of them re-enable interrupts temporarily and some idle entry methods
    cause interrupts to be re-enabled automatically on exit. Also some
    of these callbacks manipulate local clock event devices of the CPUs
    which really shouldn't be done after suspending their ticks.

    To overcome that difficulty, introduce a new cpuidle state callback,
    ->enter_freeze, that will be guaranteed (1) to keep interrupts
    disabled all the time (and return with interrupts disabled) and (2)
    not to touch the CPU timer devices. Modify cpuidle_enter_freeze() to
    look for the deepest available idle state with ->enter_freeze present
    and to make the CPU execute that callback with suspended tick (and the
    last of the online CPUs to execute it with suspended timekeeping).

    Suggested-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

14 Oct, 2014

1 commit

  • Pull s390 updates from Martin Schwidefsky:
    "This patch set contains the main portion of the changes for 3.18 in
    regard to the s390 architecture. It is a bit bigger than usual,
    mainly because of a new driver and the vector extension patches.

    The interesting bits are:
    - Quite a bit of work on the tracing front. Uprobes is enabled and
    the ftrace code is reworked to get some of the lost performance
    back if CONFIG_FTRACE is enabled.
    - To improve boot time with CONFIG_DEBIG_PAGEALLOC, support for the
    IPTE range facility is added.
    - The rwlock code is re-factored to improve writer fairness and to be
    able to use the interlocked-access instructions.
    - The kernel part for the support of the vector extension is added.
    - The device driver to access the CD/DVD on the HMC is added, this
    will hopefully come in handy to improve the installation process.
    - Add support for control-unit initiated reconfiguration.
    - The crypto device driver is enhanced to enable the additional AP
    domains and to allow the new crypto hardware to be used.
    - Bug fixes"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (39 commits)
    s390/ftrace: simplify enabling/disabling of ftrace_graph_caller
    s390/ftrace: remove 31 bit ftrace support
    s390/kdump: add support for vector extension
    s390/disassembler: add vector instructions
    s390: add support for vector extension
    s390/zcrypt: Toleration of new crypto hardware
    s390/idle: consolidate idle functions and definitions
    s390/nohz: use a per-cpu flag for arch_needs_cpu
    s390/vtime: do not reset idle data on CPU hotplug
    s390/dasd: add support for control unit initiated reconfiguration
    s390/dasd: fix infinite loop during format
    s390/mm: make use of ipte range facility
    s390/setup: correct 4-level kernel page table detection
    s390/topology: call set_sched_topology early
    s390/uprobes: architecture backend for uprobes
    s390/uprobes: common library for kprobes and uprobes
    s390/rwlock: use the interlocked-access facility 1 instructions
    s390/rwlock: improve writer fairness
    s390/rwlock: remove interrupt-enabling rwlock variant.
    s390/mm: remove change bit override support
    ...

    Linus Torvalds
     

09 Oct, 2014

1 commit


14 Sep, 2014

1 commit

  • This way we unbloat a bit main.c and more importantly we initialize
    nohz full after init_IRQ(). This dependency will be needed in further
    patches because nohz full needs irq work to raise its own IRQ.
    Information about the support for this ability on ARM64 is obtained on
    init_IRQ() which initialize the pointer to __smp_call_function.

    Since tick_init() is called right after init_IRQ(), this is a good place
    to call tick_nohz_init() and prepare for that dependency.

    Acked-by: Peter Zijlstra (Intel)
    Cc: Ingo Molnar
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

05 Sep, 2014

1 commit

  • The local nohz kick is currently used by perf which needs it to be
    NMI-safe. Recent commit though (7d1311b93e58ed55f3a31cc8f94c4b8fe988a2b9)
    changed its implementation to fire the local kick using the remote kick
    API. It was convenient to make the code more generic but the remote kick
    isn't NMI-safe.

    As a result:

    WARNING: CPU: 3 PID: 18062 at kernel/irq_work.c:72 irq_work_queue_on+0x11e/0x140()
    CPU: 3 PID: 18062 Comm: trinity-subchil Not tainted 3.16.0+ #34
    0000000000000009 00000000903774d1 ffff880244e06c00 ffffffff9a7f1e37
    0000000000000000 ffff880244e06c38 ffffffff9a0791dd ffff880244fce180
    0000000000000003 ffff880244e06d58 ffff880244e06ef8 0000000000000000
    Call Trace:
    [] dump_stack+0x4e/0x7a
    [] warn_slowpath_common+0x7d/0xa0
    [] warn_slowpath_null+0x1a/0x20
    [] irq_work_queue_on+0x11e/0x140
    [] tick_nohz_full_kick_cpu+0x57/0x90
    [] __perf_event_overflow+0x275/0x350
    [] ? perf_event_task_disable+0xa0/0xa0
    [] ? x86_perf_event_set_period+0xbf/0x150
    [] perf_event_overflow+0x14/0x20
    [] intel_pmu_handle_irq+0x206/0x410
    [] ? arch_vtime_task_switch+0x63/0x130
    [] perf_event_nmi_handler+0x2b/0x50
    [] nmi_handle+0xd2/0x390
    [] ? nmi_handle+0x5/0x390
    [] ? lock_release+0xab/0x330
    [] default_do_nmi+0x72/0x1c0
    [] ? cpuacct_account_field+0xcf/0x200
    [] do_nmi+0xb8/0x100

    Lets fix this by restoring the use of local irq work for the nohz local
    kick.

    Reported-by: Catalin Iacob
    Reported-and-tested-by: Dave Jones
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

05 Aug, 2014

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - Move the nohz kick code out of the scheduler tick to a dedicated IPI,
    from Frederic Weisbecker.

    This necessiated quite some background infrastructure rework,
    including:

    * Clean up some irq-work internals
    * Implement remote irq-work
    * Implement nohz kick on top of remote irq-work
    * Move full dynticks timer enqueue notification to new kick
    * Move multi-task notification to new kick
    * Remove unecessary barriers on multi-task notification

    - Remove proliferation of wait_on_bit() action functions and allow
    wait_on_bit_action() functions to support a timeout. (Neil Brown)

    - Another round of sched/numa improvements, cleanups and fixes. (Rik
    van Riel)

    - Implement fast idling of CPUs when the system is partially loaded,
    for better scalability. (Tim Chen)

    - Restructure and fix the CPU hotplug handling code that may leave
    cfs_rq and rt_rq's throttled when tasks are migrated away from a dead
    cpu. (Kirill Tkhai)

    - Robustify the sched topology setup code. (Peterz Zijlstra)

    - Improve sched_feat() handling wrt. static_keys (Jason Baron)

    - Misc fixes.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
    sched/fair: Fix 'make xmldocs' warning caused by missing description
    sched: Use macro for magic number of -1 for setparam
    sched: Robustify topology setup
    sched: Fix sched_setparam() policy == -1 logic
    sched: Allow wait_on_bit_action() functions to support a timeout
    sched: Remove proliferation of wait_on_bit() action functions
    sched/numa: Revert "Use effective_load() to balance NUMA loads"
    sched: Fix static_key race with sched_feat()
    sched: Remove extra static_key*() function indirection
    sched/rt: Fix replenish_dl_entity() comments to match the current upstream code
    sched: Transform resched_task() into resched_curr()
    sched/deadline: Kill task_struct->pi_top_task
    sched: Rework check_for_tasks()
    sched/rt: Enqueue just unthrottled rt_rq back on the stack in __disable_runtime()
    sched/fair: Disable runtime_enabled on dying rq
    sched/numa: Change scan period code to match intent
    sched/numa: Rework best node setting in task_numa_migrate()
    sched/numa: Examine a task move when examining a task swap
    sched/numa: Simplify task_numa_compare()
    sched/numa: Use effective_load() to balance NUMA loads
    ...

    Linus Torvalds
     

10 Jul, 2014

1 commit

  • Binding the grace-period kthreads to the timekeeping CPU resulted in
    significant performance decreases for some workloads. For more detail,
    see:

    https://lkml.org/lkml/2014/6/3/395 for benchmark numbers

    https://lkml.org/lkml/2014/6/4/218 for CPU statistics

    It turns out that it is necessary to bind the grace-period kthreads
    to the timekeeping CPU only when all but CPU 0 is a nohz_full CPU
    on the one hand or if CONFIG_NO_HZ_FULL_SYSIDLE=y on the other.
    In other cases, it suffices to bind the grace-period kthreads to the
    set of non-nohz_full CPUs.

    This commit therefore creates a tick_nohz_not_full_mask that is the
    complement of tick_nohz_full_mask, and then binds the grace-period
    kthread to the set of CPUs indicated by this new mask, which covers
    the CONFIG_NO_HZ_FULL_SYSIDLE=n case. The CONFIG_NO_HZ_FULL_SYSIDLE=y
    case still binds the grace-period kthreads to the timekeeping CPU.
    This commit also includes the tick_nohz_full_enabled() check suggested
    by Frederic Weisbecker.

    Reported-by: Jet Chen
    Signed-off-by: Paul E. McKenney
    [ paulmck: Created housekeeping_affine() and housekeeping_mask per
    fweisbec feedback. ]

    Paul E. McKenney
     

16 Jun, 2014

1 commit

  • Remotely kicking a full nohz CPU in order to make it re-evaluate its
    next tick is currently implemented using the scheduler IPI.

    However this bloats a scheduler fast path with an off-topic feature.
    The scheduler tick was abused here for its cool "callable
    anywhere/anytime" properties.

    But now that the irq work subsystem can queue remote callbacks, it's
    a perfect fit to safely queue IPIs when interrupts are disabled
    without worrying about concurrent callers.

    So lets implement remote kick on top of irq work. This is going to
    be used when a new event requires the next tick to be recalculated:
    more than 1 task competing on the CPU, timer armed, ...

    Acked-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

16 Jan, 2014

1 commit

  • This makes the code more symetric against the existing tick functions
    called on irq exit: tick_irq_exit() and tick_nohz_irq_exit().

    These function are also symetric as they mirror each other's action:
    we start to account idle time on irq exit and we stop this accounting
    on irq entry. Also the tick is stopped on irq exit and timekeeping
    catches up with the tickless time elapsed until we reach irq entry.

    This rename was suggested by Peter Zijlstra a long while ago but it
    got forgotten in the mass.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Alex Shi
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: John Stultz
    Cc: Kevin Hilman
    Link: http://lkml.kernel.org/r/1387320692-28460-2-git-send-email-fweisbec@gmail.com
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

03 Dec, 2013

2 commits


14 Aug, 2013

3 commits

  • …rederic/linux-dynticks into timers/nohz

    Pull nohz improvements from Frederic Weisbecker:

    " It mostly contains fixes and full dynticks off-case optimizations. I believe that
    distros want to enable this feature so it seems important to optimize the case
    where the "nohz_full=" parameter is empty. ie: I'm trying to remove any performance
    regression that comes with NO_HZ_FULL=y when the feature is not used.

    This patchset improves the current situation a lot (off-case appears to be around 11% faster
    with hackbench, although I guess it may vary depending on the configuration but it should be
    significantly faster in any case) now there is still some work to do: I can still observe a
    remaining loss of 1.6% throughput seen with hackbench compared to CONFIG_NO_HZ_FULL=n. "

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Scheduler IPIs and task context switches are serious fast path.
    Let's try to hide as much as we can the impact of full
    dynticks APIs' off case that are called on these sites
    through the use of static keys.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     
  • These APIs are frequenctly accessed and priority is given
    to optimize the full dynticks off-case in order to let
    distros enable this feature without suffering from
    significant performance regressions.

    Let's inline these APIs and optimize them with static keys.

    Signed-off-by: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Li Zhong
    Cc: Mike Galbraith
    Cc: Kevin Hilman

    Frederic Weisbecker
     

29 Jul, 2013

1 commit

  • Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
    repeat mode), because it has been identified as the source of a
    significant performance regression in v3.8 and later as explained by
    Jeremy Eder:

    We believe we've identified a particular commit to the cpuidle code
    that seems to be impacting performance of variety of workloads.
    The simplest way to reproduce is using netperf TCP_RR test, so
    we're using that, on a pair of Sandy Bridge based servers. We also
    have data from a large database setup where performance is also
    measurably/positively impacted, though that test data isn't easily
    share-able.

    Included below are test results from 3 test kernels:

    kernel reverts
    -----------------------------------------------------------
    1) vanilla upstream (no reverts)

    2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

    3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
    e11538d1f03914eb92af5a1a378375c05ae8520c

    In summary, netperf TCP_RR numbers improve by approximately 4%
    after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
    69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency
    never seems to get above 40%. Taking that patch out gets C0 near
    100% quite often, and performance increases.

    The below data are histograms representing the %c0 residency @
    1-second sample rates (using turbostat), while under netperf test.

    - If you look at the first 4 histograms, you can see %c0 residency
    almost entirely in the 30,40% bin.
    - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
    shows %c0 in the 80,90,100% bins.

    Below each kernel name are netperf TCP_RR trans/s numbers for the
    particular kernel that can be disclosed publicly, comparing the 3
    test kernels. We ran a 4th test with the vanilla kernel where
    we've also set /dev/cpu_dma_latency=0 to show overall impact
    boosting single-threaded TCP_RR performance over 11% above
    baseline.

    3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
    TCP_RR trans/s 54323.78

    -----------------------------------------------------------
    3.10-rc2 vanilla RX (no reverts)
    TCP_RR trans/s 48192.47

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 59]:
    ***********************************************************
    40.0000 - 50.0000 [ 1]: *
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    Sender %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 11]: ***********
    40.0000 - 50.0000 [ 49]:
    *************************************************
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    -----------------------------------------------------------
    3.10-rc2 perfteam2 RX (reverts commit
    e11538d1f03914eb92af5a1a378375c05ae8520c)
    TCP_RR trans/s 49698.69

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 1]: *
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 59]:
    ***********************************************************
    40.0000 - 50.0000 [ 0]:
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    Sender %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 2]: **
    40.0000 - 50.0000 [ 58]:
    **********************************************************
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    -----------------------------------------------------------
    3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
    and e11538d1f03914eb92af5a1a378375c05ae8520c)
    TCP_RR trans/s 47766.95

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 1]: *
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 27]: ***************************
    40.0000 - 50.0000 [ 2]: **
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 2]: **
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 28]: ****************************

    Sender:
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 11]: ***********
    40.0000 - 50.0000 [ 0]:
    50.0000 - 60.0000 [ 1]: *
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 3]: ***
    80.0000 - 90.0000 [ 7]: *******
    90.0000 - 100.0000 [ 38]: **************************************

    These results demonstrate gaining back the tendency of the CPU to
    stay in more responsive, performant C-states (and thus yield
    measurably better performance), by reverting commit
    69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

    Requested-by: Jeremy Eder
    Tested-by: Len Brown
    Cc: 3.8+
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

23 Apr, 2013

2 commits

  • When a task is scheduled in, it may have some properties
    of its own that could make the CPU reconsider the need for
    the tick: posix cpu timers, perf events, ...

    So notify the full dynticks subsystem when a task gets
    scheduled in and re-check the tick dependency at this
    stage. This is done through a self IPI to avoid messing
    up with any current lock scenario.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The scheduler IPI is used by the scheduler to kick
    full dynticks CPUs asynchronously when more than one
    task are running or when a new timer list timer is
    enqueued. This way the destination CPU can decide
    to restart the tick to handle this new situation.

    Now let's call that kick in the scheduler IPI.

    (Reusing the scheduler IPI rather than implementing
    a new IPI was suggested by Peter Zijlstra a while ago)

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

21 Apr, 2013

1 commit


19 Apr, 2013

2 commits

  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Provide two new helpers in order to notify the full dynticks CPUs about
    some internal system changes against which they may reconsider the state
    of their tick. Some practical examples include: posix cpu timers, perf tick
    and sched clock tick.

    For now the notifying handler, implemented through IPIs, is a stub
    that will be implemented when we get the tick stop/restart infrastructure
    in.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

16 Apr, 2013

1 commit

  • "Extended nohz" was used as a naming base for the full dynticks
    API and Kconfig symbols. It reflects the fact the system tries
    to stop the tick in more places than just idle.

    But that "extended" name is a bit opaque and vague. Rename it to
    "full" makes it clearer what the system tries to do under this
    config: try to shutdown the tick anytime it can. The various
    constraints that prevent that to happen shouldn't be considered
    as fundamental properties of this feature but rather technical
    issues that may be solved in the future.

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

03 Apr, 2013

1 commit

  • We are planning to convert the dynticks Kconfig options layout
    into a choice menu. The user must be able to easily pick
    any of the following implementations: constant periodic tick,
    idle dynticks, full dynticks.

    As this implies a mutual exclusion, the two dynticks implementions
    need to converge on the selection of a common Kconfig option in order
    to ease the sharing of a common infrastructure.

    It would thus seem pretty natural to reuse CONFIG_NO_HZ to
    that end. It already implements all the idle dynticks code
    and the full dynticks depends on all that code for now.
    So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
    CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.

    On the other hand we want to stay backward compatible: if
    CONFIG_NO_HZ is set in an older config file, we want to
    enable CONFIG_NO_HZ_IDLE by default.

    But we can't afford both at the same time or we run into
    a circular dependency:

    1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
    CONFIG_NO_HZ
    2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE

    We might be able to support that from Kconfig/Kbuild but it
    may not be wise to introduce such a confusing behaviour.

    So to solve this, create a new CONFIG_NO_HZ_COMMON option
    which gathers the common code between idle and full dynticks
    (that common code for now is simply the idle dynticks code)
    and select it from their referring Kconfig.

    Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
    to it for backward compatibility.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker