09 Jun, 2017

1 commit

  • When the tick is stopped and we reach the dynticks evaluation code on
    IRQ exit, we perform a soft tick restart if we observe an expired timer
    from there. It means we program the nearest possible tick but we stay in
    dynticks mode (ts->tick_stopped = 1) because we may need to stop the tick
    again after that expired timer is handled.

    Now this solution works most of the time but if we suffer an IRQ storm
    and those interrupts trigger faster than the hardware clockevents min
    delay, our tick won't fire until that IRQ storm is finished.

    Here is the problem: on IRQ exit we reprog the timer to at least
    NOW() + min_clockevents_delay. Another IRQ fires before the tick so we
    reschedule again to NOW() + min_clockevents_delay, etc... The tick
    is eternally rescheduled min_clockevents_delay ahead.

    A solution is to simply remove this soft tick restart. After all
    the normal dynticks evaluation path can handle 0 delay just fine. And
    by doing that we benefit from the optimization branch which avoids
    clock reprogramming if the clockevents deadline hasn't changed since
    the last reprog. This fixes our issue because we don't do repetitive
    clock reprog that always add hardware min delay.

    As a side effect it should even optimize the 0 delay path in general.

    Reported-and-tested-by: Octavian Purdila
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

12 Jan, 2017

1 commit

  • commit c1a9eeb938b5433947e5ea22f89baff3182e7075 upstream.

    When a disfunctional timer, e.g. dummy timer, is installed, the tick core
    tries to setup the broadcast timer.

    If no broadcast device is installed, the kernel crashes with a NULL pointer
    dereference in tick_broadcast_setup_oneshot() because the function has no
    sanity check.

    Reported-by: Mason
    Signed-off-by: Thomas Gleixner
    Cc: Mark Rutland
    Cc: Anna-Maria Gleixner
    Cc: Richard Cochran
    Cc: Sebastian Andrzej Siewior
    Cc: Daniel Lezcano
    Cc: Peter Zijlstra ,
    Cc: Sebastian Frias
    Cc: Thibaud Cornic
    Cc: Robin Murphy
    Link: http://lkml.kernel.org/r/1147ef90-7877-e4d2-bb2b-5c4fa8d3144b@free.fr
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

09 Jan, 2017

1 commit

  • commit 9c1645727b8fa90d07256fdfcc45bf831242a3ab upstream.

    The clocksource delta to nanoseconds conversion is using signed math, but
    the delta is unsigned. This makes the conversion space smaller than
    necessary and in case of a multiplication overflow the conversion can
    become negative. The conversion is done with scaled math:

    s64 nsec_delta = ((s64)clkdelta * clk->mult) >> clk->shift;

    Shifting a signed integer right obvioulsy preserves the sign, which has
    interesting consequences:

    - Time jumps backwards

    - __iter_div_u64_rem() which is used in one of the calling code pathes
    will take forever to piecewise calculate the seconds/nanoseconds part.

    This has been reported by several people with different scenarios:

    David observed that when stopping a VM with a debugger:

    "It was essentially the stopped by debugger case. I forget exactly why,
    but the guest was being explicitly stopped from outside, it wasn't just
    scheduling lag. I think it was something in the vicinity of 10 minutes
    stopped."

    When lifting the stop the machine went dead.

    The stopped by debugger case is not really interesting, but nevertheless it
    would be a good thing not to die completely.

    But this was also observed on a live system by Liav:

    "When the OS is too overloaded, delta will get a high enough value for the
    msb of the sum delta * tkr->mult + tkr->xtime_nsec to be set, and so
    after the shift the nsec variable will gain a value similar to
    0xffffffffff000000."

    Unfortunately this has been reintroduced recently with commit 6bd58f09e1d8
    ("time: Add cycles to nanoseconds translation"). It had been fixed a year
    ago already in commit 35a4933a8959 ("time: Avoid signed overflow in
    timekeeping_get_ns()").

    Though it's not surprising that the issue has been reintroduced because the
    function itself and the whole call chain uses s64 for the result and the
    propagation of it. The change in this recent commit is subtle:

    s64 nsec;

    - nsec = (d * m + n) >> s:
    + nsec = d * m + n;
    + nsec >>= s;

    d being type of cycle_t adds another level of obfuscation.

    This wouldn't have happened if the previous change to unsigned computation
    would have made the 'nsec' variable u64 right away and a follow up patch
    had cleaned up the whole call chain.

    There have been patches submitted which basically did a revert of the above
    patch leaving everything else unchanged as signed. Back to square one. This
    spawned a admittedly pointless discussion about potential users which rely
    on the unsigned behaviour until someone pointed out that it had been fixed
    before. The changelogs of said patches added further confusion as they made
    finally false claims about the consequences for eventual users which expect
    signed results.

    Despite delta being cycle_t, aka. u64, it's very well possible to hand in
    a signed negative value and the signed computation will happily return the
    correct result. But nobody actually sat down and analyzed the code which
    was added as user after the propably unintended signed conversion.

    Though in sensitive code like this it's better to analyze it proper and
    make sure that nothing relies on this than hunting the subtle wreckage half
    a year later. After analyzing all call chains it stands that no caller can
    hand in a negative value (which actually would work due to the s64 cast)
    and rely on the signed math to do the right thing.

    Change the conversion function to unsigned math. The conversion of all call
    chains is done in a follow up patch.

    This solves the starvation issue, which was caused by the negative result,
    but it does not solve the underlying problem. It merily procrastinates
    it. When the timekeeper update is deferred long enough that the unsigned
    multiplication overflows, then time going backwards is observable again.

    It does neither solve the issue of clocksources with a small counter width
    which will wrap around possibly several times and cause random time stamps
    to be generated. But those are usually not found on systems used for
    virtualization, so this is likely a non issue.

    I took the liberty to claim authorship for this simply because
    analyzing all callsites and writing the changelog took substantially
    more time than just making the simple s/s64/u64/ change and ignore the
    rest.

    Fixes: 6bd58f09e1d8 ("time: Add cycles to nanoseconds translation")
    Reported-by: David Gibson
    Reported-by: Liav Rehana
    Signed-off-by: Thomas Gleixner
    Reviewed-by: David Gibson
    Acked-by: Peter Zijlstra (Intel)
    Cc: Parit Bhargava
    Cc: Laurent Vivier
    Cc: "Christopher S. Hall"
    Cc: Chris Metcalf
    Cc: Richard Cochran
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161208204228.688545601@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

25 Oct, 2016

4 commits

  • When a timer is enqueued we try to forward the timer base clock. This
    mechanism has two issues:

    1) Forwarding a remote base unlocked

    The forwarding function is called from get_target_base() with the current
    timer base lock held. But if the new target base is a different base than
    the current base (can happen with NOHZ, sigh!) then the forwarding is done
    on an unlocked base. This can lead to corruption of base->clk.

    Solution is simple: Invoke the forwarding after the target base is locked.

    2) Possible corruption due to jiffies advancing

    This is similar to the issue in get_net_timer_interrupt() which was fixed
    in the previous patch. jiffies can advance between check and assignement
    and therefore advancing base->clk beyond the next expiry value.

    So we need to read jiffies into a local variable once and do the checks and
    assignment with the local copy.

    Fixes: a683f390b93f("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.253640125@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Ashton and Michael reported, that kernel versions 4.8 and later suffer from
    USB timeouts which are caused by the timer wheel rework.

    This is caused by a bug in the base clock forwarding mechanism, which leads
    to timers expiring early. The scenario which leads to this is:

    run_timers()
    while (jiffies >= base->clk) {
    collect_expired_timers();
    base->clk++;
    expire_timers();
    }

    So base->clk = jiffies + 1. Now the cpu goes idle:

    idle()
    get_next_timer_interrupt()
    nextevt = __next_time_interrupt();
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies has not advanced since run_timers(), so this assignment effectively
    decrements base->clk by one.

    base->clk is the index into the timer wheel arrays. So let's assume the
    following state after the base->clk increment in run_timers():

    jiffies = 0
    base->clk = 1

    A timer gets enqueued with an expiry delta of 63 ticks (which is the case
    with the USB timeout and HZ=250) so the resulting bucket index is:

    base->clk + delta = 1 + 63 = 64

    The timer goes into the first wheel level. The array size is 64 so it ends
    up in bucket 0, which is correct as it takes 63 ticks to advance base->clk
    to index into bucket 0 again.

    If the cpu goes idle before jiffies advance, then the bug in the forwarding
    mechanism sets base->clk back to 0, so the next invocation of run_timers()
    at the next tick will index into bucket 0 and therefore expire the timer 62
    ticks too early.

    Instead of blindly setting base->clk to jiffies we must make the forwarding
    conditional on jiffies > base->clk, but we cannot use jiffies for this as
    we might run into the following issue:

    if (time_after(jiffies, base->clk) {
    if (time_after(nextevt, base->clk))
    base->clk = jiffies;

    jiffies can increment between the check and the assigment far enough to
    advance beyond nextevt. So we need to use a stable value for checking.

    get_next_timer_interrupt() has the basej argument which is the jiffies
    value snapshot taken in the calling code. So we can just that.

    Thanks to Ashton for bisecting and providing trace data!

    Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
    Reported-by: Ashton Holmes
    Reported-by: Michael Thayer
    Signed-off-by: Thomas Gleixner
    Cc: Michal Necasek
    Cc: Peter Zijlstra
    Cc: knut.osmundsen@oracle.com
    Cc: stable@vger.kernel.org
    Cc: stern@rowland.harvard.edu
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20161022110552.175308322@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Linus stumbled over the unlocked modification of the timer expiry value in
    mod_timer() which is an optimization for timers which stay in the same
    bucket - due to the bucket granularity - despite their expiry time getting
    updated.

    The optimization itself still makes sense even if we take the lock, because
    in case that the bucket stays the same, we avoid the pointless
    queue/enqueue dance.

    Make the check and the modification of timer->expires protected by the base
    lock and shuffle the remaining code around so we can keep the lock held
    when we actually have to requeue the timer to a different bucket.

    Fixes: f00c0afdfa62 ("timers: Implement optimization for same expiry time in mod_timer()")
    Reported-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Cc: Peter Zijlstra

    Thomas Gleixner
     
  • Linus noticed that lock_timer_base() lacks a READ_ONCE() for accessing the
    timer flags. As a consequence the compiler is allowed to reload the flags
    between the initial check for TIMER_MIGRATION and the following timer base
    computation and the spin lock of the base.

    While this has not been observed (yet), we need to make sure that it never
    happens.

    Fixes: 0eeda71bc30d ("timer: Replace timer base by a cpu index")
    Reported-by: Linus Torvalds
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1610241711220.4983@nanos
    Cc: stable@vger.kernel.org
    Cc: Andrew Morton
    Cc: Peter Zijlstra

    Thomas Gleixner
     

17 Oct, 2016

1 commit

  • Remove the set but unused variable base in alarm_clock_get to fix the
    following warning when building with 'W=1':

    kernel/time/alarmtimer.c: In function ‘alarm_timer_create’:
    kernel/time/alarmtimer.c:545:21: warning: variable ‘base’ set but not used [-Wunused-but-set-variable]

    Signed-off-by: Tobias Klauser
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20161017094702.10873-1-tklauser@distanz.ch
    Signed-off-by: Thomas Gleixner

    Tobias Klauser
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

05 Oct, 2016

1 commit

  • In commit 27727df240c7 ("Avoid taking lock in NMI path with
    CONFIG_DEBUG_TIMEKEEPING"), I changed the logic to open-code
    the timekeeping_get_ns() function, but I forgot to include
    the unit conversion from cycles to nanoseconds, breaking the
    function's output, which impacts users like perf.

    This results in bogus perf timestamps like:
    swapper 0 [000] 253.427536: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426573: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426687: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426800: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.426905: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427022: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427127: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427239: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427346: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 254.427463: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 255.426572: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Instead of more reasonable expected timestamps like:
    swapper 0 [000] 39.953768: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.064839: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.175956: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.287103: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.398217: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.509324: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.620437: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.731546: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.842654: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 40.953772: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])
    swapper 0 [000] 41.064881: 111111111 cpu-clock: ffffffff810a0de6 native_safe_halt+0x6 ([kernel.kallsyms])

    Add the proper use of timekeeping_delta_to_ns() to convert
    the cycle delta to nanoseconds as needed.

    Thanks to Brendan and Alexei for finding this quickly after
    the v4.8 release. Unfortunately the problematic commit has
    landed in some -stable trees so they'll need this fix as
    well.

    Many apologies for this mistake. I'll be looking to add a
    perf-clock sanity test to the kselftest timers tests soon.

    Fixes: 27727df240c7 "timekeeping: Avoid taking lock in NMI path with CONFIG_DEBUG_TIMEKEEPING"
    Reported-by: Brendan Gregg
    Reported-by: Alexei Starovoitov
    Tested-and-reviewed-by: Mathieu Desnoyers
    Signed-off-by: John Stultz
    Cc: Peter Zijlstra
    Cc: stable
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/1475636148-26539-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

13 Sep, 2016

1 commit

  • can_stop_full_tick() has no check for offline cpus. So it allows to stop
    the tick on an offline cpu from the interrupt return path, which is wrong
    and subsequently makes irq_work_needs_cpu() warn about being called for an
    offline cpu.

    Commit f7ea0fd639c2c4 ("tick: Don't invoke tick_nohz_stop_sched_tick() if
    the cpu is offline") added prevention for can_stop_idle_tick(), but forgot
    to do the same in can_stop_full_tick(). Add it.

    [ tglx: Massaged changelog ]

    Signed-off-by: Wanpeng Li
    Cc: Peter Zijlstra
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/1473245473-4463-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner

    Wanpeng Li
     

08 Sep, 2016

1 commit


02 Sep, 2016

1 commit

  • tick_nohz_start_idle() is prevented to be called if the idle tick can't
    be stopped since commit 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle
    enter"). As a result, after suspend/resume the host machine, full dynticks
    kvm guest will softlockup:

    NMI watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [swapper/0:0]
    Call Trace:
    default_idle+0x31/0x1a0
    arch_cpu_idle+0xf/0x20
    default_idle_call+0x2a/0x50
    cpu_startup_entry+0x39b/0x4d0
    rest_init+0x138/0x140
    ? rest_init+0x5/0x140
    start_kernel+0x4c1/0x4ce
    ? set_init_arg+0x55/0x55
    ? early_idt_handler_array+0x120/0x120
    x86_64_start_reservations+0x24/0x26
    x86_64_start_kernel+0x142/0x14f

    In addition, cat /proc/stat | grep cpu in guest or host:

    cpu 398 16 5049 15754 5490 0 1 46 0 0
    cpu0 206 5 450 0 0 0 1 14 0 0
    cpu1 81 0 3937 3149 1514 0 0 9 0 0
    cpu2 45 6 332 6052 2243 0 0 11 0 0
    cpu3 65 2 328 6552 1732 0 0 11 0 0

    The idle and iowait states are weird 0 for cpu0(housekeeping).

    The bug is present in both guest and host kernels, and they both have
    cpu0's idle and iowait states issue, however, host kernel's suspend/resume
    path etc will touch watchdog to avoid the softlockup.

    - The watchdog will not be touched in tick_nohz_stop_idle path (need be
    touched since the scheduler stall is expected) if idle_active flags are
    not detected.
    - The idle and iowait states will not be accounted when exit idle loop
    (resched or interrupt) if idle start time and idle_active flags are
    not set.

    This patch fixes it by reverting commit 1f3b0f8243cb934 since can't stop
    idle tick doesn't mean can't be idle.

    Fixes: 1f3b0f8243cb934 ("tick/nohz: Optimize nohz idle enter")
    Signed-off-by: Wanpeng Li
    Cc: Sanjeev Yadav
    Cc: Gaurav Jindal
    Cc: stable@vger.kernel.org
    Cc: kvm@vger.kernel.org
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Cc: Paolo Bonzini
    Link: http://lkml.kernel.org/r/1472798303-4154-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Thomas Gleixner

    Wanpeng Li
     

01 Sep, 2016

5 commits

  • I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/hrtimer.c:310:16
    signed integer overflow:
    9223372036854775807 + 50000 cannot be represented in type 'long long int'
    CPU: 2 PID: 4798 Comm: trinity-c2 Not tainted 4.8.0-rc1+ #91
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
    0000000000000000 ffff88010ce6fb88 ffffffff82344740 0000000041b58ab3
    ffffffff84f97a20 ffffffff82344694 ffff88010ce6fbb0 ffff88010ce6fb60
    000000000000c350 ffff88010ce6f968 dffffc0000000000 ffffffff857bc320
    Call Trace:
    [] dump_stack+0xac/0xfc
    [] ? _atomic_dec_and_lock+0xc4/0xc4
    [] ubsan_epilogue+0xd/0x8a
    [] handle_overflow+0x202/0x23d
    [] ? val_to_string.constprop.6+0x11e/0x11e
    [] ? timerqueue_add+0x151/0x410
    [] ? hrtimer_start_range_ns+0x3b8/0x1380
    [] ? memset+0x31/0x40
    [] __ubsan_handle_add_overflow+0xe/0x10
    [] hrtimer_nanosleep+0x5d9/0x790
    [] ? hrtimer_init_sleeper+0x80/0x80
    [] ? __might_sleep+0x5b/0x260
    [] common_nsleep+0x20/0x30
    [] SyS_clock_nanosleep+0x197/0x210
    [] ? SyS_clock_getres+0x150/0x150
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] ? __context_tracking_exit.part.3+0x30/0x1b0
    [] ? SyS_clock_getres+0x150/0x150
    [] do_syscall_64+0x1b3/0x4b0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

    Add a new ktime_add_unsafe() helper which doesn't check for overflow, but
    doesn't throw a UBSAN warning when it does overflow either.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Vegard Nossum
    Signed-off-by: John Stultz

    Vegard Nossum
     
  • I ran into this:

    ================================================================================
    UBSAN: Undefined behaviour in kernel/time/time.c:783:2
    signed integer overflow:
    5273 + 9223372036854771711 cannot be represented in type 'long int'
    CPU: 0 PID: 17363 Comm: trinity-c0 Not tainted 4.8.0-rc1+ #88
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org
    04/01/2014
    0000000000000000 ffff88011457f8f0 ffffffff82344f50 0000000041b58ab3
    ffffffff84f98080 ffffffff82344ea4 ffff88011457f918 ffff88011457f8c8
    ffff88011457f8e0 7fffffffffffefff ffff88011457f6d8 dffffc0000000000
    Call Trace:
    [] dump_stack+0xac/0xfc
    [] ? _atomic_dec_and_lock+0xc4/0xc4
    [] ubsan_epilogue+0xd/0x8a
    [] handle_overflow+0x202/0x23d
    [] ? val_to_string.constprop.6+0x11e/0x11e
    [] ? debug_smp_processor_id+0x17/0x20
    [] ? __sigqueue_free.part.13+0x51/0x70
    [] ? rcu_is_watching+0x110/0x110
    [] __ubsan_handle_add_overflow+0xe/0x10
    [] timespec64_add_safe+0x298/0x340
    [] ? timespec_add_safe+0x330/0x330
    [] ? wait_noreap_copyout+0x1d0/0x1d0
    [] poll_select_set_timeout+0xf8/0x170
    [] ? poll_schedule_timeout+0x2b0/0x2b0
    [] ? __might_sleep+0x5b/0x260
    [] __sys_recvmmsg+0x107/0x790
    [] ? SyS_recvmsg+0x20/0x20
    [] ? hrtimer_start_range_ns+0x3b8/0x1380
    [] ? _raw_spin_unlock_irqrestore+0x3b/0x60
    [] ? do_setitimer+0x39a/0x8e0
    [] ? __might_sleep+0x5b/0x260
    [] ? __sys_recvmmsg+0x790/0x790
    [] SyS_recvmmsg+0xd9/0x160
    [] ? __sys_recvmmsg+0x790/0x790
    [] ? __this_cpu_preempt_check+0x13/0x20
    [] ? __context_tracking_exit.part.3+0x30/0x1b0
    [] ? __sys_recvmmsg+0x790/0x790
    [] do_syscall_64+0x1b3/0x4b0
    [] entry_SYSCALL64_slow_path+0x25/0x25
    ================================================================================

    Line 783 is this:

    783 set_normalized_timespec64(&res, lhs.tv_sec + rhs.tv_sec,
    784 lhs.tv_nsec + rhs.tv_nsec);

    In other words, since lhs.tv_sec and rhs.tv_sec are both time64_t, this
    is a signed addition which will cause undefined behaviour on overflow.

    Note that this is not currently a huge concern since the kernel should be
    built with -fno-strict-overflow by default, but could be a problem in the
    future, a problem with older compilers, or other compilers than gcc.

    The easiest way to avoid the overflow is to cast one of the arguments to
    unsigned (so the addition will be done using unsigned arithmetic).

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Vegard Nossum
    Signed-off-by: John Stultz

    Vegard Nossum
     
  • In addition to keeping a histogram of suspend times, also
    print out the time spent in suspend to dmesg.

    This helps to keep track of suspend time while debugging using
    kernel logs.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Ruchi Kandoi
    [jstultz: Tweaked commit message]
    Signed-off-by: John Stultz

    Ruchi Kandoi
     
  • Clocksources don't get the VALID_FOR_HRES flag until they have been
    checked by a watchdog. However, when using an override, the
    clocksource_select logic will clear the override value if the
    clocksource is not marked VALID_FOR_HRES during that inititial check.
    When using the boot arguments clocksource=, this selection can
    run before the watchdog, and can cause the override to be incorrectly
    cleared.

    To address this condition, the override_name is only invalidated for
    unstable clocksources. Otherwise, the override is left intact until after
    the watchdog has validated the clocksource as stable/unstable.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Cc: Martin Schwidefsky
    Signed-off-by: Kyle Walker
    Signed-off-by: John Stultz

    Kyle Walker
     
  • Fix a minor spelling error.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Richard Cochran
    Cc: Prarit Bhargava
    Signed-off-by: Pratyush Patel
    [jstultz: Added commit message]
    Signed-off-by: John Stultz

    Pratyush Patel
     

24 Aug, 2016

2 commits

  • It was reported that hibernation could fail on the 2nd attempt, where the
    system hangs at hibernate() -> syscore_resume() -> i8237A_resume() ->
    claim_dma_lock(), because the lock has already been taken.

    However there is actually no other process would like to grab this lock on
    that problematic platform.

    Further investigation showed that the problem is triggered by setting
    /sys/power/pm_trace to 1 before the 1st hibernation.

    Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend,
    and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller
    than 1970) to the release date of that motherboard during POST stage, thus
    after resumed, it may seem that the system had a significant long sleep time
    which is a completely meaningless value.

    Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the
    sleep time happened to be set to 1, fls() returns 32 and we add 1 to
    sleep_time_bin[32], which causes an out of bounds array access and therefor
    memory being overwritten.

    As depicted by System.map:
    0xffffffff81c9d080 b sleep_time_bin
    0xffffffff81c9d100 B dma_spin_lock
    the dma_spin_lock.val is set to 1, which caused this problem.

    This patch adds a sanity check in tk_debug_account_sleep_time()
    to ensure we don't index past the sleep_time_bin array.

    [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the
    issue slightly differently, but borrowed his excelent explanation of the
    issue here.]

    Fixes: 5c83545f24ab "power: Add option to log time spent in suspend"
    Reported-by: Janek Kozicki
    Reported-by: Chen Yu
    Signed-off-by: John Stultz
    Cc: linux-pm@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Xunlei Pang
    Cc: "Rafael J. Wysocki"
    Cc: stable
    Cc: Zhang Rui
    Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     
  • When I added some extra sanity checking in timekeeping_get_ns() under
    CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns()
    method was using timekeeping_get_ns().

    Thus the locking added to the debug checks broke the NMI-safety of
    __ktime_get_fast_ns().

    This patch open-codes the timekeeping_get_ns() logic for
    __ktime_get_fast_ns(), so can avoid any deadlocks in NMI.

    Fixes: 4ca22c2648f9 "timekeeping: Add warnings when overflows or underflows are observed"
    Reported-by: Steven Rostedt
    Reported-by: Peter Zijlstra
    Signed-off-by: John Stultz
    Cc: stable
    Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.org
    Signed-off-by: Thomas Gleixner

    John Stultz
     

09 Aug, 2016

1 commit

  • The tick_nohz_stop_sched_tick() routine is not properly
    canceling the sched timer when nothing is pending, because
    get_next_timer_interrupt() is no longer returning KTIME_MAX in
    that case. This causes periodic interrupts when none are needed.

    When determining the next interrupt time, we first use
    __next_timer_interrupt() to get the first expiring timer in the
    timer wheel. If no timer is found, we return the base clock value
    plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the
    timer wheel.

    Back in get_next_timer_interrupt(), we set the "expires" value
    by converting the timer wheel expiry (in ticks) to a nsec value.
    But we don't want to do this if the timer wheel expiry value
    indicates no timer; we want to return KTIME_MAX.

    Prior to commit 500462a9de65 ("timers: Switch to a non-cascading
    wheel") we checked base->active_timers to see if any timers
    were active, and if not, we didn't touch the expiry value and so
    properly returned KTIME_MAX. Now we don't have active_timers.

    To fix this, we now just check the timer wheel expiry value to
    see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't
    try to compute a new value based on it, but instead simply let the
    KTIME_MAX value in expires remain.

    Fixes: 500462a9de65 "timers: Switch to a non-cascading wheel"
    Signed-off-by: Chris Metcalf
    Cc: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Thomas Gleixner

    Chris Metcalf
     

30 Jul, 2016

1 commit

  • Pull smp hotplug updates from Thomas Gleixner:
    "This is the next part of the hotplug rework.

    - Convert all notifiers with a priority assigned

    - Convert all CPU_STARTING/DYING notifiers

    The final removal of the STARTING/DYING infrastructure will happen
    when the merge window closes.

    Another 700 hundred line of unpenetrable maze gone :)"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
    timers/core: Correct callback order during CPU hot plug
    leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
    powerpc/numa: Convert to hotplug state machine
    arm/perf: Fix hotplug state machine conversion
    irqchip/armada: Avoid unused function warnings
    ARC/time: Convert to hotplug state machine
    clocksource/atlas7: Convert to hotplug state machine
    clocksource/armada-370-xp: Convert to hotplug state machine
    clocksource/exynos_mct: Convert to hotplug state machine
    clocksource/arm_global_timer: Convert to hotplug state machine
    rcu: Convert rcutree to hotplug state machine
    KVM/arm/arm64/vgic-new: Convert to hotplug state machine
    smp/cfd: Convert core to hotplug state machine
    x86/x2apic: Convert to CPU hotplug state machine
    profile: Convert to hotplug state machine
    timers/core: Convert to hotplug state machine
    hrtimer: Convert to hotplug state machine
    x86/tboot: Convert to hotplug state machine
    arm64/armv8 deprecated: Convert to hotplug state machine
    hwtracing/coresight-etm4x: Convert to hotplug state machine
    ...

    Linus Torvalds
     

26 Jul, 2016

1 commit

  • Pull timer updates from Thomas Gleixner:
    "This update provides the following changes:

    - The rework of the timer wheel which addresses the shortcomings of
    the current wheel (cascading, slow search for next expiring timer,
    etc). That's the first major change of the wheel in almost 20
    years since Finn implemted it.

    - A large overhaul of the clocksource drivers init functions to
    consolidate the Device Tree initialization

    - Some more Y2038 updates

    - A capability fix for timerfd

    - Yet another clock chip driver

    - The usual pile of updates, comment improvements all over the place"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (130 commits)
    tick/nohz: Optimize nohz idle enter
    clockevents: Make clockevents_subsys static
    clocksource/drivers/time-armada-370-xp: Fix return value check
    timers: Implement optimization for same expiry time in mod_timer()
    timers: Split out index calculation
    timers: Only wake softirq if necessary
    timers: Forward the wheel clock whenever possible
    timers/nohz: Remove pointless tick_nohz_kick_tick() function
    timers: Optimize collect_expired_timers() for NOHZ
    timers: Move __run_timers() function
    timers: Remove set_timer_slack() leftovers
    timers: Switch to a non-cascading wheel
    timers: Reduce the CPU index space to 256k
    timers: Give a few structs and members proper names
    hlist: Add hlist_is_singular_node() helper
    signals: Use hrtimer for sigtimedwait()
    timers: Remove the deprecated mod_timer_pinned() API
    timers, net/ipv4/inet: Initialize connection request timers as pinned
    timers, drivers/tty/mips_ejtag: Initialize the poll timer as pinned
    timers, drivers/tty/metag_da: Initialize the poll timer as pinned
    ...

    Linus Torvalds
     

25 Jul, 2016

1 commit

  • Pull staging and IIO driver updates from Greg KH:
    "Here is the big Staging and IIO driver update for 4.8-rc1.

    We ended up adding more code than removing, again, but it's not all
    that bad. Lots of cleanups all over the staging tree, and new IIO
    drivers, full details in the shortlog.

    All of these have been in linux-next for a while with no reported
    issues"

    * tag 'staging-4.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (417 commits)
    drivers:iio:accel:mma8452: removed unwanted return statements
    drivers:iio:accel:mma8452: added cleanup provision in case of failure.
    iio: Add iio.git tree to MAINTAINERS
    iio:st_pressure: clean useless static channel initializers
    iio:st_pressure:lps22hb: temperature support
    iio:st_pressure:lps22hb: open drain support
    iio:st_pressure: temperature triggered buffering
    iio:st_pressure: document sampling gains
    iio:st_pressure: align storagebits on power of 2
    iio:st_sensors: align on storagebits boundaries
    staging:iio:lis3l02dq drop separate driver
    iio: accel: st_accel: Add lis3l02dq support
    iio: adc: add missing of_node references to iio_dev
    iio: adc: ti-ads1015: add indio_dev->dev.of_node reference
    iio: potentiometer: Fix typo in Kconfig
    iio: potentiometer: mcp4531: Add device tree binding
    iio: potentiometer: mcp4531: Add device tree binding documentation
    iio: potentiometer: mcp4531: Add support for MCP454x, MCP456x, MCP464x and MCP466x
    iio:imu:mpu6050: icm20608 initial support
    iio: adc: max1363: Add device tree binding
    ...

    Linus Torvalds
     

19 Jul, 2016

2 commits

  • tick_nohz_start_idle is called before checking whether the idle tick can be
    stopped. If the tick cannot be stopped, calling tick_nohz_start_idle() is
    pointless and just wasting CPU cycles.

    Only invoke tick_nohz_start_idle() when can_stop_idle_tick() returns true. A
    short one minute observation of the effect on ARM64 shows a reduction of calls
    by 1.5% thus optimizing the idle entry sequence.

    [tglx: Massaged changelog ]

    Co-developed-by: Sanjeev Yadav
    Signed-off-by: Gaurav Jindal
    Link: http://lkml.kernel.org/r/20160714120416.GB21099@gaurav.jindal@spreadtrum.com
    Signed-off-by: Thomas Gleixner

    Gaurav Jindal
     
  • The clockevents_subsys struct is used for sysfs support and
    is not declared or used outside the file it is defined in.
    Fix the following warning by making it static:

    kernel/time/clockevents.c:648:17: warning: symbol 'clockevents_subsys' was not declared. Should it be static?

    Signed-off-by: Ben Dooks
    Cc: linux-kernel@lists.codethink.co.uk
    Link: http://lkml.kernel.org/r/1466178974-7105-1-git-send-email-ben.dooks@codethink.co.uk
    Signed-off-by: Thomas Gleixner

    Ben Dooks
     

15 Jul, 2016

2 commits

  • When tearing down, call timers_dead_cpu() before notify_dead().
    There is a hidden dependency between:

    - timers
    - block multiqueue
    - rcutree

    If timers_dead_cpu() comes later than blk_mq_queue_reinit_notify()
    that latter function causes a RCU stall.

    Signed-off-by: Richard Cochran
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: John Stultz
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Thomas Gleixner
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153337.566790058@linutronix.de
    Signed-off-by: Ingo Molnar

    Richard Cochran
     
  • Split out the clockevents callbacks instead of piggybacking them on
    hrtimers.

    This gets rid of a POST_DEAD user. See commit:

    54e88fad223c ("sched: Make sure timers have migrated before killing the migration_thread")

    We just move the callback state to the proper place in the state machine.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Anna-Maria Gleixner
    Reviewed-by: Sebastian Andrzej Siewior
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rasmus Villemoes
    Cc: Rusty Russell
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160713153337.485419196@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

11 Jul, 2016

1 commit

  • Variable "now" seems to be genuinely used unintialized
    if branch

    if (CPUCLOCK_PERTHREAD(timer->it_clock)) {

    is not taken and branch

    if (unlikely(sighand == NULL)) {

    is taken. In this case the process has been reaped and the timer is marked as
    disarmed anyway. So none of the postprocessing of the sample is
    required. Return right away.

    Signed-off-by: Alexey Dobriyan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20160707223911.GA26483@p183.telecom.by
    Signed-off-by: Thomas Gleixner

    Alexey Dobriyan
     

07 Jul, 2016

10 commits

  • Ingo Molnar
     
  • The existing optimization for same expiry time in mod_timer() checks whether
    the timer expiry time is the same as the new requested expiry time. In the old
    timer wheel implementation this does not take the slack batching into account,
    neither does the new implementation evaluate whether the new expiry time will
    requeue the timer to the same bucket.

    To optimize that, we can calculate the resulting bucket and check if the new
    expiry time is different from the current expiry time. This calculation
    happens outside the base lock held region. If the resulting bucket is the same
    we can avoid taking the base lock and requeueing the timer.

    If the timer needs to be requeued then we have to check under the base lock
    whether the base time has changed between the lockless calculation and taking
    the lock. If it has changed we need to recalculate under the lock.

    This optimization takes effect for timers which are enqueued into the less
    granular wheel levels (1 and above). With a simple test case the functionality
    has been verified:

    Before After
    Match: 5.5% 86.6%
    Requeue: 94.5% 13.4%
    Recalc:
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • For further optimizations we need to seperate index calculation
    from queueing. No functional change.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • With the wheel forwading in place and with the HZ=1000 4ms folding we can
    avoid running the softirq at all.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The wheel clock is stale when a CPU goes into a long idle sleep. This has the
    side effect that timers which are queued end up in the outer wheel levels.
    That results in coarser granularity.

    To solve this, we keep track of the idle state and forward the wheel clock
    whenever possible.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • This was a failed attempt to optimize the timer expiry in idle, which was
    disabled and never revisited. Remove the cruft.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.431073782@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
    There might be expired timers so the current code loops and checks the expired
    buckets for timers. This can take quite some time for long NOHZ idle periods.

    The pending bitmask in the timer base allows us to do a quick search for the
    next expiring timer and therefore a fast forward of the base time which
    prevents pointless long lasting loops.

    For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
    in preparation for __run_timers() NOHZ optimization.

    No functional change.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • We now have implicit batching in the timer wheel. The slack API is no longer
    used, so remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Alan Stern
    Cc: Andrew F. Davis
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: David S. Miller
    Cc: David Woodhouse
    Cc: Dmitry Eremin-Solenikov
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Greg Kroah-Hartman
    Cc: Jaehoon Chung
    Cc: Jens Axboe
    Cc: John Stultz
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Mathias Nyman
    Cc: Pali Rohár
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Reichel
    Cc: Ulf Hansson
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: linux-usb@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The current timer wheel has some drawbacks:

    1) Cascading:

    Cascading can be an unbound operation and is completely pointless in most
    cases because the vast majority of the timer wheel timers are canceled or
    rearmed before expiration. (They are used as timeout safeguards, not as
    real timers to measure time.)

    2) No fast lookup of the next expiring timer:

    In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
    must fast forward the base time to the current value of jiffies. As we
    have no way to find the next expiring timer fast, the code loops linearly
    and increments the base time one by one and checks for expired timers
    in each step. This causes unbound overhead spikes exactly in the moment
    when we should wake up as fast as possible.

    After a thorough analysis of real world data gathered on laptops,
    workstations, webservers and other machines (thanks Chris!) I came to the
    conclusion that the current 'classic' timer wheel implementation can be
    modified to address the above issues.

    The vast majority of timer wheel timers is canceled or rearmed before
    expiry. Most of them are timeouts for networking and other I/O tasks. The
    nature of timeouts is to catch the exception from normal operation (TCP ack
    timed out, disk does not respond, etc.). For these kinds of timeouts the
    accuracy of the timeout is not really a concern. Timeouts are very often
    approximate worst-case values and in case the timeout fires, we already
    waited for a long time and performance is down the drain already.

    The few timers which actually expire can be split into two categories:

    1) Short expiry times which expect halfways accurate expiry

    2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

    So for long term expiry timers we can avoid the cascading property and just
    leave them in the less granular outer wheels until expiry or
    cancelation. Timers which are armed with a timeout larger than the wheel
    capacity are no longer cascaded. We expire them with the longest possible
    timeout (6+ days). We have not observed such timeouts in our data collection,
    but at least we handle them, applying the rule of the least surprise.

    To avoid extending the wheel levels for HZ=1000 so we can accomodate the
    longest observed timeouts (5 days in the network conntrack code) we reduce the
    first level granularity on HZ=1000 to 4ms, which effectively is the same as
    the HZ=250 behaviour. From our data analysis there is nothing which relies on
    that 1ms granularity and as a side effect we get better batching and timer
    locality for the networking code as well.

    Contrary to the classic wheel the granularity of the next wheel is not the
    capacity of the first wheel. The granularities of the wheels are in the
    currently chosen setting 8 times the granularity of the previous wheel.

    So for HZ=250 we end up with the following granularity levels:

    Level Offset Granularity Range
    0 0 4 ms 0 ms - 252 ms
    1 64 32 ms 256 ms - 2044 ms (256ms - ~2s)
    2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s)
    3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m)
    4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m)
    5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h)
    6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h)
    7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d)

    That's a worst case inaccuracy of 12.5% for the timers which are queued at the
    beginning of a level.

    So the new wheel concept addresses the old issues:

    1) Cascading is avoided completely

    2) By keeping the timers in the bucket until expiry/cancelation we can track
    the buckets which have timers enqueued in a bucket bitmap and therefore can
    look up the next expiring timer very fast and O(1).

    A further benefit of the concept is that the slack calculation which is done
    on every timer start is no longer necessary because the granularity levels
    provide natural batching already.

    Our extensive testing with various loads did not show any performance
    degradation vs. the current wheel implementation.

    This patch does not address the 'fast lookup' issue as we wanted to make sure
    that there is no regression introduced by the wheel redesign. The
    optimizations are in follow up patches.

    This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner