07 Jul, 2016

12 commits

  • Ingo Molnar
     
  • The existing optimization for same expiry time in mod_timer() checks whether
    the timer expiry time is the same as the new requested expiry time. In the old
    timer wheel implementation this does not take the slack batching into account,
    neither does the new implementation evaluate whether the new expiry time will
    requeue the timer to the same bucket.

    To optimize that, we can calculate the resulting bucket and check if the new
    expiry time is different from the current expiry time. This calculation
    happens outside the base lock held region. If the resulting bucket is the same
    we can avoid taking the base lock and requeueing the timer.

    If the timer needs to be requeued then we have to check under the base lock
    whether the base time has changed between the lockless calculation and taking
    the lock. If it has changed we need to recalculate under the lock.

    This optimization takes effect for timers which are enqueued into the less
    granular wheel levels (1 and above). With a simple test case the functionality
    has been verified:

    Before After
    Match: 5.5% 86.6%
    Requeue: 94.5% 13.4%
    Recalc:
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • For further optimizations we need to seperate index calculation
    from queueing. No functional change.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • With the wheel forwading in place and with the HZ=1000 4ms folding we can
    avoid running the softirq at all.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The wheel clock is stale when a CPU goes into a long idle sleep. This has the
    side effect that timers which are queued end up in the outer wheel levels.
    That results in coarser granularity.

    To solve this, we keep track of the idle state and forward the wheel clock
    whenever possible.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
    There might be expired timers so the current code loops and checks the expired
    buckets for timers. This can take quite some time for long NOHZ idle periods.

    The pending bitmask in the timer base allows us to do a quick search for the
    next expiring timer and therefore a fast forward of the base time which
    prevents pointless long lasting loops.

    For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
    in preparation for __run_timers() NOHZ optimization.

    No functional change.

    Signed-off-by: Anna-Maria Gleixner
    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
    Signed-off-by: Ingo Molnar

    Anna-Maria Gleixner
     
  • We now have implicit batching in the timer wheel. The slack API is no longer
    used, so remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Alan Stern
    Cc: Andrew F. Davis
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: David S. Miller
    Cc: David Woodhouse
    Cc: Dmitry Eremin-Solenikov
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Greg Kroah-Hartman
    Cc: Jaehoon Chung
    Cc: Jens Axboe
    Cc: John Stultz
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Mathias Nyman
    Cc: Pali Rohár
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Sebastian Reichel
    Cc: Ulf Hansson
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mmc@vger.kernel.org
    Cc: linux-pm@vger.kernel.org
    Cc: linux-usb@vger.kernel.org
    Cc: netdev@vger.kernel.org
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The current timer wheel has some drawbacks:

    1) Cascading:

    Cascading can be an unbound operation and is completely pointless in most
    cases because the vast majority of the timer wheel timers are canceled or
    rearmed before expiration. (They are used as timeout safeguards, not as
    real timers to measure time.)

    2) No fast lookup of the next expiring timer:

    In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
    must fast forward the base time to the current value of jiffies. As we
    have no way to find the next expiring timer fast, the code loops linearly
    and increments the base time one by one and checks for expired timers
    in each step. This causes unbound overhead spikes exactly in the moment
    when we should wake up as fast as possible.

    After a thorough analysis of real world data gathered on laptops,
    workstations, webservers and other machines (thanks Chris!) I came to the
    conclusion that the current 'classic' timer wheel implementation can be
    modified to address the above issues.

    The vast majority of timer wheel timers is canceled or rearmed before
    expiry. Most of them are timeouts for networking and other I/O tasks. The
    nature of timeouts is to catch the exception from normal operation (TCP ack
    timed out, disk does not respond, etc.). For these kinds of timeouts the
    accuracy of the timeout is not really a concern. Timeouts are very often
    approximate worst-case values and in case the timeout fires, we already
    waited for a long time and performance is down the drain already.

    The few timers which actually expire can be split into two categories:

    1) Short expiry times which expect halfways accurate expiry

    2) Long term expiry times are inaccurate today already due to the
    batching which is done for NOHZ automatically and also via the
    set_timer_slack() API.

    So for long term expiry timers we can avoid the cascading property and just
    leave them in the less granular outer wheels until expiry or
    cancelation. Timers which are armed with a timeout larger than the wheel
    capacity are no longer cascaded. We expire them with the longest possible
    timeout (6+ days). We have not observed such timeouts in our data collection,
    but at least we handle them, applying the rule of the least surprise.

    To avoid extending the wheel levels for HZ=1000 so we can accomodate the
    longest observed timeouts (5 days in the network conntrack code) we reduce the
    first level granularity on HZ=1000 to 4ms, which effectively is the same as
    the HZ=250 behaviour. From our data analysis there is nothing which relies on
    that 1ms granularity and as a side effect we get better batching and timer
    locality for the networking code as well.

    Contrary to the classic wheel the granularity of the next wheel is not the
    capacity of the first wheel. The granularities of the wheels are in the
    currently chosen setting 8 times the granularity of the previous wheel.

    So for HZ=250 we end up with the following granularity levels:

    Level Offset Granularity Range
    0 0 4 ms 0 ms - 252 ms
    1 64 32 ms 256 ms - 2044 ms (256ms - ~2s)
    2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s)
    3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m)
    4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m)
    5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h)
    6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h)
    7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d)

    That's a worst case inaccuracy of 12.5% for the timers which are queued at the
    beginning of a level.

    So the new wheel concept addresses the old issues:

    1) Cascading is avoided completely

    2) By keeping the timers in the bucket until expiry/cancelation we can track
    the buckets which have timers enqueued in a bucket bitmap and therefore can
    look up the next expiring timer very fast and O(1).

    A further benefit of the concept is that the slack calculation which is done
    on every timer start is no longer necessary because the granularity levels
    provide natural batching already.

    Our extensive testing with various loads did not show any performance
    degradation vs. the current wheel implementation.

    This patch does not address the 'fast lookup' issue as we wanted to make sure
    that there is no regression introduced by the wheel redesign. The
    optimizations are in follow up patches.

    This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

    Signed-off-by: Thomas Gleixner
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: Frederic Weisbecker
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Some of the names in the internal implementation of the timer code
    are not longer correct and others are simply too long to type.

    Clean it up before we switch the wheel implementation over to
    the new scheme.

    No functional change.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • We switched all users to initialize the timers as pinned and call
    mod_timer(). Remove the now unused timer API function.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • We want to move the timer migration logic from a 'push' to a 'pull' model.

    Under the current 'push' model pinned timers are handled via
    a runtime API variant: mod_timer_pinned().

    The 'pull' model requires us to store the pinned attribute of a timer
    in the timer_list structure itself, as a new TIMER_PINNED bit in
    timer->flags.

    This flag must be set at initialization time and the timer APIs
    recognize the flag.

    This patch:

    - Implements the new flag and associated new-style initialization
    methods

    - makes mod_timer() recognize new-style pinned timers,

    - and adds some migration helper facility to allow
    step by step conversion of old-style to new-style
    pinned timers.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Frederic Weisbecker
    Cc: Arjan van de Ven
    Cc: Chris Mason
    Cc: Eric Dumazet
    Cc: George Spelvin
    Cc: Josh Triplett
    Cc: Len Brown
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

10 Jun, 2016

1 commit

  • Update the usleep_range() function comment to make it clear that it can
    only be used in non-atomic context.

    Previously we claimed usleep_range() was a drop-in replacement for udelay()
    where wakeup is flexible. But that's only true in non-atomic contexts,
    where it's possible to sleep instead of delay.

    Signed-off-by: Bjorn Helgaas
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/20160531212302.28502.44995.stgit@bhelgaas-glaptop2.roam.corp.google.com
    Signed-off-by: Thomas Gleixner

    Bjorn Helgaas
     

20 May, 2016

2 commits

  • When activating a static object we need make sure that the object is
    tracked in the object tracker. If it is a non-static object then the
    activation is illegal.

    In previous implementation, each subsystem need take care of this in
    their fixup callbacks. Actually we can put it into debugobjects core.
    Thus we can save duplicated code, and have *pure* fixup callbacks.

    To achieve this, a new callback "is_static_object" is introduced to let
    the type specific code decide whether a object is static or not. If
    yes, we take it into object tracker, otherwise give warning and invoke
    fixup callback.

    This change has paassed debugobjects selftest, and I also do some test
    with all debugobjects supports enabled.

    At last, I have a concern about the fixups that can it change the object
    which is in incorrect state on fixup? Because the 'addr' may not point
    to any valid object if a non-static object is not tracked. Then Change
    such object can overwrite someone's memory and cause unexpected
    behaviour. For example, the timer_fixup_activate bind timer to function
    stub_timer.

    Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
    [changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
    Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     
  • Update the return type to use bool instead of int, corresponding to
    cheange (debugobjects: make fixup functions return bool instead of int).

    Signed-off-by: Du, Changbin
    Cc: Jonathan Corbet
    Cc: Josh Triplett
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Tejun Heo
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Du, Changbin
     

26 Mar, 2016

1 commit


18 Mar, 2016

1 commit

  • This patchset introduces a /proc//timerslack_ns interface which
    would allow controlling processes to be able to set the timerslack value
    on other processes in order to save power by avoiding wakeups (Something
    Android currently does via out-of-tree patches).

    The first patch tries to fix the internal timer_slack_ns usage which was
    defined as a long, which limits the slack range to ~4 seconds on 32bit
    systems. It converts it to a u64, which provides the same basically
    unlimited slack (500 years) on both 32bit and 64bit machines.

    The second patch introduces the /proc//timerslack_ns interface
    which allows the full 64bit slack range for a task to be read or set on
    both 32bit and 64bit machines.

    With these two patches, on a 32bit machine, after setting the slack on
    bash to 10 seconds:

    $ time sleep 1

    real 0m10.747s
    user 0m0.001s
    sys 0m0.005s

    The first patch is a little ugly, since I had to chase the slack delta
    arguments through a number of functions converting them to u64s. Let me
    know if it makes sense to break that up more or not.

    Other than that things are fairly straightforward.

    This patch (of 2):

    The timer_slack_ns value in the task struct is currently a unsigned
    long. This means that on 32bit applications, the maximum slack is just
    over 4 seconds. However, on 64bit machines, its much much larger (~500
    years).

    This disparity could make application development a little (as well as
    the default_slack) to a u64. This means both 32bit and 64bit systems
    have the same effective internal slack range.

    Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
    the interface as a unsigned long, so we preserve that limitation on
    32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
    long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
    actually larger then what can be stored by an unsigned long.

    This patch also modifies hrtimer functions which specified the slack
    delta as a unsigned long.

    Signed-off-by: John Stultz
    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Cc: Oren Laadan
    Cc: Ruchi Kandoi
    Cc: Rom Lemarchand
    Cc: Kees Cook
    Cc: Android Kernel Team
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    John Stultz
     

05 Nov, 2015

1 commit

  • Regardless of the previous CPU a timer was on, add_timer_on()
    currently simply sets timer->flags to the new CPU. As the caller must
    be seeing the timer as idle, this is locally fine, but the timer
    leaving the old base while unlocked can lead to race conditions as
    follows.

    Let's say timer was on cpu 0.

    cpu 0 cpu 1
    -----------------------------------------------------------------------------
    del_timer(timer) succeeds
    del_timer(timer)
    lock_timer_base(timer) locks cpu_0_base
    add_timer_on(timer, 1)
    spin_lock(&cpu_1_base->lock)
    timer->flags set to cpu_1_base
    operates on @timer operates on @timer

    This triggered with mod_delayed_work_on() which contains
    "if (del_timer()) add_timer_on()" sequence eventually leading to the
    following oops.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] detach_if_pending+0x69/0x1a0
    ...
    Workqueue: wqthrash wqthrash_workfunc [wqthrash]
    task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
    RIP: 0010:[] [] detach_if_pending+0x69/0x1a0
    ...
    Call Trace:
    [] del_timer+0x44/0x60
    [] try_to_grab_pending+0xb6/0x160
    [] mod_delayed_work_on+0x33/0x80
    [] wqthrash_workfunc+0x61/0x90 [wqthrash]
    [] process_one_work+0x1e8/0x650
    [] worker_thread+0x4e/0x450
    [] kthread+0xef/0x110
    [] ret_from_fork+0x3f/0x70

    Fix it by updating add_timer_on() to perform proper migration as
    __mod_timer() does.

    Reported-and-tested-by: Jeff Layton
    Signed-off-by: Tejun Heo
    Cc: Chris Worley
    Cc: bfields@fieldses.org
    Cc: Michael Skralivetsky
    Cc: Trond Myklebust
    Cc: Shaohua Li
    Cc: Jeff Layton
    Cc: kernel-team@fb.com
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20151029103113.2f893924@tlielax.poochiereds.net
    Link: http://lkml.kernel.org/r/20151104171533.GI5749@mtj.duckdns.org
    Signed-off-by: Thomas Gleixner

    Tejun Heo
     

12 Oct, 2015

1 commit

  • In apply_slack(), find_last_bit() is applied to a bitmask consisting
    of precisely BITS_PER_LONG bits. Since mask is non-zero, we might as
    well eliminate the function call and use __fls() directly. On x86_64,
    this shaves 23 bytes of the only caller, mod_timer().

    This also gets rid of Coverity CID 1192106, but that is a false
    positive: Coverity is not aware that mask != 0 implies that
    find_last_bit will not return BITS_PER_LONG.

    Signed-off-by: Rasmus Villemoes
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/1443771931-6284-1-git-send-email-linux@rasmusvillemoes.dk
    Signed-off-by: Thomas Gleixner

    Rasmus Villemoes
     

22 Sep, 2015

1 commit

  • timer_stats_account_timer() reads timer->start_site, then checks it
    for NULL and then re-reads it again, while
    timer_stats_timer_clear_start_info() can concurrently reset
    timer->start_site to NULL. This should not lead to crashes, but can
    double number of entries in timer stats as start_site is used during
    comparison, the doubled entries will have unuseful NULL start_site.

    Read timer->start_site only once in timer_stats_account_timer().

    The data race was found with KernelThreadSanitizer (KTSAN).

    Signed-off-by: Dmitry Vyukov
    Cc: andreyknvl@google.com
    Cc: glider@google.com
    Cc: kcc@google.com
    Cc: ktsan@googlegroups.com
    Cc: john.stultz@linaro.org
    Link: http://lkml.kernel.org/r/1442584463-69553-1-git-send-email-dvyukov@google.com
    Signed-off-by: Thomas Gleixner

    Dmitry Vyukov
     

18 Aug, 2015

1 commit

  • lock_timer_base() cannot prevent the following :

    CPU1 ( in __mod_timer()
    timer->flags |= TIMER_MIGRATING;
    spin_unlock(&base->lock);
    base = new_base;
    spin_lock(&base->lock);
    // The next line clears TIMER_MIGRATING
    timer->flags &= ~TIMER_BASEMASK;
    CPU2 (in lock_timer_base())
    see timer base is cpu0 base
    spin_lock_irqsave(&base->lock, *flags);
    if (timer->flags == tf)
    return base; // oops, wrong base
    timer->flags |= base->cpu // too late

    We must write timer->flags in one go, otherwise we can fool other cpus.

    Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
    Signed-off-by: Eric Dumazet
    Cc: Jon Christopherson
    Cc: David Miller
    Cc: xen-devel@lists.xen.org
    Cc: david.vrabel@citrix.com
    Cc: Sander Eikelenboom
    Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.com
    Signed-off-by: Thomas Gleixner
    Cc: Thomas Gleixner

    Eric Dumazet
     

27 Jun, 2015

1 commit

  • The recent timer wheel rework removed the get/put_cpu_var() pair in
    the hotplug migration code, which results in:

    BUG: using smp_processor_id() in preemptible [00000000] code: hib.sh/2845
    ...
    [] timer_cpu_notify+0x53/0x12

    That hunk is a leftover from an earlier iteration and went unnoticed
    so far.

    Restore the previous code which was obviously correct.

    Fixes: 0eeda71bc30d 'timer: Replace timer base by a cpu index'
    Reported-and_tested-by: Borislav Petkov
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

19 Jun, 2015

7 commits

  • If nohz is disabled on the kernel command line the [hr]timer code
    still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
    pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
    avoid the overhead.

    Before:
    48.10% hog [.] main
    15.25% [kernel] [k] _raw_spin_lock_irqsave
    9.76% [kernel] [k] _raw_spin_unlock_irqrestore
    6.50% [kernel] [k] mod_timer
    6.44% [kernel] [k] lock_timer_base.isra.38
    3.87% [kernel] [k] detach_if_pending
    3.80% [kernel] [k] del_timer
    2.67% [kernel] [k] internal_add_timer
    1.33% [kernel] [k] __internal_add_timer
    0.73% [kernel] [k] timerfn
    0.54% [kernel] [k] wake_up_nohz_cpu

    After:
    48.73% hog [.] main
    15.36% [kernel] [k] _raw_spin_lock_irqsave
    9.77% [kernel] [k] _raw_spin_unlock_irqrestore
    6.61% [kernel] [k] lock_timer_base.isra.38
    6.42% [kernel] [k] mod_timer
    3.90% [kernel] [k] detach_if_pending
    3.76% [kernel] [k] del_timer
    2.41% [kernel] [k] internal_add_timer
    1.39% [kernel] [k] __internal_add_timer
    0.76% [kernel] [k] timerfn

    We probably should have a cached value for nohz full in the per cpu
    bases as well to avoid the cpumask check. The base cache line is hot
    already, the cpumask not necessarily.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Eric reported that the timer_migration sysctl is not really nice
    performance wise as it needs to check at every timer insertion whether
    the feature is enabled or not. Further the check does not live in the
    timer code, so we have an extra function call which checks an extra
    cache line to figure out that it is disabled.

    We can do better and store that information in the per cpu (hr)timer
    bases. I pondered to use a static key, but that's a nightmare to
    update from the nohz code and the timer base cache line is hot anyway
    when we select a timer base.

    The old logic enabled the timer migration unconditionally if
    CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
    line.

    With this modification, we start off with migration disabled. The user
    visible sysctl is still set to enabled. If the kernel switches to NOHZ
    migration is enabled, if the user did not disable it via the sysctl
    prior to the switch. If nohz=off is on the kernel command line,
    migration stays disabled no matter what.

    Before:
    47.76% hog [.] main
    14.84% [kernel] [k] _raw_spin_lock_irqsave
    9.55% [kernel] [k] _raw_spin_unlock_irqrestore
    6.71% [kernel] [k] mod_timer
    6.24% [kernel] [k] lock_timer_base.isra.38
    3.76% [kernel] [k] detach_if_pending
    3.71% [kernel] [k] del_timer
    2.50% [kernel] [k] internal_add_timer
    1.51% [kernel] [k] get_nohz_timer_target
    1.28% [kernel] [k] __internal_add_timer
    0.78% [kernel] [k] timerfn
    0.48% [kernel] [k] wake_up_nohz_cpu

    After:
    48.10% hog [.] main
    15.25% [kernel] [k] _raw_spin_lock_irqsave
    9.76% [kernel] [k] _raw_spin_unlock_irqrestore
    6.50% [kernel] [k] mod_timer
    6.44% [kernel] [k] lock_timer_base.isra.38
    3.87% [kernel] [k] detach_if_pending
    3.80% [kernel] [k] del_timer
    2.67% [kernel] [k] internal_add_timer
    1.33% [kernel] [k] __internal_add_timer
    0.73% [kernel] [k] timerfn
    0.54% [kernel] [k] wake_up_nohz_cpu

    Reported-by: Eric Dumazet
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Simplify the handling of the flag storage for the timer statistics. No
    intermediate storage anymore. Just hand over the flags field.

    I left the printout of 'deferrable' for now because changing this
    would be an ABI update and I have no idea how strong people feel about
    that. OTOH, I wonder whether we should kill the whole timer stats
    stuff because all of that information can be retrieved via ftrace/perf
    as well.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Instead of storing a pointer to the per cpu tvec_base we can simply
    cache a CPU index in the timer_list and use that to get hold of the
    correct per cpu tvec_base. This is only used in lock_timer_base() and
    the slightly larger code is peanuts versus the spinlock operation and
    the d-cache foot print of the timer wheel.

    Aside of that this allows to get rid of following nuisances:

    - boot_tvec_base

    That statically allocated 4k bss data is just kept around so the
    timer has a home when it gets statically initialized. It serves no
    other purpose.

    With the CPU index we assign the timer to CPU0 at static
    initialization time and therefor can avoid the whole boot_tvec_base
    dance. That also simplifies the init code, which just can use the
    per cpu base.

    Before:
    text data bss dec hex filename
    17491 9201 4160 30852 7884 ../build/kernel/time/timer.o
    After:
    text data bss dec hex filename
    17440 9193 0 26633 6809 ../build/kernel/time/timer.o

    - Overloading the base pointer with various flags

    The CPU index has enough space to hold the flags (deferrable,
    irqsafe) so we can get rid of the extra masking and bit fiddling
    with the base pointer.

    As a benefit we reduce the size of struct timer_list on 64 bit
    machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
    which is a real win as we have tons of them embedded in other structs.

    This changes also the newly added deferrable printout of the timer
    start trace point to capture and print all timer->flags, which allows
    us to decode the target cpu of the timer as well.

    We might have used bitfields for this, but that would change the
    static initializers and the init function for no value to accomodate
    big endian bitfields.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Cc: Steven Rostedt
    Cc: Badhri Jagan Sridharan
    Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • This reduces the size of struct tvec_base by 50% and results in
    slightly smaller code as well.

    Before:
    struct tvec_base: size: 8256, cachelines: 129

    text data bss dec hex filename
    17698 13297 8256 39251 9953 ../build/kernel/time/timer.o

    After:
    struct tvec_base: 4160, cachelines: 65

    text data bss dec hex filename
    17491 9201 4160 30852 7884 ../build/kernel/time/timer.o

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Viresh Kumar
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The FIFO guarantee is only there if two timers are queued into the
    same bucket at the same jiffie on the same cpu:

    - The slack value depends on the delta between expiry and enqueue
    time, so the resulting expiry time can be different for timers
    which are queued in different jiffies.

    - Timers which are queued into the secondary array end up after a
    later queued timer which was queued into the primary array due to
    cascading.

    - Timers can end up on different cpus due to the NOHZ target moving
    around. Obviously there is no guarantee of expiry ordering between
    cpus.

    So anything which relies on FIFO behaviour of the timer wheel is
    broken already.

    This is a preparatory patch for converting the timer wheel to hlist
    which reduces the memory foot print of the wheel by 50%.

    It's a seperate patch so any (unlikely to happen) regression caused by
    this can be identified clearly.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Viresh Kumar
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Cc: George Spelvin
    Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • catchup_timer_jiffies() has been applied blindly to several functions
    without looking for possible better ways to do it.

    1) internal_add_timer()

    Move the update to base->all_timers before we actually insert the
    timer into the wheel.

    2) detach_if_pending()

    Again the update to base->all_timers allows us to explicitely do
    the timer_jiffies update in place, if this was the last timer which
    got removed.

    3) __run_timers()

    We only check on entry, which is silly, because base->timer_jiffies
    can be behind - especially on NOHZ kernels - and if there is a
    single deferrable timer somewhere between base->timer_jiffies and
    jiffies we expire it and then loop until base->timer_jiffies ==
    jiffies.

    Move it into the loop.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Cc: Frederic Weisbecker
    Cc: Eric Dumazet
    Cc: Viresh Kumar
    Cc: John Stultz
    Cc: Joonwoo Park
    Cc: Wenbo Wang
    Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

23 May, 2015

1 commit

  • The timer_start event now shows whether the timer is
    deferrable in case of a low-res timer. The debug_activate
    function now includes a deferrable flag while calling
    the trace_timer_start event.

    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Acked-by: Steven Rostedt
    Signed-off-by: Badhri Jagan Sridharan
    [jstultz: Fixed minor whitespace and grammer tweaks
    pointed out by Ingo]
    Signed-off-by: John Stultz

    Badhri Jagan Sridharan
     

05 May, 2015

1 commit

  • At present, internal_add_timer() examines flags with 'base' which doesn't
    contain flags. Examine with 'timer->base' to avoid unnecessary waking up
    of nohz CPU when timer base has TIMER_DEFERRABLE set.

    Signed-off-by: Joonwoo Park
    Cc: sboyd@codeaurora.org
    Cc: skannan@codeaurora.org
    Cc: John Stultz
    Link: http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonwoop@codeaurora.org
    Signed-off-by: Thomas Gleixner

    Joonwoo Park
     

22 Apr, 2015

4 commits

  • do_usleep_range() and schedule_hrtimeout_range() are __sched as
    well. So it makes no sense to have the exported function in a
    different section.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203503.833709502@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The only user ignores it anyway and rightfully so.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203503.756060258@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The evaluation of the next timer in the nohz code is based on jiffies
    while all the tick internals are nano seconds based. We have also to
    convert hrtimer nanoseconds to jiffies in the !highres case. That's
    just wrong and introduces interesting corner cases.

    Turn it around and convert the next timer wheel timer expiry and the
    rcu event to clock monotonic and base all calculations on
    nanoseconds. That identifies the case where no timer is pending
    clearly with an absolute expiry value of KTIME_MAX.

    Makes the code more readable and gets rid of the jiffies magic in the
    nohz code.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Paul E. McKenney
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Marcelo Tosatti
    Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • hrtimer softirq is a leftover from the initial implementation and
    serves only the purpose to handle the enqueueing of already expired
    timers in the high resolution timer mode. We discussed whether we
    change the return value and force all start sites to handle that the
    timer is already expired, but that would be a Herculean task and I'm
    not sure whether its a good idea to enforce that handling on
    everyone.

    A simpler solution is to enforce a timer interrupt instead of raising
    and scheduling a softirq. Just use the existing infrastructure to do
    so and remove all the softirq leftovers.

    The HRTIMER softirq enum is now unused, but kept around because trace
    parsers rely on the existing numbering.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203501.840834708@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

02 Apr, 2015

3 commits

  • Remove one CONFIG_HOTPLUG_CPU #ifdef in trade for introducing one
    CONFIG_SMP #ifdef.

    The CONFIG_SMP ifdef avoids declaring the per-CPU __tvec_bases storage
    on UP systems since they already have boot_tvec_bases.

    Also (re)add a runtime check on the base alignment -- for the paranoid
    amongst us :-)

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/fdd2d35e169bdc554ffa3fe77f77716298c75ada.1427814611.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is no need to call init_timers_cpu() on every CPU hotplug event,
    there is not much we need to reset.

    - Timer-lists are already empty at the end of migrate_timers().
    - timer_jiffies will be refreshed while adding a new timer, after the
    CPU is online again.
    - active_timers and all_timers can be reset from migrate_timers().

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/54a1c30ea7b805af55beb220cadf5a07a21b0a4d.1427814611.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     
  • Memory for the 'tvec_base' array is allocated separately for the boot CPU (statically)
    and non-boot CPUs (dynamically).

    The reason is because __TIMER_INITIALIZER() needs to set ->base to a
    valid pointer (because we've made NULL special, hint: lock_timer_base())
    and we cannot get a compile time pointer to per-cpu entries because we
    don't know where we'll map the section, even for the boot cpu.

    This can be simplified a bit by statically allocating per-cpu memory.
    The only disadvantage is that memory for one of the structures will stay
    unused, i.e. for the boot CPU, which uses boot_tvec_bases.

    This will also guarantee that tvec_base is cacheline aligned. Even
    though tvec_base has ____cacheline_aligned stuck on, kzalloc_node() does
    not actually respect that (but guarantees a minimum u64 alignment).

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/17cdf560f2727f687ab159707d0aa591f8a2f82d.1427814611.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Nov, 2014

1 commit

  • The "cpu" argument was kept around on the off-chance that RCU might
    offload scheduler-clock interrupts. However, this offload approach
    has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
    from qualifying CPUs. It is therefore time to remove the "cpu" argument
    to rcu_check_callbacks(), which this commit does.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Pranith Kumar

    Paul E. McKenney
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds