Eric Lee / smarc-fsl-linux-kernel

07 Jul, 2016

12 commits

4b4b20852 Merge branch 'timers/fast-wheel' into timers/core Browse Code »

Ingo Molnar
2016-07-07 16:35:28 +0800
f00c0afdf timers: Implement optimization for same expiry time in mod_timer() ... Browse Code »

The existing optimization for same expiry time in mod_timer() checks whether
the timer expiry time is the same as the new requested expiry time. In the old
timer wheel implementation this does not take the slack batching into account,
neither does the new implementation evaluate whether the new expiry time will
requeue the timer to the same bucket.

To optimize that, we can calculate the resulting bucket and check if the new
expiry time is different from the current expiry time. This calculation
happens outside the base lock held region. If the resulting bucket is the same
we can avoid taking the base lock and requeueing the timer.

If the timer needs to be requeued then we have to check under the base lock
whether the base time has changed between the lockless calculation and taking
the lock. If it has changed we need to recalculate under the lock.

This optimization takes effect for timers which are enqueued into the less
granular wheel levels (1 and above). With a simple test case the functionality
has been verified:

Before After
Match: 5.5% 86.6%
Requeue: 94.5% 13.4%
Recalc:
Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.778527749@linutronix.de
Signed-off-by: Ingo Molnar

Anna-Maria Gleixner
2016-07-07 16:35:12 +0800
ffdf04772 timers: Split out index calculation ... Browse Code »

For further optimizations we need to seperate index calculation
from queueing. No functional change.

Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.691159619@linutronix.de
Signed-off-by: Ingo Molnar

Anna-Maria Gleixner
2016-07-07 16:35:12 +0800
4e85876a9 timers: Only wake softirq if necessary ... Browse Code »

With the wheel forwading in place and with the HZ=1000 4ms folding we can
avoid running the softirq at all.

Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.607650550@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:11 +0800
a683f390b timers: Forward the wheel clock whenever possible ... Browse Code »

The wheel clock is stale when a CPU goes into a long idle sleep. This has the
side effect that timers which are queued end up in the outer wheel levels.
That results in coarser granularity.

To solve this, we keep track of the idle state and forward the wheel clock
whenever possible.

Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.512039360@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:11 +0800
236968383 timers: Optimize collect_expired_timers() for NOHZ ... Browse Code »

After a NOHZ idle sleep the timer wheel must be forwarded to current jiffies.
There might be expired timers so the current code loops and checks the expired
buckets for timers. This can take quite some time for long NOHZ idle periods.

The pending bitmask in the timer base allows us to do a quick search for the
next expiring timer and therefore a fast forward of the base time which
prevents pointless long lasting loops.

For a 3 seconds idle sleep this reduces the catchup time from ~1ms to 5us.

Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.351296290@linutronix.de
Signed-off-by: Ingo Molnar

Anna-Maria Gleixner
2016-07-07 16:35:10 +0800
73420fea8 timers: Move __run_timers() function ... Browse Code »

Move __run_timers() below __next_timer_interrupt() and next_pending_bucket()
in preparation for __run_timers() NOHZ optimization.

No functional change.

Signed-off-by: Anna-Maria Gleixner
Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.271872665@linutronix.de
Signed-off-by: Ingo Molnar

Anna-Maria Gleixner
2016-07-07 16:35:09 +0800
53bf837b7 timers: Remove set_timer_slack() leftovers ... Browse Code »

We now have implicit batching in the timer wheel. The slack API is no longer
used, so remove it.

Signed-off-by: Thomas Gleixner
Cc: Alan Stern
Cc: Andrew F. Davis
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: David S. Miller
Cc: David Woodhouse
Cc: Dmitry Eremin-Solenikov
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Greg Kroah-Hartman
Cc: Jaehoon Chung
Cc: Jens Axboe
Cc: John Stultz
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Mathias Nyman
Cc: Pali Rohár
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Sebastian Reichel
Cc: Ulf Hansson
Cc: linux-block@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mmc@vger.kernel.org
Cc: linux-pm@vger.kernel.org
Cc: linux-usb@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.189813118@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:09 +0800
500462a9d timers: Switch to a non-cascading wheel ... Browse Code »

The current timer wheel has some drawbacks:

1) Cascading:

Cascading can be an unbound operation and is completely pointless in most
cases because the vast majority of the timer wheel timers are canceled or
rearmed before expiration. (They are used as timeout safeguards, not as
real timers to measure time.)

2) No fast lookup of the next expiring timer:

In NOHZ scenarios the first timer soft interrupt after a long NOHZ period
must fast forward the base time to the current value of jiffies. As we
have no way to find the next expiring timer fast, the code loops linearly
and increments the base time one by one and checks for expired timers
in each step. This causes unbound overhead spikes exactly in the moment
when we should wake up as fast as possible.

After a thorough analysis of real world data gathered on laptops,
workstations, webservers and other machines (thanks Chris!) I came to the
conclusion that the current 'classic' timer wheel implementation can be
modified to address the above issues.

The vast majority of timer wheel timers is canceled or rearmed before
expiry. Most of them are timeouts for networking and other I/O tasks. The
nature of timeouts is to catch the exception from normal operation (TCP ack
timed out, disk does not respond, etc.). For these kinds of timeouts the
accuracy of the timeout is not really a concern. Timeouts are very often
approximate worst-case values and in case the timeout fires, we already
waited for a long time and performance is down the drain already.

The few timers which actually expire can be split into two categories:

1) Short expiry times which expect halfways accurate expiry

2) Long term expiry times are inaccurate today already due to the
batching which is done for NOHZ automatically and also via the
set_timer_slack() API.

So for long term expiry timers we can avoid the cascading property and just
leave them in the less granular outer wheels until expiry or
cancelation. Timers which are armed with a timeout larger than the wheel
capacity are no longer cascaded. We expire them with the longest possible
timeout (6+ days). We have not observed such timeouts in our data collection,
but at least we handle them, applying the rule of the least surprise.

To avoid extending the wheel levels for HZ=1000 so we can accomodate the
longest observed timeouts (5 days in the network conntrack code) we reduce the
first level granularity on HZ=1000 to 4ms, which effectively is the same as
the HZ=250 behaviour. From our data analysis there is nothing which relies on
that 1ms granularity and as a side effect we get better batching and timer
locality for the networking code as well.

Contrary to the classic wheel the granularity of the next wheel is not the
capacity of the first wheel. The granularities of the wheels are in the
currently chosen setting 8 times the granularity of the previous wheel.

So for HZ=250 we end up with the following granularity levels:

Level Offset Granularity Range
0 0 4 ms 0 ms - 252 ms
1 64 32 ms 256 ms - 2044 ms (256ms - ~2s)
2 128 256 ms 2048 ms - 16380 ms (~2s - ~16s)
3 192 2048 ms (~2s) 16384 ms - 131068 ms (~16s - ~2m)
4 256 16384 ms (~16s) 131072 ms - 1048572 ms (~2m - ~17m)
5 320 131072 ms (~2m) 1048576 ms - 8388604 ms (~17m - ~2h)
6 384 1048576 ms (~17m) 8388608 ms - 67108863 ms (~2h - ~18h)
7 448 8388608 ms (~2h) 67108864 ms - 536870911 ms (~18h - ~6d)

That's a worst case inaccuracy of 12.5% for the timers which are queued at the
beginning of a level.

So the new wheel concept addresses the old issues:

1) Cascading is avoided completely

2) By keeping the timers in the bucket until expiry/cancelation we can track
the buckets which have timers enqueued in a bucket bitmap and therefore can
look up the next expiring timer very fast and O(1).

A further benefit of the concept is that the slack calculation which is done
on every timer start is no longer necessary because the granularity levels
provide natural batching already.

Our extensive testing with various loads did not show any performance
degradation vs. the current wheel implementation.

This patch does not address the 'fast lookup' issue as we wanted to make sure
that there is no regression introduced by the wheel redesign. The
optimizations are in follow up patches.

This patch contains fixes from Anna-Maria Gleixner and Richard Cochran.

Signed-off-by: Thomas Gleixner
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: Frederic Weisbecker
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094342.108621834@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:09 +0800
494af3ed7 timers: Give a few structs and members proper names ... Browse Code »

Some of the names in the internal implementation of the timer code
are not longer correct and others are simply too long to type.

Clean it up before we switch the wheel implementation over to
the new scheme.

No functional change.

Signed-off-by: Thomas Gleixner
Reviewed-by: Frederic Weisbecker
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.948752516@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:08 +0800
177ec0a0a timers: Remove the deprecated mod_timer_pinned() API ... Browse Code »

We switched all users to initialize the timers as pinned and call
mod_timer(). Remove the now unused timer API function.

Signed-off-by: Thomas Gleixner
Reviewed-by: Frederic Weisbecker
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.706205231@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:35:06 +0800
e675447bd timers: Make 'pinned' a timer property ... Browse Code »

We want to move the timer migration logic from a 'push' to a 'pull' model.

Under the current 'push' model pinned timers are handled via
a runtime API variant: mod_timer_pinned().

The 'pull' model requires us to store the pinned attribute of a timer
in the timer_list structure itself, as a new TIMER_PINNED bit in
timer->flags.

This flag must be set at initialization time and the timer APIs
recognize the flag.

This patch:

- Implements the new flag and associated new-style initialization
methods

- makes mod_timer() recognize new-style pinned timers,

- and adds some migration helper facility to allow
step by step conversion of old-style to new-style
pinned timers.

Signed-off-by: Thomas Gleixner
Reviewed-by: Frederic Weisbecker
Cc: Arjan van de Ven
Cc: Chris Mason
Cc: Eric Dumazet
Cc: George Spelvin
Cc: Josh Triplett
Cc: Len Brown
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160704094341.049338558@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-07-07 16:25:13 +0800

10 Jun, 2016

1 commit

b5227d03b timers: Clarify usleep_range() function comment ... Browse Code »

Update the usleep_range() function comment to make it clear that it can
only be used in non-atomic context.

Previously we claimed usleep_range() was a drop-in replacement for udelay()
where wakeup is flexible. But that's only true in non-atomic contexts,
where it's possible to sleep instead of delay.

Signed-off-by: Bjorn Helgaas
Cc: John Stultz
Link: http://lkml.kernel.org/r/20160531212302.28502.44995.stgit@bhelgaas-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner

Bjorn Helgaas
2016-06-10 06:59:14 +0800

20 May, 2016

2 commits

b9fdac7f6 debugobjects: insulate non-fixup logic related to static obj from fixup callbacks ... Browse Code »

When activating a static object we need make sure that the object is
tracked in the object tracker. If it is a non-static object then the
activation is illegal.

In previous implementation, each subsystem need take care of this in
their fixup callbacks. Actually we can put it into debugobjects core.
Thus we can save duplicated code, and have *pure* fixup callbacks.

To achieve this, a new callback "is_static_object" is introduced to let
the type specific code decide whether a object is static or not. If
yes, we take it into object tracker, otherwise give warning and invoke
fixup callback.

This change has paassed debugobjects selftest, and I also do some test
with all debugobjects supports enabled.

At last, I have a concern about the fixups that can it change the object
which is in incorrect state on fixup? Because the 'addr' may not point
to any valid object if a non-static object is not tracked. Then Change
such object can overwrite someone's memory and cause unexpected
behaviour. For example, the timer_fixup_activate bind timer to function
stub_timer.

Link: http://lkml.kernel.org/r/1462576157-14539-1-git-send-email-changbin.du@intel.com
[changbin.du@intel.com: improve code comments where invoke the new is_static_object callback]
Link: http://lkml.kernel.org/r/1462777431-8171-1-git-send-email-changbin.du@intel.com
Signed-off-by: Du, Changbin
Cc: Jonathan Corbet
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Tejun Heo
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Du, Changbin
2016-05-20 10:12:14 +0800
e3252464d timer: update debugobjects fixup callbacks return type ... Browse Code »

Update the return type to use bool instead of int, corresponding to
cheange (debugobjects: make fixup functions return bool instead of int).

Signed-off-by: Du, Changbin
Cc: Jonathan Corbet
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Tejun Heo
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Du, Changbin
2016-05-20 10:12:14 +0800

26 Mar, 2016

1 commit

69b27baf0 sched: add schedule_timeout_idle() ... Browse Code »

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2016-03-26 07:37:42 +0800

18 Mar, 2016

1 commit

da8b44d5a timer: convert timer_slack_ns from unsigned long to u64 ... Browse Code »

This patchset introduces a /proc//timerslack_ns interface which
would allow controlling processes to be able to set the timerslack value
on other processes in order to save power by avoiding wakeups (Something
Android currently does via out-of-tree patches).

The first patch tries to fix the internal timer_slack_ns usage which was
defined as a long, which limits the slack range to ~4 seconds on 32bit
systems. It converts it to a u64, which provides the same basically
unlimited slack (500 years) on both 32bit and 64bit machines.

The second patch introduces the /proc//timerslack_ns interface
which allows the full 64bit slack range for a task to be read or set on
both 32bit and 64bit machines.

With these two patches, on a 32bit machine, after setting the slack on
bash to 10 seconds:

$ time sleep 1

real 0m10.747s
user 0m0.001s
sys 0m0.005s

The first patch is a little ugly, since I had to chase the slack delta
arguments through a number of functions converting them to u64s. Let me
know if it makes sense to break that up more or not.

Other than that things are fairly straightforward.

This patch (of 2):

The timer_slack_ns value in the task struct is currently a unsigned
long. This means that on 32bit applications, the maximum slack is just
over 4 seconds. However, on 64bit machines, its much much larger (~500
years).

This disparity could make application development a little (as well as
the default_slack) to a u64. This means both 32bit and 64bit systems
have the same effective internal slack range.

Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
the interface as a unsigned long, so we preserve that limitation on
32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
actually larger then what can be stored by an unsigned long.

This patch also modifies hrtimer functions which specified the slack
delta as a unsigned long.

Signed-off-by: John Stultz
Cc: Arjan van de Ven
Cc: Thomas Gleixner
Cc: Oren Laadan
Cc: Ruchi Kandoi
Cc: Rom Lemarchand
Cc: Kees Cook
Cc: Android Kernel Team
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

John Stultz
2016-03-18 06:09:34 +0800

05 Nov, 2015

1 commit

22b886dd1 timers: Use proper base migration in add_timer_on() ... Browse Code »

Regardless of the previous CPU a timer was on, add_timer_on()
currently simply sets timer->flags to the new CPU. As the caller must
be seeing the timer as idle, this is locally fine, but the timer
leaving the old base while unlocked can lead to race conditions as
follows.

Let's say timer was on cpu 0.

cpu 0 cpu 1
-----------------------------------------------------------------------------
del_timer(timer) succeeds
del_timer(timer)
lock_timer_base(timer) locks cpu_0_base
add_timer_on(timer, 1)
spin_lock(&cpu_1_base->lock)
timer->flags set to cpu_1_base
operates on @timer operates on @timer

This triggered with mod_delayed_work_on() which contains
"if (del_timer()) add_timer_on()" sequence eventually leading to the
following oops.

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] detach_if_pending+0x69/0x1a0
...
Workqueue: wqthrash wqthrash_workfunc [wqthrash]
task: ffff8800172ca680 ti: ffff8800172d0000 task.ti: ffff8800172d0000
RIP: 0010:[] [] detach_if_pending+0x69/0x1a0
...
Call Trace:
[] del_timer+0x44/0x60
[] try_to_grab_pending+0xb6/0x160
[] mod_delayed_work_on+0x33/0x80
[] wqthrash_workfunc+0x61/0x90 [wqthrash]
[] process_one_work+0x1e8/0x650
[] worker_thread+0x4e/0x450
[] kthread+0xef/0x110
[] ret_from_fork+0x3f/0x70

Fix it by updating add_timer_on() to perform proper migration as
__mod_timer() does.

Reported-and-tested-by: Jeff Layton
Signed-off-by: Tejun Heo
Cc: Chris Worley
Cc: bfields@fieldses.org
Cc: Michael Skralivetsky
Cc: Trond Myklebust
Cc: Shaohua Li
Cc: Jeff Layton
Cc: kernel-team@fb.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20151029103113.2f893924@tlielax.poochiereds.net
Link: http://lkml.kernel.org/r/20151104171533.GI5749@mtj.duckdns.org
Signed-off-by: Thomas Gleixner

Tejun Heo
2015-11-05 03:23:19 +0800

12 Oct, 2015

1 commit

9fc4468d5 timers: Use __fls in apply_slack() ... Browse Code »

In apply_slack(), find_last_bit() is applied to a bitmask consisting
of precisely BITS_PER_LONG bits. Since mask is non-zero, we might as
well eliminate the function call and use __fls() directly. On x86_64,
this shaves 23 bytes of the only caller, mod_timer().

This also gets rid of Coverity CID 1192106, but that is a false
positive: Coverity is not aware that mask != 0 implies that
find_last_bit will not return BITS_PER_LONG.

Signed-off-by: Rasmus Villemoes
Cc: John Stultz
Link: http://lkml.kernel.org/r/1443771931-6284-1-git-send-email-linux@rasmusvillemoes.dk
Signed-off-by: Thomas Gleixner

Rasmus Villemoes
2015-10-12 04:13:46 +0800

22 Sep, 2015

1 commit

3ed769bdb timers: Fix data race in timer_stats_account_timer() ... Browse Code »

timer_stats_account_timer() reads timer->start_site, then checks it
for NULL and then re-reads it again, while
timer_stats_timer_clear_start_info() can concurrently reset
timer->start_site to NULL. This should not lead to crashes, but can
double number of entries in timer stats as start_site is used during
comparison, the doubled entries will have unuseful NULL start_site.

Read timer->start_site only once in timer_stats_account_timer().

The data race was found with KernelThreadSanitizer (KTSAN).

Signed-off-by: Dmitry Vyukov
Cc: andreyknvl@google.com
Cc: glider@google.com
Cc: kcc@google.com
Cc: ktsan@googlegroups.com
Cc: john.stultz@linaro.org
Link: http://lkml.kernel.org/r/1442584463-69553-1-git-send-email-dvyukov@google.com
Signed-off-by: Thomas Gleixner

Dmitry Vyukov
2015-09-22 21:43:18 +0800

18 Aug, 2015

1 commit

d0023a144 timer: Write timer->flags atomically ... Browse Code »

lock_timer_base() cannot prevent the following :

CPU1 ( in __mod_timer()
timer->flags |= TIMER_MIGRATING;
spin_unlock(&base->lock);
base = new_base;
spin_lock(&base->lock);
// The next line clears TIMER_MIGRATING
timer->flags &= ~TIMER_BASEMASK;
CPU2 (in lock_timer_base())
see timer base is cpu0 base
spin_lock_irqsave(&base->lock, *flags);
if (timer->flags == tf)
return base; // oops, wrong base
timer->flags |= base->cpu // too late

We must write timer->flags in one go, otherwise we can fool other cpus.

Fixes: bc7a34b8b9eb ("timer: Reduce timer migration overhead if disabled")
Signed-off-by: Eric Dumazet
Cc: Jon Christopherson
Cc: David Miller
Cc: xen-devel@lists.xen.org
Cc: david.vrabel@citrix.com
Cc: Sander Eikelenboom
Link: http://lkml.kernel.org/r/1439831928.32680.11.camel@edumazet-glaptop2.roam.corp.google.com
Signed-off-by: Thomas Gleixner
Cc: Thomas Gleixner

Eric Dumazet
2015-08-18 21:31:16 +0800

27 Jun, 2015

1 commit

24bfcb100 timer: Fix hotplug regression ... Browse Code »

The recent timer wheel rework removed the get/put_cpu_var() pair in
the hotplug migration code, which results in:

BUG: using smp_processor_id() in preemptible [00000000] code: hib.sh/2845
...
[] timer_cpu_notify+0x53/0x12

That hunk is a leftover from an earlier iteration and went unnoticed
so far.

Restore the previous code which was obviously correct.

Fixes: 0eeda71bc30d 'timer: Replace timer base by a cpu index'
Reported-and_tested-by: Borislav Petkov
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-27 04:58:06 +0800

19 Jun, 2015

7 commits

683be13a2 timer: Minimize nohz off overhead ... Browse Code »

If nohz is disabled on the kernel command line the [hr]timer code
still calls wake_up_nohz_cpu() and tick_nohz_full_cpu(), a pretty
pointless exercise. Cache nohz_active in [hr]timer per cpu bases and
avoid the overhead.

Before:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu

After:
48.73% hog [.] main
15.36% [kernel] [k] _raw_spin_lock_irqsave
9.77% [kernel] [k] _raw_spin_unlock_irqrestore
6.61% [kernel] [k] lock_timer_base.isra.38
6.42% [kernel] [k] mod_timer
3.90% [kernel] [k] detach_if_pending
3.76% [kernel] [k] del_timer
2.41% [kernel] [k] internal_add_timer
1.39% [kernel] [k] __internal_add_timer
0.76% [kernel] [k] timerfn

We probably should have a cached value for nohz full in the per cpu
bases as well to avoid the cpumask check. The base cache line is hot
already, the cpumask not necessarily.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224512.207378134@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:28 +0800
bc7a34b8b timer: Reduce timer migration overhead if disabled ... Browse Code »

Eric reported that the timer_migration sysctl is not really nice
performance wise as it needs to check at every timer insertion whether
the feature is enabled or not. Further the check does not live in the
timer code, so we have an extra function call which checks an extra
cache line to figure out that it is disabled.

We can do better and store that information in the per cpu (hr)timer
bases. I pondered to use a static key, but that's a nightmare to
update from the nohz code and the timer base cache line is hot anyway
when we select a timer base.

The old logic enabled the timer migration unconditionally if
CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
line.

With this modification, we start off with migration disabled. The user
visible sysctl is still set to enabled. If the kernel switches to NOHZ
migration is enabled, if the user did not disable it via the sysctl
prior to the switch. If nohz=off is on the kernel command line,
migration stays disabled no matter what.

Before:
47.76% hog [.] main
14.84% [kernel] [k] _raw_spin_lock_irqsave
9.55% [kernel] [k] _raw_spin_unlock_irqrestore
6.71% [kernel] [k] mod_timer
6.24% [kernel] [k] lock_timer_base.isra.38
3.76% [kernel] [k] detach_if_pending
3.71% [kernel] [k] del_timer
2.50% [kernel] [k] internal_add_timer
1.51% [kernel] [k] get_nohz_timer_target
1.28% [kernel] [k] __internal_add_timer
0.78% [kernel] [k] timerfn
0.48% [kernel] [k] wake_up_nohz_cpu

After:
48.10% hog [.] main
15.25% [kernel] [k] _raw_spin_lock_irqsave
9.76% [kernel] [k] _raw_spin_unlock_irqrestore
6.50% [kernel] [k] mod_timer
6.44% [kernel] [k] lock_timer_base.isra.38
3.87% [kernel] [k] detach_if_pending
3.80% [kernel] [k] del_timer
2.67% [kernel] [k] internal_add_timer
1.33% [kernel] [k] __internal_add_timer
0.73% [kernel] [k] timerfn
0.54% [kernel] [k] wake_up_nohz_cpu

Reported-by: Eric Dumazet
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:28 +0800
c74441a17 timer: Stats: Simplify the flags handling ... Browse Code »

Simplify the handling of the flag storage for the timer statistics. No
intermediate storage anymore. Just hand over the flags field.

I left the printout of 'deferrable' for now because changing this
would be an ABI update and I have no idea how strong people feel about
that. OTOH, I wonder whether we should kill the whole timer stats
stuff because all of that information can be retrieved via ftrace/perf
as well.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224512.046626248@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:27 +0800
0eeda71bc timer: Replace timer base by a cpu index ... Browse Code »

Instead of storing a pointer to the per cpu tvec_base we can simply
cache a CPU index in the timer_list and use that to get hold of the
correct per cpu tvec_base. This is only used in lock_timer_base() and
the slightly larger code is peanuts versus the spinlock operation and
the d-cache foot print of the timer wheel.

Aside of that this allows to get rid of following nuisances:

- boot_tvec_base

That statically allocated 4k bss data is just kept around so the
timer has a home when it gets statically initialized. It serves no
other purpose.

With the CPU index we assign the timer to CPU0 at static
initialization time and therefor can avoid the whole boot_tvec_base
dance. That also simplifies the init code, which just can use the
per cpu base.

Before:
text data bss dec hex filename
17491 9201 4160 30852 7884 ../build/kernel/time/timer.o
After:
text data bss dec hex filename
17440 9193 0 26633 6809 ../build/kernel/time/timer.o

- Overloading the base pointer with various flags

The CPU index has enough space to hold the flags (deferrable,
irqsafe) so we can get rid of the extra masking and bit fiddling
with the base pointer.

As a benefit we reduce the size of struct timer_list on 64 bit
machines. 4 - 8 bytes, a size reduction up to 15% per struct timer_list,
which is a real win as we have tons of them embedded in other structs.

This changes also the newly added deferrable printout of the timer
start trace point to capture and print all timer->flags, which allows
us to decode the target cpu of the timer as well.

We might have used bitfields for this, but that would change the
static initializers and the init function for no value to accomodate
big endian bitfields.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Cc: Steven Rostedt
Cc: Badhri Jagan Sridharan
Link: http://lkml.kernel.org/r/20150526224511.950084301@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:27 +0800
1dabbcec2 timer: Use hlist for the timer wheel hash buckets ... Browse Code »

This reduces the size of struct tvec_base by 50% and results in
slightly smaller code as well.

Before:
struct tvec_base: size: 8256, cachelines: 129

text data bss dec hex filename
17698 13297 8256 39251 9953 ../build/kernel/time/timer.o

After:
struct tvec_base: 4160, cachelines: 65

text data bss dec hex filename
17491 9201 4160 30852 7884 ../build/kernel/time/timer.o

Signed-off-by: Thomas Gleixner
Reviewed-by: Viresh Kumar
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224511.854731214@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:27 +0800
1bd04bf6f timer: Remove FIFO "guarantee" ... Browse Code »

The FIFO guarantee is only there if two timers are queued into the
same bucket at the same jiffie on the same cpu:

- The slack value depends on the delta between expiry and enqueue
time, so the resulting expiry time can be different for timers
which are queued in different jiffies.

- Timers which are queued into the secondary array end up after a
later queued timer which was queued into the primary array due to
cascading.

- Timers can end up on different cpus due to the NOHZ target moving
around. Obviously there is no guarantee of expiry ordering between
cpus.

So anything which relies on FIFO behaviour of the timer wheel is
broken already.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%.

It's a seperate patch so any (unlikely to happen) regression caused by
this can be identified clearly.

Signed-off-by: Thomas Gleixner
Reviewed-by: Viresh Kumar
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Cc: George Spelvin
Link: http://lkml.kernel.org/r/20150526224511.757520403@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:27 +0800
3bb475a34 timers: Sanitize catchup_timer_jiffies() usage ... Browse Code »

catchup_timer_jiffies() has been applied blindly to several functions
without looking for possible better ways to do it.

1) internal_add_timer()

Move the update to base->all_timers before we actually insert the
timer into the wheel.

2) detach_if_pending()

Again the update to base->all_timers allows us to explicitely do
the timer_jiffies update in place, if this was the last timer which
got removed.

3) __run_timers()

We only check on entry, which is silly, because base->timer_jiffies
can be behind - especially on NOHZ kernels - and if there is a
single deferrable timer somewhere between base->timer_jiffies and
jiffies we expire it and then loop until base->timer_jiffies ==
jiffies.

Move it into the loop.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
Cc: Frederic Weisbecker
Cc: Eric Dumazet
Cc: Viresh Kumar
Cc: John Stultz
Cc: Joonwoo Park
Cc: Wenbo Wang
Link: http://lkml.kernel.org/r/20150526224511.662994644@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-06-19 21:18:27 +0800

23 May, 2015

1 commit

4e413e852 tracing: timer: Add deferrable flag to timer_start ... Browse Code »

The timer_start event now shows whether the timer is
deferrable in case of a low-res timer. The debug_activate
function now includes a deferrable flag while calling
the trace_timer_start event.

Cc: Thomas Gleixner
Cc: Ingo Molnar
Acked-by: Steven Rostedt
Signed-off-by: Badhri Jagan Sridharan
[jstultz: Fixed minor whitespace and grammer tweaks
pointed out by Ingo]
Signed-off-by: John Stultz

Badhri Jagan Sridharan
2015-05-23 01:36:06 +0800

05 May, 2015

1 commit

781978e6e timer: Use timer->base for flag checks ... Browse Code »

At present, internal_add_timer() examines flags with 'base' which doesn't
contain flags. Examine with 'timer->base' to avoid unnecessary waking up
of nohz CPU when timer base has TIMER_DEFERRABLE set.

Signed-off-by: Joonwoo Park
Cc: sboyd@codeaurora.org
Cc: skannan@codeaurora.org
Cc: John Stultz
Link: http://lkml.kernel.org/r/1430187709-21087-1-git-send-email-joonwoop@codeaurora.org
Signed-off-by: Thomas Gleixner

Joonwoo Park
2015-05-05 16:40:43 +0800

22 Apr, 2015

4 commits

2ad5d3272 timer: Put usleep_range into the __sched section ... Browse Code »

do_usleep_range() and schedule_hrtimeout_range() are __sched as
well. So it makes no sense to have the exported function in a
different section.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20150414203503.833709502@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:52 +0800
6deba083e timer: Remove pointless return value of do_usleep_range() ... Browse Code »

The only user ignores it anyway and rightfully so.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20150414203503.756060258@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:52 +0800
c1ad348b4 tick: Nohz: Rework next timer evaluation ... Browse Code »

The evaluation of the next timer in the nohz code is based on jiffies
while all the tick internals are nano seconds based. We have also to
convert hrtimer nanoseconds to jiffies in the !highres case. That's
just wrong and introduces interesting corner cases.

Turn it around and convert the next timer wheel timer expiry and the
rcu event to clock monotonic and base all calculations on
nanoseconds. That identifies the case where no timer is pending
clearly with an absolute expiry value of KTIME_MAX.

Makes the code more readable and gets rid of the jiffies magic in the
nohz code.

Signed-off-by: Thomas Gleixner
Reviewed-by: Paul E. McKenney
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Cc: Josh Triplett
Cc: Lai Jiangshan
Cc: John Stultz
Cc: Marcelo Tosatti
Link: http://lkml.kernel.org/r/20150414203502.184198593@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:50 +0800
c6eb3f70d hrtimer: Get rid of hrtimer softirq ... Browse Code »

hrtimer softirq is a leftover from the initial implementation and
serves only the purpose to handle the enqueueing of already expired
timers in the high resolution timer mode. We discussed whether we
change the return value and force all start sites to handle that the
timer is already expired, but that would be a Herculean task and I'm
not sure whether its a good idea to enforce that handling on
everyone.

A simpler solution is to enforce a timer interrupt instead of raising
and scheduling a softirq. Just use the existing infrastructure to do
so and remove all the softirq leftovers.

The HRTIMER softirq enum is now unused, but kept around because trace
parsers rely on the existing numbering.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20150414203501.840834708@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:50 +0800

02 Apr, 2015

3 commits

3650b57fd timer: Further simplify the SMP and HOTPLUG logic ... Browse Code »

Remove one CONFIG_HOTPLUG_CPU #ifdef in trade for introducing one
CONFIG_SMP #ifdef.

The CONFIG_SMP ifdef avoids declaring the per-CPU __tvec_bases storage
on UP systems since they already have boot_tvec_bases.

Also (re)add a runtime check on the base alignment -- for the paranoid
amongst us :-)

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Viresh Kumar
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/fdd2d35e169bdc554ffa3fe77f77716298c75ada.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-04-02 23:46:21 +0800
8def90604 timer: Don't initialize 'tvec_base' on hotplug ... Browse Code »

There is no need to call init_timers_cpu() on every CPU hotplug event,
there is not much we need to reset.

- Timer-lists are already empty at the end of migrate_timers().
- timer_jiffies will be refreshed while adding a new timer, after the
CPU is online again.
- active_timers and all_timers can be reset from migrate_timers().

Signed-off-by: Viresh Kumar
Signed-off-by: Peter Zijlstra (Intel)
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/54a1c30ea7b805af55beb220cadf5a07a21b0a4d.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar

Viresh Kumar
2015-04-02 23:46:01 +0800
b337a9380 timer: Allocate per-cpu tvec_base's statically ... Browse Code »

Memory for the 'tvec_base' array is allocated separately for the boot CPU (statically)
and non-boot CPUs (dynamically).

The reason is because __TIMER_INITIALIZER() needs to set ->base to a
valid pointer (because we've made NULL special, hint: lock_timer_base())
and we cannot get a compile time pointer to per-cpu entries because we
don't know where we'll map the section, even for the boot cpu.

This can be simplified a bit by statically allocating per-cpu memory.
The only disadvantage is that memory for one of the structures will stay
unused, i.e. for the boot CPU, which uses boot_tvec_bases.

This will also guarantee that tvec_base is cacheline aligned. Even
though tvec_base has ____cacheline_aligned stuck on, kzalloc_node() does
not actually respect that (but guarantees a minimum u64 alignment).

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Viresh Kumar
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/17cdf560f2727f687ab159707d0aa591f8a2f82d.1427814611.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2015-04-02 23:46:00 +0800

04 Nov, 2014

1 commit

c3377c2da rcu: Remove "cpu" argument to rcu_check_callbacks() ... Browse Code »

The "cpu" argument was kept around on the off-chance that RCU might
offload scheduler-clock interrupts. However, this offload approach
has been replaced by NO_HZ_FULL, which offloads -all- RCU processing
from qualifying CPUs. It is therefore time to remove the "cpu" argument
to rcu_check_callbacks(), which this commit does.

Signed-off-by: Paul E. McKenney
Reviewed-by: Pranith Kumar

Paul E. McKenney
2014-11-04 11:20:11 +0800

15 Oct, 2014

1 commit

0429fbc0b Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

Pull percpu consistent-ops changes from Tejun Heo:
"Way back, before the current percpu allocator was implemented, static
and dynamic percpu memory areas were allocated and handled separately
and had their own accessors. The distinction has been gone for many
years now; however, the now duplicate two sets of accessors remained
with the pointer based ones - this_cpu_*() - evolving various other
operations over time. During the process, we also accumulated other
inconsistent operations.

This pull request contains Christoph's patches to clean up the
duplicate accessor situation. __get_cpu_var() uses are replaced with
with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

Unfortunately, the former sometimes is tricky thanks to C being a bit
messy with the distinction between lvalues and pointers, which led to
a rather ugly solution for cpumask_var_t involving the introduction of
this_cpu_cpumask_var_ptr().

This converts most of the uses but not all. Christoph will follow up
with the remaining conversions in this merge window and hopefully
remove the obsolete accessors"

* 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
irqchip: Properly fetch the per cpu offset
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
Revert "powerpc: Replace __get_cpu_var uses"
percpu: Remove __this_cpu_ptr
clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
sparc: Replace __get_cpu_var uses
avr32: Replace __get_cpu_var with __this_cpu_write
blackfin: Replace __get_cpu_var uses
tile: Use this_cpu_ptr() for hardware counters
tile: Replace __get_cpu_var uses
powerpc: Replace __get_cpu_var uses
alpha: Replace __get_cpu_var
ia64: Replace __get_cpu_var uses
s390: cio driver &__get_cpu_var replacements
s390: Replace __get_cpu_var uses
mips: Replace __get_cpu_var uses
MIPS: Replace __get_cpu_var uses in FPU emulator.
arm: Replace __this_cpu_ptr with raw_cpu_ptr
...

Linus Torvalds
2014-10-15 13:48:18 +0800