Eric Lee / smarc-fsl-linux-kernel

03 Oct, 2014

1 commit

347abad98 sched, time: Fix build error with 64 bit cputime_t on 32 bit systems ... Browse Code »

On 32 bit systems cmpxchg cannot handle 64 bit values, so
some additional magic is required to allow a 32 bit system
with CONFIG_VIRT_CPU_ACCOUNTING_GEN=y enabled to build.

Make sure the correct cmpxchg function is used when doing
an atomic swap of a cputime_t.

Reported-by: Arnd Bergmann
Signed-off-by: Rik van Riel
Acked-by: Arnd Bergmann
Signed-off-by: Peter Zijlstra (Intel)
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Cc: Andrew Morton
Cc: Benjamin Herrenschmidt
Cc: Heiko Carstens
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: linux390@de.ibm.com
Cc: linux-arch@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Link: http://lkml.kernel.org/r/20140930155947.070cdb1f@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-10-03 11:46:55 +0800

19 Sep, 2014

1 commit

9c368b5b6 sched, time: Fix lock inversion in thread_group_cputime() ... Browse Code »

The sig->stats_lock nests inside the tasklist_lock and the
sighand->siglock in __exit_signal and wait_task_zombie.

However, both of those locks can be taken from irq context,
which means we need to use the interrupt safe variant of
read_seqbegin_or_lock. This blocks interrupts when the "lock"
branch is taken (seq is odd), preventing the lock inversion.

On the first (lockless) pass through the loop, irqs are not
blocked.

Reported-by: Stanislaw Gruszka
Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: prarit@redhat.com
Cc: oleg@redhat.com
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1410527535-9814-3-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-09-19 18:35:17 +0800

08 Sep, 2014

2 commits

eb1b4af0a sched, time: Atomically increment stime & utime ... Browse Code »

The functions task_cputime_adjusted and thread_group_cputime_adjusted()
can be called locklessly, as well as concurrently on many different CPUs.

This can occasionally lead to the utime and stime reported by times(), and
other syscalls like it, going backward. The cause for this appears to be
multiple threads racing in cputime_adjust(), both with values for utime or
stime that is larger than the original, but each with a different value.

Sometimes the larger value gets saved first, only to be immediately
overwritten with a smaller value by another thread.

Using atomic exchange prevents that problem, and ensures time
progresses monotonically.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: akpm@linux-foundation.org
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Cc: oleg@redhat.com
Link: http://lkml.kernel.org/r/1408133138-22048-4-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-09-08 14:17:02 +0800
e78c34967 time, signal: Protect resource use statistics with seqlock ... Browse Code »

Both times() and clock_gettime(CLOCK_PROCESS_CPUTIME_ID) have scalability
issues on large systems, due to both functions being serialized with a
lock.

The lock protects against reporting a wrong value, due to a thread in the
task group exiting, its statistics reporting up to the signal struct, and
that exited task's statistics being counted twice (or not at all).

Protecting that with a lock results in times() and clock_gettime() being
completely serialized on large systems.

This can be fixed by using a seqlock around the events that gather and
propagate statistics. As an additional benefit, the protection code can
be moved into thread_group_cputime(), slightly simplifying the calling
functions.

In the case of posix_cpu_clock_get_task() things can be simplified a
lot, because the calling function already ensures that the task sticks
around, and the rest is now taken care of in thread_group_cputime().

This way the statistics reporting code can run lockless.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alex Thorlton
Cc: Andrew Morton
Cc: Daeseok Youn
Cc: David Rientjes
Cc: Dongsheng Yang
Cc: Geert Uytterhoeven
Cc: Guillaume Morin
Cc: Ionut Alexa
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Li Zefan
Cc: Michal Hocko
Cc: Michal Schmidt
Cc: Oleg Nesterov
Cc: Vladimir Davydov
Cc: umgwanakikbuti@gmail.com
Cc: fweisbec@gmail.com
Cc: srao@redhat.com
Cc: lwoodman@redhat.com
Cc: atheurer@redhat.com
Link: http://lkml.kernel.org/r/20140816134010.26a9b572@annuminas.surriel.com
Signed-off-by: Ingo Molnar

Rik van Riel
2014-09-08 14:17:01 +0800

20 Aug, 2014

1 commit

1e4dda08b sched: Change thread_group_cputime() to use for_each_thread() ... Browse Code »

Change thread_group_cputime() to use for_each_thread() instead of
buggy while_each_thread(). This also makes the pid_alive() check
unnecessary.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Hidetoshi Seto
Cc: Frank Mayhar
Cc: Frederic Weisbecker
Cc: Andrew Morton
Cc: Sanjay Rao
Cc: Larry Woodman
Cc: Rik van Riel
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140813192000.GA19327@redhat.com
Signed-off-by: Ingo Molnar

Oleg Nesterov
2014-08-20 15:47:18 +0800

07 May, 2014

1 commit

2d513868e sched: Sanitize irq accounting madness ... Browse Code »

Russell reported, that irqtime_account_idle_ticks() takes ages due to:

for (i = 0; i < ticks; i++)
irqtime_account_process_tick(current, 0, rq);

It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cefa ("sched:
Export ns irqtimes through /proc/stat")

So instead of looping nr_ticks times just apply the whole thing at
once.

As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?

Reported-by: Russell King
Signed-off-by: Thomas Gleixner
Reviewed-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Cc: Venkatesh Pallipadi
Cc: Shaun Ruffell
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-05-07 17:51:30 +0800

02 Apr, 2014

1 commit

a21e40877 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Ingo Molnar:
"The main purpose is to fix a full dynticks bug related to
virtualization, where steal time accounting appears to be zero in
/proc/stat even after a few seconds of competing guests running busy
loops in a same host CPU. It's not a regression though as it was
there since the beginning.

The other commits are preparatory work to fix the bug and various
cleanups"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arch: Remove stub cputime.h headers
sched: Remove needless round trip nsecs tick conversion of steal time
cputime: Fix jiffies based cputime assumption on steal accounting
cputime: Bring cputime -> nsecs conversion
cputime: Default implementation of nsecs -> cputime conversion
cputime: Fix nsecs_to_cputime() return type cast

Linus Torvalds
2014-04-02 01:16:10 +0800

13 Mar, 2014

1 commit

dee08a72d cputime: Fix jiffies based cputime assumption on steal accounting ... Browse Code »

The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.

As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.

As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.

In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.

Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.

Reported-by: Huiqingding
Reported-by: Marcelo Tosatti
Cc: Ingo Molnar
Cc: Marcelo Tosatti
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Acked-by: Rik van Riel
Signed-off-by: Frederic Weisbecker

Frederic Weisbecker
2014-03-13 22:56:44 +0800

09 Feb, 2014

1 commit

d0ea02680 sched: Implement task_nice() as static inline function ... Browse Code »

As patch "sched: Move the priority specific bits into a new header file" exposes
the priority related macros in linux/sched/prio.h, we don't have to implement
task_nice() in kernel/sched/core.c any more.

This patch implements it in linux/sched/sched.h as static inline function,
saving the kernel stack and enhancing performance a bit.

Signed-off-by: Dongsheng Yang
Cc: clark.williams@gmail.com
Cc: rostedt@goodmis.org
Cc: raistlin@linux.it
Cc: juri.lelli@gmail.com
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Dongsheng Yang
2014-02-09 22:28:23 +0800

06 Sep, 2013

1 commit

57d730924 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull cputime fix from Ingo Molnar:
"This fixes a longer-standing cputime accounting bug that Stanislaw
Gruszka finally managed to track down"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cputime: Do not scale when utime == 0

Linus Torvalds
2013-09-06 03:36:46 +0800

05 Sep, 2013

1 commit

6832d9652 Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timers/nohz changes from Ingo Molnar:
"It mostly contains fixes and full dynticks off-case optimizations, by
Frederic Weisbecker"

* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
nohz: Include local CPU in full dynticks global kick
nohz: Optimize full dynticks's sched hooks with static keys
nohz: Optimize full dynticks state checks with static keys
nohz: Rename a few state variables
vtime: Always debug check snapshot source _before_ updating it
vtime: Always scale generic vtime accounting results
vtime: Optimize full dynticks accounting off case with static keys
vtime: Describe overriden functions in dedicated arch headers
m68k: hardirq_count() only need preempt_mask.h
hardirq: Split preempt count mask definitions
context_tracking: Split low level state headers
vtime: Fix racy cputime delta update
vtime: Remove a few unneeded generic vtime state checks
context_tracking: User/kernel broundary cross trace events
context_tracking: Optimize context switch off case with static keys
context_tracking: Optimize guest APIs off case with static key
context_tracking: Optimize main APIs off case with static key
context_tracking: Ground setup for static key use
context_tracking: Remove full dynticks' hacky dependency on wide context tracking
nohz: Only enable context tracking on full dynticks CPUs
...

Linus Torvalds
2013-09-05 00:36:54 +0800

04 Sep, 2013

1 commit

5a8e01f8f sched/cputime: Do not scale when utime == 0 ... Browse Code »

scale_stime() silently assumes that stime < rtime, otherwise
when stime == rtime and both values are big enough (operations
on them do not fit in 32 bits), the resulting scaling stime can
be bigger than rtime. In consequence utime = rtime - stime
results in negative value.

User space visible symptoms of the bug are overflowed TIME
values on ps/top, for example:

$ ps aux | grep rcu
root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0]
root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]
root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0]
root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1]
root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt]

or overflowed utime values read directly from /proc/$PID/stat

Reference:

https://lkml.org/lkml/2013/8/20/259

Reported-and-tested-by: Sergey Senozhatsky
Signed-off-by: Stanislaw Gruszka
Cc: stable@vger.kernel.org
Cc: Frederic Weisbecker
Cc: Peter Zijlstra
Cc: Paul E. McKenney
Cc: Borislav Petkov
Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-09-04 22:31:25 +0800

16 Aug, 2013

1 commit

a4f61cc03 sched/cputime: Use this_cpu_add() in task_group_account_field() ... Browse Code »

Use of a this_cpu() operation reduces the number of instructions used
for accounting (account_user_time()) and frees up some registers. This is in
the scheduler tick hotpath.

Signed-off-by: Christoph Lameter
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/00000140596dd165-338ff7f5-893b-4fec-b251-aaac5557239e-000000@email.amazonses.com
Signed-off-by: Ingo Molnar

Christoph Lameter
2013-08-16 23:44:29 +0800

14 Aug, 2013

6 commits

af2350bd1 vtime: Always debug check snapshot source _before_ updating it ... Browse Code »

The vtime delta update performed by get_vtime_delta() always check
that the source of the snapshot is valid.

Meanhile the snapshot updaters that rely on get_vtime_delta() also
set the new snapshot origin. But some of them do this right before
the call to get_vtime_delta(), making its debug check useless.

This is easily fixable by moving the snapshot origin update after
the call to get_vtime_delta(). The order doesn't matter there.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:56 +0800
b854fafa4 vtime: Always scale generic vtime accounting results ... Browse Code »

The cputime accounting in full dynticks can be a subtle
mixup of CPUs using tick based accounting and others using
generic vtime.

As long as the tick can have a share on producing these stats, we
want to scale the result against CFS precise accounting as the tick
can miss some task hiding between the periodic interrupt.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:55 +0800
b04934061 vtime: Optimize full dynticks accounting off case with static keys ... Browse Code »

If no CPU is in the full dynticks range, we can avoid the full
dynticks cputime accounting through generic vtime along with its
overhead and use the traditional tick based accounting instead.

Let's do this and nope the off case with static keys.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:54 +0800
54461562c vtime: Fix racy cputime delta update ... Browse Code »

get_vtime_delta() must be called under the task vtime_seqlock
with the code that does the cputime accounting flush.

Otherwise the cputime reader can be fooled and run into
a race where it sees the snapshot update but misses the
cputime flush. As a result it can report a cputime that is
way too short.

Fix vtime_account_user() that wasn't complying to that rule.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:50 +0800
7621d1f8b vtime: Remove a few unneeded generic vtime state checks ... Browse Code »

Some generic vtime APIs check if the vtime accounting
is enabled on the local CPU before doing their work.

Some of these are not needed because all their callers already
take care of that. Let's remove the checks on these.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:49 +0800
48d6a816a context_tracking: Optimize guest APIs off case with static key ... Browse Code »

Optimize guest entry/exit APIs with static keys. This minimize
the overhead for those who enable CONFIG_NO_HZ_FULL without
always using it. Having no range passed to nohz_full= should
result in the probes overhead to be minimized.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-14 23:14:46 +0800

13 Aug, 2013

1 commit

5b206d48e vtime: Update a few comments ... Browse Code »

Update a stale comment from the old vtime era and document some
locking that might be non obvious.

Signed-off-by: Frederic Weisbecker
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Cc: Kevin Hilman

Frederic Weisbecker
2013-08-13 06:40:44 +0800

01 Jul, 2013

1 commit

2fd1b4878 Merge tag 'v3.10' into sched/core ... Browse Code »

Merge in a recent upstream commit:

c2853c8df57f include/linux/math64.h: add div64_ul()

because:

72a4cf20cb71 sched: Change cfs_rq load avg to unsigned long

relies on it.

[ We don't rebase sched/core for this, because the handful of
followup commits after the broken commit are not behavioral
changes so are unlikely to be needed during bisection. ]

Signed-off-by: Ingo Molnar

Ingo Molnar
2013-07-01 17:18:53 +0800

31 May, 2013

1 commit

45eacc692 vtime: Use consistent clocks among nohz accounting ... Browse Code »

While computing the cputime delta of dynticks CPUs,
we are mixing up clocks of differents natures:

* local_clock() which takes care of unstable clock
sources and fix these if needed.

* sched_clock() which is the weaker version of
local_clock(). It doesn't compute any fixup in case
of unstable source.

If the clock source is stable, those two clocks are the
same and we can safely compute the difference against
two random points.

Otherwise it results in random deltas as sched_clock()
can randomly drift away, back or forward, from local_clock().

As a consequence, some strange behaviour with unstable tsc
has been observed such as non progressing constant zero cputime.
(The 'top' command showing no load).

Fix this by only using local_clock(), or its irq safe/remote
equivalent, in vtime code.

Reported-by: Mike Galbraith
Suggested-by: Mike Galbraith
Cc: Steven Rostedt
Cc: Paul E. McKenney
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Borislav Petkov
Cc: Li Zhong
Cc: Mike Galbraith
Signed-off-by: Frederic Weisbecker
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-05-31 17:31:50 +0800

28 May, 2013

1 commit

84f9f3a15 sched: Use swap() macro in scale_stime() ... Browse Code »

Simple cleanup.

Reported-by: Peter Zijlstra
Signed-off-by: Stanislaw Gruszka
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/1367501673-6563-1-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-05-28 17:58:10 +0800

03 May, 2013

1 commit

0279b3c0a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"This fixes the cputime scaling overflow problems for good without
having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
as well."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "math64: New div64_u64_rem helper"
sched: Avoid prev->stime underflow
sched: Do not account bogus utime
sched: Avoid cputime scaling overflow

Linus Torvalds
2013-05-03 05:56:31 +0800

01 May, 2013

3 commits

68aa8efcd sched: Avoid prev->stime underflow ... Browse Code »

Dave Hansen reported strange utime/stime values on his system:
https://lkml.org/lkml/2013/4/4/435

This happens because prev->stime value is bigger than rtime
value. Root of the problem are non-monotonic rtime values (i.e.
current rtime is smaller than previous rtime) and that should be
debugged and fixed.

But since problem did not manifest itself before commit
62188451f0d63add7ad0cd2a1ae269d600c1663d "cputime: Avoid
multiplication overflow on utime scaling", it should be threated
as regression, which we can easily fixed on cputime_adjust()
function.

For now, let's apply this fix, but further work is needed to fix
root of the problem.

Reported-and-tested-by: Dave Hansen
Cc: # 3.9+
Signed-off-by: Stanislaw Gruszka
Cc: Frederic Weisbecker
Cc: rostedt@goodmis.org
Cc: Linus Torvalds
Cc: Dave Hansen
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1367314507-9728-3-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-05-01 01:13:05 +0800
772c808a2 sched: Do not account bogus utime ... Browse Code »

Due to rounding in scale_stime(), for big numbers, scaled stime
values will grow in chunks. Since rtime grow in jiffies and we
calculate utime like below:

prev->stime = max(prev->stime, stime);
prev->utime = max(prev->utime, rtime - prev->stime);

we could erroneously account stime values as utime. To prevent
that only update prev->{u,s}time values when they are smaller
than current rtime.

Signed-off-by: Stanislaw Gruszka
Cc: Frederic Weisbecker
Cc: rostedt@goodmis.org
Cc: Linus Torvalds
Cc: Dave Hansen
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1367314507-9728-2-git-send-email-sgruszka@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-05-01 01:13:04 +0800
55eaa7c1f sched: Avoid cputime scaling overflow ... Browse Code »

Here is patch, which adds Linus's cputime scaling algorithm to the
kernel.

This is a follow up (well, fix) to commit
d9a3c9823a2e6a543eb7807fb3d15d8233817ec5 ("sched: Lower chances
of cputime scaling overflow") which commit tried to avoid
multiplication overflow, but did not guarantee that the overflow
would not happen.

Linus crated a different algorithm, which completely avoids the
multiplication overflow by dropping precision when numbers are
big.

It was tested by me and it gives good relative error of
scaled numbers. Testing method is described here:
http://marc.info/?l=linux-kernel&m=136733059505406&w=2

Originally-From: Linus Torvalds
Signed-off-by: Stanislaw Gruszka
Cc: Frederic Weisbecker
Cc: rostedt@goodmis.org
Cc: Dave Hansen
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130430151441.GC10465@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-05-01 01:13:04 +0800

30 Apr, 2013

1 commit

16fa94b53 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes in this development cycle were:

- full dynticks preparatory work by Frederic Weisbecker

- factor out the cpu time accounting code better, by Li Zefan

- multi-CPU load balancer cleanups and improvements by Joonsoo Kim

- various smaller fixes and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
sched: Fix init NOHZ_IDLE flag
sched: Prevent to re-select dst-cpu in load_balance()
sched: Rename load_balance_tmpmask to load_balance_mask
sched: Move up affinity check to mitigate useless redoing overhead
sched: Don't consider other cpus in our group in case of NEWLY_IDLE
sched: Explicitly cpu_idle_type checking in rebalance_domains()
sched: Change position of resched_cpu() in load_balance()
sched: Fix wrong rq's runnable_avg update with rt tasks
sched: Document task_struct::personality field
sched/cpuacct/UML: Fix header file dependency bug on the UML build
cgroup: Kill subsys.active flag
sched/cpuacct: No need to check subsys active state
sched/cpuacct: Initialize cpuacct subsystem earlier
sched/cpuacct: Initialize root cpuacct earlier
sched/cpuacct: Allocate per_cpu cpuusage for root cpuacct statically
sched/cpuacct: Clean up cpuacct.h
sched/cpuacct: Remove redundant NULL checks in cpuacct_acount_field()
sched/cpuacct: Remove redundant NULL checks in cpuacct_charge()
sched/cpuacct: Add cpuacct_acount_field()
sched/cpuacct: Add cpuacct_init()
...

Linus Torvalds
2013-04-30 22:43:28 +0800

10 Apr, 2013

1 commit

1966aaf7d sched/cpuacct: Add cpuacct_acount_field() ... Browse Code »

So we can remove open-coded cpuacct code in cputime.c.

Signed-off-by: Li Zefan
Acked-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/51553692.9060008@huawei.com
Signed-off-by: Ingo Molnar

Li Zefan
2013-04-10 19:54:17 +0800

08 Apr, 2013

1 commit

e614b3332 sched/cputime: Fix accounting on multi-threaded processes ... Browse Code »

Recent commit 6fac4829 ("cputime: Use accessors to read task
cputime stats") introduced a bug, where we account many times
the cputime of the first thread, instead of cputimes of all
the different threads.

Signed-off-by: Stanislaw Gruszka
Acked-by: Frederic Weisbecker
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20130404085740.GA2495@redhat.com
Signed-off-by: Ingo Molnar

Stanislaw Gruszka
2013-04-08 23:40:52 +0800

14 Mar, 2013

1 commit

d9a3c9823 sched: Lower chances of cputime scaling overflow ... Browse Code »

Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.

This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.

We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.

This easily overflows after a few days/weeks of execution.

A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
62188451f0d63add7ad0cd2a1ae269d600c1663d
("cputime: Avoid multiplication overflow on utime scaling")

The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.

This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.

To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.

This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.

If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
("sched, cputime: Introduce thread_group_times()")

Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.

Signed-off-by: Frederic Weisbecker
Cc: Stanislaw Gruszka
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Andrew Morton

Frederic Weisbecker
2013-03-14 01:18:14 +0800

08 Mar, 2013

1 commit

9fbc42eac cputime: Dynamically scale cputime for full dynticks accounting ... Browse Code »

The full dynticks cputime accounting is able to account either
using the tick or the context tracking subsystem. This way
the housekeeping CPU can keep the low overhead tick based
solution.

This latter mode has a low jiffies resolution granularity and
need to be scaled against CFS precise runtime accounting to
improve its result. We are doing this for CONFIG_TICK_CPU_ACCOUNTING,
now we also need to expand it to full dynticks accounting dynamic
off-case as well.

Signed-off-by: Frederic Weisbecker
Cc: Li Zhong
Cc: Kevin Hilman
Cc: Mats Liljegren
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Steven Rostedt
Cc: Namhyung Kim
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Paul E. McKenney

Frederic Weisbecker
2013-03-08 00:10:32 +0800

24 Feb, 2013

1 commit

7f6575f1f cputime: Use local_clock() for full dynticks cputime accounting ... Browse Code »

Running the full dynticks cputime accounting with preemptible
kernel debugging trigger the following warning:

[ 4.488303] BUG: using smp_processor_id() in preemptible [00000000] code: init/1
[ 4.490971] caller is native_sched_clock+0x22/0x80
[ 4.493663] Pid: 1, comm: init Not tainted 3.8.0+ #13
[ 4.496376] Call Trace:
[ 4.498996] [] debug_smp_processor_id+0xdb/0xf0
[ 4.501716] [] native_sched_clock+0x22/0x80
[ 4.504434] [] sched_clock+0x9/0x10
[ 4.507185] [] fetch_task_cputime+0xad/0x120
[ 4.509916] [] task_cputime+0x35/0x60
[ 4.512622] [] acct_update_integrals+0x1e/0x40
[ 4.515372] [] do_execve_common+0x4ff/0x5c0
[ 4.518117] [] ? do_execve_common+0x144/0x5c0
[ 4.520844] [] ? rest_init+0x160/0x160
[ 4.523554] [] do_execve+0x37/0x40
[ 4.526276] [] run_init_process+0x23/0x30
[ 4.528953] [] kernel_init+0x9c/0xf0
[ 4.531608] [] ret_from_fork+0x7c/0xb0

We use sched_clock() to perform and fixup the cputime
accounting. However we are calling it with preemption enabled
from the read side, which trigger the bug above.

To fix this up, use local_clock() instead. It takes care of
preemption and also provide a more reliable clock source. This
is welcome for this kind of statistic that is widely relied on
in userspace.

Reported-by: Thomas Gleixner
Reported-by: Ingo Molnar
Suggested-by: Thomas Gleixner
Signed-off-by: Frederic Weisbecker
Cc: Li Zhong
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Kevin Hilman
Link: http://lkml.kernel.org/r/1361636925-22288-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar

Frederic Weisbecker
2013-02-24 19:57:16 +0800

19 Feb, 2013

1 commit

cdc4e86b5 cputime: Remove irqsave from seqlock readers ... Browse Code »

The reader side code has no requirement to disable interrupts while
sampling data. The sequence counter is enough to ensure consistency.

Signed-off-by: Thomas Gleixner
Cc: Frederic Weisbecker
Signed-off-by: Ingo Molnar

Thomas Gleixner
2013-02-19 15:05:53 +0800

05 Feb, 2013

1 commit

b2c77a57e Merge tag 'full-dynticks-cputime-for-mingo' of git://git.kernel.org/pub/scm/linu… ... Browse Code »

…x/kernel/git/frederic/linux-dynticks into sched/core

Pull full-dynticks (user-space execution is undisturbed and
receives no timer IRQs) preparation changes that convert the
cputime accounting code to be full-dynticks ready,
from Frederic Weisbecker:

"This implements the cputime accounting on full dynticks CPUs.

Typical cputime stats infrastructure relies on the timer tick and
its periodic polling on the CPU to account the amount of time
spent by the CPUs and the tasks per high level domains such as
userspace, kernelspace, guest, ...

Now we are preparing to implement full dynticks capability on
Linux for Real Time and HPC users who want full CPU isolation.
This feature requires a cputime accounting that doesn't depend
on the timer tick.

To implement it, this new cputime infrastructure plugs into
kernel/user/guest boundaries to take snapshots of cputime and
flush these to the stats when needed. This performs pretty
much like CONFIG_VIRT_CPU_ACCOUNTING except that context location
and cputime snaphots are synchronized between write and read
side such that the latter can safely retrieve the pending tickless
cputime of a task and add it to its latest cputime snapshot to
return the correct result to the user."

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2013-02-05 20:10:33 +0800

28 Jan, 2013

5 commits

6a61671bb cputime: Safely read cputime of full dynticks CPUs ... Browse Code »

While remotely reading the cputime of a task running in a
full dynticks CPU, the values stored in utime/stime fields
of struct task_struct may be stale. Its values may be those
of the last kernel user transition time snapshot and
we need to add the tickless time spent since this snapshot.

To fix this, flush the cputime of the dynticks CPUs on
kernel user transition and record the time / context
where we did this. Then on top of this snapshot and the current
time, perform the fixup on the reader side from task_times()
accessors.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
[fixed kvm module related build errors]
Signed-off-by: Sedat Dilek

Frederic Weisbecker
2013-01-28 03:35:47 +0800
c11f11fcb kvm: Prepare to add generic guest entry/exit callbacks ... Browse Code »

Do some ground preparatory work before adding guest_enter()
and guest_exit() context tracking callbacks. Those will
be later used to read the guest cputime safely when we
run in full dynticks mode.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Gleb Natapov
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Marcelo Tosatti
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 03:35:40 +0800
6fac4829c cputime: Use accessors to read task cputime stats ... Browse Code »

This is in preparation for the full dynticks feature. While
remotely reading the cputime of a task running in a full
dynticks CPU, we'll need to do some extra-computation. This
way we can account the time it spent tickless in userspace
since its last cputime snapshot.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:31 +0800
3f4724ea8 cputime: Allow dynamic switch between tick/virtual based cputime accounting ... Browse Code »

Allow to dynamically switch between tick and virtual based
cputime accounting. This way we can provide a kind of "on-demand"
virtual based cputime accounting. In this mode, the kernel relies
on the context tracking subsystem to dynamically probe on kernel
boundaries.

This is in preparation for being able to stop the timer tick in
more places than just the idle state. Doing so will depend on
CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
the cputime without the tick by hooking on kernel/user boundaries.

Depending whether the tick is stopped or not, we can switch between
tick and vtime based accounting anytime in order to minimize the
overhead associated to user hooks.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:29 +0800
abf917cd9 cputime: Generic on-demand virtual cputime accounting ... Browse Code »

If we want to stop the tick further idle, we need to be
able to account the cputime without using the tick.

Virtual based cputime accounting solves that problem by
hooking into kernel/user boundaries.

However implementing CONFIG_VIRT_CPU_ACCOUNTING require
low level hooks and involves more overhead. But we already
have a generic context tracking subsystem that is required
for RCU needs by archs which plan to shut down the tick
outside idle.

This patch implements a generic virtual based cputime
accounting that relies on these generic kernel/user hooks.

There are some upsides of doing this:

- This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
if context tracking is already built (already necessary for RCU in full
tickless mode).

- We can rely on the generic context tracking subsystem to dynamically
(de)activate the hooks, so that we can switch anytime between virtual
and tick based accounting. This way we don't have the overhead
of the virtual accounting when the tick is running periodically.

And one downside:

- There is probably more overhead than a native virtual based cputime
accounting. But this relies on hooks that are already set anyway.

Signed-off-by: Frederic Weisbecker
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Li Zhong
Cc: Namhyung Kim
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner

Frederic Weisbecker
2013-01-28 02:23:27 +0800