Eric Lee / smarc-fsl-linux-kernel

14 Aug, 2011

2 commits

5710f15b5 sched/cpupri: Remove cpupri->pri_active ... Browse Code »

Since [sched/cpupri: Remove the vec->lock], member pri_active
of struct cpupri is not needed any more, just remove it. Also
clean stuff related to it.

Signed-off-by: Yong Zhang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110806001004.GA2207@zhy
Signed-off-by: Ingo Molnar

Yong Zhang
2011-08-14 18:01:11 +0800
c92211d9b sched/cpupri: Remove the vec->lock ... Browse Code »

sched/cpupri: Remove the vec->lock

The cpupri vec->lock has been showing up as a top contention
lately. This is because of the RT push/pull logic takes an
agressive approach for migrating RT tasks. The cpupri logic is
in place to improve the performance of the push/pull when dealing
with large number CPU machines.

The problem though is a vec->lock is required, where a vec is a
global per RT priority structure. That is, if there are lots of
RT tasks at the same priority, every time they are added or removed
from the RT queue, this global vec->lock is taken. Now that more
kernel threads are becoming RT (RCU boost and threaded interrupts)
this is becoming much more of an issue.

There are two variables that are being synced by the vec->lock.
The cpupri bitmask, and the vec->counter. The cpupri bitmask
is one bit per priority. If a RT priority vec has a process queued,
then the vec->count is > 0 and the cpupri bitmask is set for that
RT priority.

If the cpupri bitmask gets out of sync with the vec->counter, we could
end up pushing a low proirity RT task to a high priority queue.
That RT task that could have run immediately could be queued on a
run queue with a higher priority task indefinitely.

The solution is not to use the cpupri bitmask and just look at the
vec->count directly when doing a pull. The cpupri bitmask is just
a fast way to scan the RT priorities when a pull is made. Instead
of using the bitmask, and just examine all RT priorities, and
look at the vec->counts, we could eliminate the vec->lock. The
scan of RT tasks is to find a run queue that we can push an RT task
to, and we do not push to a high priority queue, thus the scan only
needs to go from 1 to RT task->prio, and not all 100 RT priorities.

The push algorithm, which does the scan of RT priorities (and
scan of the bitmask) only happens when we have an overloaded RT run
queue (more than one RT task queued). The grabbing of the vec->lock
happens every time any RT task is queued or dequeued on the run
queue for that priority. The slowing down of the scan by not using
a bitmask is negligible by the speed up of removing the vec->lock
contention, and replacing it with an atomic counter and memory barrier.

To prove this, I wrote a patch that times both the loop and the code
that grabs the vec->locks. I passed the patches to various people
(and companies) to test and show the results. I let everyone choose
their own load to test, giving different loads on the system,
for various different setups.

Here's some of the results: (snipping to a few CPUs to not make
this change log huge, but the results were consistent across
the entire system).

System 1 (24 CPUs)

Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
[...]
cpu 20: loop 3057 1.766 0.061 0.642 1963.170
vec 6782949 90.469 0.089 0.414 2811760.503
cpu 21: loop 2617 1.723 0.062 0.641 1679.074
vec 6782810 90.499 0.089 0.291 1978499.900
cpu 22: loop 2212 1.863 0.063 0.699 1547.160
vec 6767244 85.685 0.089 0.435 2949676.898
cpu 23: loop 2320 2.013 0.062 0.594 1380.265
vec 6781694 87.923 0.088 0.431 2928538.224

After patch:
cpu 20: loop 2078 1.579 0.061 0.533 1108.006
vec 6164555 5.704 0.060 0.143 885185.809
cpu 21: loop 2268 1.712 0.065 0.575 1305.248
vec 6153376 5.558 0.060 0.187 1154960.469
cpu 22: loop 1542 1.639 0.095 0.533 823.249
vec 6156510 5.720 0.060 0.190 1172727.232
cpu 23: loop 1650 1.733 0.068 0.545 900.781
vec 6170784 5.533 0.060 0.167 1034287.953

All times are in microseconds. The 'loop' is the amount of time spent
doing the loop across the priorities (before patch uses bitmask).
the 'vec' is the amount of time in the code that requires grabbing
the vec->lock. The second patch just does not have the vec lock, but
encompasses the same code.

Amazingly the loop code even went down on average. The vec code went
from .5 down to .18, that's more than half the time spent!

Note, more than one test was run, but they all had the same results.

System 2 (64 CPUs)

Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 60: loop 0 0 0 0 0
vec 5410840 277.954 0.084 0.782 4232895.727
cpu 61: loop 0 0 0 0 0
vec 4915648 188.399 0.084 0.570 2803220.301
cpu 62: loop 0 0 0 0 0
vec 5356076 276.417 0.085 0.786 4214544.548
cpu 63: loop 0 0 0 0 0
vec 4891837 170.531 0.085 0.799 3910948.833

After patch:
cpu 60: loop 0 0 0 0 0
vec 5365118 5.080 0.021 0.063 340490.267
cpu 61: loop 0 0 0 0 0
vec 4898590 1.757 0.019 0.071 347903.615
cpu 62: loop 0 0 0 0 0
vec 5737130 3.067 0.021 0.119 687108.734
cpu 63: loop 0 0 0 0 0
vec 4903228 1.822 0.021 0.071 348506.477

The test run during the measurement did not have any (very few,
from other CPUs) RT tasks pushing. But this shows that it helped
out tremendously with the contention, as the contention happens
because the vec->lock is taken only on queuing at an RT priority,
and different CPUs that queue tasks at the same priority will
have contention.

I tested on my own 4 CPU machine with the following results:

Before patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2377 1.489 0.158 0.588 1398.395
vec 4484 770.146 2.301 4.396 19711.755
cpu 1: loop 2169 1.962 0.160 0.576 1250.110
vec 4425 152.769 2.297 4.030 17834.228
cpu 2: loop 2324 1.749 0.155 0.559 1299.799
vec 4368 779.632 2.325 4.665 20379.268
cpu 3: loop 2325 1.629 0.157 0.561 1306.113
vec 4650 408.782 2.394 4.348 20222.577

After patch:
CPU: Name Count Max Min Average Total
---- ---- ----- --- --- ------- -----
cpu 0: loop 2121 1.616 0.113 0.636 1349.189
vec 4303 1.151 0.225 0.421 1811.966
cpu 1: loop 2130 1.638 0.178 0.644 1372.927
vec 4627 1.379 0.235 0.428 1983.648
cpu 2: loop 2056 1.464 0.165 0.637 1310.141
vec 4471 1.311 0.217 0.433 1937.927
cpu 3: loop 2154 1.481 0.162 0.601 1295.083
vec 4236 1.253 0.230 0.425 1803.008

This was running my migrate.c code that can be found at:
http://lwn.net/Articles/425763/

The migrate code does stress the RT tasks a bit. This shows that
the loop did increase a little after the patch, but not by much.
The vec code dropped dramatically. From 4.3us down to .42us.
That's a 10x improvement!

Tested-by: Mike Galbraith
Tested-by: Luis Claudio R. Gonçalves
Tested-by: Matthew Hank Sabins
Signed-off-by: Steven Rostedt
Reviewed-by: Gregory Haskins
Acked-by: Hillf Danton
Signed-off-by: Peter Zijlstra
Cc: Chris Mason
Link: http://lkml.kernel.org/r/1312317372.18583.101.camel@gandalf.stny.rr.com
Signed-off-by: Ingo Molnar

Steven Rostedt
2011-08-14 18:01:03 +0800

17 Jul, 2010

1 commit

68c38fc3c sched: No need for bootmem special cases ... Browse Code »

As of commit dcce284 ("mm: Extend gfp masking to the page
allocator") and commit 7e85ee0 ("slab,slub: don't enable
interrupts during early boot"), the slab allocator makes
sure we don't attempt to sleep during boot.

Therefore, remove bootmem special cases from the scheduler
and use plain GFP_KERNEL instead.

Signed-off-by: Pekka Enberg
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Pekka Enberg
2010-07-17 18:06:22 +0800

15 Dec, 2009

1 commit

fe841226b sched: Convert cpupri lock to raw_spinlock ... Browse Code »

Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Acked-by: Ingo Molnar

Thomas Gleixner
2009-12-15 06:55:33 +0800

30 Mar, 2009

1 commit

612a726fa cpumask: remove cpumask_t from core ... Browse Code »

Impact: cleanup

struct cpumask is nicer, and we use it to make where we've made code
safe for CONFIG_CPUMASK_OFFSTACK=y.

Signed-off-by: Rusty Russell

Rusty Russell
2009-03-30 19:35:17 +0800

25 Nov, 2008

1 commit

68e74568f sched: convert struct cpupri_vec cpumask_var_t. ... Browse Code »

Impact: stack usage reduction, (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS.

The fact cpupro_init is called both before and after the slab is
available makes for an ugly parameter unfortunately.

We also use cpumask_any_and to get rid of a temporary in cpupri_find.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:52:22 +0800

06 Jun, 2008

3 commits

e539d8fcd sched: fix the cpuprio count really ... Browse Code »

Peter pointed out that the last version of the "fix" was still one off
under certain circumstances. Use BITS_TO_LONG instead to get an
accurate result.

Signed-off-by: Thomas Gleixner

Thomas Gleixner
2008-06-06 21:19:44 +0800
709d4b0c6 sched: fix cpupri priocount ... Browse Code »

A rounding error was pointed out by Peter Zijlstra which would result
in the structure holding priorities to be off by one.

Signed-off-by: Gregory Haskins
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Arnaldo Carvalho de Melo
Signed-off-by: Thomas Gleixner

Gregory Haskins
2008-06-06 21:19:43 +0800
6e0534f27 sched: use a 2-d bitmap for searching lowest-pri CPU ... Browse Code »

The current code use a linear algorithm which causes scaling issues
on larger SMP machines. This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.

Signed-off-by: Gregory Haskins
Acked-by: Steven Rostedt
Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner

Gregory Haskins
2008-06-06 21:19:28 +0800