01 Mar, 2010
1 commit
-
Make rcu_dereference() of runqueue data structures be
rcu_dereference_sched().Located-by: Ingo Molnar
Signed-off-by: Paul E. McKenney
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference:
Signed-off-by: Ingo Molnar
26 Feb, 2010
1 commit
-
On platforms like dual socket quad-core platform, the scheduler load
balancer is not detecting the load imbalances in certain scenarios. This
is leading to scenarios like where one socket is completely busy (with
all the 4 cores running with 4 tasks) and leaving another socket
completely idle. This causes performance issues as those 4 tasks share
the memory controller, last-level cache bandwidth etc. Also we won't be
taking advantage of turbo-mode as much as we would like, etc.Some of the comparisons in the scheduler load balancing code are
comparing the "weighted cpu load that is scaled wrt sched_group's
cpu_power" with the "weighted average load per task that is not scaled
wrt sched_group's cpu_power". While this has probably been broken for a
longer time (for multi socket numa nodes etc), the problem got aggrevated
via this recent change:|
| commit f93e65c186ab3c05ce2068733ca10e34fd00125e
| Author: Peter Zijlstra
| Date: Tue Sep 1 10:34:32 2009 +0200
|
| sched: Restore __cpu_power to a straight sum of power
|Also with this change, the sched group cpu power alone no longer reflects
the group capacity that is needed to implement MC, MT performance
(default) and power-savings (user-selectable) policies.We need to use the computed group capacity (sgs.group_capacity, that is
computed using the SD_PREFER_SIBLING logic in update_sd_lb_stats()) to
find out if the group with the max load is above its capacity and how
much load to move etc.Reported-by: Ma Ling
Initial-Analysis-by: Zhang, Yanmin
Signed-off-by: Suresh Siddha
[ -v2: build fix ]
Signed-off-by: Peter Zijlstra
Cc: # [2.6.32.x, 2.6.33.x]
LKML-Reference:
Signed-off-by: Ingo Molnar
16 Feb, 2010
1 commit
-
Conflicts: kernel/sched.c
Necessary due to the urgent fixes which conflict with the code move
from sched.c to sched_fair.cSigned-off-by: Thomas Gleixner
08 Feb, 2010
1 commit
-
Merge reason: Merge dependent fix, update to latest -rc.
Signed-off-by: Ingo Molnar
23 Jan, 2010
1 commit
-
The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.Extend the related functions with a "head" argument.
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Tested-by: Carsten Emde
Tested-by: Mathias Weber
LKML-Reference:
21 Jan, 2010
11 commits
-
We want to update the sched_group_powers when balance_cpu == this_cpu.
Currently the group powers are updated only if the balance_cpu is the
first CPU in the local group. But balance_cpu = this_cpu could also be
the first idle cpu in the group. Hence fix the place where the group
powers are updated.Signed-off-by: Gautham R Shenoy
Signed-off-by: Joel Schopp
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Since all load_balance() callers will have !NULL balance parameters we
can now assume so and remove a few checks.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
The two functions: load_balance{,_newidle}() are very similar, with the
following differences:- rq->lock usage
- sb->balance_interval updates
- *balance checkSo remove the load_balance_newidle() call with load_balance(.idle =
CPU_NEWLY_IDLE), explicitly unlock the rq->lock before calling (would be
done by double_lock_balance() anyway), and ignore the other differences
for now.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
load_balance() and load_balance_newidle() look remarkably similar, one
key point they differ in is the condition on when to active balance.So split out that logic into a separate function.
One side effect is that previously load_balance_newidle() used to fail
and return -1 under these conditions, whereas now it doesn't. I've not
yet fully figured out the whole -1 return case for either
load_balance{,_newidle}().Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Since load-balancing can hold rq->locks for quite a long while, allow
breaking out early when there is lock contention.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Move code around to get rid of fwd declarations.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Again, since we only iterate the fair class, remove the abstraction.
Since this is the last user of the rq_iterator, remove all that too.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Since we only ever iterate the fair class, do away with this abstraction.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Take out the sched_class methods for load-balancing.
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Straight fwd code movement.
Since non of the load-balance abstractions are used anymore, do away with
them and simplify the code some. In preparation move the code around.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
SD_PREFER_SIBLING is set at the CPU domain level if power saving isn't
enabled, leading to many cache misses on large machines as we traverse
looking for an idle shared cache to wake to. Change the enabler of
select_idle_sibling() to SD_SHARE_PKG_RESOURCES, and enable same at the
sibling domain level.Reported-by: Lin Ming
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
17 Jan, 2010
1 commit
-
kernel/sched: don't expose local functions
The get_rr_interval_* functions are all class methods of
struct sched_class. They are not exported so make them
static.Signed-off-by: H Hartley Sweeten
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
17 Dec, 2009
2 commits
-
In order to remove the cfs_rq dependency from set_task_cpu() we
need to ensure the task is cfs_rq invariant for all callsites.The simple approach is to substract cfs_rq->min_vruntime from
se->vruntime on dequeue, and add cfs_rq->min_vruntime on
enqueue.However, this has the downside of breaking FAIR_SLEEPERS since
we loose the old vruntime as we only maintain the relative
position.To solve this, we observe that we only migrate runnable tasks,
we do this using deactivate_task(.sleep=0) and
activate_task(.wakeup=0), therefore we can restrain the
min_vruntime invariance to that state.The only other case is wakeup balancing, since we want to
maintain the old vruntime we cannot make it relative on dequeue,
but since we don't migrate inactive tasks, we can do so right
before we activate it again.This is where we need the new pre-wakeup hook, we need to call
this while still holding the old rq->lock. We could fold it into
->select_task_rq(), but since that has multiple callsites and
would obfuscate the locking requirements, that seems like a
fudge.This leaves the fork() case, simply make sure that ->task_fork()
leaves the ->vruntime in a relative state.This covers all cases where set_task_cpu() gets called, and
ensures it sees a relative vruntime.Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar -
We should skip !SD_LOAD_BALANCE domains.
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
CC: stable@kernel.org
Signed-off-by: Ingo Molnar
15 Dec, 2009
1 commit
-
Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Acked-by: Ingo Molnar
09 Dec, 2009
9 commits
-
The normalized values are also recalculated in case the scaling factor
changes.This patch updates the internally used scheduler tuning values that are
normalized to one cpu in case a user sets new values via sysfs.Together with patch 2 of this series this allows to let user configured
values scale (or not) to cpu add/remove events taking place later.Signed-off-by: Christian Ehrhardt
Signed-off-by: Peter Zijlstra
LKML-Reference:
[ v2: fix warning ]
Signed-off-by: Ingo Molnar -
As scaling now takes place on all kind of cpu add/remove events a user
that configures values via proc should be able to configure if his set
values are still rescaled or kept whatever happens.As the comments state that log2 was just a second guess that worked the
interface is not just designed for on/off, but to choose a scaling type.
Currently this allows none, log and linear, but more important it allwos
us to keep the interface even if someone has an even better idea how to
scale the values.Signed-off-by: Christian Ehrhardt
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Based on Peter Zijlstras patch suggestion this enables recalculation of
the scheduler tunables in response of a change in the number of cpus. It
also adds a max of eight cpus that are considered in that scaling.Signed-off-by: Christian Ehrhardt
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
As Nick pointed out, and realized by myself when doing:
sched: Fix balance vs hotplug race
the patch:
sched: for_each_domain() vs RCUis wrong, sched_domains are freed after synchronize_sched(), which
means disabling preemption is enough.Reported-by: Nick Piggin
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
WAKEUP_RUNNING was an experiment, not sure why that ever ended up being
merged...Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Streamline the wakeup preemption code a bit, unifying the preempt path
so that they all do the same.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
If a RT task is woken up while a non-RT task is running,
check_preempt_wakeup() is called to check whether the new task can
preempt the old task. The function returns quickly without going deeper
because it is apparent that a RT task can always preempt a non-RT task.In this situation, check_preempt_wakeup() always calls update_curr() to
update vruntime value of the currently running task. However, the
function call is unnecessary and redundant at that moment because (1) a
non-RT task can always be preempted by a RT task regardless of its
vruntime value, and (2) update_curr() will be called shortly when the
context switch between two occurs.By moving update_curr() in check_preempt_wakeup(), we can avoid
redundant call to update_curr(), slightly reducing the time taken to
wake up RT tasks.Signed-off-by: Jupyung Lee
[ Place update_curr() right before the wake_preempt_entity() call, which
is the only thing that relies on the updated vruntime ]
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
Currently we try to do task placement in wake_up_new_task() after we do
the load-balance pass in sched_fork(). This yields complicated semantics
in that we have to deal with tasks on different RQs and the
set_task_cpu() calls in copy_process() and sched_fork()Rename ->task_new() to ->task_fork() and call it from sched_fork()
before the balancing, this gives the policy a clear point to place the
task.Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar -
sched_rr_get_param calls
task->sched_class->get_rr_interval(task) without protection
against a concurrent sched_setscheduler() call which modifies
task->sched_class.Serialize the access with task_rq_lock(task) and hand the rq
pointer into get_rr_interval() as it's needed at least in the
sched_fair implementation.Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
26 Nov, 2009
1 commit
-
Merge reason: Pick up fixes that did not make it into .32.0
Signed-off-by: Ingo Molnar
24 Nov, 2009
1 commit
-
Branch hint profiling on my nehalem machine showed 90%
incorrect branch hints:15728471 158903754 90 pick_next_task_fair
sched_fair.c 1555Signed-off-by: Tim Blechmann
Cc: Peter Zijlstra
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
LKML-Reference:
Signed-off-by: Ingo Molnar
13 Nov, 2009
2 commits
-
Instead of only considering SD_WAKE_AFFINE | SD_PREFER_SIBLING
domains also allow all SD_PREFER_SIBLING domains below a
SD_WAKE_AFFINE domain to change the affinity target.Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar -
Clean up the new affine to idle sibling bits while trying to
grok them. Should not have any function differences.Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar
05 Nov, 2009
2 commits
-
Ingo Molnar reported:
[ 26.804000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.808000] caller is vmstat_update+0x26/0x70
[ 26.812000] Pid: 10, comm: events/1 Not tainted 2.6.32-rc5 #6887
[ 26.816000] Call Trace:
[ 26.820000] [] ? printk+0x28/0x3c
[ 26.824000] [] debug_smp_processor_id+0xf0/0x110
[ 26.824000] mount used greatest stack depth: 1464 bytes left
[ 26.828000] [] vmstat_update+0x26/0x70
[ 26.832000] [] worker_thread+0x188/0x310
[ 26.836000] [] ? worker_thread+0x127/0x310
[ 26.840000] [] ? autoremove_wake_function+0x0/0x60
[ 26.844000] [] ? worker_thread+0x0/0x310
[ 26.848000] [] kthread+0x7c/0x90
[ 26.852000] [] ? kthread+0x0/0x90
[ 26.856000] [] kernel_thread_helper+0x7/0x10
[ 26.860000] BUG: using smp_processor_id() in preemptible [00000000] code: events/1/10
[ 26.864000] caller is vmstat_update+0x3c/0x70Because this commit:
a1f84a3: sched: Check for an idle shared cache in select_task_rq_fair()
broke ->cpus_allowed.
Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
Cc: arjan@infradead.org
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar -
When waking affine, check for an idle shared cache, and if
found, wake to that CPU/sibling instead of the waker's CPU.This improves pgsql+oltp ramp up by roughly 8%. Possibly more
for other loads, depending on overlap. The trade-off is a
roughly 1% peak downturn if tasks are truly synchronous.Signed-off-by: Mike Galbraith
Cc: Arjan van de Ven
Cc: Peter Zijlstra
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar
24 Oct, 2009
1 commit
-
This patch restores the effectiveness of LAST_BUDDY in preventing
pgsql+oltp from collapsing due to wakeup preemption. It also
switches LAST_BUDDY to exclusively do what it does best, namely
mitigate the effects of aggressive wakeup preemption, which
improves vmark throughput markedly, and restores mysql+oltp
scalability.Since buddies are about scalability, enable them beginning at the
point where we begin expanding sched_latency, namely
sched_nr_latency. Previously, buddies were cleared aggressively,
which seriously reduced their effectiveness. Not clearing
aggressively however, produces a small drop in mysql+oltp
throughput immediately after peak, indicating that LAST_BUDDY is
actually doing some harm. This is right at the point where X on the
desktop in competition with another load wants low latency service.
Ergo, do not enable until we need to scale.To mitigate latency induced by buddies, or by a task just missing
wakeup preemption, check latency at tick time.Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
CACHE_HOT_BUDDY.Supporting performance tests:
tip = v2.6.32-rc5-1497-ga525b32
tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs(Three run averages except where noted.)
vmark:
------
tip 108466 messages per second
tip+ 125307 messages per second
tip+x 125335 messages per second
tipx 117781 messages per second
2.6.31.3 122729 messages per secondmysql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86pgsql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
14 Oct, 2009
1 commit
-
Yanmin reported a hackbench regression due to:
> commit de69a80be32445b0a71e8e3b757e584d7beb90f7
> Author: Peter Zijlstra
> Date: Thu Sep 17 09:01:20 2009 +0200
>
> sched: Stop buddies from hogging the systemI really liked de69a80b, and it affecting hackbench shows I wasn't
crazy ;-)So hackbench is a multi-cast, with one sender spraying multiple
receivers, who in their turn don't spray back.This would be exactly the scenario that patch 'cures'. Previously
we would not clear the last buddy after running the next task,
allowing the sender to get back to work sooner than it otherwise
ought to have been, increasing latencies for other tasks.Now, since those receivers don't poke back, they don't enforce the
buddy relation, which means there's nothing to re-elect the sender.Cure this by less agressively clearing the buddy stats. Only clear
buddies when they were not chosen. It should still avoid a buddy
sticking around long after its served its time.Reported-by: "Zhang, Yanmin"
Signed-off-by: Peter Zijlstra
CC: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar
24 Sep, 2009
1 commit
-
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Sep, 2009
1 commit
-
…l/git/tip/linux-2.6-tip
* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Simplify sys_sched_rr_get_interval() system call
sched: Fix potential NULL derference of doms_cur
sched: Fix raciness in runqueue_is_locked()
sched: Re-add lost cpu_allowed check to sched_fair.c::select_task_rq_fair()
sched: Remove unneeded indentation in sched_fair.c::place_entity()
21 Sep, 2009
1 commit
-
By removing the need for it to know details of scheduling classes.
This allows PlugSched to define orthogonal scheduling classes.
Signed-off-by: Peter Williams
Acked-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar