Eric Lee / smarc-fsl-linux-kernel

26 Jan, 2011

3 commits

05ca62c6c sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages ... Browse Code »

The delta in clock_task is a more fair attribution of how much time a tg has
been contributing load to the current cpu.

While not really important it also means we're more in sync (by magnitude)
with respect to periodic updates (since __update_curr deltas are clock_task
based).

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:03 +0800
b815f1963 sched: Fix/remove redundant cfs_rq checks ... Browse Code »

Since updates are against an entity's queuing cfs_rq it's not possible to
enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load
would crash prior to the check if we did anyway since we load is examined
during the initializers).

Also, in the update_cfs_load case there's no point
in maintaining averages for rq->cfs_rq since we don't perform shares
distribution at that level -- NULL check is replaced accordingly.

Thanks to Dan Carpenter for pointing out the deference before NULL check.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:02 +0800
e37b6a7b2 sched: Fix sign under-flows in wake_affine ... Browse Code »

While care is taken around the zero-point in effective_load to not exceed
the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0)
for (load + effective_load) to underflow.

In this case the comparing the unsigned values can result in incorrect balanced
decisions.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:01 +0800

24 Jan, 2011

1 commit

3ff6dcac7 sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug ... Browse Code »

Michael Witten and Christian Kujau reported that the autogroup
scheduling feature hurts interactivity on their UP systems.

It turns out that this is an older bug in the group scheduling code,
and the wider appeal provided by the autogroup feature exposed it
more prominently.

When on UP with FAIR_GROUP_SCHED enabled, tune shares
only affect tg->shares, but is not reflected in
tg->se->load. The reason is that update_cfs_shares()
does nothing on UP.

So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED.

This issue was found when enable autogroup scheduling was enabled,
but it is an older bug that also exists on cgroup.cpu on UP.

Reported-and-Tested-by: Michael Witten
Reported-and-Tested-by: Christian Kujau
Signed-off-by: Yong Zhang
Acked-by: Pekka Enberg
Acked-by: Mike Galbraith
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Yong Zhang
2011-01-24 18:47:50 +0800

18 Jan, 2011

2 commits

d7d829441 sched: Fix signed unsigned comparison in check_preempt_tick() ... Browse Code »

Signed unsigned comparison may lead to superfluous resched if leftmost
is right of the current task, wasting a few cycles, and inadvertently
_lengthening_ the current task's slice.

Reported-by: Venkatesh Pallipadi
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-01-18 22:09:44 +0800
977dda7c9 sched: Update effective_load() to use global share weights ... Browse Code »

Previously effective_load would approximate the global load weight present on
a group taking advantage of:

entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.

This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.

However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.

Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-18 22:09:38 +0800

19 Dec, 2010

2 commits

19e5eebb8 sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight ... Browse Code »

Mike Galbraith reported poor interactivity[*] when the new shares distribution
code was combined with autogroups.

The root cause turns out to be a mis-ordering of accounting accrued execution
time and shares updates. Since update_curr() is issued hierarchically,
updating the parent entity weights to reflect child enqueue/dequeue results in
the parent's unaccounted execution time then being accrued (vs vruntime) at the
new weight as opposed to the weight present at accumulation.

While this doesn't have much effect on processes with timeslices that cross a
tick, it is particularly problematic for an interactive process (e.g. Xorg)
which incurs many (tiny) timeslices. In this scenario almost all updates are
at dequeue which can result in significant fairness perturbation (especially if
it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
transitions).

Correct this by ensuring unaccounted time is accumulated prior to manipulating
an entity's weight.

[*] http://xkcd.com/619/ is perversely Nostradamian here.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-12-19 23:36:30 +0800
43365bd7f sched: Move periodic share updates to entity_tick() ... Browse Code »

Long running entities that do not block (dequeue) require periodic updates to
maintain accurate share values. (Note: group entities with several threads are
quite likely to be non-blocking in many circumstances).

By virtue of being long-running however, we will see entity ticks (otherwise
the required update occurs in dequeue/put and we are done). Thus we can move
the detection (and associated work) for these updates into the periodic path.

This restores the 'atomicity' of update_curr() with respect to accounting.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-12-19 23:36:22 +0800

09 Dec, 2010

1 commit

8e9255e6a Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: we want to queue up dependent cleanup

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-12-09 03:15:29 +0800

26 Nov, 2010

1 commit

22a867d81 Merge commit 'v2.6.37-rc3' into sched/core ... Browse Code »

Merge reason: Pick up latest fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-11-26 22:05:21 +0800

23 Nov, 2010

1 commit

70caf8a6c sched: Fix UP build breakage ... Browse Code »

The recent cgroup-scheduling rework caused a UP build problem.

Cc: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-23 17:29:07 +0800

18 Nov, 2010

12 commits

d6b559182 sched: Allow update_cfs_load() to update global load ... Browse Code »

Refactor the global load updates from update_shares_cpu() so that
update_cfs_load() can update global load when it is more than ~10%
out of sync.

The new global_load parameter allows us to force an update, regardless of
the error factor so that we can synchronize w/ update_shares().

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:50 +0800
3b3d190ec sched: Implement demand based update_cfs_load() ... Browse Code »

When the system is busy, dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates). Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:49 +0800
c66eaf619 sched: Update shares on idle_balance ... Browse Code »

Since shares updates are no longer expensive and effectively local, update them
at idle_balance(). This allows us to more quickly redistribute shares to
another cpu when our load becomes idle.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:49 +0800
a7a4f8a75 sched: Add sysctl_sched_shares_window ... Browse Code »

Introduce a new sysctl for the shares window and disambiguate it from
sched_time_avg.

A 10ms window appears to be a good compromise between accuracy and performance.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:49 +0800
67e86250f sched: Introduce hierarchal order on shares update list ... Browse Code »

Avoid duplicate shares update calls by ensuring children always appear before
parents in rq->leaf_cfs_rq_list.

This allows us to do a single in-order traversal for update_shares().

Since we always enqueue in bottom-up order this reduces to 2 cases:

1) Our parent is already in the list, e.g.

root
\
b
/\
c d* (root->b->c already enqueued)

Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.

2) Our parent does not appear in the list (or we have no parent)

In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
(bottom-up) it will appear to our right by the same rule.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:48 +0800
e33078baa sched: Fix update_cfs_load() synchronization ... Browse Code »

Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
the put path since nr_running accounting occurs at deactivation.

It's also not safe to make the removal decision based on load_avg as this fails
with both high periods and low shares. Resolve this by clipping history after
4 periods without activity.

Note: the above will always occur from update_shares() since in the
last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load
is called.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:48 +0800
f0d7442a5 sched: Fix load corruption from update_cfs_shares() ... Browse Code »

As part of enqueue_entity both a new entity weight and its contribution to the
queuing cfs_rq / rq are updated. Since update_cfs_shares will only update the
queueing weights when the entity is on_rq (which in this case it is not yet),
there's a dependency loop here:

update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight
account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*]

Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as
if we had accounted the enqueue already.

This was also resulting in rq->load corruption previously.

[*]: this dependency also exists when using the group cfs_rq w/
update_cfs_shares as the weight of the enqueued entity changes
without the load being updated.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-11-18 20:27:47 +0800
9e3081ca6 sched: Make tg_shares_up() walk on-demand ... Browse Code »

Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-18 20:27:47 +0800
3d4b47b4b sched: Implement on-demand (active) cfs_rq list ... Browse Code »

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-18 20:27:47 +0800
2069dd75c sched: Rewrite tg_shares_up) ... Browse Code »

By tracking a per-cpu load-avg for each cfs_rq and folding it into a
global task_group load on each tick we can rework tg_shares_up to be
strictly per-cpu.

This should improve cpu-cgroup performance for smp systems
significantly.

[ Paul: changed to use queueing cfs_rq + bug fixes ]

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-18 20:27:46 +0800
d5ad140bc sched: Fix idle balancing ... Browse Code »

An earlier commit reverts idle balancing throttling reset to fix a 30%
regression in volanomark throughput. We still need to reset idle_stamp
when we pull a task in newidle balance.

Reported-by: Alex Shi
Signed-off-by: Nikhil Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Nikhil Rao
2010-11-18 20:12:33 +0800
b5482cfa1 sched: Fix volanomark performance regression ... Browse Code »

Commit fab4762 triggers excessive idle balancing, causing a ~30% loss in
volanomark throughput. Remove idle balancing throttle reset.

Originally-by: Alex Shi
Signed-off-by: Mike Galbraith
Acked-by: Nikhil Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Alex Shi
2010-11-18 20:11:43 +0800

11 Nov, 2010

2 commits

1e5a74059 sched: Fix cross-sched-class wakeup preemption ... Browse Code »

Instead of dealing with sched classes inside each check_preempt_curr()
implementation, pull out this logic into the generic wakeup preemption
path.

This fixes a hang in KVM (and others) where we are waiting for the
stop machine thread to run ...

Reported-by: Markus Trippelsdorf
Tested-by: Marcelo Tosatti
Tested-by: Sergey Senozhatsky
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-11 21:37:23 +0800
aae6d3ddd sched: Use group weight, idle cpu metrics to fix imbalances during idle ... Browse Code »

Currently we consider a sched domain to be well balanced when the imbalance
is less than the domain's imablance_pct. As the number of cores and threads
are increasing, current values of imbalance_pct (for example 25% for a
NUMA domain) are not enough to detect imbalances like:

a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
socket. Leading to an idle HT cpu.

b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
socket and 7 on another socket. Leaving one core in a socket idle
whereas in another socket we have a core having both its HT siblings busy.

While this issue can be fixed by decreasing the domain's imbalance_pct
(by making it a function of number of logical cpus in the domain), it
can potentially cause more task migrations across sched groups in an
overloaded case.

Fix this by using imbalance_pct only during newly_idle and busy
load balancing. And during idle load balancing, check if there
is an imbalance in number of idle cpu's across the busiest and this
sched_group or if the busiest group has more tasks than its weight that
the idle cpu in this_group can pull.

Reported-by: Nikhil Rao
Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Suresh Siddha
2010-11-11 06:13:56 +0800

29 Oct, 2010

1 commit

37542b6a7 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched_stat: Update sched_info_queue/dequeue() code comments
sched, cgroup: Fixup broken cgroup movement

Linus Torvalds
2010-10-29 23:05:33 +0800

22 Oct, 2010

2 commits

b2b5ce022 sched, cgroup: Fixup broken cgroup movement ... Browse Code »

Dima noticed that we fail to correct the ->vruntime of sleeping tasks
when we move them between cgroups.

Reported-by: Dima Zavin
Signed-off-by: Peter Zijlstra
Tested-by: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-10-22 20:16:45 +0800
bc4016f48 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
sched: Export account_system_vtime()
sched: Call tick_check_idle before __irq_enter
sched: Remove irq time from available CPU power
sched: Do not account irq time to current task
x86: Add IRQ_TIME_ACCOUNTING
sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
sched: Add a PF flag for ksoftirqd identification
sched: Consolidate account_system_vtime extern declaration
sched: Fix softirq time accounting
sched: Drop group_capacity to 1 only if local group has extra capacity
sched: Force balancing on newidle balance if local group has capacity
sched: Set group_imb only a task can be pulled from the busiest cpu
sched: Do not consider SCHED_IDLE tasks to be cache hot
sched: Drop all load weight manipulation for RT tasks
sched: Create special class for stop/migrate work
sched: Unindent labels
sched: Comment updates: fix default latency and granularity numbers
tracing/sched: Add sched_pi_setprio tracepoint
sched: Give CPU bound RT tasks preference
sched: Try not to migrate higher priority RT tasks
...

Linus Torvalds
2010-10-22 03:55:43 +0800

19 Oct, 2010

5 commits

aa4838085 sched: Remove irq time from available CPU power ... Browse Code »

The idea was suggested by Peter Zijlstra here:

http://marc.info/?l=linux-kernel&m=127476934517534&w=2

irq time is technically not available to the tasks running on the CPU.
This patch removes irq time from CPU power piggybacking on
sched_rt_avg_update().

Tested this by keeping CPU X busy with a network intensive task having 75%
oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
cycle soakers on the system. Without this change, there will be two tasks on
each CPU. With this change, there is a single task on irq busy CPU X and
remaining 7 tasks are spread around among other 3 CPUs.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2010-10-19 02:52:27 +0800
305e6835e sched: Do not account irq time to current task ... Browse Code »

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.

Now, if I run another CPU intensive task on CPU 2, without this change
/proc//schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2010-10-19 02:52:26 +0800
75dd321d7 sched: Drop group_capacity to 1 only if local group has extra capacity ... Browse Code »

When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
only if the local group has extra capacity. The extra check prevents the case
where you always pull from the heaviest group when it is already under-utilized
(possible with a large weight task outweighs the tasks on the system).

For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
and each task is running on one core. In this case, we observe the following
events when balancing at the NUMA domain:

- find_busiest_group() will always pick the sched group containing the niced
task to be the busiest group.
- find_busiest_queue() will then always pick one of the cpus running the
nice0 task (never picks the cpu with the nice -15 task since
weighted_cpuload > imbalance).
- The load balancer fails to migrate the task since it is the running task
and increments sd->nr_balance_failed.
- It repeats the above steps a few more times until sd->nr_balance_failed > 5,
at which point it kicks off the active load balancer, wakes up the migration
thread and kicks the nice 0 task off the cpu.

The load balancer doesn't stop until we kick out all nice 0 tasks from
the sched group, leaving you with 3 idle cpus and one cpu running the
nice -15 task.

When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are
not relevant because the niced task has a very large weight.

In this patch, we add an extra condition to the "if(prefer_sibling)" check in
update_sd_lb_stats(). We drop the capacity of a group only if the local group
has extra capacity, ie. nr_running < group_capacity. This patch preserves the
original intent of the prefer_siblings check (to spread tasks across the system
in low utilization scenarios) and fixes the case above.

It helps in the following ways:
- In low utilization cases (where nr_tasks << nr_cpus), we still drop
group_capacity down to 1 if we prefer siblings.
- On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
likely be > sgs.group_capacity.
- When balancing large weight tasks, if the local group does not have extra
capacity, we do not pick the group with the niced task as the busiest group.
This prevents failed balances, active migration and the under-utilization
described above.

Signed-off-by: Nikhil Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Nikhil Rao
2010-10-19 02:52:19 +0800
fab476228 sched: Force balancing on newidle balance if local group has capacity ... Browse Code »

This patch forces a load balance on a newly idle cpu when the local group has
extra capacity and the busiest group does not have any. It improves system
utilization when balancing tasks with a large weight differential.

Under certain situations, such as a niced down task (i.e. nice = -15) in the
presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
kicks away other tasks because of its large weight. This leads to sub-optimal
utilization of the machine. Even though the sched group has capacity, it does
not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.

With this patch, if the local group has extra capacity, we shortcut the checks
in f_b_g() and try to pull a task over. A sched group has extra capacity if the
group capacity is greater than the number of running tasks in that group.

Thanks to Mike Galbraith for discussions leading to this patch and for the
insight to reuse SD_NEWIDLE_BALANCE.

Signed-off-by: Nikhil Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Nikhil Rao
2010-10-19 02:52:18 +0800
2582f0eba sched: Set group_imb only a task can be pulled from the busiest cpu ... Browse Code »

When cycling through sched groups to determine the busiest group, set
group_imb only if the busiest cpu has more than 1 runnable task. This patch
fixes the case where two cpus in a group have one runnable task each, but there
is a large weight differential between these two tasks. The load balancer is
unable to migrate any task from this group, and hence do not consider this
group to be imbalanced.

Signed-off-by: Nikhil Rao
Signed-off-by: Peter Zijlstra
LKML-Reference:
[ small code readability edits ]
Signed-off-by: Ingo Molnar

Nikhil Rao
2010-10-19 02:52:17 +0800

14 Oct, 2010

2 commits

864616ee6 sched: Comment updates: fix default latency and granularity numbers ... Browse Code »

Targeted preemption latency and minimal preemption granularity
for CPU-bound tasks have been changed.

This patch updates the comments about these values.

Signed-off-by: Takuya Yoshikawa
Cc: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Takuya Yoshikawa
2010-10-14 15:13:35 +0800
ed859ed3b Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: update from -rc5 to -almost-final

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-10-14 15:11:46 +0800

08 Oct, 2010

1 commit

b0a0f667a sched: suppress RCU lockdep splat in task_fork_fair ... Browse Code »

> ===================================================
> [ INFO: suspicious rcu_dereference_check() usage. ]
> ---------------------------------------------------
> /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
>
> other info that might help us debug this:
>
> rcu_scheduler_active = 1, debug_locks = 1
> 1 lock held by ifup/23517:
> #0: (&rq->lock){-.-.-.}, at: [] task_fork_fair+0x3b/0x108
>
> stack backtrace:
> Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
> Call Trace:
> [] ? printk+0xf/0x16
> [] lockdep_rcu_dereference+0x74/0x7d
> [] task_group+0x6d/0x79
> [] set_task_rq+0xe/0x57
> [] task_fork_fair+0x57/0x108
> [] sched_fork+0x82/0xf9
> [] copy_process+0x569/0xe8e
> [] do_fork+0x118/0x262
> [] ? do_page_fault+0x16a/0x2cf
> [] ? up_read+0x16/0x2a
> [] sys_clone+0x1b/0x20
> [] ptregs_clone+0x15/0x30
> [] ? sysenter_do_call+0x12/0x38

Here a newly created task is having its runqueue assigned. The new task
is not yet on the tasklist, so cannot go away. This is therefore a false
positive, suppress with an RCU read-side critical section.

Reported-by: Ben Greear
Tested-by: Ben Greear <greearb@candelatech.com

Paul E. McKenney
2010-10-08 01:40:50 +0800

21 Sep, 2010

2 commits

58b26c4c0 sched: Increment cache_nice_tries only on periodic lb ... Browse Code »

scheduler uses cache_nice_tries as an indicator to do cache_hot and
active load balance, when normal load balance fails. Currently,
this value is changed on any failed load balance attempt. That ends
up being not so nice to workloads that enter/exit idle often, as
they do more frequent new_idle balance and that pretty soon results
in cache hot tasks being pulled in.

Making the cache_nice_tries ignore failed new_idle balance seems to
make better sense. With that only the failed load balance in
periodic load balance gets accounted and the rate of accumulation
of cache_nice_tries will not depend on idle entry/exit (short
running sleep-wakeup kind of tasks). This reduces movement of
cache_hot tasks.

schedstat diff (after-before) excerpt from a workload that has
frequent and short wakeup-idle pattern (:2 in cpu col below refers
to NEWIDLE idx) This snapshot was across ~400 seconds.

Without this change:
domainstats: domain0
cpu cnt bln fld imb gain hgain nobusyq nobusyg
0:2 306487 219575 73167 110069413 44583 19070 1172 218403
1:2 292139 194853 81421 120893383 50745 21902 1259 193594
2:2 283166 174607 91359 129699642 54931 23688 1287 173320
3:2 273998 161788 93991 132757146 57122 24351 1366 160422
4:2 289851 215692 62190 83398383 36377 13680 851 214841
5:2 316312 222146 77605 117582154 49948 20281 988 221158
6:2 297172 195596 83623 122133390 52801 21301 929 194667
7:2 283391 178078 86378 126622761 55122 22239 928 177150
8:2 297655 210359 72995 110246694 45798 19777 1125 209234
9:2 297357 202011 79363 119753474 50953 22088 1089 200922
10:2 278797 178703 83180 122514385 52969 22726 1128 177575
11:2 272661 167669 86978 127342327 55857 24342 1195 166474
12:2 293039 204031 73211 110282059 47285 19651 948 203083
13:2 289502 196762 76803 114712942 49339 20547 1016 195746
14:2 264446 169609 78292 115715605 50459 21017 982 168627
15:2 260968 163660 80142 116811793 51483 21281 1064 162596

With this change:
domainstats: domain0
cpu cnt bln fld imb gain hgain nobusyq nobusyg
0:2 272347 187380 77455 105420270 24975 1 953 186427
1:2 267276 172360 86234 116242264 28087 6 1028 171332
2:2 259769 156777 93281 123243134 30555 1 1043 155734
3:2 250870 143129 97627 127370868 32026 6 1188 141941
4:2 248422 177116 64096 78261112 22202 2 757 176359
5:2 275595 180683 84950 116075022 29400 6 778 179905
6:2 262418 162609 88944 119256898 31056 4 817 161792
7:2 252204 147946 92646 122388300 32879 4 824 147122
8:2 262335 172239 81631 110477214 26599 4 864 171375
9:2 261563 164775 88016 117203621 28331 3 849 163926
10:2 243389 140949 93379 121353071 29585 2 909 140040
11:2 242795 134651 98310 124768957 30895 2 1016 133635
12:2 255234 166622 79843 104696912 26483 4 746 165876
13:2 244944 151595 83855 109808099 27787 3 801 150794
14:2 241301 140982 89935 116954383 30403 6 845 140137
15:2 232271 128564 92821 119185207 31207 4 1416 127148

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2010-09-21 19:57:11 +0800
f6c3f1686 sched: Fix nohz balance kick ... Browse Code »

There's a situation where the nohz balancer will try to wake itself:

cpu-x is idle which is also ilb_cpu
got a scheduler tick during idle
and the nohz_kick_needed() in trigger_load_balance() checks for
rq_x->nr_running which might not be zero (because of someone waking a
task on this rq etc) and this leads to the situation of the cpu-x
sending a kick to itself.

And this can cause a lockup.

Avoid this by not marking ourself eligible for kicking.

Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Suresh Siddha
2010-09-21 19:50:50 +0800

14 Sep, 2010

1 commit

0bf377bbb sched: Improve latencies under load by decreasing minimum scheduling granularity ... Browse Code »

Mathieu reported bad latencies with make -j10 kind of kbuild
workloads - which is mostly caused by us scheduling with a
too coarse granularity.

Reduce the minimum granularity some more, to make sure we
can meet the latency target.

I got the following results (make -j10 kbuild load, average of 3
runs):

vanilla:

maximum latency: 38278.9 µs
average latency: 7730.1 µs

patched:

maximum latency: 22702.1 µs
average latency: 6684.8 µs

Mathieu also measured it:

|
| * wakeup-latency.c (SIGEV_THREAD) with make -j10
|
| - Mainline 2.6.35.2 kernel
|
| maximum latency: 45762.1 µs
| average latency: 7348.6 µs
|
| - With only Peter's smaller min_gran (shown below):
|
| maximum latency: 29100.6 µs
| average latency: 6684.1 µs
|

Reported-by: Mathieu Desnoyers
Reported-by: Linus Torvalds
Acked-by: Mathieu Desnoyers
Suggested-by: Peter Zijlstra
Acked-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Ingo Molnar
2010-09-14 02:17:11 +0800

11 Sep, 2010

1 commit

aad1830e6 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, tsc: Fix a preemption leak in restore_sched_clock_state()
sched: Move sched_avg_update() to update_cpu_load()

Linus Torvalds
2010-09-11 22:59:49 +0800