Doug / smarc-fsl-linux-kernel | Embedian Git Server

28 May, 2011

1 commit

1e8762317 sched: Fix ->min_vruntime calculation in dequeue_entity() ... Browse Code »

Dima Zavin reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin
Signed-off-by: John Stultz
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:02:56 +0800

20 May, 2011

1 commit

1399fa780 sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations ... Browse Code »

SCHED_LOAD_SCALE is used to increase nice resolution and to
scale cpu_power calculations in the scheduler. This patch
introduces SCHED_POWER_SCALE and converts all uses of
SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
instead.

This is a preparatory patch for increasing the resolution of
SCHED_LOAD_SCALE, and there is no need to increase resolution
for cpu_power calculations.

Signed-off-by: Nikhil Rao
Acked-by: Peter Zijlstra
Cc: Nikunj A. Dadhania
Cc: Srivatsa Vaddagiri
Cc: Stephan Barwolf
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar

Nikhil Rao
2011-05-20 20:16:50 +0800

04 May, 2011

1 commit

931aeeda0 sched: Remove unused 'this_best_prio arg' from balance_tasks() ... Browse Code »

It's passed across multiple functions but is never really used, so
remove it.

Signed-off-by: Vladimir Davydov
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2011-05-04 15:07:21 +0800

19 Apr, 2011

2 commits

2f36825b1 sched: Next buddy hint on sleep and preempt path ... Browse Code »

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi
Acked-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:38 +0800
69c80f3e9 sched: Make set_*_buddy() work on non-task entities ... Browse Code »

Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:37 +0800

18 Apr, 2011

1 commit

6ddafdaab Merge branch 'sched/locking' into sched/core ... Browse Code »

Merge reason: the rq locking changes are stable,
propagate them into the .40 queue.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-04-18 20:53:33 +0800

14 Apr, 2011

3 commits

3fe1698b7 sched: Deal with non-atomic min_vruntime reads on 32bits ... Browse Code »

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:37 +0800
74f8e4b23 sched: Remove rq argument to sched_class::task_waking() ... Browse Code »

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800
7608dec2c sched: Drop the rq argument to sched_class::select_task_rq() ... Browse Code »

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800

11 Apr, 2011

5 commits

a6c75f2f8 sched: Avoid using sd->level ... Browse Code »

Don't use sd->level for identifying properties of the domain.

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-11 18:58:20 +0800
dce840a08 sched: Dynamically allocate sched_domain/sched_group data-structures ... Browse Code »

Instead of relying on static allocations for the sched_domain and
sched_group trees, dynamically allocate and RCU free them.

Allocating this dynamically also allows for some build_sched_groups()
simplification since we can now (like with other simplifications) rely
on the sched_domain tree instead of hard-coded knowledge.

One tricky to note is that detach_destroy_domains() needs to hold
rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
since that can lead to partial sched_group existance (could possibly
be solved by doing the tear-down backwards but this is much more
robust).

A concequence of the above is that we can no longer print the
sched_domain debug stuff from cpu_attach_domain() since that might now
run with preemption disabled (due to classic RCU etc.) and
sched_domain_debug() does some GFP_KERNEL allocations.

Another thing to note is that we now fully rely on normal RCU and not
RCU-sched, this is because with the new and exiting RCU flavours we
grew over the years BH doesn't necessarily hold off RCU-sched grace
periods (-rt is known to break this). This would in fact already cause
us grief since we do sched_domain/sched_group iterations from softirq
context.

This patch is somewhat larger than I would like it to be, but I didn't
find any means of shrinking/splitting this.

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-11 18:58:19 +0800
f4ad9bd20 sched: Eliminate dead code from wakeup_gran() ... Browse Code »

calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

Signed-off-by: Shaohua Li
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
Signed-off-by: Ingo Molnar

Shaohua Li
2011-04-11 17:08:55 +0800
b30aef17f sched: Fix erroneous all_pinned logic ... Browse Code »

The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.

This all works as designed if there is only one cgroup in the system.

However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.

The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:

load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.

To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.

Signed-off-by: Ken Chen
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
Signed-off-by: Ingo Molnar

Ken Chen
2011-04-11 17:08:54 +0800
b0432d8f1 sched: Fix sched-domain avg_load calculation ... Browse Code »

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
of the bad state because of positive feedback loop: large load
delta -> heavier load migration -> larger imbalance and the cycle
goes on.

2. severe imbalance in CPU queue depth. This causes really long
scheduling latency blip which affects badly on application that
has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.

Signed-off-by: Ken Chen
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
Signed-off-by: Ingo Molnar

Ken Chen
2011-04-11 17:08:54 +0800

08 Apr, 2011

2 commits

8b9686ff4 Merge branches 'x86-fixes-for-linus', 'sched-fixes-for-linus', 'timers-fixes-for… ... Browse Code »

…-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix FPU exception handling on non-SSE systems
x86, hibernate: Initialize mmu_cr4_features during boot
x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
x86: visws: Fixup irq overhaul fallout

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Clean up rebalance_domains() load-balance interval calculation

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix cpumask leak in __setup_irq()

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf probe: Fix listing incorrect line number with inline function
perf probe: Fix to find recursively inlined function
perf probe: Fix multiple --vars options behavior
perf probe: Fix to remove redundant close
perf probe: Fix to ensure function declared file

Linus Torvalds
2011-04-08 03:12:58 +0800
42933bac1 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 ... Browse Code »

* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
Fix common misspellings

Linus Torvalds
2011-04-08 02:14:49 +0800

05 Apr, 2011

1 commit

49c022e65 sched: Clean up rebalance_domains() load-balance interval calculation ... Browse Code »

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.

Reported-by: Linus Torvalds
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-05 16:29:36 +0800

31 Mar, 2011

2 commits

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800
3436ae129 sched: Fix rebalance interval calculation ... Browse Code »

The interval for checking scheduling domains if they are due to be
balanced currently depends on boot state NR_CPUS, which may not
accurately reflect the number of online CPUs at the time of check.

Thus replace NR_CPUS with num_online_cpus().

(ed: Should only affect those who set NR_CPUS really high, such as 4096
or so :-)

Signed-off-by: Sisir Koppaka
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Sisir Koppaka
2011-03-31 19:00:37 +0800

04 Mar, 2011

2 commits

6d1cafd8b sched: Resched proper CPU on yield_to() ... Browse Code »

yield_to_task_fair() has code to resched the CPU of yielding task when the
intention is to resched the CPU of the task that is being yielded to.

Change here fixes the problem and also makes the resched conditional on
rq != p_rq.

Signed-off-by: Venkatesh Pallipadi
Reviewed-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-03-04 18:14:31 +0800
a2f5c9ab7 sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks ... Browse Code »

Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
ensure idle tasks don't preempt idle tasks) so the non-interactive,
but still important, SCHED_BATCH tasks will run in favor of the very
low priority SCHED_IDLE tasks.

Signed-off-by: Darren Hart
Signed-off-by: Peter Zijlstra
Acked-by: Mike Galbraith
Cc: Richard Purdie
LKML-Reference:
Signed-off-by: Ingo Molnar

Darren Hart
2011-03-04 18:14:29 +0800

23 Feb, 2011

3 commits

866ab43ef sched: Fix the group_imb logic ... Browse Code »

On a 2*6*2 machine something like:

taskset -c 3-11 bash -c 'for ((i=0;i
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:57 +0800
cc57aa8f4 sched: Clean up some f_b_g() comments ... Browse Code »

The existing comment tends to grow state (as it already has), split it
up and place it near the actual tests.

Signed-off-by: Peter Zijlstra
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:56 +0800
c186fafe9 sched: Clean up remnants of sd_idle ... Browse Code »

With the wholesale removal of the sd_idle SMT logic we can clean up
some more.

Signed-off-by: Peter Zijlstra
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:55 +0800

16 Feb, 2011

1 commit

46e49b383 sched: Wholesale removal of sd_idle logic ... Browse Code »

sd_idle logic was introduced way back in 2005 (commit 5969fe06),
as an HT optimization.

As per the discussion in the thread here:

lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
https://patchwork.kernel.org/patch/532501/

The capacity based logic in the load balancer right now handles this
in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
does not seem to bring any additional benefits. sd_idle logic also has
some bugs that has performance impact. Here is the patch that removes
the sd_idle logic altogether.

Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
logic.

Signed-off-by: Venkatesh Pallipadi
Acked-by: Vaidyanathan Srinivasan
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-02-16 20:33:20 +0800

03 Feb, 2011

4 commits

d95f41220 sched: Add yield_to(task, preempt) functionality ... Browse Code »

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected. We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it. Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel
Signed-off-by: Marcelo Tosatti
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-02-03 21:20:33 +0800
ac53db596 sched: Use a buddy to implement yield_task_fair() ... Browse Code »

Use the buddy mechanism to implement yield_task_fair. This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next. We need next to be able
to override yield, because it is possible for the "next" and
"yield" task to be different processen in the same sub-tree
of the CFS tree. When they are, we need to go into that
sub-tree regardless of the "yield" hint, and pick the correct
entity once we get to the right level.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:33 +0800
2c13c919d sched: Limit the scope of clear_buddies ... Browse Code »

The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues. In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

(nl)
/ \
G(nl) G
/ \ \
T(l) T(n) T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared. This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be. As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:32 +0800
725e7580a sched: Check the right ->nr_running in yield_task_fair() ... Browse Code »

With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:32 +0800

26 Jan, 2011

6 commits

da7a735e5 sched: Fix switch_from_fair() ... Browse Code »

When a task is taken out of the fair class we must ensure the vruntime
is properly normalized because when we put it back in it will assume
to be normalized.

The case that goes wrong is when changing away from the fair class
while sleeping. Sleeping tasks have non-normalized vruntime in order
to make sleeper-fairness work. So treat the switch away from fair as a
wakeup and preserve the relative vruntime.

Also update sysrq-n to call the ->switch_{to,from} methods.

Reported-by: Onkalo Samu
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-01-26 19:33:22 +0800
f07333bf6 sched: Avoid expensive initial update_cfs_load() ... Browse Code »

Since cfs->{load_stamp,load_last} are zero-initalized the initial load update
will consider the delta to be 'since the beginning of time'.

This results in a lot of pointless divisions to bring this large period to be
within the sysctl_sched_shares_window.

Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this
allows for an initial load_stamp > load_last which then lets standard idle
truncation proceed.

We avoid spinning (and slightly improve consistency) by fixing delta to be
[period - 1] in this path resulting in a slightly more predictable shares ramp.
(Previously the amount of idle time preserved by the overflow would range between
[period/2,period-1].)

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:33:19 +0800
6d5ab2932 sched: Simplify update_cfs_shares parameters ... Browse Code »

Re-visiting this: Since update_cfs_shares will now only ever re-weight an
entity that is a relative parent of the current entity in enqueue_entity; we
can safely issue the account_entity_enqueue relative to that cfs_rq and avoid
the requirement for special handling of the enqueue case in update_cfs_shares.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:33:19 +0800
05ca62c6c sched: Use rq->clock_task instead of rq->clock for correctly maintaining load averages ... Browse Code »

The delta in clock_task is a more fair attribution of how much time a tg has
been contributing load to the current cpu.

While not really important it also means we're more in sync (by magnitude)
with respect to periodic updates (since __update_curr deltas are clock_task
based).

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:03 +0800
b815f1963 sched: Fix/remove redundant cfs_rq checks ... Browse Code »

Since updates are against an entity's queuing cfs_rq it's not possible to
enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load
would crash prior to the check if we did anyway since we load is examined
during the initializers).

Also, in the update_cfs_load case there's no point
in maintaining averages for rq->cfs_rq since we don't perform shares
distribution at that level -- NULL check is replaced accordingly.

Thanks to Dan Carpenter for pointing out the deference before NULL check.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:02 +0800
e37b6a7b2 sched: Fix sign under-flows in wake_affine ... Browse Code »

While care is taken around the zero-point in effective_load to not exceed
the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0)
for (load + effective_load) to underflow.

In this case the comparing the unsigned values can result in incorrect balanced
decisions.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:31:01 +0800

24 Jan, 2011

1 commit

3ff6dcac7 sched: Fix poor interactivity on UP systems due to group scheduler nice tune bug ... Browse Code »

Michael Witten and Christian Kujau reported that the autogroup
scheduling feature hurts interactivity on their UP systems.

It turns out that this is an older bug in the group scheduling code,
and the wider appeal provided by the autogroup feature exposed it
more prominently.

When on UP with FAIR_GROUP_SCHED enabled, tune shares
only affect tg->shares, but is not reflected in
tg->se->load. The reason is that update_cfs_shares()
does nothing on UP.

So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED.

This issue was found when enable autogroup scheduling was enabled,
but it is an older bug that also exists on cgroup.cpu on UP.

Reported-and-Tested-by: Michael Witten
Reported-and-Tested-by: Christian Kujau
Signed-off-by: Yong Zhang
Acked-by: Pekka Enberg
Acked-by: Mike Galbraith
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Yong Zhang
2011-01-24 18:47:50 +0800

18 Jan, 2011

2 commits

d7d829441 sched: Fix signed unsigned comparison in check_preempt_tick() ... Browse Code »

Signed unsigned comparison may lead to superfluous resched if leftmost
is right of the current task, wasting a few cycles, and inadvertently
_lengthening_ the current task's slice.

Reported-by: Venkatesh Pallipadi
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-01-18 22:09:44 +0800
977dda7c9 sched: Update effective_load() to use global share weights ... Browse Code »

Previously effective_load would approximate the global load weight present on
a group taking advantage of:

entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
by tg_shares_up.

This worked (approximately) for an 'empty' (at tg level) cpu since we would
place boost load representative of what a newly woken task would receive.

However, now that load is instantaneously updated this assumption is no longer
true and the load calculation is rather incorrect in this case.

Fix this (and improve the general case) by re-writing effective_load to take
advantage of the new shares distribution code.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-18 22:09:38 +0800

19 Dec, 2010

2 commits

19e5eebb8 sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight ... Browse Code »

Mike Galbraith reported poor interactivity[*] when the new shares distribution
code was combined with autogroups.

The root cause turns out to be a mis-ordering of accounting accrued execution
time and shares updates. Since update_curr() is issued hierarchically,
updating the parent entity weights to reflect child enqueue/dequeue results in
the parent's unaccounted execution time then being accrued (vs vruntime) at the
new weight as opposed to the weight present at accumulation.

While this doesn't have much effect on processes with timeslices that cross a
tick, it is particularly problematic for an interactive process (e.g. Xorg)
which incurs many (tiny) timeslices. In this scenario almost all updates are
at dequeue which can result in significant fairness perturbation (especially if
it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
transitions).

Correct this by ensuring unaccounted time is accumulated prior to manipulating
an entity's weight.

[*] http://xkcd.com/619/ is perversely Nostradamian here.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Linus Torvalds
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-12-19 23:36:30 +0800
43365bd7f sched: Move periodic share updates to entity_tick() ... Browse Code »

Long running entities that do not block (dequeue) require periodic updates to
maintain accurate share values. (Note: group entities with several threads are
quite likely to be non-blocking in many circumstances).

By virtue of being long-running however, we will see entity ticks (otherwise
the required update occurs in dequeue/put and we are done). Thus we can move
the detection (and associated work) for these updates into the periodic path.

This restores the 'atomicity' of update_curr() with respect to accounting.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2010-12-19 23:36:22 +0800