Eric Lee / linux-smarc-t335x-v3.2

22 Jul, 2011

7 commits

0f3171438 sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair ... Browse Code »

No need to define a new "cfs_rq" variable in the "for" block.
Just use the one at the top of the function.

Signed-off-by: Lin Ming
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com
Signed-off-by: Ingo Molnar

Lin Ming
2011-07-22 18:47:22 +0800
2bd2d6f2d sched: Replace use of entity_key() ... Browse Code »

"entity_key()" is only used in "__enqueue_entity()" and
its only function is to subtract a tasks vruntime by
its groups minvruntime.
Before this patch a rbtree enqueue-decision is done by
comparing two tasks in the style:

"if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

which would be

"if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

or (if reducing cfs_rq->min_vruntime out)

"if (se->vruntime < entry->vruntime)"

which is

"if (entity_before(se, entry))"

So we do not need "entity_key()".
If "entity_before()" is inline we will also save one subtraction (only one,
because "entity_key(cfs_rq, se)" was cached in "key")

Signed-off-by: Stephan Baerwolf
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org
Signed-off-by: Ingo Molnar

Stephan Baerwolf
2011-07-22 00:01:55 +0800
045176d22 sched: Remove unused function cpu_cfs_rq() ... Browse Code »

The last reference to cpu_cfs_rq() was removed with commit 88ec22d3
("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus,
remove this function, too.

Signed-off-by: Jan Schoenherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Jan Schoenherr
2011-07-22 00:01:49 +0800
9763b67fb sched, cgroup: Optimize load_balance_fair() ... Browse Code »

Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

No idea if it will actually work out to be beneficial in practice, does
anybody have a cgroup workload that might show a difference one way or
the other?

[ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

Signed-off-by: Peter Zijlstra
Reviewed-by: Paul Turner
Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-07-22 00:01:46 +0800
9598c82dc sched: Don't update shares twice on on_rq parent ... Browse Code »

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight. However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-07-22 00:01:44 +0800
9bbd73743 sched: update correct entity's runtime in check_preempt_wakeup() ... Browse Code »

While looking at check_preempt_wakeup() I realized that we are
potentially updating the wrong entity in the fair-group scheduling
case. In this case the current task's cfs_rq may not be the same as
the one used for the comparison between the waking task and the
existing task's vruntime.

This potentially results in us using a stale vruntime in the
pre-emption decision, providing a small false preference for the
previous task. The effects of this are bounded since we always
perform a hierarchal update on the tick.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-07-22 00:01:43 +0800
994bf1c92 Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: pick up the latest scheduler fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-07-22 00:00:01 +0800

21 Jul, 2011

1 commit

9c3f75cbd sched: Break out cpu_power from the sched_group structure ... Browse Code »

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Reported-and-tested-by: Anton Blanchard
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-07-21 00:32:40 +0800

01 Jul, 2011

1 commit

2a46dae38 sched: Remove rcu_read_lock() from wake_affine() ... Browse Code »

wake_affine() is only called from one path: select_task_rq_fair(),
which already has the RCU read lock held.

Signed-off-by: Nikunj A. Dadhania
Signed-off-by: Peter Zijlstra
Cc: Paul E. McKenney
Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com
Signed-off-by: Ingo Molnar

Nikunj A. Dadhania
2011-07-01 16:39:06 +0800

28 May, 2011

1 commit

1e8762317 sched: Fix ->min_vruntime calculation in dequeue_entity() ... Browse Code »

Dima Zavin reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin
Signed-off-by: John Stultz
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:02:56 +0800

20 May, 2011

1 commit

1399fa780 sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations ... Browse Code »

SCHED_LOAD_SCALE is used to increase nice resolution and to
scale cpu_power calculations in the scheduler. This patch
introduces SCHED_POWER_SCALE and converts all uses of
SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
instead.

This is a preparatory patch for increasing the resolution of
SCHED_LOAD_SCALE, and there is no need to increase resolution
for cpu_power calculations.

Signed-off-by: Nikhil Rao
Acked-by: Peter Zijlstra
Cc: Nikunj A. Dadhania
Cc: Srivatsa Vaddagiri
Cc: Stephan Barwolf
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar

Nikhil Rao
2011-05-20 20:16:50 +0800

04 May, 2011

1 commit

931aeeda0 sched: Remove unused 'this_best_prio arg' from balance_tasks() ... Browse Code »

It's passed across multiple functions but is never really used, so
remove it.

Signed-off-by: Vladimir Davydov
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2011-05-04 15:07:21 +0800

19 Apr, 2011

2 commits

2f36825b1 sched: Next buddy hint on sleep and preempt path ... Browse Code »

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi
Acked-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:38 +0800
69c80f3e9 sched: Make set_*_buddy() work on non-task entities ... Browse Code »

Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:37 +0800

18 Apr, 2011

1 commit

6ddafdaab Merge branch 'sched/locking' into sched/core ... Browse Code »

Merge reason: the rq locking changes are stable,
propagate them into the .40 queue.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-04-18 20:53:33 +0800

14 Apr, 2011

3 commits

3fe1698b7 sched: Deal with non-atomic min_vruntime reads on 32bits ... Browse Code »

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:37 +0800
74f8e4b23 sched: Remove rq argument to sched_class::task_waking() ... Browse Code »

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800
7608dec2c sched: Drop the rq argument to sched_class::select_task_rq() ... Browse Code »

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800

11 Apr, 2011

5 commits

a6c75f2f8 sched: Avoid using sd->level ... Browse Code »

Don't use sd->level for identifying properties of the domain.

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-11 18:58:20 +0800
dce840a08 sched: Dynamically allocate sched_domain/sched_group data-structures ... Browse Code »

Instead of relying on static allocations for the sched_domain and
sched_group trees, dynamically allocate and RCU free them.

Allocating this dynamically also allows for some build_sched_groups()
simplification since we can now (like with other simplifications) rely
on the sched_domain tree instead of hard-coded knowledge.

One tricky to note is that detach_destroy_domains() needs to hold
rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
since that can lead to partial sched_group existance (could possibly
be solved by doing the tear-down backwards but this is much more
robust).

A concequence of the above is that we can no longer print the
sched_domain debug stuff from cpu_attach_domain() since that might now
run with preemption disabled (due to classic RCU etc.) and
sched_domain_debug() does some GFP_KERNEL allocations.

Another thing to note is that we now fully rely on normal RCU and not
RCU-sched, this is because with the new and exiting RCU flavours we
grew over the years BH doesn't necessarily hold off RCU-sched grace
periods (-rt is known to break this). This would in fact already cause
us grief since we do sched_domain/sched_group iterations from softirq
context.

This patch is somewhat larger than I would like it to be, but I didn't
find any means of shrinking/splitting this.

Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-11 18:58:19 +0800
f4ad9bd20 sched: Eliminate dead code from wakeup_gran() ... Browse Code »

calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

Signed-off-by: Shaohua Li
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
Signed-off-by: Ingo Molnar

Shaohua Li
2011-04-11 17:08:55 +0800
b30aef17f sched: Fix erroneous all_pinned logic ... Browse Code »

The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.

This all works as designed if there is only one cgroup in the system.

However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.

The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:

load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.

To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.

Signed-off-by: Ken Chen
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
Signed-off-by: Ingo Molnar

Ken Chen
2011-04-11 17:08:54 +0800
b0432d8f1 sched: Fix sched-domain avg_load calculation ... Browse Code »

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
of the bad state because of positive feedback loop: large load
delta -> heavier load migration -> larger imbalance and the cycle
goes on.

2. severe imbalance in CPU queue depth. This causes really long
scheduling latency blip which affects badly on application that
has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.

Signed-off-by: Ken Chen
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
Signed-off-by: Ingo Molnar

Ken Chen
2011-04-11 17:08:54 +0800

08 Apr, 2011

2 commits

8b9686ff4 Merge branches 'x86-fixes-for-linus', 'sched-fixes-for-linus', 'timers-fixes-for… ... Browse Code »

…-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86-32, fpu: Fix FPU exception handling on non-SSE systems
x86, hibernate: Initialize mmu_cr4_features during boot
x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
x86: visws: Fixup irq overhaul fallout

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Clean up rebalance_domains() load-balance interval calculation

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
genirq: Fix cpumask leak in __setup_irq()

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf probe: Fix listing incorrect line number with inline function
perf probe: Fix to find recursively inlined function
perf probe: Fix multiple --vars options behavior
perf probe: Fix to remove redundant close
perf probe: Fix to ensure function declared file

Linus Torvalds
2011-04-08 03:12:58 +0800
42933bac1 Merge branch 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6 ... Browse Code »

* 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
Fix common misspellings

Linus Torvalds
2011-04-08 02:14:49 +0800

05 Apr, 2011

1 commit

49c022e65 sched: Clean up rebalance_domains() load-balance interval calculation ... Browse Code »

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.

Reported-by: Linus Torvalds
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-05 16:29:36 +0800

31 Mar, 2011

2 commits

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800
3436ae129 sched: Fix rebalance interval calculation ... Browse Code »

The interval for checking scheduling domains if they are due to be
balanced currently depends on boot state NR_CPUS, which may not
accurately reflect the number of online CPUs at the time of check.

Thus replace NR_CPUS with num_online_cpus().

(ed: Should only affect those who set NR_CPUS really high, such as 4096
or so :-)

Signed-off-by: Sisir Koppaka
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Sisir Koppaka
2011-03-31 19:00:37 +0800

04 Mar, 2011

2 commits

6d1cafd8b sched: Resched proper CPU on yield_to() ... Browse Code »

yield_to_task_fair() has code to resched the CPU of yielding task when the
intention is to resched the CPU of the task that is being yielded to.

Change here fixes the problem and also makes the resched conditional on
rq != p_rq.

Signed-off-by: Venkatesh Pallipadi
Reviewed-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-03-04 18:14:31 +0800
a2f5c9ab7 sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks ... Browse Code »

Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
ensure idle tasks don't preempt idle tasks) so the non-interactive,
but still important, SCHED_BATCH tasks will run in favor of the very
low priority SCHED_IDLE tasks.

Signed-off-by: Darren Hart
Signed-off-by: Peter Zijlstra
Acked-by: Mike Galbraith
Cc: Richard Purdie
LKML-Reference:
Signed-off-by: Ingo Molnar

Darren Hart
2011-03-04 18:14:29 +0800

23 Feb, 2011

3 commits

866ab43ef sched: Fix the group_imb logic ... Browse Code »

On a 2*6*2 machine something like:

taskset -c 3-11 bash -c 'for ((i=0;i
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:57 +0800
cc57aa8f4 sched: Clean up some f_b_g() comments ... Browse Code »

The existing comment tends to grow state (as it already has), split it
up and place it near the actual tests.

Signed-off-by: Peter Zijlstra
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:56 +0800
c186fafe9 sched: Clean up remnants of sd_idle ... Browse Code »

With the wholesale removal of the sd_idle SMT logic we can clean up
some more.

Signed-off-by: Peter Zijlstra
Cc: Nikhil Rao
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-23 18:33:55 +0800

16 Feb, 2011

1 commit

46e49b383 sched: Wholesale removal of sd_idle logic ... Browse Code »

sd_idle logic was introduced way back in 2005 (commit 5969fe06),
as an HT optimization.

As per the discussion in the thread here:

lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
https://patchwork.kernel.org/patch/532501/

The capacity based logic in the load balancer right now handles this
in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
does not seem to bring any additional benefits. sd_idle logic also has
some bugs that has performance impact. Here is the patch that removes
the sd_idle logic altogether.

Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
logic.

Signed-off-by: Venkatesh Pallipadi
Acked-by: Vaidyanathan Srinivasan
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-02-16 20:33:20 +0800

03 Feb, 2011

4 commits

d95f41220 sched: Add yield_to(task, preempt) functionality ... Browse Code »

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected. We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it. Calling it too often can result in the scheduler just
ignoring the hint.

Signed-off-by: Rik van Riel
Signed-off-by: Marcelo Tosatti
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-02-03 21:20:33 +0800
ac53db596 sched: Use a buddy to implement yield_task_fair() ... Browse Code »

Use the buddy mechanism to implement yield_task_fair. This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next. We need next to be able
to override yield, because it is possible for the "next" and
"yield" task to be different processen in the same sub-tree
of the CFS tree. When they are, we need to go into that
sub-tree regardless of the "yield" hint, and pick the correct
entity once we get to the right level.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:33 +0800
2c13c919d sched: Limit the scope of clear_buddies ... Browse Code »

The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues. In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

(nl)
/ \
G(nl) G
/ \ \
T(l) T(n) T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared. This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be. As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:32 +0800
725e7580a sched: Check the right ->nr_running in yield_task_fair() ... Browse Code »

With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.

Signed-off-by: Rik van Riel
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rik van Riel
2011-02-03 21:20:32 +0800

26 Jan, 2011

2 commits

da7a735e5 sched: Fix switch_from_fair() ... Browse Code »

When a task is taken out of the fair class we must ensure the vruntime
is properly normalized because when we put it back in it will assume
to be normalized.

The case that goes wrong is when changing away from the fair class
while sleeping. Sleeping tasks have non-normalized vruntime in order
to make sleeper-fairness work. So treat the switch away from fair as a
wakeup and preserve the relative vruntime.

Also update sysrq-n to call the ->switch_{to,from} methods.

Reported-by: Onkalo Samu
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-01-26 19:33:22 +0800
f07333bf6 sched: Avoid expensive initial update_cfs_load() ... Browse Code »

Since cfs->{load_stamp,load_last} are zero-initalized the initial load update
will consider the delta to be 'since the beginning of time'.

This results in a lot of pointless divisions to bring this large period to be
within the sysctl_sched_shares_window.

Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this
allows for an initial load_stamp > load_last which then lets standard idle
truncation proceed.

We avoid spinning (and slightly improve consistency) by fixing delta to be
[period - 1] in this path resulting in a slightly more predictable shares ramp.
(Previously the amount of idle time preserved by the overflow would range between
[period/2,period-1].)

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul Turner
2011-01-26 19:33:19 +0800