Doug / smarc-fsl-linux-kernel | Embedian Git Server

16 Dec, 2011

1 commit

ab2789213 sched: Fix select_idle_sibling() regression in selecting an idle SMT sibling ... Browse Code »

Mike Galbraith reported that this recent commit:

commit 4dcfe1025b513c2c1da5bf5586adb0e80148f612
Author: Peter Zijlstra
Date: Thu Nov 10 13:01:10 2011 +0100

sched: Avoid SMT siblings in select_idle_sibling() if possible

stopped selecting an idle SMT sibling when there are no idle
cores in a single socket system.

Intent of the select_idle_sibling() was to fallback to an idle
SMT sibling, if it fails to identify an idle core. But this
fallback was not happening on systems where all the scheduler
domains had `SD_SHARE_PKG_RESOURCES' flag set.

Fix it. Slightly bigger patch of cleaning all these goto's etc
is queued up for the next release.

Reported-by: Mike Galbraith
Reported-by: Alex Shi
Signed-off-by: Peter Zijlstra
Signed-off-by: Suresh Siddha
Link: http://lkml.kernel.org/r/1323978421.1984.244.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-12-16 16:44:58 +0800

16 Nov, 2011

2 commits

fccfdc6f0 sched: Fix buglet in return_cfs_rq_runtime() ... Browse Code »

In return_cfs_rq_runtime() we want to return bandwidth when there are no
remaining tasks, not "return" when this is the case.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111108042736.623812423@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-11-16 15:43:45 +0800
4dcfe1025 sched: Avoid SMT siblings in select_idle_sibling() if possible ... Browse Code »

Avoid select_idle_sibling() from picking a sibling thread if there's
an idle core that shares cache.

This fixes SMT balancing in the increasingly common case where there's
a shared cache core available to balance to.

Tested-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Suresh Siddha
Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-11-16 15:43:43 +0800

14 Nov, 2011

2 commits

461819ac8 sched_fair: Fix a typo in the comment describing update_sd_lb_stats ... Browse Code »

Signed-off-by: Hui Kang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1318388459-4427-1-git-send-email-hkang.sunysb@gmail.com
Signed-off-by: Ingo Molnar

Hui Kang
2011-11-14 19:50:34 +0800
cf5f0acf3 sched: Add a comment to effective_load() since it's a pain ... Browse Code »

Every time I have to stare at this function I need to completely
reverse engineer its workings, about time I write a comment
explaining the thing.

Collected bits and pieces from previous changelogs, mostly:

4be9daaa1b33701f011f4117f22dc1e45a3e6e34
83378269a5fad98f562ebc0f09c349575e6cbfe1

Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1318518057.27731.2.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-11-14 19:50:32 +0800

06 Oct, 2011

3 commits

fa17b507f sched: Wrap scheduler p->cpus_allowed access ... Browse Code »

This task is preparatory for the migrate_disable() implementation, but
stands on its own and provides a cleanup.

It currently only converts those sites required for task-placement.
Kosaki-san once mentioned replacing cpus_allowed with a proper
cpumask_t instead of the NR_CPUS sized array it currently is, that
would also require something like this.

Signed-off-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Cc: KOSAKI Motohiro
Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-06 18:46:56 +0800
6eb57e0d6 sched: Request for idle balance during nohz idle load balance ... Browse Code »

rq's idle_at_tick is set to idle/busy during the timer tick
depending on the cpu was idle or not. This will be used later in the load
balance that will be done in the softirq context (which is a process
context in -RT kernels).

For nohz kernels, for the cpu doing nohz idle load balance on behalf of
all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
recorded when it got the timer tick presumably when it is busy).

As the nohz idle load balancing is also being done at the same place
as the regular load balancing, nohz idle load balancing was bailing out
when it sees rq's idle_at_tick not set.

Thus leading to poor system utilization.

Rename rq's idle_at_tick to idle_balance and set it when someone requests
for nohz idle balance on an idle cpu.

Reported-by: Srivatsa Vaddagiri
Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar

Suresh Siddha
2011-10-06 18:46:27 +0800
ca38062e5 sched: Use resched IPI to kick off the nohz idle balance ... Browse Code »

Current use of smp call function to kick the nohz idle balance can deadlock
in this scenario.

1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI
to cpu-B to process the call single queue is not yet sent.

2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
and __smp_call_function_single() with nowait will queue the csd to the
cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
as the call single queue was not empty.

3. cpu-A is busy with lot of interrupts

4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
idle load balancer and clear its rq->nohz_balance_kick.

5. At this point, csd queued as part of the step-2 above is still locked
and waiting to be serviced on cpu-B.

6. cpu-A is still busy with interrupt load and now it got another timer
interrupt and as part of it decided to kick cpu-B for another idle load
balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
above) and does __smp_call_function_single() with the same csd that is
still locked.

7. And we get a deadlock waiting for the csd_lock() in the
__smp_call_function_single().

Main issue here is that cpu-B can service the idle load balancer kick
request from cpu-A even with out receiving the IPI and this lead to
doing multiple __smp_call_function_single() on the same csd leading to
deadlock.

To kick a cpu, scheduler already has the reschedule vector reserved. Use
that mechanism (kick_process()) instead of using the generic smp call function
mechanism to kick off the nohz idle load balancing and avoid the deadlock.

[ This issue is present from 2.6.35+ kernels, but marking it -stable
only from v3.0+ as the proposed fix depends on the scheduler_ipi()
that is introduced recently. ]

Reported-by: Prarit Bhargava
Signed-off-by: Suresh Siddha
Cc: stable@kernel.org # v3.0+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar

Suresh Siddha
2011-10-06 18:46:23 +0800

26 Sep, 2011

1 commit

f4cfb33ed sched: Remove redundant test in check_preempt_tick() ... Browse Code »

The caller already checks for nr_running > 1, therefore we don't have
to do so again.

Signed-off-by: Wang Xingchao
Reviewed-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1316194552-12019-1-git-send-email-xingchao.wang@intel.com
Signed-off-by: Ingo Molnar

Wang Xingchao
2011-09-26 19:25:49 +0800

14 Aug, 2011

14 commits

d8b4986d3 sched: Return unused runtime on group dequeue ... Browse Code »

When a local cfs_rq blocks we return the majority of its remaining quota to the
global bandwidth pool for use by other runqueues.

We do this only when the quota is current and there is more than
min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

In the case where there are throttled runqueues and we have sufficient
bandwidth to meter out a slice, a second timer is kicked off to handle this
delivery, unthrottling where appropriate.

Using a 'worst case' antagonist which executes on each cpu
for 1ms before moving onto the next on a fairly large machine:

no quota generations:

197.47 ms /cgroup/a/cpuacct.usage
199.46 ms /cgroup/a/cpuacct.usage
205.46 ms /cgroup/a/cpuacct.usage
198.46 ms /cgroup/a/cpuacct.usage
208.39 ms /cgroup/a/cpuacct.usage

Since we are allowed to use "stale" quota our usage is effectively bounded by
the rate of input into the global pool and performance is relatively stable.

with quota generations [1s increments]:

119.58 ms /cgroup/a/cpuacct.usage
119.65 ms /cgroup/a/cpuacct.usage
119.64 ms /cgroup/a/cpuacct.usage
119.63 ms /cgroup/a/cpuacct.usage
119.60 ms /cgroup/a/cpuacct.usage

The large deficit here is due to quota generations (/intentionally/) preventing
us from now using previously stranded slack quota. The cost is that this quota
becomes unavailable.

with quota generations and quota return:

200.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
198.09 ms /cgroup/a/cpuacct.usage
200.09 ms /cgroup/a/cpuacct.usage
200.06 ms /cgroup/a/cpuacct.usage

By returning unused quota we're able to both stably consume our desired quota
and prevent unintentional overages due to the abuse of slack quota from
previous quota periods (especially on a large machine).

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:54 +0800
e8da1b18b sched: Add exports tracking cfs bandwidth control statistics ... Browse Code »

This change introduces statistics exports for the cpu sub-system, these are
added through the use of a stat file similar to that exported by other
subsystems.

The following exports are included:

nr_periods: number of periods in which execution occurred
nr_throttled: the number of periods above in which execution was throttle
throttled_time: cumulative wall-time that any cpus have been throttled for
this group

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
Signed-off-by: Ingo Molnar

Nikhil Rao
2011-08-14 18:03:49 +0800
d3d9dc330 sched: Throttle entities exceeding their allowed bandwidth ... Browse Code »

With the machinery in place to throttle and unthrottle entities, as well as
handle their participation (or lack there of) we can now enable throttling.

There are 2 points that we must check whether it's time to set throttled state:
put_prev_entity() and enqueue_entity().

- put_prev_entity() is the typical throttle path, we reach it by exceeding our
allocated run-time within update_curr()->account_cfs_rq_runtime() and going
through a reschedule.

- enqueue_entity() covers the case of a wake-up into an already throttled
group. In this case we know the group cannot be on_rq and can throttle
immediately. Checks are added at time of put_prev_entity() and
enqueue_entity()

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:47 +0800
5238cdd38 sched: Prevent buddy interactions with throttled entities ... Browse Code »

Buddies allow us to select "on-rq" entities without actually selecting them
from a cfs_rq's rb_tree. As a result we must ensure that throttled entities
are not falsely nominated as buddies. The fact that entities are dequeued
within throttle_entity is not sufficient for clearing buddy status as the
nomination may occur after throttling.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:42 +0800
64660c864 sched: Prevent interactions with throttled entities ... Browse Code »

From the perspective of load-balance and shares distribution, throttled
entities should be invisible.

However, both of these operations work on 'active' lists and are not
inherently aware of what group hierarchies may be present. In some cases this
may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
while in others (e.g. update_shares()) it is more difficult to compute without
incurring some O(n^2) costs.

Instead, track hierarchicaal throttled state at time of transition. This
allows us to easily identify whether an entity belongs to a throttled hierarchy
and avoid incorrect interactions with it.

Also, when an entity leaves a throttled hierarchy we need to advance its
time averaging for shares averaging so that the elapsed throttled time is not
considered as part of the cfs_rq's operation.

We also use this information to prevent buddy interactions in the wakeup and
yield_to() paths.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:40 +0800
671fd9dab sched: Add support for unthrottling group entities ... Browse Code »

At the start of each period we refresh the global bandwidth pool. At this time
we must also unthrottle any cfs_rq entities who are now within bandwidth once
more (as quota permits).

Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
and their entities re-enqueued.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:36 +0800
85dac906b sched: Add support for throttling group entities ... Browse Code »

Now that consumption is tracked (via update_curr()) we add support to throttle
group entities (and their corresponding cfs_rqs) in the case where this is no
run-time remaining.

Throttled entities are dequeued to prevent scheduling, additionally we mark
them as throttled (using cfs_rq->throttled) to prevent them from becoming
re-enqueued until they are unthrottled. A list of a task_group's throttled
entities are maintained on the cfs_bandwidth structure.

Note: While the machinery for throttling is added in this patch the act of
throttling an entity exceeding its bandwidth is deferred until later within
the series.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:34 +0800
a9cf55b28 sched: Expire invalid runtime ... Browse Code »

Since quota is managed using a global state but consumed on a per-cpu basis
we need to ensure that our per-cpu state is appropriately synchronized.
Most importantly, runtime that is state (from a previous period) should not be
locally consumable.

We take advantage of existing sched_clock synchronization about the jiffy to
efficiently detect whether we have (globally) crossed a quota boundary above.

One catch is that the direction of spread on sched_clock is undefined,
specifically, we don't know whether our local clock is behind or ahead
of the one responsible for the current expiration time.

Fortunately we can differentiate these by considering whether the
global deadline has advanced. If it has not, then we assume our clock to be
"fast" and advance our local expiration; otherwise, we know the deadline has
truly passed and we expire our local runtime.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:31 +0800
58088ad01 sched: Add a timer to handle CFS bandwidth refresh ... Browse Code »

This patch adds a per-task_group timer which handles the refresh of the global
CFS bandwidth pool.

Since the RT pool is using a similar timer there's some small refactoring to
share this support.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:28 +0800
ec12cb7f3 sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth ... Browse Code »

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:26 +0800
ab84d31e1 sched: Introduce primitives to account for CFS bandwidth tracking ... Browse Code »

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

- The global bandwidth is per task_group, it represents a pool of unclaimed
bandwidth that cfs_rqs can allocate from.
- The local bandwidth is tracked per-cfs_rq, this represents allotments from
the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
- cpu.cfs_period_us : the bandwidth period in usecs
- cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
to consume over period above.

Signed-off-by: Paul Turner
Signed-off-by: Nikhil Rao
Signed-off-by: Bharata B Rao
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:03:20 +0800
953bfcd10 sched: Implement hierarchical task accounting for SCHED_OTHER ... Browse Code »

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:01:13 +0800
083547169 sched: Remove noop in lowest_flag_domain() ... Browse Code »

Checking for the validity of sd is removed, since it is already
checked by the for_each_domain macro.

Signed-off-by: Hillf Danton
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com
Signed-off-by: Ingo Molnar

Hillf Danton
2011-08-14 18:00:46 +0800
2c2efaed9 sched: Kill WAKEUP_PREEMPT ... Browse Code »

Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
and its outlived its use by a long long while.

Signed-off-by: Yong Zhang
Acked-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy
Signed-off-by: Ingo Molnar

Yong Zhang
2011-08-14 18:00:41 +0800

22 Jul, 2011

7 commits

0f3171438 sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair ... Browse Code »

No need to define a new "cfs_rq" variable in the "for" block.
Just use the one at the top of the function.

Signed-off-by: Lin Ming
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com
Signed-off-by: Ingo Molnar

Lin Ming
2011-07-22 18:47:22 +0800
2bd2d6f2d sched: Replace use of entity_key() ... Browse Code »

"entity_key()" is only used in "__enqueue_entity()" and
its only function is to subtract a tasks vruntime by
its groups minvruntime.
Before this patch a rbtree enqueue-decision is done by
comparing two tasks in the style:

"if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

which would be

"if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

or (if reducing cfs_rq->min_vruntime out)

"if (se->vruntime < entry->vruntime)"

which is

"if (entity_before(se, entry))"

So we do not need "entity_key()".
If "entity_before()" is inline we will also save one subtraction (only one,
because "entity_key(cfs_rq, se)" was cached in "key")

Signed-off-by: Stephan Baerwolf
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org
Signed-off-by: Ingo Molnar

Stephan Baerwolf
2011-07-22 00:01:55 +0800
045176d22 sched: Remove unused function cpu_cfs_rq() ... Browse Code »

The last reference to cpu_cfs_rq() was removed with commit 88ec22d3
("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus,
remove this function, too.

Signed-off-by: Jan Schoenherr
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de
Signed-off-by: Ingo Molnar

Jan Schoenherr
2011-07-22 00:01:49 +0800
9763b67fb sched, cgroup: Optimize load_balance_fair() ... Browse Code »

Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

No idea if it will actually work out to be beneficial in practice, does
anybody have a cgroup workload that might show a difference one way or
the other?

[ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

Signed-off-by: Peter Zijlstra
Reviewed-by: Paul Turner
Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-07-22 00:01:46 +0800
9598c82dc sched: Don't update shares twice on on_rq parent ... Browse Code »

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight. However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-07-22 00:01:44 +0800
9bbd73743 sched: update correct entity's runtime in check_preempt_wakeup() ... Browse Code »

While looking at check_preempt_wakeup() I realized that we are
potentially updating the wrong entity in the fair-group scheduling
case. In this case the current task's cfs_rq may not be the same as
the one used for the comparison between the waking task and the
existing task's vruntime.

This potentially results in us using a stale vruntime in the
pre-emption decision, providing a small false preference for the
previous task. The effects of this are bounded since we always
perform a hierarchal update on the tick.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-07-22 00:01:43 +0800
994bf1c92 Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: pick up the latest scheduler fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-07-22 00:00:01 +0800

21 Jul, 2011

1 commit

9c3f75cbd sched: Break out cpu_power from the sched_group structure ... Browse Code »

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.

Reported-and-tested-by: Anton Blanchard
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-07-21 00:32:40 +0800

01 Jul, 2011

1 commit

2a46dae38 sched: Remove rcu_read_lock() from wake_affine() ... Browse Code »

wake_affine() is only called from one path: select_task_rq_fair(),
which already has the RCU read lock held.

Signed-off-by: Nikunj A. Dadhania
Signed-off-by: Peter Zijlstra
Cc: Paul E. McKenney
Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com
Signed-off-by: Ingo Molnar

Nikunj A. Dadhania
2011-07-01 16:39:06 +0800

28 May, 2011

1 commit

1e8762317 sched: Fix ->min_vruntime calculation in dequeue_entity() ... Browse Code »

Dima Zavin reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin
Signed-off-by: John Stultz
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:02:56 +0800

20 May, 2011

1 commit

1399fa780 sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations ... Browse Code »

SCHED_LOAD_SCALE is used to increase nice resolution and to
scale cpu_power calculations in the scheduler. This patch
introduces SCHED_POWER_SCALE and converts all uses of
SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
instead.

This is a preparatory patch for increasing the resolution of
SCHED_LOAD_SCALE, and there is no need to increase resolution
for cpu_power calculations.

Signed-off-by: Nikhil Rao
Acked-by: Peter Zijlstra
Cc: Nikunj A. Dadhania
Cc: Srivatsa Vaddagiri
Cc: Stephan Barwolf
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
Signed-off-by: Ingo Molnar

Nikhil Rao
2011-05-20 20:16:50 +0800

04 May, 2011

1 commit

931aeeda0 sched: Remove unused 'this_best_prio arg' from balance_tasks() ... Browse Code »

It's passed across multiple functions but is never really used, so
remove it.

Signed-off-by: Vladimir Davydov
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.com
Signed-off-by: Ingo Molnar

Vladimir Davydov
2011-05-04 15:07:21 +0800

19 Apr, 2011

2 commits

2f36825b1 sched: Next buddy hint on sleep and preempt path ... Browse Code »

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
cpu-soaker-5004 [003] 3683.391089
cpu-soaker-5016 [003] 3683.393106
cpu-soaker-5005 [003] 3683.395119
cpu-soaker-5017 [003] 3683.397130
cpu-soaker-5004 [003] 3683.399143
cpu-soaker-5016 [003] 3683.401155
cpu-soaker-5005 [003] 3683.403168
cpu-soaker-5017 [003] 3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
cpu-soaker-21890 [003] 865.895494
cpu-soaker-21935 [003] 865.897506
cpu-soaker-21934 [003] 865.899520
cpu-soaker-21935 [003] 865.901532
cpu-soaker-21934 [003] 865.903543
cpu-soaker-21935 [003] 865.905546
cpu-soaker-21891 [003] 865.907548
cpu-soaker-21890 [003] 865.909560
cpu-soaker-21891 [003] 865.911571
cpu-soaker-21890 [003] 865.913582
cpu-soaker-21891 [003] 865.915594
cpu-soaker-21934 [003] 865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

BEFORE: 105.84 MB/sec
AFTER: 112.42 MB/sec

Signed-off-by: Venkatesh Pallipadi
Acked-by: Rik van Riel
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:38 +0800
69c80f3e9 sched: Make set_*_buddy() work on non-task entities ... Browse Code »

Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2011-04-19 16:08:37 +0800

18 Apr, 2011

1 commit

6ddafdaab Merge branch 'sched/locking' into sched/core ... Browse Code »

Merge reason: the rq locking changes are stable,
propagate them into the .40 queue.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-04-18 20:53:33 +0800

14 Apr, 2011

2 commits

3fe1698b7 sched: Deal with non-atomic min_vruntime reads on 32bits ... Browse Code »

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:37 +0800
74f8e4b23 sched: Remove rq argument to sched_class::task_waking() ... Browse Code »

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800