16 Dec, 2011

1 commit

  • Mike Galbraith reported that this recent commit:

    commit 4dcfe1025b513c2c1da5bf5586adb0e80148f612
    Author: Peter Zijlstra
    Date: Thu Nov 10 13:01:10 2011 +0100

    sched: Avoid SMT siblings in select_idle_sibling() if possible

    stopped selecting an idle SMT sibling when there are no idle
    cores in a single socket system.

    Intent of the select_idle_sibling() was to fallback to an idle
    SMT sibling, if it fails to identify an idle core. But this
    fallback was not happening on systems where all the scheduler
    domains had `SD_SHARE_PKG_RESOURCES' flag set.

    Fix it. Slightly bigger patch of cleaning all these goto's etc
    is queued up for the next release.

    Reported-by: Mike Galbraith
    Reported-by: Alex Shi
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Suresh Siddha
    Link: http://lkml.kernel.org/r/1323978421.1984.244.camel@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

16 Nov, 2011

2 commits

  • In return_cfs_rq_runtime() we want to return bandwidth when there are no
    remaining tasks, not "return" when this is the case.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111108042736.623812423@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Avoid select_idle_sibling() from picking a sibling thread if there's
    an idle core that shares cache.

    This fixes SMT balancing in the increasingly common case where there's
    a shared cache core available to balance to.

    Tested-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Suresh Siddha
    Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Nov, 2011

2 commits


06 Oct, 2011

3 commits

  • This task is preparatory for the migrate_disable() implementation, but
    stands on its own and provides a cleanup.

    It currently only converts those sites required for task-placement.
    Kosaki-san once mentioned replacing cpus_allowed with a proper
    cpumask_t instead of the NR_CPUS sized array it currently is, that
    would also require something like this.

    Signed-off-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: KOSAKI Motohiro
    Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • rq's idle_at_tick is set to idle/busy during the timer tick
    depending on the cpu was idle or not. This will be used later in the load
    balance that will be done in the softirq context (which is a process
    context in -RT kernels).

    For nohz kernels, for the cpu doing nohz idle load balance on behalf of
    all the idle cpu's, its rq->idle_at_tick might have a stale value (which is
    recorded when it got the timer tick presumably when it is busy).

    As the nohz idle load balancing is also being done at the same place
    as the regular load balancing, nohz idle load balancing was bailing out
    when it sees rq's idle_at_tick not set.

    Thus leading to poor system utilization.

    Rename rq's idle_at_tick to idle_balance and set it when someone requests
    for nohz idle balance on an idle cpu.

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • Current use of smp call function to kick the nohz idle balance can deadlock
    in this scenario.

    1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single
    data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI
    to cpu-B to process the call single queue is not yet sent.

    2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B
    for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1)
    and __smp_call_function_single() with nowait will queue the csd to the
    cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B
    as the call single queue was not empty.

    3. cpu-A is busy with lot of interrupts

    4. Meanwhile cpu-B is entering and exiting idle and noticed that it has
    it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the
    idle load balancer and clear its rq->nohz_balance_kick.

    5. At this point, csd queued as part of the step-2 above is still locked
    and waiting to be serviced on cpu-B.

    6. cpu-A is still busy with interrupt load and now it got another timer
    interrupt and as part of it decided to kick cpu-B for another idle load
    balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4
    above) and does __smp_call_function_single() with the same csd that is
    still locked.

    7. And we get a deadlock waiting for the csd_lock() in the
    __smp_call_function_single().

    Main issue here is that cpu-B can service the idle load balancer kick
    request from cpu-A even with out receiving the IPI and this lead to
    doing multiple __smp_call_function_single() on the same csd leading to
    deadlock.

    To kick a cpu, scheduler already has the reschedule vector reserved. Use
    that mechanism (kick_process()) instead of using the generic smp call function
    mechanism to kick off the nohz idle load balancing and avoid the deadlock.

    [ This issue is present from 2.6.35+ kernels, but marking it -stable
    only from v3.0+ as the proposed fix depends on the scheduler_ipi()
    that is introduced recently. ]

    Reported-by: Prarit Bhargava
    Signed-off-by: Suresh Siddha
    Cc: stable@kernel.org # v3.0+
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.com
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

26 Sep, 2011

1 commit


14 Aug, 2011

14 commits

  • When a local cfs_rq blocks we return the majority of its remaining quota to the
    global bandwidth pool for use by other runqueues.

    We do this only when the quota is current and there is more than
    min_cfs_rq_quota [1ms by default] of runtime remaining on the rq.

    In the case where there are throttled runqueues and we have sufficient
    bandwidth to meter out a slice, a second timer is kicked off to handle this
    delivery, unthrottling where appropriate.

    Using a 'worst case' antagonist which executes on each cpu
    for 1ms before moving onto the next on a fairly large machine:

    no quota generations:

    197.47 ms /cgroup/a/cpuacct.usage
    199.46 ms /cgroup/a/cpuacct.usage
    205.46 ms /cgroup/a/cpuacct.usage
    198.46 ms /cgroup/a/cpuacct.usage
    208.39 ms /cgroup/a/cpuacct.usage

    Since we are allowed to use "stale" quota our usage is effectively bounded by
    the rate of input into the global pool and performance is relatively stable.

    with quota generations [1s increments]:

    119.58 ms /cgroup/a/cpuacct.usage
    119.65 ms /cgroup/a/cpuacct.usage
    119.64 ms /cgroup/a/cpuacct.usage
    119.63 ms /cgroup/a/cpuacct.usage
    119.60 ms /cgroup/a/cpuacct.usage

    The large deficit here is due to quota generations (/intentionally/) preventing
    us from now using previously stranded slack quota. The cost is that this quota
    becomes unavailable.

    with quota generations and quota return:

    200.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    198.09 ms /cgroup/a/cpuacct.usage
    200.09 ms /cgroup/a/cpuacct.usage
    200.06 ms /cgroup/a/cpuacct.usage

    By returning unused quota we're able to both stably consume our desired quota
    and prevent unintentional overages due to the abuse of slack quota from
    previous quota periods (especially on a large machine).

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.306848658@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This change introduces statistics exports for the cpu sub-system, these are
    added through the use of a stat file similar to that exported by other
    subsystems.

    The following exports are included:

    nr_periods: number of periods in which execution occurred
    nr_throttled: the number of periods above in which execution was throttle
    throttled_time: cumulative wall-time that any cpus have been throttled for
    this group

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.198901931@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • With the machinery in place to throttle and unthrottle entities, as well as
    handle their participation (or lack there of) we can now enable throttling.

    There are 2 points that we must check whether it's time to set throttled state:
    put_prev_entity() and enqueue_entity().

    - put_prev_entity() is the typical throttle path, we reach it by exceeding our
    allocated run-time within update_curr()->account_cfs_rq_runtime() and going
    through a reschedule.

    - enqueue_entity() covers the case of a wake-up into an already throttled
    group. In this case we know the group cannot be on_rq and can throttle
    immediately. Checks are added at time of put_prev_entity() and
    enqueue_entity()

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184758.091415417@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Buddies allow us to select "on-rq" entities without actually selecting them
    from a cfs_rq's rb_tree. As a result we must ensure that throttled entities
    are not falsely nominated as buddies. The fact that entities are dequeued
    within throttle_entity is not sufficient for clearing buddy status as the
    nomination may occur after throttling.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.886850167@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • From the perspective of load-balance and shares distribution, throttled
    entities should be invisible.

    However, both of these operations work on 'active' lists and are not
    inherently aware of what group hierarchies may be present. In some cases this
    may be side-stepped (e.g. we could sideload via tg_load_down in load balance)
    while in others (e.g. update_shares()) it is more difficult to compute without
    incurring some O(n^2) costs.

    Instead, track hierarchicaal throttled state at time of transition. This
    allows us to easily identify whether an entity belongs to a throttled hierarchy
    and avoid incorrect interactions with it.

    Also, when an entity leaves a throttled hierarchy we need to advance its
    time averaging for shares averaging so that the elapsed throttled time is not
    considered as part of the cfs_rq's operation.

    We also use this information to prevent buddy interactions in the wakeup and
    yield_to() paths.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.777916795@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • At the start of each period we refresh the global bandwidth pool. At this time
    we must also unthrottle any cfs_rq entities who are now within bandwidth once
    more (as quota permits).

    Unthrottled entities have their corresponding cfs_rq->throttled flag cleared
    and their entities re-enqueued.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.574628950@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Now that consumption is tracked (via update_curr()) we add support to throttle
    group entities (and their corresponding cfs_rqs) in the case where this is no
    run-time remaining.

    Throttled entities are dequeued to prevent scheduling, additionally we mark
    them as throttled (using cfs_rq->throttled) to prevent them from becoming
    re-enqueued until they are unthrottled. A list of a task_group's throttled
    entities are maintained on the cfs_bandwidth structure.

    Note: While the machinery for throttling is added in this patch the act of
    throttling an entity exceeding its bandwidth is deferred until later within
    the series.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.480608533@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since quota is managed using a global state but consumed on a per-cpu basis
    we need to ensure that our per-cpu state is appropriately synchronized.
    Most importantly, runtime that is state (from a previous period) should not be
    locally consumable.

    We take advantage of existing sched_clock synchronization about the jiffy to
    efficiently detect whether we have (globally) crossed a quota boundary above.

    One catch is that the direction of spread on sched_clock is undefined,
    specifically, we don't know whether our local clock is behind or ahead
    of the one responsible for the current expiration time.

    Fortunately we can differentiate these by considering whether the
    global deadline has advanced. If it has not, then we assume our clock to be
    "fast" and advance our local expiration; otherwise, we know the deadline has
    truly passed and we expire our local runtime.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.379275352@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • This patch adds a per-task_group timer which handles the refresh of the global
    CFS bandwidth pool.

    Since the RT pool is using a similar timer there's some small refactoring to
    share this support.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.277271273@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Account bandwidth usage on the cfs_rq level versus the task_groups to which
    they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
    under cfs_rq->runtime_enabled.

    cfs_rq's which belong to a bandwidth constrained task_group have their runtime
    accounted via the update_curr() path, which withdraws bandwidth from the global
    pool as desired. Updates involving the global pool are currently protected
    under cfs_bandwidth->lock, local runtime is protected by rq->lock.

    This patch only assigns and tracks quota, no action is taken in the case that
    cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184757.179386821@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • In this patch we introduce the notion of CFS bandwidth, partitioned into
    globally unassigned bandwidth, and locally claimed bandwidth.

    - The global bandwidth is per task_group, it represents a pool of unclaimed
    bandwidth that cfs_rqs can allocate from.
    - The local bandwidth is tracked per-cfs_rq, this represents allotments from
    the global pool bandwidth assigned to a specific cpu.

    Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
    - cpu.cfs_period_us : the bandwidth period in usecs
    - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
    to consume over period above.

    Signed-off-by: Paul Turner
    Signed-off-by: Nikhil Rao
    Signed-off-by: Bharata B Rao
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.972636699@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Introduce hierarchical task accounting for the group scheduling case in CFS, as
    well as promoting the responsibility for maintaining rq->nr_running to the
    scheduling classes.

    The primary motivation for this is that with scheduling classes supporting
    bandwidth throttling it is possible for entities participating in throttled
    sub-trees to not have root visible changes in rq->nr_running across activate
    and de-activate operations. This in turn leads to incorrect idle and
    weight-per-task load balance decisions.

    This also allows us to make a small fixlet to the fastpath in pick_next_task()
    under group scheduling.

    Note: this issue also exists with the existing sched_rt throttling mechanism.
    This patch does not address that.

    Signed-off-by: Paul Turner
    Reviewed-by: Hidetoshi Seto
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Checking for the validity of sd is removed, since it is already
    checked by the for_each_domain macro.

    Signed-off-by: Hillf Danton
    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Hillf Danton
     
  • Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
    and its outlived its use by a long long while.

    Signed-off-by: Yong Zhang
    Acked-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhy
    Signed-off-by: Ingo Molnar

    Yong Zhang
     

22 Jul, 2011

7 commits

  • No need to define a new "cfs_rq" variable in the "for" block.
    Just use the one at the top of the function.

    Signed-off-by: Lin Ming
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com
    Signed-off-by: Ingo Molnar

    Lin Ming
     
  • "entity_key()" is only used in "__enqueue_entity()" and
    its only function is to subtract a tasks vruntime by
    its groups minvruntime.
    Before this patch a rbtree enqueue-decision is done by
    comparing two tasks in the style:

    "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

    which would be

    "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

    or (if reducing cfs_rq->min_vruntime out)

    "if (se->vruntime < entry->vruntime)"

    which is

    "if (entity_before(se, entry))"

    So we do not need "entity_key()".
    If "entity_before()" is inline we will also save one subtraction (only one,
    because "entity_key(cfs_rq, se)" was cached in "key")

    Signed-off-by: Stephan Baerwolf
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org
    Signed-off-by: Ingo Molnar

    Stephan Baerwolf
     
  • The last reference to cpu_cfs_rq() was removed with commit 88ec22d3
    ("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus,
    remove this function, too.

    Signed-off-by: Jan Schoenherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan Schoenherr
     
  • Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
    achieves that load_balance_fair() only iterates those task_groups that
    actually have tasks on busiest, and that we iterate bottom-up, trying to
    move light groups before the heavier ones.

    No idea if it will actually work out to be beneficial in practice, does
    anybody have a cgroup workload that might show a difference one way or
    the other?

    [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Paul Turner
    Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
    with additional weight. However, we perform a double shares update on this
    entity as we continue the shares update traversal from this point, despite
    dequeue_entity() having already updated its queuing cfs_rq.
    Avoid this by starting from the parent when we resume.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • While looking at check_preempt_wakeup() I realized that we are
    potentially updating the wrong entity in the fair-group scheduling
    case. In this case the current task's cfs_rq may not be the same as
    the one used for the comparison between the waking task and the
    existing task's vruntime.

    This potentially results in us using a stale vruntime in the
    pre-emption decision, providing a small false preference for the
    previous task. The effects of this are bounded since we always
    perform a hierarchal update on the tick.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Merge reason: pick up the latest scheduler fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Jul, 2011

1 commit

  • In order to prepare for non-unique sched_groups per domain, we need to
    carry the cpu_power elsewhere, so put a level of indirection in.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Jul, 2011

1 commit

  • wake_affine() is only called from one path: select_task_rq_fair(),
    which already has the RCU read lock held.

    Signed-off-by: Nikunj A. Dadhania
    Signed-off-by: Peter Zijlstra
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com
    Signed-off-by: Ingo Molnar

    Nikunj A. Dadhania
     

28 May, 2011

1 commit

  • Dima Zavin reported:

    "After pulling the thread off the run-queue during a cgroup change,
    the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
    then gets normalized to this new value. This can then lead to the thread
    getting an unfair boost in the new group if the vruntime of the next
    task in the old run-queue was way further ahead."

    Reported-by: Dima Zavin
    Signed-off-by: John Stultz
    Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 May, 2011

1 commit

  • SCHED_LOAD_SCALE is used to increase nice resolution and to
    scale cpu_power calculations in the scheduler. This patch
    introduces SCHED_POWER_SCALE and converts all uses of
    SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
    instead.

    This is a preparatory patch for increasing the resolution of
    SCHED_LOAD_SCALE, and there is no need to increase resolution
    for cpu_power calculations.

    Signed-off-by: Nikhil Rao
    Acked-by: Peter Zijlstra
    Cc: Nikunj A. Dadhania
    Cc: Srivatsa Vaddagiri
    Cc: Stephan Barwolf
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     

04 May, 2011

1 commit


19 Apr, 2011

2 commits

  • When a task in a taskgroup sleeps, pick_next_task starts all the way back at
    the root and picks the task/taskgroup with the min vruntime across all
    runnable tasks.

    But when there are many frequently sleeping tasks across different taskgroups,
    it makes better sense to stay with same taskgroup for its slice period (or
    until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
    on each sleep after a short runtime.

    This helps specifically where taskgroups corresponds to a process with
    multiple threads. The change reduces the number of CR3 switches in this case.

    Example:

    Two taskgroups with 2 threads each which are running for 2ms and
    sleeping for 1ms. Looking at sched:sched_switch shows:

    BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
    cpu-soaker-5004 [003] 3683.391089
    cpu-soaker-5016 [003] 3683.393106
    cpu-soaker-5005 [003] 3683.395119
    cpu-soaker-5017 [003] 3683.397130
    cpu-soaker-5004 [003] 3683.399143
    cpu-soaker-5016 [003] 3683.401155
    cpu-soaker-5005 [003] 3683.403168
    cpu-soaker-5017 [003] 3683.405170

    AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
    cpu-soaker-21890 [003] 865.895494
    cpu-soaker-21935 [003] 865.897506
    cpu-soaker-21934 [003] 865.899520
    cpu-soaker-21935 [003] 865.901532
    cpu-soaker-21934 [003] 865.903543
    cpu-soaker-21935 [003] 865.905546
    cpu-soaker-21891 [003] 865.907548
    cpu-soaker-21890 [003] 865.909560
    cpu-soaker-21891 [003] 865.911571
    cpu-soaker-21890 [003] 865.913582
    cpu-soaker-21891 [003] 865.915594
    cpu-soaker-21934 [003] 865.917606

    Similar problem is there when there are multiple taskgroups and say a task A
    preempts currently running task B of taskgroup_1. On schedule, pick_next_task
    can pick an unrelated task on taskgroup_2. Here it would be better to give some
    preference to task B on pick_next_task.

    A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
    client processes with 2 threads each running on a single CPU. Avg throughput
    across 5 50 sec runs was:

    BEFORE: 105.84 MB/sec
    AFTER: 112.42 MB/sec

    Signed-off-by: Venkatesh Pallipadi
    Acked-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Make set_*_buddy() work on non-task sched_entity, to facilitate the
    use of next_buddy to cache a group entity in cases where one of the
    tasks within that entity sleeps or gets preempted.

    set_skip_buddy() was incorrectly comparing the policy of task that is
    yielding to be not equal to SCHED_IDLE. Yielding should happen even
    when task yielding is SCHED_IDLE. This change removes the policy check
    on the yielding task.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

18 Apr, 2011

1 commit


14 Apr, 2011

2 commits

  • In order to avoid reading partial updated min_vruntime values on 32bit
    implement a seqcount like solution.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation of calling this without rq->lock held, remove the
    dependency on the rq argument.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra