28 May, 2011

1 commit

  • Dima Zavin reported:

    "After pulling the thread off the run-queue during a cgroup change,
    the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
    then gets normalized to this new value. This can then lead to the thread
    getting an unfair boost in the new group if the vruntime of the next
    task in the old run-queue was way further ahead."

    Reported-by: Dima Zavin
    Signed-off-by: John Stultz
    Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 May, 2011

1 commit

  • SCHED_LOAD_SCALE is used to increase nice resolution and to
    scale cpu_power calculations in the scheduler. This patch
    introduces SCHED_POWER_SCALE and converts all uses of
    SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
    instead.

    This is a preparatory patch for increasing the resolution of
    SCHED_LOAD_SCALE, and there is no need to increase resolution
    for cpu_power calculations.

    Signed-off-by: Nikhil Rao
    Acked-by: Peter Zijlstra
    Cc: Nikunj A. Dadhania
    Cc: Srivatsa Vaddagiri
    Cc: Stephan Barwolf
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     

04 May, 2011

1 commit


19 Apr, 2011

2 commits

  • When a task in a taskgroup sleeps, pick_next_task starts all the way back at
    the root and picks the task/taskgroup with the min vruntime across all
    runnable tasks.

    But when there are many frequently sleeping tasks across different taskgroups,
    it makes better sense to stay with same taskgroup for its slice period (or
    until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
    on each sleep after a short runtime.

    This helps specifically where taskgroups corresponds to a process with
    multiple threads. The change reduces the number of CR3 switches in this case.

    Example:

    Two taskgroups with 2 threads each which are running for 2ms and
    sleeping for 1ms. Looking at sched:sched_switch shows:

    BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
    cpu-soaker-5004 [003] 3683.391089
    cpu-soaker-5016 [003] 3683.393106
    cpu-soaker-5005 [003] 3683.395119
    cpu-soaker-5017 [003] 3683.397130
    cpu-soaker-5004 [003] 3683.399143
    cpu-soaker-5016 [003] 3683.401155
    cpu-soaker-5005 [003] 3683.403168
    cpu-soaker-5017 [003] 3683.405170

    AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
    cpu-soaker-21890 [003] 865.895494
    cpu-soaker-21935 [003] 865.897506
    cpu-soaker-21934 [003] 865.899520
    cpu-soaker-21935 [003] 865.901532
    cpu-soaker-21934 [003] 865.903543
    cpu-soaker-21935 [003] 865.905546
    cpu-soaker-21891 [003] 865.907548
    cpu-soaker-21890 [003] 865.909560
    cpu-soaker-21891 [003] 865.911571
    cpu-soaker-21890 [003] 865.913582
    cpu-soaker-21891 [003] 865.915594
    cpu-soaker-21934 [003] 865.917606

    Similar problem is there when there are multiple taskgroups and say a task A
    preempts currently running task B of taskgroup_1. On schedule, pick_next_task
    can pick an unrelated task on taskgroup_2. Here it would be better to give some
    preference to task B on pick_next_task.

    A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
    client processes with 2 threads each running on a single CPU. Avg throughput
    across 5 50 sec runs was:

    BEFORE: 105.84 MB/sec
    AFTER: 112.42 MB/sec

    Signed-off-by: Venkatesh Pallipadi
    Acked-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Make set_*_buddy() work on non-task sched_entity, to facilitate the
    use of next_buddy to cache a group entity in cases where one of the
    tasks within that entity sleeps or gets preempted.

    set_skip_buddy() was incorrectly comparing the policy of task that is
    yielding to be not equal to SCHED_IDLE. Yielding should happen even
    when task yielding is SCHED_IDLE. This change removes the policy check
    on the yielding task.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

18 Apr, 2011

1 commit


14 Apr, 2011

3 commits

  • In order to avoid reading partial updated min_vruntime values on 32bit
    implement a seqcount like solution.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation of calling this without rq->lock held, remove the
    dependency on the rq argument.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation of calling select_task_rq() without rq->lock held, drop
    the dependency on the rq argument.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Apr, 2011

5 commits

  • Don't use sd->level for identifying properties of the domain.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Instead of relying on static allocations for the sched_domain and
    sched_group trees, dynamically allocate and RCU free them.

    Allocating this dynamically also allows for some build_sched_groups()
    simplification since we can now (like with other simplifications) rely
    on the sched_domain tree instead of hard-coded knowledge.

    One tricky to note is that detach_destroy_domains() needs to hold
    rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
    since that can lead to partial sched_group existance (could possibly
    be solved by doing the tear-down backwards but this is much more
    robust).

    A concequence of the above is that we can no longer print the
    sched_domain debug stuff from cpu_attach_domain() since that might now
    run with preemption disabled (due to classic RCU etc.) and
    sched_domain_debug() does some GFP_KERNEL allocations.

    Another thing to note is that we now fully rely on normal RCU and not
    RCU-sched, this is because with the new and exiting RCU flavours we
    grew over the years BH doesn't necessarily hold off RCU-sched grace
    periods (-rt is known to break this). This would in fact already cause
    us grief since we do sched_domain/sched_group iterations from softirq
    context.

    This patch is somewhat larger than I would like it to be, but I didn't
    find any means of shrinking/splitting this.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

    Signed-off-by: Shaohua Li
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
    Signed-off-by: Ingo Molnar

    Shaohua Li
     
  • The scheduler load balancer has specific code to deal with cases of
    unbalanced system due to lots of unmovable tasks (for example because of
    hard CPU affinity). In those situation, it excludes the busiest CPU that
    has pinned tasks for load balance consideration such that it can perform
    second 2nd load balance pass on the rest of the system.

    This all works as designed if there is only one cgroup in the system.

    However, when we have multiple cgroups, this logic has false positives and
    triggers multiple load balance passes despite there are actually no pinned
    tasks at all.

    The reason it has false positives is that the all pinned logic is deep in
    the lowest function of can_migrate_task() and is too low level:

    load_balance_fair() iterates each task group and calls balance_tasks() to
    migrate target load. Along the way, balance_tasks() will also set a
    all_pinned variable. Given that task-groups are iterated, this all_pinned
    variable is essentially the status of last group in the scanning process.
    Task group can have number of reasons that no load being migrated, none
    due to cpu affinity. However, this status bit is being propagated back up
    to the higher level load_balance(), which incorrectly think that no tasks
    were moved. It kick off the all pinned logic and start multiple passes
    attempt to move load onto puller CPU.

    To fix this, move the all_pinned aggregation up at the iterator level.
    This ensures that the status is aggregated over all task-groups, not just
    last one in the list.

    Signed-off-by: Ken Chen
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • In function find_busiest_group(), the sched-domain avg_load isn't
    calculated at all if there is a group imbalance within the domain. This
    will cause erroneous imbalance calculation.

    The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
    will dump entire sds->max_load into imbalance variable, which is used
    later on to migrate entire load from busiest CPU to the puller CPU.

    This has two really bad effect:

    1. stampede of task migration, and they won't be able to break out
    of the bad state because of positive feedback loop: large load
    delta -> heavier load migration -> larger imbalance and the cycle
    goes on.

    2. severe imbalance in CPU queue depth. This causes really long
    scheduling latency blip which affects badly on application that
    has tight latency requirement.

    The fix is to have kernel calculate domain avg_load in both cases. This
    will ensure that imbalance calculation is always sensible and the target
    is usually half way between busiest and puller CPU.

    Signed-off-by: Ken Chen
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     

08 Apr, 2011

2 commits

  • …-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-32, fpu: Fix FPU exception handling on non-SSE systems
    x86, hibernate: Initialize mmu_cr4_features during boot
    x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
    x86: visws: Fixup irq overhaul fallout

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Clean up rebalance_domains() load-balance interval calculation

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
    rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix cpumask leak in __setup_irq()

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf probe: Fix listing incorrect line number with inline function
    perf probe: Fix to find recursively inlined function
    perf probe: Fix multiple --vars options behavior
    perf probe: Fix to remove redundant close
    perf probe: Fix to ensure function declared file

    Linus Torvalds
     
  • * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
    Fix common misspellings

    Linus Torvalds
     

05 Apr, 2011

1 commit

  • Instead of the possible multiple-evaluation of num_online_cpus()
    in rebalance_domains() that Linus reported, avoid it altogether
    in the normal case since it's implemented with a Hamming weight
    function over a cpu bitmask which can be darn expensive for those
    with big iron.

    This also makes it cleaner, smaller and documents the code.

    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Mar, 2011

2 commits

  • Fixes generated by 'codespell' and manually reviewed.

    Signed-off-by: Lucas De Marchi

    Lucas De Marchi
     
  • The interval for checking scheduling domains if they are due to be
    balanced currently depends on boot state NR_CPUS, which may not
    accurately reflect the number of online CPUs at the time of check.

    Thus replace NR_CPUS with num_online_cpus().

    (ed: Should only affect those who set NR_CPUS really high, such as 4096
    or so :-)

    Signed-off-by: Sisir Koppaka
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Sisir Koppaka
     

04 Mar, 2011

2 commits

  • yield_to_task_fair() has code to resched the CPU of yielding task when the
    intention is to resched the CPU of the task that is being yielded to.

    Change here fixes the problem and also makes the resched conditional on
    rq != p_rq.

    Signed-off-by: Venkatesh Pallipadi
    Reviewed-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
    ensure idle tasks don't preempt idle tasks) so the non-interactive,
    but still important, SCHED_BATCH tasks will run in favor of the very
    low priority SCHED_IDLE tasks.

    Signed-off-by: Darren Hart
    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Cc: Richard Purdie
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

23 Feb, 2011

3 commits


16 Feb, 2011

1 commit

  • sd_idle logic was introduced way back in 2005 (commit 5969fe06),
    as an HT optimization.

    As per the discussion in the thread here:

    lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
    https://patchwork.kernel.org/patch/532501/

    The capacity based logic in the load balancer right now handles this
    in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
    does not seem to bring any additional benefits. sd_idle logic also has
    some bugs that has performance impact. Here is the patch that removes
    the sd_idle logic altogether.

    Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
    logic.

    Signed-off-by: Venkatesh Pallipadi
    Acked-by: Vaidyanathan Srinivasan
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

03 Feb, 2011

4 commits

  • Currently only implemented for fair class tasks.

    Add a yield_to_task method() to the fair scheduling class. allowing the
    caller of yield_to() to accelerate another thread in it's thread group,
    task group.

    Implemented via a scheduler hint, using cfs_rq->next to encourage the
    target being selected. We can rely on pick_next_entity to keep things
    fair, so noone can accelerate a thread that has already used its fair
    share of CPU time.

    This also means callers should only call yield_to when they really
    mean it. Calling it too often can result in the scheduler just
    ignoring the hint.

    Signed-off-by: Rik van Riel
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Use the buddy mechanism to implement yield_task_fair. This
    allows us to skip onto the next highest priority se at every
    level in the CFS tree, unless doing so would introduce gross
    unfairness in CPU time distribution.

    We order the buddy selection in pick_next_entity to check
    yield first, then last, then next. We need next to be able
    to override yield, because it is possible for the "next" and
    "yield" task to be different processen in the same sub-tree
    of the CFS tree. When they are, we need to go into that
    sub-tree regardless of the "yield" hint, and pick the correct
    entity once we get to the right level.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • The clear_buddies function does not seem to play well with the concept
    of hierarchical runqueues. In the following tree, task groups are
    represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

    (nl)
    / \
    G(nl) G
    / \ \
    T(l) T(n) T

    This situation can arise when a task is woken up T(n), and the previously
    running task T(l) is marked last.

    When clear_buddies is called from either T(l) or T(n), the next and last
    buddies of the group G(nl) will be cleared. This is not the desired
    result, since we would like to be able to find the other type of buddy
    in many cases.

    This especially a worry when implementing yield_task_fair through the
    buddy system.

    The fix is simple: only clear the buddy type that the task itself
    is indicated to be. As an added bonus, we stop walking up the tree
    when the buddy has already been cleared or pointed elsewhere.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
    Yielding to a task from another cfs_rq may be worthwhile, since
    a process calling yield typically cannot use the CPU right now.

    Therefor, we want to check the per-cpu nr_running, not the
    cgroup local one.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

26 Jan, 2011

6 commits

  • When a task is taken out of the fair class we must ensure the vruntime
    is properly normalized because when we put it back in it will assume
    to be normalized.

    The case that goes wrong is when changing away from the fair class
    while sleeping. Sleeping tasks have non-normalized vruntime in order
    to make sleeper-fairness work. So treat the switch away from fair as a
    wakeup and preserve the relative vruntime.

    Also update sysrq-n to call the ->switch_{to,from} methods.

    Reported-by: Onkalo Samu
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since cfs->{load_stamp,load_last} are zero-initalized the initial load update
    will consider the delta to be 'since the beginning of time'.

    This results in a lot of pointless divisions to bring this large period to be
    within the sysctl_sched_shares_window.

    Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this
    allows for an initial load_stamp > load_last which then lets standard idle
    truncation proceed.

    We avoid spinning (and slightly improve consistency) by fixing delta to be
    [period - 1] in this path resulting in a slightly more predictable shares ramp.
    (Previously the amount of idle time preserved by the overflow would range between
    [period/2,period-1].)

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Re-visiting this: Since update_cfs_shares will now only ever re-weight an
    entity that is a relative parent of the current entity in enqueue_entity; we
    can safely issue the account_entity_enqueue relative to that cfs_rq and avoid
    the requirement for special handling of the enqueue case in update_cfs_shares.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • The delta in clock_task is a more fair attribution of how much time a tg has
    been contributing load to the current cpu.

    While not really important it also means we're more in sync (by magnitude)
    with respect to periodic updates (since __update_curr deltas are clock_task
    based).

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since updates are against an entity's queuing cfs_rq it's not possible to
    enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load
    would crash prior to the check if we did anyway since we load is examined
    during the initializers).

    Also, in the update_cfs_load case there's no point
    in maintaining averages for rq->cfs_rq since we don't perform shares
    distribution at that level -- NULL check is replaced accordingly.

    Thanks to Dan Carpenter for pointing out the deference before NULL check.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • While care is taken around the zero-point in effective_load to not exceed
    the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0)
    for (load + effective_load) to underflow.

    In this case the comparing the unsigned values can result in incorrect balanced
    decisions.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     

24 Jan, 2011

1 commit

  • Michael Witten and Christian Kujau reported that the autogroup
    scheduling feature hurts interactivity on their UP systems.

    It turns out that this is an older bug in the group scheduling code,
    and the wider appeal provided by the autogroup feature exposed it
    more prominently.

    When on UP with FAIR_GROUP_SCHED enabled, tune shares
    only affect tg->shares, but is not reflected in
    tg->se->load. The reason is that update_cfs_shares()
    does nothing on UP.

    So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED.

    This issue was found when enable autogroup scheduling was enabled,
    but it is an older bug that also exists on cgroup.cpu on UP.

    Reported-and-Tested-by: Michael Witten
    Reported-and-Tested-by: Christian Kujau
    Signed-off-by: Yong Zhang
    Acked-by: Pekka Enberg
    Acked-by: Mike Galbraith
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yong Zhang
     

18 Jan, 2011

2 commits

  • Signed unsigned comparison may lead to superfluous resched if leftmost
    is right of the current task, wasting a few cycles, and inadvertently
    _lengthening_ the current task's slice.

    Reported-by: Venkatesh Pallipadi
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Previously effective_load would approximate the global load weight present on
    a group taking advantage of:

    entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
    by tg_shares_up.

    This worked (approximately) for an 'empty' (at tg level) cpu since we would
    place boost load representative of what a newly woken task would receive.

    However, now that load is instantaneously updated this assumption is no longer
    true and the load calculation is rather incorrect in this case.

    Fix this (and improve the general case) by re-writing effective_load to take
    advantage of the new shares distribution code.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     

19 Dec, 2010

2 commits

  • Mike Galbraith reported poor interactivity[*] when the new shares distribution
    code was combined with autogroups.

    The root cause turns out to be a mis-ordering of accounting accrued execution
    time and shares updates. Since update_curr() is issued hierarchically,
    updating the parent entity weights to reflect child enqueue/dequeue results in
    the parent's unaccounted execution time then being accrued (vs vruntime) at the
    new weight as opposed to the weight present at accumulation.

    While this doesn't have much effect on processes with timeslices that cross a
    tick, it is particularly problematic for an interactive process (e.g. Xorg)
    which incurs many (tiny) timeslices. In this scenario almost all updates are
    at dequeue which can result in significant fairness perturbation (especially if
    it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
    transitions).

    Correct this by ensuring unaccounted time is accumulated prior to manipulating
    an entity's weight.

    [*] http://xkcd.com/619/ is perversely Nostradamian here.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Long running entities that do not block (dequeue) require periodic updates to
    maintain accurate share values. (Note: group entities with several threads are
    quite likely to be non-blocking in many circumstances).

    By virtue of being long-running however, we will see entity ticks (otherwise
    the required update occurs in dequeue/put and we are done). Thus we can move
    the detection (and associated work) for these updates into the periodic path.

    This restores the 'atomicity' of update_curr() with respect to accounting.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner