22 Jul, 2011

7 commits

  • No need to define a new "cfs_rq" variable in the "for" block.
    Just use the one at the top of the function.

    Signed-off-by: Lin Ming
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.com
    Signed-off-by: Ingo Molnar

    Lin Ming
     
  • "entity_key()" is only used in "__enqueue_entity()" and
    its only function is to subtract a tasks vruntime by
    its groups minvruntime.
    Before this patch a rbtree enqueue-decision is done by
    comparing two tasks in the style:

    "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

    which would be

    "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

    or (if reducing cfs_rq->min_vruntime out)

    "if (se->vruntime < entry->vruntime)"

    which is

    "if (entity_before(se, entry))"

    So we do not need "entity_key()".
    If "entity_before()" is inline we will also save one subtraction (only one,
    because "entity_key(cfs_rq, se)" was cached in "key")

    Signed-off-by: Stephan Baerwolf
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.org
    Signed-off-by: Ingo Molnar

    Stephan Baerwolf
     
  • The last reference to cpu_cfs_rq() was removed with commit 88ec22d3
    ("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus,
    remove this function, too.

    Signed-off-by: Jan Schoenherr
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.de
    Signed-off-by: Ingo Molnar

    Jan Schoenherr
     
  • Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
    achieves that load_balance_fair() only iterates those task_groups that
    actually have tasks on busiest, and that we iterate bottom-up, trying to
    move light groups before the heavier ones.

    No idea if it will actually work out to be beneficial in practice, does
    anybody have a cgroup workload that might show a difference one way or
    the other?

    [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Paul Turner
    Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
    with additional weight. However, we perform a double shares update on this
    entity as we continue the shares update traversal from this point, despite
    dequeue_entity() having already updated its queuing cfs_rq.
    Avoid this by starting from the parent when we resume.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110707053059.797714697@google.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • While looking at check_preempt_wakeup() I realized that we are
    potentially updating the wrong entity in the fair-group scheduling
    case. In this case the current task's cfs_rq may not be the same as
    the one used for the comparison between the waking task and the
    existing task's vruntime.

    This potentially results in us using a stale vruntime in the
    pre-emption decision, providing a small false preference for the
    previous task. The effects of this are bounded since we always
    perform a hierarchal update on the tick.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Merge reason: pick up the latest scheduler fixes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Jul, 2011

1 commit

  • In order to prepare for non-unique sched_groups per domain, we need to
    carry the cpu_power elsewhere, so put a level of indirection in.

    Reported-and-tested-by: Anton Blanchard
    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Jul, 2011

1 commit

  • wake_affine() is only called from one path: select_task_rq_fair(),
    which already has the RCU read lock held.

    Signed-off-by: Nikunj A. Dadhania
    Signed-off-by: Peter Zijlstra
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.com
    Signed-off-by: Ingo Molnar

    Nikunj A. Dadhania
     

28 May, 2011

1 commit

  • Dima Zavin reported:

    "After pulling the thread off the run-queue during a cgroup change,
    the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
    then gets normalized to this new value. This can then lead to the thread
    getting an unfair boost in the new group if the vruntime of the next
    task in the old run-queue was way further ahead."

    Reported-by: Dima Zavin
    Signed-off-by: John Stultz
    Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 May, 2011

1 commit

  • SCHED_LOAD_SCALE is used to increase nice resolution and to
    scale cpu_power calculations in the scheduler. This patch
    introduces SCHED_POWER_SCALE and converts all uses of
    SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
    instead.

    This is a preparatory patch for increasing the resolution of
    SCHED_LOAD_SCALE, and there is no need to increase resolution
    for cpu_power calculations.

    Signed-off-by: Nikhil Rao
    Acked-by: Peter Zijlstra
    Cc: Nikunj A. Dadhania
    Cc: Srivatsa Vaddagiri
    Cc: Stephan Barwolf
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.com
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     

04 May, 2011

1 commit


19 Apr, 2011

2 commits

  • When a task in a taskgroup sleeps, pick_next_task starts all the way back at
    the root and picks the task/taskgroup with the min vruntime across all
    runnable tasks.

    But when there are many frequently sleeping tasks across different taskgroups,
    it makes better sense to stay with same taskgroup for its slice period (or
    until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
    on each sleep after a short runtime.

    This helps specifically where taskgroups corresponds to a process with
    multiple threads. The change reduces the number of CR3 switches in this case.

    Example:

    Two taskgroups with 2 threads each which are running for 2ms and
    sleeping for 1ms. Looking at sched:sched_switch shows:

    BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
    cpu-soaker-5004 [003] 3683.391089
    cpu-soaker-5016 [003] 3683.393106
    cpu-soaker-5005 [003] 3683.395119
    cpu-soaker-5017 [003] 3683.397130
    cpu-soaker-5004 [003] 3683.399143
    cpu-soaker-5016 [003] 3683.401155
    cpu-soaker-5005 [003] 3683.403168
    cpu-soaker-5017 [003] 3683.405170

    AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
    cpu-soaker-21890 [003] 865.895494
    cpu-soaker-21935 [003] 865.897506
    cpu-soaker-21934 [003] 865.899520
    cpu-soaker-21935 [003] 865.901532
    cpu-soaker-21934 [003] 865.903543
    cpu-soaker-21935 [003] 865.905546
    cpu-soaker-21891 [003] 865.907548
    cpu-soaker-21890 [003] 865.909560
    cpu-soaker-21891 [003] 865.911571
    cpu-soaker-21890 [003] 865.913582
    cpu-soaker-21891 [003] 865.915594
    cpu-soaker-21934 [003] 865.917606

    Similar problem is there when there are multiple taskgroups and say a task A
    preempts currently running task B of taskgroup_1. On schedule, pick_next_task
    can pick an unrelated task on taskgroup_2. Here it would be better to give some
    preference to task B on pick_next_task.

    A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
    client processes with 2 threads each running on a single CPU. Avg throughput
    across 5 50 sec runs was:

    BEFORE: 105.84 MB/sec
    AFTER: 112.42 MB/sec

    Signed-off-by: Venkatesh Pallipadi
    Acked-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Make set_*_buddy() work on non-task sched_entity, to facilitate the
    use of next_buddy to cache a group entity in cases where one of the
    tasks within that entity sleeps or gets preempted.

    set_skip_buddy() was incorrectly comparing the policy of task that is
    yielding to be not equal to SCHED_IDLE. Yielding should happen even
    when task yielding is SCHED_IDLE. This change removes the policy check
    on the yielding task.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.com
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

18 Apr, 2011

1 commit


14 Apr, 2011

3 commits

  • In order to avoid reading partial updated min_vruntime values on 32bit
    implement a seqcount like solution.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation of calling this without rq->lock held, remove the
    dependency on the rq argument.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In preparation of calling select_task_rq() without rq->lock held, drop
    the dependency on the rq argument.

    Reviewed-by: Frank Rowand
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Apr, 2011

5 commits

  • Don't use sd->level for identifying properties of the domain.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Instead of relying on static allocations for the sched_domain and
    sched_group trees, dynamically allocate and RCU free them.

    Allocating this dynamically also allows for some build_sched_groups()
    simplification since we can now (like with other simplifications) rely
    on the sched_domain tree instead of hard-coded knowledge.

    One tricky to note is that detach_destroy_domains() needs to hold
    rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
    since that can lead to partial sched_group existance (could possibly
    be solved by doing the tear-down backwards but this is much more
    robust).

    A concequence of the above is that we can no longer print the
    sched_domain debug stuff from cpu_attach_domain() since that might now
    run with preemption disabled (due to classic RCU etc.) and
    sched_domain_debug() does some GFP_KERNEL allocations.

    Another thing to note is that we now fully rely on normal RCU and not
    RCU-sched, this is because with the new and exiting RCU flavours we
    grew over the years BH doesn't necessarily hold off RCU-sched grace
    periods (-rt is known to break this). This would in fact already cause
    us grief since we do sched_domain/sched_group iterations from softirq
    context.

    This patch is somewhat larger than I would like it to be, but I didn't
    find any means of shrinking/splitting this.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nl
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

    Signed-off-by: Shaohua Li
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroe
    Signed-off-by: Ingo Molnar

    Shaohua Li
     
  • The scheduler load balancer has specific code to deal with cases of
    unbalanced system due to lots of unmovable tasks (for example because of
    hard CPU affinity). In those situation, it excludes the busiest CPU that
    has pinned tasks for load balance consideration such that it can perform
    second 2nd load balance pass on the rest of the system.

    This all works as designed if there is only one cgroup in the system.

    However, when we have multiple cgroups, this logic has false positives and
    triggers multiple load balance passes despite there are actually no pinned
    tasks at all.

    The reason it has false positives is that the all pinned logic is deep in
    the lowest function of can_migrate_task() and is too low level:

    load_balance_fair() iterates each task group and calls balance_tasks() to
    migrate target load. Along the way, balance_tasks() will also set a
    all_pinned variable. Given that task-groups are iterated, this all_pinned
    variable is essentially the status of last group in the scanning process.
    Task group can have number of reasons that no load being migrated, none
    due to cpu affinity. However, this status bit is being propagated back up
    to the higher level load_balance(), which incorrectly think that no tasks
    were moved. It kick off the all pinned logic and start multiple passes
    attempt to move load onto puller CPU.

    To fix this, move the all_pinned aggregation up at the iterator level.
    This ensures that the status is aggregated over all task-groups, not just
    last one in the list.

    Signed-off-by: Ken Chen
    Cc: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • In function find_busiest_group(), the sched-domain avg_load isn't
    calculated at all if there is a group imbalance within the domain. This
    will cause erroneous imbalance calculation.

    The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
    will dump entire sds->max_load into imbalance variable, which is used
    later on to migrate entire load from busiest CPU to the puller CPU.

    This has two really bad effect:

    1. stampede of task migration, and they won't be able to break out
    of the bad state because of positive feedback loop: large load
    delta -> heavier load migration -> larger imbalance and the cycle
    goes on.

    2. severe imbalance in CPU queue depth. This causes really long
    scheduling latency blip which affects badly on application that
    has tight latency requirement.

    The fix is to have kernel calculate domain avg_load in both cases. This
    will ensure that imbalance calculation is always sensible and the target
    is usually half way between busiest and puller CPU.

    Signed-off-by: Ken Chen
    Signed-off-by: Peter Zijlstra
    Cc:
    Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.com
    Signed-off-by: Ingo Molnar

    Ken Chen
     

08 Apr, 2011

2 commits

  • …-linus', 'irq-fixes-for-linus' and 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

    * 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86-32, fpu: Fix FPU exception handling on non-SSE systems
    x86, hibernate: Initialize mmu_cr4_features during boot
    x86-32, NUMA: Fix ACPI NUMA init broken by recent x86-64 change
    x86: visws: Fixup irq overhaul fallout

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: Clean up rebalance_domains() load-balance interval calculation

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    x86/mrst/vrtc: Fix boot crash in mrst_rtc_init()
    rtc, x86/mrst/vrtc: Fix boot crash in rtc_read_alarm()

    * 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    genirq: Fix cpumask leak in __setup_irq()

    * 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    perf probe: Fix listing incorrect line number with inline function
    perf probe: Fix to find recursively inlined function
    perf probe: Fix multiple --vars options behavior
    perf probe: Fix to remove redundant close
    perf probe: Fix to ensure function declared file

    Linus Torvalds
     
  • * 'for-linus2' of git://git.profusion.mobi/users/lucas/linux-2.6:
    Fix common misspellings

    Linus Torvalds
     

05 Apr, 2011

1 commit

  • Instead of the possible multiple-evaluation of num_online_cpus()
    in rebalance_domains() that Linus reported, avoid it altogether
    in the normal case since it's implemented with a Hamming weight
    function over a cpu bitmask which can be darn expensive for those
    with big iron.

    This also makes it cleaner, smaller and documents the code.

    Reported-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Mar, 2011

2 commits

  • Fixes generated by 'codespell' and manually reviewed.

    Signed-off-by: Lucas De Marchi

    Lucas De Marchi
     
  • The interval for checking scheduling domains if they are due to be
    balanced currently depends on boot state NR_CPUS, which may not
    accurately reflect the number of online CPUs at the time of check.

    Thus replace NR_CPUS with num_online_cpus().

    (ed: Should only affect those who set NR_CPUS really high, such as 4096
    or so :-)

    Signed-off-by: Sisir Koppaka
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Sisir Koppaka
     

04 Mar, 2011

2 commits

  • yield_to_task_fair() has code to resched the CPU of yielding task when the
    intention is to resched the CPU of the task that is being yielded to.

    Change here fixes the problem and also makes the resched conditional on
    rq != p_rq.

    Signed-off-by: Venkatesh Pallipadi
    Reviewed-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
    ensure idle tasks don't preempt idle tasks) so the non-interactive,
    but still important, SCHED_BATCH tasks will run in favor of the very
    low priority SCHED_IDLE tasks.

    Signed-off-by: Darren Hart
    Signed-off-by: Peter Zijlstra
    Acked-by: Mike Galbraith
    Cc: Richard Purdie
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Darren Hart
     

23 Feb, 2011

3 commits


16 Feb, 2011

1 commit

  • sd_idle logic was introduced way back in 2005 (commit 5969fe06),
    as an HT optimization.

    As per the discussion in the thread here:

    lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
    https://patchwork.kernel.org/patch/532501/

    The capacity based logic in the load balancer right now handles this
    in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
    does not seem to bring any additional benefits. sd_idle logic also has
    some bugs that has performance impact. Here is the patch that removes
    the sd_idle logic altogether.

    Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
    logic.

    Signed-off-by: Venkatesh Pallipadi
    Acked-by: Vaidyanathan Srinivasan
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

03 Feb, 2011

4 commits

  • Currently only implemented for fair class tasks.

    Add a yield_to_task method() to the fair scheduling class. allowing the
    caller of yield_to() to accelerate another thread in it's thread group,
    task group.

    Implemented via a scheduler hint, using cfs_rq->next to encourage the
    target being selected. We can rely on pick_next_entity to keep things
    fair, so noone can accelerate a thread that has already used its fair
    share of CPU time.

    This also means callers should only call yield_to when they really
    mean it. Calling it too often can result in the scheduler just
    ignoring the hint.

    Signed-off-by: Rik van Riel
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Use the buddy mechanism to implement yield_task_fair. This
    allows us to skip onto the next highest priority se at every
    level in the CFS tree, unless doing so would introduce gross
    unfairness in CPU time distribution.

    We order the buddy selection in pick_next_entity to check
    yield first, then last, then next. We need next to be able
    to override yield, because it is possible for the "next" and
    "yield" task to be different processen in the same sub-tree
    of the CFS tree. When they are, we need to go into that
    sub-tree regardless of the "yield" hint, and pick the correct
    entity once we get to the right level.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • The clear_buddies function does not seem to play well with the concept
    of hierarchical runqueues. In the following tree, task groups are
    represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

    (nl)
    / \
    G(nl) G
    / \ \
    T(l) T(n) T

    This situation can arise when a task is woken up T(n), and the previously
    running task T(l) is marked last.

    When clear_buddies is called from either T(l) or T(n), the next and last
    buddies of the group G(nl) will be cleared. This is not the desired
    result, since we would like to be able to find the other type of buddy
    in many cases.

    This especially a worry when implementing yield_task_fair through the
    buddy system.

    The fix is simple: only clear the buddy type that the task itself
    is indicated to be. As an added bonus, we stop walking up the tree
    when the buddy has already been cleared or pointed elsewhere.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     
  • With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
    Yielding to a task from another cfs_rq may be worthwhile, since
    a process calling yield typically cannot use the CPU right now.

    Therefor, we want to check the per-cpu nr_running, not the
    cgroup local one.

    Signed-off-by: Rik van Riel
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

26 Jan, 2011

2 commits

  • When a task is taken out of the fair class we must ensure the vruntime
    is properly normalized because when we put it back in it will assume
    to be normalized.

    The case that goes wrong is when changing away from the fair class
    while sleeping. Sleeping tasks have non-normalized vruntime in order
    to make sleeper-fairness work. So treat the switch away from fair as a
    wakeup and preserve the relative vruntime.

    Also update sysrq-n to call the ->switch_{to,from} methods.

    Reported-by: Onkalo Samu
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since cfs->{load_stamp,load_last} are zero-initalized the initial load update
    will consider the delta to be 'since the beginning of time'.

    This results in a lot of pointless divisions to bring this large period to be
    within the sysctl_sched_shares_window.

    Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this
    allows for an initial load_stamp > load_last which then lets standard idle
    truncation proceed.

    We avoid spinning (and slightly improve consistency) by fixing delta to be
    [period - 1] in this path resulting in a slightly more predictable shares ramp.
    (Previously the amount of idle time preserved by the overflow would range between
    [period/2,period-1].)

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner