02 Oct, 2007

1 commit


20 Sep, 2007

1 commit

  • add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
    more agressive, by moving the yielding task to the last position
    in the rbtree.

    with sched_compat_yield=0:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
    2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop

    with sched_compat_yield=1:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
    2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     

05 Sep, 2007

5 commits

  • fix ideal_runtime:

    - do not scale it using niced_granularity()
    it is against sum_exec_delta, so its wall-time, not fair-time.

    - move the whole check into __check_preempt_curr_fair()
    so that wakeup preemption can also benefit from the new logic.

    this also results in code size reduction:

    text data bss dec hex filename
    13391 228 1204 14823 39e7 sched.o.before
    13369 228 1204 14801 39d1 sched.o.after

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Second preparatory patch for fix-ideal runtime:

    Mark prev_sum_exec_runtime at the beginning of our run, the same spot
    that adds our wait period to wait_runtime. This seems a more natural
    location to do this, and it also reduces the code a bit:

    text data bss dec hex filename
    13397 228 1204 14829 39ed sched.o.before
    13391 228 1204 14823 39e7 sched.o.after

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Preparatory patch for fix-ideal-runtime:

    simplify __check_preempt_curr_fair(): get rid of the integer return.

    text data bss dec hex filename
    13404 228 1204 14836 39f4 sched.o.before
    13393 228 1204 14825 39e9 sched.o.after

    functionality is unchanged.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • the cfs_rq->wait_runtime debug/statistics counter was not maintained
    properly - fix this.

    this also removes some code:

    text data bss dec hex filename
    13420 228 1204 14852 3a04 sched.o.before
    13404 228 1204 14836 39f4 sched.o.after

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • fix niced_granularity(). This resulted in under-scheduling for
    CPU-bound negative nice level tasks (and this in turn caused
    higher than necessary latencies in nice-0 tasks).

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Aug, 2007

6 commits

  • cleanup: we have the 'se' and 'curr' entity-pointers already,
    no need to use p->se and current->se.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • small schedstat fix: the cfs_rq->wait_runtime 'sum of all runtimes'
    statistics counters missed newly forked tasks and thus had a constant
    negative skew. Fix this.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • Peter Zijlstra noticed the following bug in SCHED_FEAT_SKIP_INITIAL (which
    is disabled by default at the moment): it relies on se.wait_start_fair
    being 0 while update_stats_wait_end() did not recognize a 0 value,
    so instead of 'skipping' the initial interval we gave the new child
    a maximum boost of +runtime-limit ...

    (No impact on the default kernel, but nice to fix for completeness.)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • update the fair-clock before using it for the key value.

    [ mingo@elte.hu: small cleanups. ]

    Signed-off-by: Ting Yang
    Signed-off-by: Ingo Molnar
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra

    Ting Yang
     
  • de-HZ-ification of the granularity defaults unearthed a pre-existing
    property of CFS: while it correctly converges to the granularity goal,
    it does not prevent run-time fluctuations in the range of
    [-gran ... 0 ... +gran].

    With the increase of the granularity due to the removal of HZ
    dependencies, this becomes visible in chew-max output (with 5 tasks
    running):

    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40
    out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40

    average slice is the ideal 13 msecs and the period is picture-perfect 40
    msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no
    mechanism in CFS to keep that from happening: it's a perfectly valid
    solution that CFS finds.

    to fix this we add a granularity/preemption rule that knows about
    the "target latency", which makes tasks that run longer than the ideal
    latency run a bit less. The simplest approach is to simply decrease the
    preemption granularity when a task overruns its ideal latency. For this
    we have to track how much the task executed since its last preemption.

    ( this adds a new field to task_struct, but we can eliminate that
    overhead in 2.6.24 by putting all the scheduler timestamps into an
    anonymous union. )

    with this change in place, chew-max output is fluctuation-less all
    around:

    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40

    this patch has no impact on any fastpath or on any globally observable
    scheduling property. (unless you have sharp enough eyes to see
    millisecond-level ruckles in glxgears smoothness :-)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     
  • There is an Amarok song switch time increase (regression) under
    hefty load.

    What is happening is that sleeper_bonus is never consumed, and only
    rarely goes below runtime_limit, so for the most part, Amarok isn't
    getting any bonus at all. We're keeping sleeper_bonus right at
    runtime_limit (sched_latency == sched_runtime_limit == 40ms) forever, ie
    we don't consume if we're lower that that, and don't add if we're above
    it. One Amarok thread waking (or anybody else) will push us past the
    threshold, so the next thread waking gets nada, but will reap pain from
    the previous thread waking until we drop back to runtime_limit. It
    looks to me like under load, some random task gets a bonus, and
    everybody else pays, whether deserving or not.

    This diff fixed the regression for me at any load rate.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Mike Galbraith
     

26 Aug, 2007

2 commits

  • due to adaptive granularity scheduling the role of sched_granularity
    has changed to "minimum granularity", so rename the variable (and the
    tunable) accordingly.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Instead of specifying the preemption granularity, specify the wanted
    latency. By fixing the granlarity to a constany the wakeup latency
    it a function of the number of running tasks on the rq.

    Invert this relation.

    sysctl_sched_granularity becomes a minimum for the dynamic granularity
    computed from the new sysctl_sched_latency.

    Then use this latency to do more intelligent granularity decisions: if
    there are fewer tasks running then we can schedule coarser. This helps
    performance while still always keeping the latency target.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Aug, 2007

6 commits

  • fix task startup penalty miscalculation: sysctl_sched_granularity is
    unsigned int and wait_runtime is long so we first have to convert it
    to long before turning it negative ...

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • current code:

    delta = calc_delta_mine(delta_exec, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    Notice that this calc_delta_mine() line is exactly delta_mine, which
    gives:

    delta = min((u64)delta_mine, cfs_rq->sleeper_bonus);

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • current code:

    delta = min(cfs_rq->sleeper_bonus, (u64)delta_exec);
    delta = calc_delta_mine(delta, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    drop the first min(), because we clip against sleeper_bonus in the 3rd line
    again. That gives:

    delta = calc_delta_mine(delta_exec, curr->load.weight, lw);
    delta = min((u64)delta, cfs_rq->sleeper_bonus);

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • make the bonus balance more consistent: do not hand out a bonus if
    there's too much in flight already, and only deduct as much from a
    runner as it has the capacity. This makes the bonus engine a zero-sum
    game (as intended).

    this also simplifies the code:

    text data bss dec hex filename
    34770 2998 24 37792 93a0 sched.o.before
    34749 2998 24 37771 938b sched.o.after

    and it also avoids overscheduling in sleep-happy workloads like
    hackbench.c.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • remove HZ dependency from the granularity default. Use 10 msec for
    the base granularity, 1 msec for wakeup granularity and 25 msec for
    batch wakeup granularity. (These defaults are close to the values
    that the default HZ=250 setting got previously, and thus it's the
    most common setting.)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • when I built with CONFIG_FAIR_GROUP_SCHED=y, I need the following change
    to make things right.

    [ From: mingo@elte.hu ]

    this config option is not upstream-configurable right now but lets fix
    this for completeness.

    Signed-off-by: Bruce Ashfield
    Signed-off-by: Ingo Molnar

    Bruce Ashfield
     

13 Aug, 2007

1 commit

  • Peter Ziljstra noticed that the sleeper bonus deduction code
    was not properly rate-limited: a task that scheduled more
    frequently would get a disproportionately large deduction.
    So limit the deduction to delta_exec.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Aug, 2007

1 commit


09 Aug, 2007

17 commits