20 Sep, 2007

2 commits

  • When using rt_mutex, a NULL pointer dereference is occurred at
    enqueue_task_rt. Here is a scenario;
    1) there are two threads, the thread A is fair_sched_class and
    thread B is rt_sched_class.
    2) Thread A is boosted up to rt_sched_class, because the thread A
    has a rt_mutex lock and the thread B is waiting the lock.
    3) At this time, when thread A create a new thread C, the thread
    C has a rt_sched_class.
    4) When doing wake_up_new_task() for the thread C, the priority
    of the thread C is out of the RT priority range, because the
    normal priority of thread A is not the RT priority. It makes
    data corruption by overflowing the rt_prio_array.
    The new thread C should be fair_sched_class.

    The new thread should be valid scheduler class before queuing.
    This patch fixes to set the suitable scheduler class.

    Signed-off-by: Hiroshi Shimamoto
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Hiroshi Shimamoto
     
  • add /proc/sys/kernel/sched_compat_yield to make sys_sched_yield()
    more agressive, by moving the yielding task to the last position
    in the rbtree.

    with sched_compat_yield=0:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2539 mingo 20 0 1576 252 204 R 50 0.0 0:02.03 loop_yield
    2541 mingo 20 0 1576 244 196 R 50 0.0 0:02.05 loop

    with sched_compat_yield=1:

    PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
    2584 mingo 20 0 1576 248 196 R 99 0.0 0:52.45 loop
    2582 mingo 20 0 1576 256 204 R 0 0.0 0:00.00 loop_yield

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     

05 Sep, 2007

3 commits

  • rename RSR to SRR - 'RSR' is already defined on xtensa.

    found by Adrian Bunk.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • the cfs_rq->wait_runtime debug/statistics counter was not maintained
    properly - fix this.

    this also removes some code:

    text data bss dec hex filename
    13420 228 1204 14852 3a04 sched.o.before
    13404 228 1204 14836 39f4 sched.o.after

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • First fix the check
    if (*imbalance + SCHED_LOAD_SCALE_FUZZ < busiest_load_per_task)
    with this
    if (*imbalance < busiest_load_per_task)

    As the current check is always false for nice 0 tasks (as
    SCHED_LOAD_SCALE_FUZZ is same as busiest_load_per_task for nice 0
    tasks).

    With the above change, imbalance was getting reset to 0 in the corner
    case condition, making the FUZZ logic fail. Fix it by not corrupting the
    imbalance and change the imbalance, only when it finds that the HT/MC
    optimization is needed.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

28 Aug, 2007

1 commit

  • de-HZ-ification of the granularity defaults unearthed a pre-existing
    property of CFS: while it correctly converges to the granularity goal,
    it does not prevent run-time fluctuations in the range of
    [-gran ... 0 ... +gran].

    With the increase of the granularity due to the removal of HZ
    dependencies, this becomes visible in chew-max output (with 5 tasks
    running):

    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 17 . 13 | per: 44 . 40
    out: 27 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 36 . 40
    out: 29 . 27. 32 | flu: 2 . 0 | ran: 17 . 13 | per: 46 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40
    out: 29 . 27. 32 | flu: 0 . 0 | ran: 18 . 13 | per: 47 . 40
    out: 28 . 27. 32 | flu: 0 . 0 | ran: 9 . 13 | per: 37 . 40

    average slice is the ideal 13 msecs and the period is picture-perfect 40
    msecs. But the 'ran' field fluctuates around 13.33 msecs and there's no
    mechanism in CFS to keep that from happening: it's a perfectly valid
    solution that CFS finds.

    to fix this we add a granularity/preemption rule that knows about
    the "target latency", which makes tasks that run longer than the ideal
    latency run a bit less. The simplest approach is to simply decrease the
    preemption granularity when a task overruns its ideal latency. For this
    we have to track how much the task executed since its last preemption.

    ( this adds a new field to task_struct, but we can eliminate that
    overhead in 2.6.24 by putting all the scheduler timestamps into an
    anonymous union. )

    with this change in place, chew-max output is fluctuation-less all
    around:

    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 2 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40
    out: 28 . 27. 39 | flu: 0 . 1 | ran: 13 . 13 | per: 41 . 40

    this patch has no impact on any fastpath or on any globally observable
    scheduling property. (unless you have sharp enough eyes to see
    millisecond-level ruckles in glxgears smoothness :-)

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Mike Galbraith

    Ingo Molnar
     

26 Aug, 2007

3 commits

  • runtime limit and wakeup granularity used to be a function of
    granularity and that was incorrect changed to sched_latency.

    Fix this to make wakeup granularity a function of min-granularity,
    and the runtime limit equal to latency.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • due to adaptive granularity scheduling the role of sched_granularity
    has changed to "minimum granularity", so rename the variable (and the
    tunable) accordingly.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra

    Ingo Molnar
     
  • Instead of specifying the preemption granularity, specify the wanted
    latency. By fixing the granlarity to a constany the wakeup latency
    it a function of the number of running tasks on the rq.

    Invert this relation.

    sysctl_sched_granularity becomes a minimum for the dynamic granularity
    computed from the new sysctl_sched_latency.

    Then use this latency to do more intelligent granularity decisions: if
    there are fewer tasks running then we can schedule coarser. This helps
    performance while still always keeping the latency target.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Aug, 2007

2 commits

  • Remove trivial conditional branch in Linux scheduler's
    can_migrate_task() function.

    text data bss dec hex filename
    34770 2998 24 37792 93a0 sched.o.before
    34757 2998 24 37779 9393 sched.o.after

    Signed-off-by: Sven-Thorsten Dietrich
    Signed-off-by: Ingo Molnar

    Sven-Thorsten Dietrich
     
  • remove HZ dependency from the granularity default. Use 10 msec for
    the base granularity, 1 msec for wakeup granularity and 25 msec for
    batch wakeup granularity. (These defaults are close to the values
    that the default HZ=250 setting got previously, and thus it's the
    most common setting.)

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

23 Aug, 2007

5 commits

  • Michael Gerdau reported reniced task CPU usage weirdnesses.
    Such symptoms can be caused by limit underruns so double the
    sched_runtime_limit.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Was playing with sched_smt_power_savings/sched_mc_power_savings and
    found out that while the scheduler domains are reconstructed when sysfs
    settings change, rebalance_domains() can get triggered with null domain
    on other cpus, which is setting next_balance to jiffies + 60*HZ.
    Resulting in no idle/busy balancing for 60 seconds.

    Fix this.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • On a four package system with HT - HT load balancing optimizations were
    broken. For example, if two tasks end up running on two logical threads
    of one of the packages, scheduler is not able to pull one of the tasks
    to a completely idle package.

    In this scenario, for nice-0 tasks, imbalance calculated by scheduler
    will be 512 and find_busiest_queue() will return 0 (as each cpu's load
    is 1024 > imbalance and has only one task running).

    Similarly MC scheduler optimizations also get fixed with this patch.

    [ mingo@elte.hu: restored fair balancing by increasing the fuzz and
    adding it back to the power decision, without the /2
    factor. ]

    Signed-off-by: Suresh Siddha
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     
  • There are two remaining gotchas:

    - The directories have impossible permissions (writeable).

    - The ctl_name for the kernel directory is inconsistent with
    everything else. It should be CTL_KERN.

    Signed-off-by: Eric W. Biederman
    Signed-off-by: Ingo Molnar

    Eric W. Biederman
     
  • construct a more or less wall-clock time out of sched_clock(), by
    using ACPI-idle's existing knowledge about how much time we spent
    idling. This allows the rq clock to work around TSC-stops-in-C2,
    TSC-gets-corrupted-in-C3 type of problems.

    ( Besides the scheduler's statistics this also benefits blktrace and
    printk-timestamps as well. )

    Furthermore, the precise before-C2/C3-sleep and after-C2/C3-wakeup
    callbacks allow the scheduler to get out the most of the period where
    the CPU has a reliable TSC. This results in slightly more precise
    task statistics.

    the ACPI bits were acked by Len.

    Signed-off-by: Ingo Molnar
    Acked-by: Len Brown

    Ingo Molnar
     

13 Aug, 2007

2 commits


11 Aug, 2007

1 commit

  • improve the rq-clock overflow logic: limit the absolute rq->clock
    delta since the last scheduler tick, instead of limiting the delta
    itself.

    tested by Arjan van de Ven - whole laptop was misbehaving due to
    an incorrectly calibrated cpu_khz confusing sched_clock().

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven

    Ingo Molnar
     

09 Aug, 2007

21 commits