26 Jan, 2011

3 commits

  • The delta in clock_task is a more fair attribution of how much time a tg has
    been contributing load to the current cpu.

    While not really important it also means we're more in sync (by magnitude)
    with respect to periodic updates (since __update_curr deltas are clock_task
    based).

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since updates are against an entity's queuing cfs_rq it's not possible to
    enter update_cfs_{shares,load} with a NULL cfs_rq. (Indeed, update_cfs_load
    would crash prior to the check if we did anyway since we load is examined
    during the initializers).

    Also, in the update_cfs_load case there's no point
    in maintaining averages for rq->cfs_rq since we don't perform shares
    distribution at that level -- NULL check is replaced accordingly.

    Thanks to Dan Carpenter for pointing out the deference before NULL check.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • While care is taken around the zero-point in effective_load to not exceed
    the instantaneous rq->weight, it's still possible (e.g. using wake_idx != 0)
    for (load + effective_load) to underflow.

    In this case the comparing the unsigned values can result in incorrect balanced
    decisions.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     

24 Jan, 2011

1 commit

  • Michael Witten and Christian Kujau reported that the autogroup
    scheduling feature hurts interactivity on their UP systems.

    It turns out that this is an older bug in the group scheduling code,
    and the wider appeal provided by the autogroup feature exposed it
    more prominently.

    When on UP with FAIR_GROUP_SCHED enabled, tune shares
    only affect tg->shares, but is not reflected in
    tg->se->load. The reason is that update_cfs_shares()
    does nothing on UP.

    So introduce update_cfs_shares() for UP && FAIR_GROUP_SCHED.

    This issue was found when enable autogroup scheduling was enabled,
    but it is an older bug that also exists on cgroup.cpu on UP.

    Reported-and-Tested-by: Michael Witten
    Reported-and-Tested-by: Christian Kujau
    Signed-off-by: Yong Zhang
    Acked-by: Pekka Enberg
    Acked-by: Mike Galbraith
    Acked-by: Peter Zijlstra
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Yong Zhang
     

18 Jan, 2011

2 commits

  • Signed unsigned comparison may lead to superfluous resched if leftmost
    is right of the current task, wasting a few cycles, and inadvertently
    _lengthening_ the current task's slice.

    Reported-by: Venkatesh Pallipadi
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     
  • Previously effective_load would approximate the global load weight present on
    a group taking advantage of:

    entity_weight = tg->shares ( lw / global_lw ), where entity_weight was provided
    by tg_shares_up.

    This worked (approximately) for an 'empty' (at tg level) cpu since we would
    place boost load representative of what a newly woken task would receive.

    However, now that load is instantaneously updated this assumption is no longer
    true and the load calculation is rather incorrect in this case.

    Fix this (and improve the general case) by re-writing effective_load to take
    advantage of the new shares distribution code.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     

19 Dec, 2010

2 commits

  • Mike Galbraith reported poor interactivity[*] when the new shares distribution
    code was combined with autogroups.

    The root cause turns out to be a mis-ordering of accounting accrued execution
    time and shares updates. Since update_curr() is issued hierarchically,
    updating the parent entity weights to reflect child enqueue/dequeue results in
    the parent's unaccounted execution time then being accrued (vs vruntime) at the
    new weight as opposed to the weight present at accumulation.

    While this doesn't have much effect on processes with timeslices that cross a
    tick, it is particularly problematic for an interactive process (e.g. Xorg)
    which incurs many (tiny) timeslices. In this scenario almost all updates are
    at dequeue which can result in significant fairness perturbation (especially if
    it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
    transitions).

    Correct this by ensuring unaccounted time is accumulated prior to manipulating
    an entity's weight.

    [*] http://xkcd.com/619/ is perversely Nostradamian here.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    Cc: Linus Torvalds
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Long running entities that do not block (dequeue) require periodic updates to
    maintain accurate share values. (Note: group entities with several threads are
    quite likely to be non-blocking in many circumstances).

    By virtue of being long-running however, we will see entity ticks (otherwise
    the required update occurs in dequeue/put and we are done). Thus we can move
    the detection (and associated work) for these updates into the periodic path.

    This restores the 'atomicity' of update_curr() with respect to accounting.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     

09 Dec, 2010

1 commit


26 Nov, 2010

1 commit


23 Nov, 2010

1 commit


18 Nov, 2010

12 commits

  • Refactor the global load updates from update_shares_cpu() so that
    update_cfs_load() can update global load when it is more than ~10%
    out of sync.

    The new global_load parameter allows us to force an update, regardless of
    the error factor so that we can synchronize w/ update_shares().

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • When the system is busy, dilation of rq->next_balance makes lb->update_shares()
    insufficiently frequent for threads which don't sleep (no dequeue/enqueue
    updates). Adjust for this by making demand based updates based on the
    accumulation of execution time sufficient to wrap our averaging window.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Since shares updates are no longer expensive and effectively local, update them
    at idle_balance(). This allows us to more quickly redistribute shares to
    another cpu when our load becomes idle.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Introduce a new sysctl for the shares window and disambiguate it from
    sched_time_avg.

    A 10ms window appears to be a good compromise between accuracy and performance.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Avoid duplicate shares update calls by ensuring children always appear before
    parents in rq->leaf_cfs_rq_list.

    This allows us to do a single in-order traversal for update_shares().

    Since we always enqueue in bottom-up order this reduces to 2 cases:

    1) Our parent is already in the list, e.g.

    root
    \
    b
    /\
    c d* (root->b->c already enqueued)

    Since d's parent is enqueued we push it to the head of the list, implicitly ahead of b.

    2) Our parent does not appear in the list (or we have no parent)

    In this case we enqueue to the tail of the list, if our parent is subsequently enqueued
    (bottom-up) it will appear to our right by the same rule.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
    the put path since nr_running accounting occurs at deactivation.

    It's also not safe to make the removal decision based on load_avg as this fails
    with both high periods and low shares. Resolve this by clipping history after
    4 periods without activity.

    Note: the above will always occur from update_shares() since in the
    last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load
    is called.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • As part of enqueue_entity both a new entity weight and its contribution to the
    queuing cfs_rq / rq are updated. Since update_cfs_shares will only update the
    queueing weights when the entity is on_rq (which in this case it is not yet),
    there's a dependency loop here:

    update_cfs_shares needs account_entity_enqueue to update cfs_rq->load.weight
    account_entity_enqueue needs the updated weight for the queuing cfs_rq load[*]

    Fix this and avoid spurious dequeue/enqueues by issuing update_cfs_shares as
    if we had accounted the enqueue already.

    This was also resulting in rq->load corruption previously.

    [*]: this dependency also exists when using the group cfs_rq w/
    update_cfs_shares as the weight of the enqueued entity changes
    without the load being updated.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Turner
     
  • Make tg_shares_up() use the active cgroup list, this means we cannot
    do a strict bottom-up walk of the hierarchy, but assuming its a very
    wide tree with a small number of active groups it should be a win.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Make certain load-balance actions scale per number of active cgroups
    instead of the number of existing cgroups.

    This makes wakeup/sleep paths more expensive, but is a win for systems
    where the vast majority of existing cgroups are idle.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • By tracking a per-cpu load-avg for each cfs_rq and folding it into a
    global task_group load on each tick we can rework tg_shares_up to be
    strictly per-cpu.

    This should improve cpu-cgroup performance for smp systems
    significantly.

    [ Paul: changed to use queueing cfs_rq + bug fixes ]

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • An earlier commit reverts idle balancing throttling reset to fix a 30%
    regression in volanomark throughput. We still need to reset idle_stamp
    when we pull a task in newidle balance.

    Reported-by: Alex Shi
    Signed-off-by: Nikhil Rao
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • Commit fab4762 triggers excessive idle balancing, causing a ~30% loss in
    volanomark throughput. Remove idle balancing throttle reset.

    Originally-by: Alex Shi
    Signed-off-by: Mike Galbraith
    Acked-by: Nikhil Rao
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Alex Shi
     

11 Nov, 2010

2 commits

  • Instead of dealing with sched classes inside each check_preempt_curr()
    implementation, pull out this logic into the generic wakeup preemption
    path.

    This fixes a hang in KVM (and others) where we are waiting for the
    stop machine thread to run ...

    Reported-by: Markus Trippelsdorf
    Tested-by: Marcelo Tosatti
    Tested-by: Sergey Senozhatsky
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently we consider a sched domain to be well balanced when the imbalance
    is less than the domain's imablance_pct. As the number of cores and threads
    are increasing, current values of imbalance_pct (for example 25% for a
    NUMA domain) are not enough to detect imbalances like:

    a) On a WSM-EP system (two sockets, each having 6 cores and 12 logical threads),
    24 cpu-hogging tasks get scheduled as 13 on one socket and 11 on another
    socket. Leading to an idle HT cpu.

    b) On a hypothetial 2 socket NHM-EX system (each socket having 8 cores and
    16 logical threads), 16 cpu-hogging tasks can get scheduled as 9 on one
    socket and 7 on another socket. Leaving one core in a socket idle
    whereas in another socket we have a core having both its HT siblings busy.

    While this issue can be fixed by decreasing the domain's imbalance_pct
    (by making it a function of number of logical cpus in the domain), it
    can potentially cause more task migrations across sched groups in an
    overloaded case.

    Fix this by using imbalance_pct only during newly_idle and busy
    load balancing. And during idle load balancing, check if there
    is an imbalance in number of idle cpu's across the busiest and this
    sched_group or if the busiest group has more tasks than its weight that
    the idle cpu in this_group can pull.

    Reported-by: Nikhil Rao
    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

29 Oct, 2010

1 commit


22 Oct, 2010

2 commits

  • Dima noticed that we fail to correct the ->vruntime of sleeping tasks
    when we move them between cgroups.

    Reported-by: Dima Zavin
    Signed-off-by: Peter Zijlstra
    Tested-by: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (29 commits)
    sched: Export account_system_vtime()
    sched: Call tick_check_idle before __irq_enter
    sched: Remove irq time from available CPU power
    sched: Do not account irq time to current task
    x86: Add IRQ_TIME_ACCOUNTING
    sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time
    sched: Add a PF flag for ksoftirqd identification
    sched: Consolidate account_system_vtime extern declaration
    sched: Fix softirq time accounting
    sched: Drop group_capacity to 1 only if local group has extra capacity
    sched: Force balancing on newidle balance if local group has capacity
    sched: Set group_imb only a task can be pulled from the busiest cpu
    sched: Do not consider SCHED_IDLE tasks to be cache hot
    sched: Drop all load weight manipulation for RT tasks
    sched: Create special class for stop/migrate work
    sched: Unindent labels
    sched: Comment updates: fix default latency and granularity numbers
    tracing/sched: Add sched_pi_setprio tracepoint
    sched: Give CPU bound RT tasks preference
    sched: Try not to migrate higher priority RT tasks
    ...

    Linus Torvalds
     

19 Oct, 2010

5 commits

  • The idea was suggested by Peter Zijlstra here:

    http://marc.info/?l=linux-kernel&m=127476934517534&w=2

    irq time is technically not available to the tasks running on the CPU.
    This patch removes irq time from CPU power piggybacking on
    sched_rt_avg_update().

    Tested this by keeping CPU X busy with a network intensive task having 75%
    oa a single CPU irq processing (hard+soft) on a 4-way system. And start seven
    cycle soakers on the system. Without this change, there will be two tasks on
    each CPU. With this change, there is a single task on irq busy CPU X and
    remaining 7 tasks are spread around among other 3 CPUs.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Scheduler accounts both softirq and interrupt processing times to the
    currently running task. This means, if the interrupt processing was
    for some other task in the system, then the current task ends up being
    penalized as it gets shorter runtime than otherwise.

    Change sched task accounting to acoount only actual task time from
    currently running task. Now update_curr(), modifies the delta_exec to
    depend on rq->clock_task.

    Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
    extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
    for later.

    This change will impact scheduling behavior in interrupt heavy conditions.

    Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
    task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
    spending 75%+ of its time in irq processing. CPU 3 spending around 35%
    time running nc task.

    Now, if I run another CPU intensive task on CPU 2, without this change
    /proc//schedstat shows 100% of time accounted to this task. With this
    change, it rightly shows less than 25% accounted to this task as remaining
    time is actually spent on irq processing.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • When SD_PREFER_SIBLING is set on a sched domain, drop group_capacity to 1
    only if the local group has extra capacity. The extra check prevents the case
    where you always pull from the heaviest group when it is already under-utilized
    (possible with a large weight task outweighs the tasks on the system).

    For example, consider a 16-cpu quad-core quad-socket machine with MC and NUMA
    scheduling domains. Let's say we spawn 15 nice0 tasks and one nice-15 task,
    and each task is running on one core. In this case, we observe the following
    events when balancing at the NUMA domain:

    - find_busiest_group() will always pick the sched group containing the niced
    task to be the busiest group.
    - find_busiest_queue() will then always pick one of the cpus running the
    nice0 task (never picks the cpu with the nice -15 task since
    weighted_cpuload > imbalance).
    - The load balancer fails to migrate the task since it is the running task
    and increments sd->nr_balance_failed.
    - It repeats the above steps a few more times until sd->nr_balance_failed > 5,
    at which point it kicks off the active load balancer, wakes up the migration
    thread and kicks the nice 0 task off the cpu.

    The load balancer doesn't stop until we kick out all nice 0 tasks from
    the sched group, leaving you with 3 idle cpus and one cpu running the
    nice -15 task.

    When balancing at the NUMA domain, we drop sgs.group_capacity to 1 if the child
    domain (in this case MC) has SD_PREFER_SIBLING set. Subsequent load checks are
    not relevant because the niced task has a very large weight.

    In this patch, we add an extra condition to the "if(prefer_sibling)" check in
    update_sd_lb_stats(). We drop the capacity of a group only if the local group
    has extra capacity, ie. nr_running < group_capacity. This patch preserves the
    original intent of the prefer_siblings check (to spread tasks across the system
    in low utilization scenarios) and fixes the case above.

    It helps in the following ways:
    - In low utilization cases (where nr_tasks << nr_cpus), we still drop
    group_capacity down to 1 if we prefer siblings.
    - On very busy systems (where nr_tasks >> nr_cpus), sgs.nr_running will most
    likely be > sgs.group_capacity.
    - When balancing large weight tasks, if the local group does not have extra
    capacity, we do not pick the group with the niced task as the busiest group.
    This prevents failed balances, active migration and the under-utilization
    described above.

    Signed-off-by: Nikhil Rao
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • This patch forces a load balance on a newly idle cpu when the local group has
    extra capacity and the busiest group does not have any. It improves system
    utilization when balancing tasks with a large weight differential.

    Under certain situations, such as a niced down task (i.e. nice = -15) in the
    presence of nr_cpus NICE0 tasks, the niced task lands on a sched group and
    kicks away other tasks because of its large weight. This leads to sub-optimal
    utilization of the machine. Even though the sched group has capacity, it does
    not pull tasks because sds.this_load >> sds.max_load, and f_b_g() returns NULL.

    With this patch, if the local group has extra capacity, we shortcut the checks
    in f_b_g() and try to pull a task over. A sched group has extra capacity if the
    group capacity is greater than the number of running tasks in that group.

    Thanks to Mike Galbraith for discussions leading to this patch and for the
    insight to reuse SD_NEWIDLE_BALANCE.

    Signed-off-by: Nikhil Rao
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     
  • When cycling through sched groups to determine the busiest group, set
    group_imb only if the busiest cpu has more than 1 runnable task. This patch
    fixes the case where two cpus in a group have one runnable task each, but there
    is a large weight differential between these two tasks. The load balancer is
    unable to migrate any task from this group, and hence do not consider this
    group to be imbalanced.

    Signed-off-by: Nikhil Rao
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    [ small code readability edits ]
    Signed-off-by: Ingo Molnar

    Nikhil Rao
     

14 Oct, 2010

2 commits


08 Oct, 2010

1 commit

  • > ===================================================
    > [ INFO: suspicious rcu_dereference_check() usage. ]
    > ---------------------------------------------------
    > /home/greearb/git/linux.wireless-testing/kernel/sched.c:618 invoked rcu_dereference_check() without protection!
    >
    > other info that might help us debug this:
    >
    > rcu_scheduler_active = 1, debug_locks = 1
    > 1 lock held by ifup/23517:
    > #0: (&rq->lock){-.-.-.}, at: [] task_fork_fair+0x3b/0x108
    >
    > stack backtrace:
    > Pid: 23517, comm: ifup Not tainted 2.6.36-rc6-wl+ #5
    > Call Trace:
    > [] ? printk+0xf/0x16
    > [] lockdep_rcu_dereference+0x74/0x7d
    > [] task_group+0x6d/0x79
    > [] set_task_rq+0xe/0x57
    > [] task_fork_fair+0x57/0x108
    > [] sched_fork+0x82/0xf9
    > [] copy_process+0x569/0xe8e
    > [] do_fork+0x118/0x262
    > [] ? do_page_fault+0x16a/0x2cf
    > [] ? up_read+0x16/0x2a
    > [] sys_clone+0x1b/0x20
    > [] ptregs_clone+0x15/0x30
    > [] ? sysenter_do_call+0x12/0x38

    Here a newly created task is having its runqueue assigned. The new task
    is not yet on the tasklist, so cannot go away. This is therefore a false
    positive, suppress with an RCU read-side critical section.

    Reported-by: Ben Greear
    Tested-by: Ben Greear <greearb@candelatech.com

    Paul E. McKenney
     

21 Sep, 2010

2 commits

  • scheduler uses cache_nice_tries as an indicator to do cache_hot and
    active load balance, when normal load balance fails. Currently,
    this value is changed on any failed load balance attempt. That ends
    up being not so nice to workloads that enter/exit idle often, as
    they do more frequent new_idle balance and that pretty soon results
    in cache hot tasks being pulled in.

    Making the cache_nice_tries ignore failed new_idle balance seems to
    make better sense. With that only the failed load balance in
    periodic load balance gets accounted and the rate of accumulation
    of cache_nice_tries will not depend on idle entry/exit (short
    running sleep-wakeup kind of tasks). This reduces movement of
    cache_hot tasks.

    schedstat diff (after-before) excerpt from a workload that has
    frequent and short wakeup-idle pattern (:2 in cpu col below refers
    to NEWIDLE idx) This snapshot was across ~400 seconds.

    Without this change:
    domainstats: domain0
    cpu cnt bln fld imb gain hgain nobusyq nobusyg
    0:2 306487 219575 73167 110069413 44583 19070 1172 218403
    1:2 292139 194853 81421 120893383 50745 21902 1259 193594
    2:2 283166 174607 91359 129699642 54931 23688 1287 173320
    3:2 273998 161788 93991 132757146 57122 24351 1366 160422
    4:2 289851 215692 62190 83398383 36377 13680 851 214841
    5:2 316312 222146 77605 117582154 49948 20281 988 221158
    6:2 297172 195596 83623 122133390 52801 21301 929 194667
    7:2 283391 178078 86378 126622761 55122 22239 928 177150
    8:2 297655 210359 72995 110246694 45798 19777 1125 209234
    9:2 297357 202011 79363 119753474 50953 22088 1089 200922
    10:2 278797 178703 83180 122514385 52969 22726 1128 177575
    11:2 272661 167669 86978 127342327 55857 24342 1195 166474
    12:2 293039 204031 73211 110282059 47285 19651 948 203083
    13:2 289502 196762 76803 114712942 49339 20547 1016 195746
    14:2 264446 169609 78292 115715605 50459 21017 982 168627
    15:2 260968 163660 80142 116811793 51483 21281 1064 162596

    With this change:
    domainstats: domain0
    cpu cnt bln fld imb gain hgain nobusyq nobusyg
    0:2 272347 187380 77455 105420270 24975 1 953 186427
    1:2 267276 172360 86234 116242264 28087 6 1028 171332
    2:2 259769 156777 93281 123243134 30555 1 1043 155734
    3:2 250870 143129 97627 127370868 32026 6 1188 141941
    4:2 248422 177116 64096 78261112 22202 2 757 176359
    5:2 275595 180683 84950 116075022 29400 6 778 179905
    6:2 262418 162609 88944 119256898 31056 4 817 161792
    7:2 252204 147946 92646 122388300 32879 4 824 147122
    8:2 262335 172239 81631 110477214 26599 4 864 171375
    9:2 261563 164775 88016 117203621 28331 3 849 163926
    10:2 243389 140949 93379 121353071 29585 2 909 140040
    11:2 242795 134651 98310 124768957 30895 2 1016 133635
    12:2 255234 166622 79843 104696912 26483 4 746 165876
    13:2 244944 151595 83855 109808099 27787 3 801 150794
    14:2 241301 140982 89935 116954383 30403 6 845 140137
    15:2 232271 128564 92821 119185207 31207 4 1416 127148

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • There's a situation where the nohz balancer will try to wake itself:

    cpu-x is idle which is also ilb_cpu
    got a scheduler tick during idle
    and the nohz_kick_needed() in trigger_load_balance() checks for
    rq_x->nr_running which might not be zero (because of someone waking a
    task on this rq etc) and this leads to the situation of the cpu-x
    sending a kick to itself.

    And this can cause a lockup.

    Avoid this by not marking ourself eligible for kicking.

    Signed-off-by: Suresh Siddha
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Suresh Siddha
     

14 Sep, 2010

1 commit

  • Mathieu reported bad latencies with make -j10 kind of kbuild
    workloads - which is mostly caused by us scheduling with a
    too coarse granularity.

    Reduce the minimum granularity some more, to make sure we
    can meet the latency target.

    I got the following results (make -j10 kbuild load, average of 3
    runs):

    vanilla:

    maximum latency: 38278.9 µs
    average latency: 7730.1 µs

    patched:

    maximum latency: 22702.1 µs
    average latency: 6684.8 µs

    Mathieu also measured it:

    |
    | * wakeup-latency.c (SIGEV_THREAD) with make -j10
    |
    | - Mainline 2.6.35.2 kernel
    |
    | maximum latency: 45762.1 µs
    | average latency: 7348.6 µs
    |
    | - With only Peter's smaller min_gran (shown below):
    |
    | maximum latency: 29100.6 µs
    | average latency: 6684.1 µs
    |

    Reported-by: Mathieu Desnoyers
    Reported-by: Linus Torvalds
    Acked-by: Mathieu Desnoyers
    Suggested-by: Peter Zijlstra
    Acked-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

11 Sep, 2010

1 commit