18 Oct, 2007

1 commit


17 Oct, 2007

7 commits

  • Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
    task_rq_lock(rq->idle), rq->idle can't change its task_rq().

    This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
    which uses spin_lock_irq/spin_unlock_irq.

    Signed-off-by: Oleg Nesterov
    Cc: Cliff Wickman
    Cc: Gautham R Shenoy
    Cc: Ingo Molnar
    Cc: Srivatsa Vaddagiri
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Currently move_task_off_dead_cpu() is called under
    write_lock_irq(tasklist). This means it can't use task_lock() which is
    needed to improve migrating to take task's ->cpuset into account.

    Change the code to call move_task_off_dead_cpu() with irqs enabled, and
    change migrate_live_tasks() to use read_lock(tasklist).

    This all is a preparation for the futher changes proposed by Cliff Wickman, see
    http://marc.info/?t=117327786100003

    Signed-off-by: Oleg Nesterov
    Cc: Cliff Wickman
    Cc: Gautham R Shenoy
    Cc: Ingo Molnar
    Cc: Srivatsa Vaddagiri
    Cc: Akinobu Mita
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Child task may be added on a different cpu that the one on which parent
    is running. In which case, task_new_fair() should check whether the new
    born task's parent entity should be added as well on the cfs_rq.

    Patch below fixes the problem in task_new_fair.

    This could fix the put_prev_task_fair() crashes reported.

    Reported-by: Kamalesh Babulal
    Reported-by: Andy Whitcroft
    Signed-off-by: Srivatsa Vaddagiri
    Signed-off-by: Ingo Molnar

    Srivatsa Vaddagiri
     
  • We recently discovered a nasty performance bug in the kernel CPU load
    balancer where we were hit by 50% performance regression.

    When tasks are assigned to a subset of CPUs that span across
    sched_domains (either ccNUMA node or the new multi-core domain) via
    cpu affinity, kernel fails to perform proper load balance at
    these domains, due to several logic in find_busiest_group() miss
    identified busiest sched group within a given domain. This leads to
    inadequate load balance and causes 50% performance hit.

    To give you a concrete example, on a dual-core, 2 socket numa system,
    there are 4 logical cpu, organized as:

    CPU0 attaching sched-domain:
    domain 0: span 0003 groups: 0001 0002
    domain 1: span 000f groups: 0003 000c
    CPU1 attaching sched-domain:
    domain 0: span 0003 groups: 0002 0001
    domain 1: span 000f groups: 0003 000c
    CPU2 attaching sched-domain:
    domain 0: span 000c groups: 0004 0008
    domain 1: span 000f groups: 000c 0003
    CPU3 attaching sched-domain:
    domain 0: span 000c groups: 0008 0004
    domain 1: span 000f groups: 000c 0003

    If I run 2 tasks with CPU affinity set to 0x5. There are situation
    where cpu0 has run queue length of 2, and cpu2 will be idle. The
    kernel load balancer is unable to balance out these two tasks over
    cpu0 and cpu2 due to at least three logics in find_busiest_group()
    that heavily bias load balance towards power saving mode. e.g. while
    determining "busiest" variable, kernel only set it when
    "sum_nr_running > group_capacity". This test is flawed that
    "sum_nr_running" is not necessary same as
    sum-tasks-allowed-to-run-within-the sched-group. The end result is
    that kernel "think" everything is balanced, but in reality we have an
    imbalance and thus causing one CPU to be over-subscribed and leaving
    other idle. There are two other logic in the same function will also
    causing similar effect. The nastiness of this bug is that kernel not
    be able to get unstuck in this unfortunate broken state. From what
    we've seen in our environment, kernel will stuck in imbalanced state
    for extended period of time and it is also very easy for the kernel to
    stuck into that state (it's pretty much 100% reproducible for us).

    So proposing the following fix: add addition logic in
    find_busiest_group to detect intrinsic imbalance within the busiest
    group. When such condition is detected, load balance goes into spread
    mode instead of default grouping mode.

    Signed-off-by: Ken Chen
    Signed-off-by: Ingo Molnar

    Ken Chen
     
  • It occurred to me this morning that the procname field was dynamically
    allocated and needed to be freed. I started to put in break statements
    when allocation failed but it was approaching 50% error handling code.

    I came up with this alternative of looping while entry->mode is set and
    checking proc_handler instead of ->table. Alternatively, the string
    version of the domain name and cpu number could be stored the structs.

    I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
    counts after taking a cpuset exclusive and back.

    Signed-off-by: Ingo Molnar

    Milton Miller
     
  • Remove the cpuset hooks that defined sched domains depending on the setting
    of the 'cpu_exclusive' flag.

    The cpu_exclusive flag can only be set on a child if it is set on the
    parent.

    This made that flag painfully unsuitable for use as a flag defining a
    partitioning of a system.

    It was entirely unobvious to a cpuset user what partitioning of sched
    domains they would be causing when they set that one cpu_exclusive bit on
    one cpuset, because it depended on what CPUs were in the remainder of that
    cpusets siblings and child cpusets, after subtracting out other
    cpu_exclusive cpusets.

    Furthermore, there was no way on production systems to query the
    result.

    Using the cpu_exclusive flag for this was simply wrong from the get go.

    Fortunately, it was sufficiently borked that so far as I know, almost no
    successful use has been made of this. One real time group did use it to
    affectively isolate CPUs from any load balancing efforts. They are willing
    to adapt to alternative mechanisms for this, such as someway to manipulate
    the list of isolated CPUs on a running system. They can do without this
    present cpu_exclusive based mechanism while we develop an alternative.

    There is a real risk, to the best of my understanding, of users
    accidentally setting up a partitioned scheduler domains, inhibiting desired
    load balancing across all their CPUs, due to the nonobvious (from the
    cpuset perspective) side affects of the cpu_exclusive flag.

    Furthermore, since there was no way on a running system to see what one was
    doing with sched domains, this change will be invisible to any using code.
    Unless they have real insight to the scheduler load balancing choices, they
    will be unable to detect that this change has been made in the kernel's
    behaviour.

    Initial discussion on lkml of this patch has generated much comment. My
    (probably controversial) take on that discussion is that it has reached a
    rough concensus that the current cpuset cpu_exclusive mechanism for
    defining sched domains is borked. There is no concensus on the
    replacement. But since we can remove this mechanism, and since its
    continued presence risks causing unwanted partitioning of the schedulers
    load balancing, we should remove it while we can, as we proceed to work the
    replacement scheduler domain mechanisms.

    Signed-off-by: Paul Jackson
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Dinakar Guniguntala
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Jackson
     
  • Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
    variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly
    from startup and CPU HOTPLUG functions.

    Signed-off-by: Mike Travis
    Cc: Andi Kleen
    Cc: Christoph Lameter
    Cc: "Siddha, Suresh B"
    Cc: "David S. Miller"
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Travis
     

15 Oct, 2007

32 commits