26 Sep, 2006

1 commit

  • The scheduler will stop load balancing if the most busy processor contains
    processes pinned via processor affinity.

    The scheduler currently only does one search for busiest cpu. If it cannot
    pull any tasks away from the busiest cpu because they were pinned then the
    scheduler goes into a corner and sulks leaving the idle processors idle.

    F.e. If you have processor 0 busy running four tasks pinned via taskset,
    there are none on processor 1 and one just started two processes on
    processor 2 then the scheduler will not move one of the two processes away
    from processor 2.

    This patch fixes that issue by forcing the scheduler to come out of its
    corner and retrying the load balancing by considering other processors for
    load balancing.

    This patch was originally developed by John Hawkes and discussed at

    http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

    I have removed extraneous material and gone back to equipping struct rq
    with the cpu the queue is associated with since this makes the patch much
    easier and it is likely that others in the future will have the same
    difficulty of figuring out which processor owns which runqueue.

    The overhead added through these patches is a single word on the stack if
    the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
    environments the maximum number of cpus that can be configued is 255 which
    would result in the use of 32 bytes additional on the stack. On IA64 up to
    1k cpus can be configured which will result in the use of 128 additional
    bytes on the stack. The maximum additional cache footprint is one
    cacheline. Typically memory use will be much less than a cacheline and the
    additional cpumask will be placed on the stack in a cacheline that already
    contains other local variable.

    Signed-off-by: Christoph Lameter
    Cc: John Hawkes
    Cc: "Siddha, Suresh B"
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Peter Williams
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

28 Aug, 2006

1 commit

  • sched_setscheduler() looks at ->signal->rlim[]. It is unsafe do
    dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
    current). We pin the task structure, but this can't prevent from
    release_task()->__exit_signal() which sets ->signal = NULL.

    Restore tasklist_lock across the setscheduler call.

    Signed-off-by: Oleg Nesterov
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

01 Aug, 2006

3 commits

  • Initialize init task's pi_waiters plist. Otherwise cpu hotplug of cpu 0
    might crash, since rt_mutex_getprio() accesses an uninitialized list head.

    call chain which led to crash:

    take_cpu_down
    sched_idle_next
    __setscheduler
    rt_mutex_getprio

    Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately,
    since the pi_waiters member is only conditionally present.

    Cc: Arjan van de Ven
    Cc: Thomas Gleixner
    Acked-by: Ingo Molnar
    Signed-off-by: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Heiko Carstens
     
  • In cond_resched_lock() it calls __resched_legal() before dropping the spin
    lock. __resched_legal() will always finds the preempt_count non-zero and
    will prevent the call to __cond_resched().

    The attached patch adds a parameter to __resched_legal() with the expected
    preempt_count value.

    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jim Houston
     
  • Use the correct groups while initializing sched groups power for
    allnodes_domain. This fixes the crash observed while creating exclusive
    cpusets.

    Signed-off-by: Suresh Siddha
    Reported-and-tested-by: Paul Jackson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     

15 Jul, 2006

3 commits

  • Make the task-related schedstats functions callable by delay accounting even
    if schedstats collection isn't turned on. This removes the dependency of
    delay accounting on schedstats.

    Signed-off-by: Chandra Seetharaman
    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • Unlike earlier iterations of the delay accounting patches, now delays are only
    collected for the actual I/O waits rather than try and cover the delays seen
    in I/O submission paths.

    Account separately for block I/O delays incurred as a result of swapin page
    faults whose frequency can be affected by the task/process' rss limit. Hence
    swapin delays can act as feedback for rss limit changes independent of I/O
    priority changes.

    Signed-off-by: Shailabh Nagar
    Signed-off-by: Balbir Singh
    Cc: Jes Sorensen
    Cc: Peter Chubb
    Cc: Erich Focht
    Cc: Levent Serinol
    Cc: Jay Lan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shailabh Nagar
     
  • On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement
    lock validator support there's a bug in rq->lock handling: in this case we
    dont 'carry over' the runqueue lock into another task - but still we did a
    spinlock_release() of it. Fix this by making the spinlock_release() in
    context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW.

    (Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW.
    This fixes a lockdep-internal BUG message on such platforms.)

    Signed-off-by: Ingo Molnar
    Cc: Ralf Baechle
    Cc: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

11 Jul, 2006

2 commits

  • - constify and optimize stat_nam (thanks to Michael Tokarev!)
    - spelling and comment fixes

    Signed-off-by: Andreas Mohr
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr
     
  • Problem:

    In the function __migrate_task(), deactivate_task() followed by
    activate_task() is used to move the task from one run queue to
    another. This has two undesirable effects:

    1. The task's priority is recalculated. (Nowhere else in the
    scheduler code is the priority recalculated for a change of CPU.)

    2. The task's time stamp is set to the current time. At the very least,
    this makes the adjustment of the time stamp before the call to
    deactivate_task() redundant but I believe the problem is more serious
    as the time stamp now holds the time of the queue change instead of
    the time at which the task was woken. In addition, unless dest_rq is
    the same queue as "current" is on the time stamp could be inaccurate
    due to inter CPU drift.

    Solution:

    Replace the call to activate_task() with one to __activate_task().

    Signed-off-by: Peter Williams
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Williams
     

04 Jul, 2006

7 commits

  • convert:

    - runqueue_t to 'struct rq'
    - prio_array_t to 'struct prio_array'
    - migration_req_t to 'struct migration_req'

    I was the one who added these but they are both against the kernel coding
    style and also were used inconsistently at places. So just get rid of them at
    once, now that we are flushing the scheduler patch-queue anyway.

    Conversion was mostly scripted, the result was reviewed and all secondary
    whitespace and style impact (if any) was fixed up by hand.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • cleanup: remove task_t and convert all the uses to struct task_struct. I
    introduced it for the scheduler anno and it was a mistake.

    Conversion was mostly scripted, the result was reviewed and all
    secondary whitespace and style impact (if any) was fixed up by hand.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Clean up some of the impact of recent (and not so recent) scheduler
    changes:

    - turning macros into nice inline functions
    - sanitizing and unifying variable definitions
    - whitespace, style consistency, 80-lines, comment correctness, spelling
    and curly braces police

    Due to the macro hell and variable placement simplifications there's even 26
    bytes of .text saved:

    text data bss dec hex filename
    25510 4153 192 29855 749f sched.o.before
    25484 4153 192 29829 7485 sched.o.after

    [akpm@osdl.org: build fix]
    Signed-off-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Teach per-CPU runqueue locks and recursive locking code to the lock validator.
    Has no effect on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Use the lock validator framework to prove spinlock and rwlock locking
    correctness.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Accurate hard-IRQ-flags and softirq-flags state tracing.

    This allows us to attach extra functionality to IRQ flags on/off
    events (such as trace-on/off).

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Generic lock debugging:

    - generalized lock debugging framework. For example, a bug in one lock
    subsystem turns off debugging in all lock subsystems.

    - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
    the mutex/rtmutex debugging code: it caused way too much prototype
    hackery, and lockdep will give the same information anyway.

    - ability to do silent tests

    - check lock freeing in vfree too.

    - more finegrained debugging options, to allow distributions to
    turn off more expensive debugging features.

    There's no separate 'held mutexes' list anymore - but there's a 'held locks'
    stack within lockdep, which unifies deadlock detection across all lock
    classes. (this is independent of the lockdep validation stuff - lockdep first
    checks whether we are holding a lock already)

    Here are the current debugging options:

    CONFIG_DEBUG_MUTEXES=y
    CONFIG_DEBUG_LOCK_ALLOC=y

    which do:

    config DEBUG_MUTEXES
    bool "Mutex debugging, basic checks"

    config DEBUG_LOCK_ALLOC
    bool "Detect incorrect freeing of live mutexes"

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

01 Jul, 2006

1 commit

  • Fix a bug identified by Zou Nan hai :

    If the system is in state SYSTEM_BOOTING, and need_resched() is true,
    cond_resched() returns true even though it didn't reschedule. Consequently
    need_resched() remains true and JBD locks up.

    Fix that by teaching cond_resched() to only return true if it really did call
    schedule().

    cond_resched_lock() and cond_resched_softirq() have a problem too. If we're
    in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
    the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
    will prevent schedule() from being called. So on return, need_resched() will
    still be true, but cond_resched_lock() has to return 1 to tell the caller that
    the lock was dropped. The caller will probably lock up.

    Bottom line: if these functions dropped the lock, they _must_ call schedule()
    to clear need_resched(). Make it so.

    Also, uninline __cond_resched(). It's largeish, and slowpath.

    Acked-by: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

28 Jun, 2006

20 commits

  • When the priority of a task, which is blocked on a lock, changes we must
    propagate this change into the PI lock chain. Therefor the chain walk code
    is changed to get rid of the references to current to avoid false positives
    in the deadlock detector, as setscheduler might be called by a task which
    holds the lock on which the task whose priority is changed is blocked.

    Also add some comments about the get/put_task_struct usage to avoid
    confusion.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • There is no need to hold tasklist_lock across the setscheduler call, when
    we pin the task structure with get_task_struct(). Interrupts are disabled
    in setscheduler anyway and the permission checks do not need interrupts
    disabled.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Add framework to boost/unboost the priority of RT tasks.

    This consists of:

    - caching the 'normal' priority in ->normal_prio
    - providing a functions to set/get the priority of the task
    - make sched_setscheduler() aware of boosting

    The effective_prio() cleanups also fix a priority-calculation bug pointed out
    by Andrey Gelman, in set_user_nice().

    has_rt_policy() fix: Peter Williams

    Signed-off-by: Ingo Molnar
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Arjan van de Ven
    Cc: Andrey Gelman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     
  • Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
    This call grabs a spin_lock that is not always protected by interrupts
    disabled. So this means that setscheduler cant be called from interrupt
    context.

    To prevent this from happening in the future, this patch adds a
    BUG_ON(in_interrupt()) in that function. (Thanks to akpm for this suggestion).

    Signed-off-by: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Saves 543 bytes from sched.o (gcc 3.3.3).

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Con Kolivas
    Cc: Peter Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
    /sys/devices/system/cpu/ control the MC/SMT power savings policy for the
    scheduler.

    Based on the values (1-enable, 0-disable) for these controls, sched groups
    cpu power will be determined for different domains. When power savings
    policy is enabled and under light load conditions, scheduler will minimize
    the physical packages/cpu cores carrying the load and thus conserving
    power(with a perf impact based on the workload characteristics... see OLS
    2005 CMP kernel scheduler paper for more details..)

    Signed-off-by: Suresh Siddha
    Cc: Ingo Molnar
    Cc: Nick Piggin
    Cc: Con Kolivas
    Cc: "Chen, Kenneth W"
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Siddha, Suresh B
     
  • As explained here:
    http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

    there is a problem with sharing sched_group structures between two
    separate sched_group structures for different sched_domains.

    The patch has been tested and found to avoid the kernel lockup problem
    described in above URL.

    Signed-off-by: Srivatsa Vaddagiri
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa Vaddagiri
     
  • The sched group structures used to represent various nodes need to be
    allocated from respective nodes (as suggested here also:

    http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

    Signed-off-by: Srivatsa Vaddagiri
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa Vaddagiri
     
  • Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
    allocation.

    Signed-off-by: Srivatsa Vaddagiri
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa Vaddagiri
     
  • Try to handle mem allocation failures in build_sched_domains by bailing out
    and cleaning up thus-far allocated memory. The patch has a direct consequence
    that we disable load balancing completely (even at sibling level) upon *any*
    memory allocation failure.

    [Lee.Schermerhorn@hp.com: bugfix]
    Signed-off-by: Srivatsa Vaddagir
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srivatsa Vaddagiri
     
  • Problem:

    To help distribute high priority tasks evenly across the available CPUs
    move_tasks() does not, under some circumstances, skip tasks whose load
    weight is bigger than the designated amount. Because the highest priority
    task on the busiest queue may be on the expired array it may be moved as a
    result of this mechanism. Apart from not being the most desirable way to
    redistribute the high priority tasks (we'd rather move the second highest
    priority task), there is a risk that this could set up a loop with this
    task bouncing backwards and forwards between the two queues. (This latter
    possibility can be demonstrated by running a nice==-20 CPU bound task on an
    otherwise quiet 2 CPU system.)

    Solution:

    Modify the mechanism so that it does not override skip for the highest
    priority task on the CPU. Of course, if there are more than one tasks at
    the highest priority then it will allow the override for one of them as
    this is a desirable redistribution of high priority tasks.

    Signed-off-by: Peter Williams
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Williams
     
  • Problem:

    The move_tasks() function is designed to move UP TO the amount of load it
    is asked to move and in doing this it skips over tasks looking for ones
    whose load weights are less than or equal to the remaining load to be
    moved. This is (in general) a good thing but it has the unfortunate result
    of breaking one of the original load balancer's good points: namely, that
    (within the limits imposed by the active/expired array model and the fact
    the expired is processed first) it moves high priority tasks before low
    priority ones and this means there's a good chance (see active/expired
    problem for why it's only a chance) that the highest priority task on the
    queue but not actually on the CPU will be moved to the other CPU where (as
    a high priority task) it may preempt the current task.

    Solution:

    Modify move_tasks() so that high priority tasks are not skipped when moving
    them will make them the highest priority task on their new run queue.

    Signed-off-by: Peter Williams
    Cc: Ingo Molnar
    Cc: "Siddha, Suresh B"
    Cc: "Chen, Kenneth W"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Williams
     
  • Problem:

    The introduction of separate run queues per CPU has brought with it "nice"
    enforcement problems that are best described by a simple example.

    For the sake of argument suppose that on a single CPU machine with a
    nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
    task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
    suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
    2 nice==0 hard spinners running. The user of this system would be entitled
    to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
    tasks only get 5% each. However, whether this expectation is met is pretty
    much down to luck as there are four equally likely distributions of the
    tasks to the CPUs that the load balancing code will consider to be balanced
    with loads of 2.0 for each CPU. Two of these distributions involve one
    nice==0 and one nice==19 task per CPU and in these circumstances the users
    expectations will be met. The other two distributions both involve both
    nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
    each task will get 50% of a CPU and the user's expectations will not be
    met.

    Solution:

    The solution to this problem that is implemented in the attached patch is
    to use weighted loads when determining if the system is balanced and, when
    an imbalance is detected, to move an amount of weighted load between run
    queues (as opposed to a number of tasks) to restore the balance. Once
    again, the easiest way to explain why both of these measures are necessary
    is to use a simple example. Suppose that (in a slight variation of the
    above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
    hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
    the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
    CPUs would be 4.0 and 0.2 respectively and the load balancing code would
    move 2 tasks resulting in one CPU with a load of 2.0 and the other with
    load of 2.2. If this was considered to be a big enough imbalance to
    justify moving a task and that task was moved using the current
    move_tasks() then it would move the highest priority task that it found and
    this would result in one CPU with a load of 3.0 and the other with a load
    of 1.2 which would result in the movement of a task in the opposite
    direction and so on -- infinite loop. If, on the other hand, an amount of
    load to be moved is calculated from the imbalance (in this case 0.1) and
    move_tasks() skips tasks until it find ones whose contributions to the
    weighted load are less than this amount it would move two of the nice==19
    tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
    loads of 2.1 for each CPU.

    One of the advantages of this mechanism is that on a system where all tasks
    have nice==0 the load balancing calculations would be mathematically
    identical to the current load balancing code.

    Notes:

    struct task_struct:

    has a new field load_weight which (in a trade off of space for speed)
    stores the contribution that this task makes to a CPU's weighted load when
    it is runnable.

    struct runqueue:

    has a new field raw_weighted_load which is the sum of the load_weight
    values for the currently runnable tasks on this run queue. This field
    always needs to be updated when nr_running is updated so two new inline
    functions inc_nr_running() and dec_nr_running() have been created to make
    sure that this happens. This also offers a convenient way to optimize away
    this part of the smpnice mechanism when CONFIG_SMP is not defined.

    int try_to_wake_up():

    in this function the value SCHED_LOAD_BALANCE is used to represent the load
    contribution of a single task in various calculations in the code that
    decides which CPU to put the waking task on. While this would be a valid
    on a system where the nice values for the runnable tasks were distributed
    evenly around zero it will lead to anomalous load balancing if the
    distribution is skewed in either direction. To overcome this problem
    SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
    or by the average load_weight per task for the queue in question (as
    appropriate).

    int move_tasks():

    The modifications to this function were complicated by the fact that
    active_load_balance() uses it to move exactly one task without checking
    whether an imbalance actually exists. This precluded the simple
    overloading of max_nr_move with max_load_move and necessitated the addition
    of the latter as an extra argument to the function. The internal
    implementation is then modified to move up to max_nr_move tasks and
    max_load_move of weighted load. This slightly complicates the code where
    move_tasks() is called and if ever active_load_balance() is changed to not
    use move_tasks() the implementation of move_tasks() should be simplified
    accordingly.

    struct sched_group *find_busiest_group():

    Similar to try_to_wake_up(), there are places in this function where
    SCHED_LOAD_SCALE is used to represent the load contribution of a single
    task and the same issues are created. A similar solution is adopted except
    that it is now the average per task contribution to a group's load (as
    opposed to a run queue) that is required. As this value is not directly
    available from the group it is calculated on the fly as the queues in the
    groups are visited when determining the busiest group.

    A key change to this function is that it is no longer to scale down
    *imbalance on exit as move_tasks() uses the load in its scaled form.

    void set_user_nice():

    has been modified to update the task's load_weight field when it's nice
    value and also to ensure that its run queue's raw_weighted_load field is
    updated if it was runnable.

    From: "Siddha, Suresh B"

    With smpnice, sched groups with highest priority tasks can mask the imbalance
    between the other sched groups with in the same domain. This patch fixes some
    of the listed down scenarios by not considering the sched groups which are
    lightly loaded.

    a) on a simple 4-way MP system, if we have one high priority and 4 normal
    priority tasks, with smpnice we would like to see the high priority task
    scheduled on one cpu, two other cpus getting one normal task each and the
    fourth cpu getting the remaining two normal tasks. but with current
    smpnice extra normal priority task keeps jumping from one cpu to another
    cpu having the normal priority task. This is because of the
    busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
    cpu with high priority task in max_load calculations but including that in
    total and avg_load calcuations.. leading to max_load < avg_load and load
    balance between cpus running normal priority tasks(2 Vs 1) will always show
    imbalanace as one normal priority and the extra normal priority task will
    keep moving from one cpu to another cpu having normal priority task..

    b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
    highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
    priority task each.. P2 and P3 are idle. With this patch, one of the
    normal priority tasks on P1 will be moved to P2 or P3..

    c) With the current weighted smp nice calculations, it doesn't always make
    sense to look at the highest weighted runqueue in the busy group..
    Consider a load balance scenario on a DP with HT system, with Package-0
    containing one high priority and one low priority, Package-1 containing one
    low priority(with other thread being idle).. Package-1 thinks that it need
    to take the low priority thread from Package-0. And find_busiest_queue()
    returns the cpu thread with highest priority task.. And ultimately(with
    help of active load balance) we move high priority task to Package-1. And
    same continues with Package-0 now, moving high priority task from package-1
    to package-0.. Even without the presence of active load balance, load
    balance will fail to balance the above scenario.. Fix find_busiest_queue
    to use "imbalance" when it is lightly loaded.

    [kernel@kolivas.org: sched: store weighted load on up]
    [kernel@kolivas.org: sched: add discrete weighted cpu load function]
    [suresh.b.siddha@intel.com: sched: remove dead code]
    Signed-off-by: Peter Williams
    Cc: "Siddha, Suresh B"
    Cc: "Chen, Kenneth W"
    Acked-by: Ingo Molnar
    Cc: Nick Piggin
    Signed-off-by: Con Kolivas
    Cc: John Hawkes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Williams
     
  • There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
    __migrate_task() doesn't report any err code, so task can be left on its
    runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
    possible target. Also, chaning cpus_allowed mask requires rq->lock being
    held.

    Signed-off-by: Kirill Korotaev
    Acked-By: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Korotaev
     
  • Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
    as a long long here.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
    and not explicit enough. The sleep boost is not supposed to be any larger
    than without this code and the comment is not clear enough about what
    exactly it does, just the reason it does it. Fix it.

    There is a ceiling to the priority beyond which tasks that only ever sleep
    for very long periods cannot surpass. Fix it.

    Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

    Opportunity to micro-optimise.

    Signed-off-by: Con Kolivas
    Signed-off-by: Mike Galbraith
    Acked-by: Ingo Molnar
    Signed-off-by: Ken Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Con Kolivas
     
  • Signed-off-by: Steven Rostedt
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Steven Rostedt
     
  • Initial report and lock contention fix from Chris Mason:

    Recent benchmarks showed some performance regressions between 2.6.16 and
    2.6.5. We tracked down one of the regressions to lock contention in
    schedule heavy workloads (~70,000 context switches per second)

    kernel/sched.c:dependent_sleeper() was responsible for most of the lock
    contention, hammering on the run queue locks. The patch below is more of a
    discussion point than a suggested fix (although it does reduce lock
    contention significantly). The dependent_sleeper code looks very expensive
    to me, especially for using a spinlock to bounce control between two
    different siblings in the same cpu.

    It is further optimized:

    * perform dependent_sleeper check after next task is determined
    * convert wake_sleeping_dependent to use trylock
    * skip smt runqueue check if trylock fails
    * optimize double_rq_lock now that smt nice is converted to trylock
    * early exit in searching first SD_SHARE_CPUPOWER domain
    * speedup fast path of dependent_sleeper

    [akpm@osdl.org: cleanup]
    Signed-off-by: Ken Chen
    Acked-by: Ingo Molnar
    Acked-by: Con Kolivas
    Signed-off-by: Nick Piggin
    Acked-by: Chris Mason
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen, Kenneth W
     
  • Make notifier_calls associated with cpu_notifier as __cpuinit.

    __cpuinit makes sure that the function is init time only unless
    CONFIG_HOTPLUG_CPU is defined.

    [akpm@osdl.org: section fix]
    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     
  • This patch reverts notifier_block changes made in 2.6.17

    Signed-off-by: Chandra Seetharaman
    Cc: Ashok Raj
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chandra Seetharaman
     

27 Jun, 2006

2 commits

  • * x86-64: (83 commits)
    [PATCH] x86_64: x86_64 stack usage debugging
    [PATCH] x86_64: (resend) x86_64 stack overflow debugging
    [PATCH] x86_64: msi_apic.c build fix
    [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs
    [PATCH] x86_64: Avoid broadcasting NMI IPIs
    [PATCH] x86_64: fix apic error on bootup
    [PATCH] x86_64: enlarge window for stack growth
    [PATCH] x86_64: Minor string functions optimizations
    [PATCH] x86_64: Move export symbols to their C functions
    [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR
    [PATCH] x86_64: Fix modular pc speaker
    [PATCH] x86_64: remove sys32_ni_syscall()
    [PATCH] x86_64: Do not use -ffunction-sections for modules
    [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle
    [PATCH] x86_64: adjust kstack_depth_to_print default
    [PATCH] i386/x86-64: adjust /proc/interrupts column headings
    [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels
    [PATCH] x86_64: Fix fast check in safe_smp_processor_id
    [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information
    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
    ...

    Manual resolve of trivial conflict in arch/i386/kernel/Makefile

    Linus Torvalds
     
  • During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

    Converted i386/x86-64/ia64 for now because that was the easiest
    way to fix ACPI which also manipulates these flags in its idle
    function.

    Cc: Nick Piggin
    Cc: Tony Luck
    Cc: Len Brown
    Signed-off-by: Andi Kleen
    Signed-off-by: Linus Torvalds

    Andi Kleen