12 Jan, 2009

1 commit


04 Jan, 2009

2 commits

  • Impact: prevents panic from stack overflow on numa-capable machines.

    Some of the "removal of stack hogs" changes in kernel/sched.c by using
    node_to_cpumask_ptr were undone by the early cpumask API updates, and
    causes a panic due to stack overflow. This patch undoes those changes
    by using cpumask_of_node() which returns a 'const struct cpumask *'.

    In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
    reducing stack usage. (Both of these updates removed 9 FIXME's!)

    Also:
    Pick up some remaining changes from the old 'cpumask_t' functions to
    the new 'struct cpumask *' functions.

    Optimize memory traffic by allocating each percpu local_cpu_mask on the
    same node as the referring cpu.

    Signed-off-by: Mike Travis
    Acked-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • …ux-2.6-cpumask into merge-rr-cpumask

    Conflicts:
    arch/x86/kernel/io_apic.c
    kernel/rcuclassic.c
    kernel/sched.c
    kernel/time/tick-sched.c

    Signed-off-by: Mike Travis <travis@sgi.com>
    [ mingo@elte.hu: backmerged typo fix for io_apic.c ]
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

    Mike Travis
     

25 Dec, 2008

1 commit


17 Dec, 2008

1 commit


12 Dec, 2008

1 commit


29 Nov, 2008

1 commit

  • Move double_lock_balance()/double_unlock_balance() higher to fix the following
    with gcc-3.4.6:

    CC kernel/sched.o
    In file included from kernel/sched.c:1605:
    kernel/sched_rt.c: In function `find_lock_lowest_rq':
    kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
    kernel/sched_rt.c:1077: sorry, unimplemented: called from here
    make[2]: *** [kernel/sched.o] Error 1

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Ingo Molnar

    Alexey Dobriyan
     

26 Nov, 2008

1 commit


25 Nov, 2008

5 commits

  • Impact: Trivial API conversion

    NR_CPUS -> nr_cpu_ids
    cpumask_t -> struct cpumask
    sizeof(cpumask_t) -> cpumask_size()
    cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)

    cpu_set() -> cpumask_set_cpu()
    first_cpu() -> cpumask_first()
    cpumask_of_cpu() -> cpumask_of()
    cpus_* -> cpumask_*

    There are some FIXMEs where we all archs to complete infrastructure
    (patches have been sent):

    cpu_coregroup_map -> cpu_coregroup_mask
    node_to_cpumask* -> cpumask_of_node

    There is also one FIXME where we pass an array of cpumasks to
    partition_sched_domains(): this implies knowing the definition of
    'struct cpumask' and the size of a cpumask. This will be fixed in a
    future patch.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
    is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: stack reduction for large NR_CPUS

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    stack space.

    We simply return if the allocation fails: since we don't use it we
    could just pass NULL to cpupri_find and have it handle that.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
    is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

    def_root_domain is static, and so its masks are initialized with
    alloc_bootmem_cpumask_var. After that, alloc_cpumask_var is used.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: trivial wrap of member accesses

    This eases the transition in the next patch.

    We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
    for_each_cpu_and, and sched_balance_self() due to getting weight before
    setting sd to NULL.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

07 Nov, 2008

1 commit

  • We have a test case which measures the variation in the amount of time
    needed to perform a fixed amount of work on the preempt_rt kernel. We
    started seeing deterioration in it's performance recently. The test
    should never take more than 10 microseconds, but we started 5-10%
    failure rate.

    Using elimination method, we traced the problem to commit
    1b12bbc747560ea68bcc132c3d05699e52271da0 (lockdep: re-annotate
    scheduler runqueues).

    When LOCKDEP is disabled, this patch only adds an additional function
    call to double_unlock_balance(). Hence I inlined double_unlock_balance()
    and the problem went away. Here is a patch to make this change.

    Signed-off-by: Sripathi Kodi
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Sripathi Kodi
     

03 Nov, 2008

1 commit

  • Impact: micro-optimization to SCHED_FIFO/RR scheduling

    A very minor improvement, but might it be better to check sched_rt_runtime(rt_rq)
    before taking the rt_runtime_lock?

    Peter Zijlstra observes:

    > Yes, I think its ok to do so.
    >
    > Like pointed out in the other thread, there are two races:
    >
    > - sched_rt_runtime() going to RUNTIME_INF, and that will be handled
    > properly by sched_rt_runtime_exceeded()
    >
    > - sched_rt_runtime() going to !RUNTIME_INF, and here we can miss an
    > accounting cycle, but I don't think that is something to worry too
    > much about.

    Signed-off-by: Dimitri Sivanich
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    --

    kernel/sched_rt.c | 4 ++--
    1 file changed, 2 insertions(+), 2 deletions(-)

    Dimitri Sivanich
     

24 Oct, 2008

1 commit


22 Oct, 2008

1 commit

  • a patch from Henrik Austad did this:

    >> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is
    >> not set.

    Peter observed:

    > While a proper cleanup, could you do it by re-arranging the methods so
    > as to not create an additional ifdef?

    Do not declare select_task_rq and some other methods as part of sched_class
    when CONFIG_SMP is not set.

    Also gather those methods to avoid CONFIG_SMP mess.

    Idea-by: Henrik Austad
    Signed-off-by: Li Zefan
    Acked-by: Peter Zijlstra
    Acked-by: Henrik Austad
    Signed-off-by: Ingo Molnar

    Li Zefan
     

20 Oct, 2008

1 commit


04 Oct, 2008

1 commit

  • While working on the new version of the code for SCHED_SPORADIC I
    noticed something strange in the present throttling mechanism. More
    specifically in the throttling timer handler in sched_rt.c
    (do_sched_rt_period_timer()) and in rt_rq_enqueue().

    The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
    asks for rescheduling if the runqueue has a sched_entity associated to
    it (i.e., rt_rq->rt_se != NULL).
    Now, if the runqueue is the root rq (which has a rt_se = NULL)
    rescheduling does not take place, and it is delayed to some undefined
    instant in the future.

    This imply some random bandwidth usage by the RT tasks under throttling.
    For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
    task will get less than 95%. In our tests we got something varying
    between 70% to 95%.
    Using smaller time values, e.g., 95ms/100ms, things are even worse, and
    I can see values also going down to 20-25%!!

    The tests we performed are simply running 'yes' as a SCHED_FIFO task,
    and checking the CPU usage with top, but we can investigate thoroughly
    if you think it is needed.

    Things go much better, for us, with the attached patch... Don't know if
    it is the best approach, but it solved the issue for us.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Acked-by: Peter Zijlstra
    Cc:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

23 Sep, 2008

2 commits


22 Sep, 2008

1 commit

  • Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

    The difference seems to come from different preemption agressiveness,
    which affects the cache footprint of the workload and its effective
    cache trashing.

    Aggresively preempt a task if its avg overlap is very small, this should
    avoid the task going to sleep and find it still running when we schedule
    back to it - saving a wakeup.

    Reported-by: Lin Ming
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Sep, 2008

1 commit

  • Overview

    This patch reworks the handling of POSIX CPU timers, including the
    ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
    with the help of Roland McGrath, the owner and original writer of this code.

    The problem we ran into, and the reason for this rework, has to do with using
    a profiling timer in a process with a large number of threads. It appears
    that the performance of the old implementation of run_posix_cpu_timers() was
    at least O(n*3) (where "n" is the number of threads in a process) or worse.
    Everything is fine with an increasing number of threads until the time taken
    for that routine to run becomes the same as or greater than the tick time, at
    which point things degrade rather quickly.

    This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

    Code Changes

    This rework corrects the implementation of run_posix_cpu_timers() to make it
    run in constant time for a particular machine. (Performance may vary between
    one machine and another depending upon whether the kernel is built as single-
    or multiprocessor and, in the latter case, depending upon the number of
    running processors.) To do this, at each tick we now update fields in
    signal_struct as well as task_struct. The run_posix_cpu_timers() function
    uses those fields to make its decisions.

    We define a new structure, "task_cputime," to contain user, system and
    scheduler times and use these in appropriate places:

    struct task_cputime {
    cputime_t utime;
    cputime_t stime;
    unsigned long long sum_exec_runtime;
    };

    This is included in the structure "thread_group_cputime," which is a new
    substructure of signal_struct and which varies for uniprocessor versus
    multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
    a simple substructure, while for multiprocessor kernels it is a pointer:

    struct thread_group_cputime {
    struct task_cputime totals;
    };

    struct thread_group_cputime {
    struct task_cputime *totals;
    };

    We also add a new task_cputime substructure directly to signal_struct, to
    cache the earliest expiration of process-wide timers, and task_cputime also
    replaces the it_*_expires fields of task_struct (used for earliest expiration
    of thread timers). The "thread_group_cputime" structure contains process-wide
    timers that are updated via account_user_time() and friends. In the non-SMP
    case the structure is a simple aggregator; unfortunately in the SMP case that
    simplicity was not achievable due to cache-line contention between CPUs (in
    one measured case performance was actually _worse_ on a 16-cpu system than
    the same test on a 4-cpu system, due to this contention). For SMP, the
    thread_group_cputime counters are maintained as a per-cpu structure allocated
    using alloc_percpu(). The timer functions update only the timer field in
    the structure corresponding to the running CPU, obtained using per_cpu_ptr().

    We define a set of inline functions in sched.h that we use to maintain the
    thread_group_cputime structure and hide the differences between UP and SMP
    implementations from the rest of the kernel. The thread_group_cputime_init()
    function initializes the thread_group_cputime structure for the given task.
    The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
    out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
    in the per-cpu structures and fields. The thread_group_cputime_free()
    function, also a no-op for UP, in SMP frees the per-cpu structures. The
    thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
    thread_group_cputime_alloc() if the per-cpu structures haven't yet been
    allocated. The thread_group_cputime() function fills the task_cputime
    structure it is passed with the contents of the thread_group_cputime fields;
    in UP it's that simple but in SMP it must also safely check that tsk->signal
    is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
    if so, sums the per-cpu values for each online CPU. Finally, the three
    functions account_group_user_time(), account_group_system_time() and
    account_group_exec_runtime() are used by timer functions to update the
    respective fields of the thread_group_cputime structure.

    Non-SMP operation is trivial and will not be mentioned further.

    The per-cpu structure is always allocated when a task creates its first new
    thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
    It is freed at process exit via a call to thread_group_cputime_free() from
    cleanup_signal().

    All functions that formerly summed utime/stime/sum_sched_runtime values from
    from all threads in the thread group now use thread_group_cputime() to
    snapshot the values in the thread_group_cputime structure or the values in
    the task structure itself if the per-cpu structure hasn't been allocated.

    Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
    The run_posix_cpu_timers() function has been split into a fast path and a
    slow path; the former safely checks whether there are any expired thread
    timers and, if not, just returns, while the slow path does the heavy lifting.
    With the dedicated thread group fields, timers are no longer "rebalanced" and
    the process_timer_rebalance() function and related code has gone away. All
    summing loops are gone and all code that used them now uses the
    thread_group_cputime() inline. When process-wide timers are set, the new
    task_cputime structure in signal_struct is used to cache the earliest
    expiration; this is checked in the fast path.

    Performance

    The fix appears not to add significant overhead to existing operations. It
    generally performs the same as the current code except in two cases, one in
    which it performs slightly worse (Case 5 below) and one in which it performs
    very significantly better (Case 2 below). Overall it's a wash except in those
    two cases.

    I've since done somewhat more involved testing on a dual-core Opteron system.

    Case 1: With no itimer running, for a test with 100,000 threads, the fixed
    kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
    all of which was spent in the system. There were twice as many
    voluntary context switches with the fix as without it.

    Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
    an unmodified kernel can handle), the fixed kernel ran the test in
    eight percent of the time (5.8 seconds as opposed to 70 seconds) and
    had better tick accuracy (.012 seconds per tick as opposed to .023
    seconds per tick).

    Case 3: A 4000-thread test with an initial timer tick of .01 second and an
    interval of 10,000 seconds (i.e. a timer that ticks only once) had
    very nearly the same performance in both cases: 6.3 seconds elapsed
    for the fixed kernel versus 5.5 seconds for the unfixed kernel.

    With fewer threads (eight in these tests), the Case 1 test ran in essentially
    the same time on both the modified and unmodified kernels (5.2 seconds versus
    5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
    versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
    tick versus .025 seconds per tick for the unmodified kernel.

    Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

    Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
    running), the modified kernel was very slightly favored in that while
    it killed the process in 19.997 seconds of CPU time (5.002 seconds of
    wall time), only .003 seconds of that was system time, the rest was
    user time. The unmodified kernel killed the process in 20.001 seconds
    of CPU (5.014 seconds of wall time) of which .016 seconds was system
    time. Really, though, the results were too close to call. The results
    were essentially the same with no itimer running.

    Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
    (where the hard limit would never be reached) and an itimer running,
    the modified kernel exhibited worse tick accuracy than the unmodified
    kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
    performance was almost indistinguishable. With no itimer running this
    test exhibited virtually identical behavior and times in both cases.

    In times past I did some limited performance testing. those results are below.

    On a four-cpu Opteron system without this fix, a sixteen-thread test executed
    in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
    the same system with the fix, user and elapsed time were about the same, but
    system time dropped to 0.007 seconds. Performance with eight, four and one
    thread were comparable. Interestingly, the timer ticks with the fix seemed
    more accurate: The sixteen-thread test with the fix received 149543 ticks
    for 0.024 seconds per tick, while the same test without the fix received 58720
    for 0.061 seconds per tick. Both cases were configured for an interval of
    0.01 seconds. Again, the other tests were comparable. Each thread in this
    test computed the primes up to 25,000,000.

    I also did a test with a large number of threads, 100,000 threads, which is
    impossible without the fix. In this case each thread computed the primes only
    up to 10,000 (to make the runtime manageable). System time dominated, at
    1546.968 seconds out of a total 2176.906 seconds (giving a user time of
    629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
    accurate. There is obviously no comparable test without the fix.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

11 Sep, 2008

1 commit

  • On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.

    Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
    to 0. When every cpu is up, per-cpu migration_thread is created and it runs
    very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
    quickly. After all cpus are up, with below calling chain:

    sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
    => cpu_attach_domain => rq_attach_root => set_rq_online => ...
    => _enable_runtime

    _enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
    reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
    do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
    scheduled to run on that cpu. here is RT task migration_thread which is
    woken up when a task is migrated to another cpu.

    Below patch fixes it against 2.6.27-rc5.

    Signed-off-by: Zhang Yanmin
    Signed-off-by: Ingo Molnar

    Zhang, Yanmin
     

28 Aug, 2008

2 commits

  • It fixes an accounting bug where we would continue accumulating runtime
    even though the bandwidth control is disabled. This would lead to very long
    throttle periods once bandwidth control gets turned on again.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When sysctl_sched_rt_runtime is set to something other than -1 and the
    CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state
    where we see one or more CPUs idling forvever even though there are
    real-time
    tasks in their rt runqueue that are able to run (no longer throttled).

    The sequence is:

    - A real-time task is running when the timer sets the rt runqueue
    to throttled, and the rt task is resched_task()ed and switched
    out, and idle is switched in since there are no non-rt tasks to
    run on that cpu.

    - Eventually the do_sched_rt_period_timer() runs and un-throttles
    the rt runqueue, but we just exit the timer interrupt and go back
    to executing the idle task in the idle loop forever.

    If we change the sched_rt_rq_enqueue() routine to use some of the code
    from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and
    resched_task() the currently executing task (idle in our case) if it is
    a lower priority task than the higher rt task in the now un-throttled
    runqueue, the problem is no longer observed.

    Signed-off-by: John Blackwood
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    John Blackwood
     

19 Aug, 2008

2 commits


14 Aug, 2008

1 commit

  • When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be
    detatched. Alex observed the case where a runqueue was stealing bandwidth
    from an already disabled runqueue to satisfy its own needs.

    Stop this by skipping over already disabled runqueues.

    Reported-by: Alex Nixon
    Signed-off-by: Peter Zijlstra
    Tested-by: Alex Nixon
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Aug, 2008

1 commit

  • Instead of using a per-rq lock class, use the regular nesting operations.

    However, take extra care with double_lock_balance() as it can release the
    already held rq->lock (and therefore change its nesting class).

    So what can happen is:

    spin_lock(rq->lock); // this rq subclass 0

    double_lock_balance(rq, other_rq);
    // release rq
    // acquire other_rq->lock subclass 0
    // acquire rq->lock subclass 1

    spin_unlock(other_rq->lock);

    leaving you with rq->lock in subclass 1

    So a subsequent double_lock_balance() call can try to nest a subclass 1
    lock while already holding a subclass 1 lock.

    Fix this by introducing double_unlock_balance() which releases the other
    rq's lock, but also re-sets the subclass for this rq's lock to 0.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Jul, 2008

1 commit


24 Jul, 2008

2 commits

  • Reported-by: Daniel Walker
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • * 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: hrtick_enabled() should use cpu_active()
    sched, x86: clean up hrtick implementation
    sched: fix build error, provide partition_sched_domains() unconditionally
    sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
    cpu hotplug: Make cpu_active_map synchronization dependency clear
    cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
    sched: rework of "prioritize non-migratable tasks over migratable ones"
    sched: reduce stack size in isolated_cpu_setup()
    Revert parts of "ftrace: do not trace scheduler functions"

    Fixed up conflicts in include/asm-x86/thread_info.h (due to the
    TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
    kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
    introduction).

    Linus Torvalds
     

20 Jul, 2008

1 commit


18 Jul, 2008

3 commits

  • Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed. It is
    declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.

    This is a consequence of patch 1f11eb6a8bc92536d9e93ead48fa3ffbd1478571 plus
    patch 1100ac91b6af02d8639d518fad5b434b1bf44ed6.

    Signed-off-by: David Howells
    Signed-off-by: Ingo Molnar

    David Howells
     
  • This is based on Linus' idea of creating cpu_active_map that prevents
    scheduler load balancer from migrating tasks to the cpu that is going
    down.

    It allows us to simplify domain management code and avoid unecessary
    domain rebuilds during cpu hotplug event handling.

    Please ignore the cpusets part for now. It needs some more work in order
    to avoid crazy lock nesting. Although I did simplfy and unify domain
    reinitialization logic. We now simply call partition_sched_domains() in
    all the cases. This means that we're using exact same code paths as in
    cpusets case and hence the test below cover cpusets too.
    Cpuset changes to make rebuild_sched_domains() callable from various
    contexts are in the separate patch (right next after this one).

    This not only boots but also easily handles
    while true; do make clean; make -j 8; done
    and
    while true; do on-off-cpu 1; done
    at the same time.
    (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

    Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
    this on right now in gnome-terminal and things are moving just fine.

    Also this is running with most of the debug features enabled (lockdep,
    mutex, etc) no BUG_ONs or lockdep complaints so far.

    I believe I addressed all of the Dmitry's comments for original Linus'
    version. I changed both fair and rt balancer to mask out non-active cpus.
    And replaced cpu_is_offline() with !cpu_active() in the main scheduler
    code where it made sense (to me).

    Signed-off-by: Max Krasnyanskiy
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    Cc: dmitry.adamushko@gmail.com
    Cc: pj@sgi.com
    Signed-off-by: Ingo Molnar

    Max Krasnyansky
     
  • (1) handle in a generic way all cases when a newly woken-up task is
    not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
    1")

    (2) if current is to be preempted, then make sure "p" will be picked
    up by pick_next_task_rt().
    i.e. move task's group at the head of its list as well.

    currently, it's not a case for the group-scheduling case as described
    here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.html

    Signed-off-by: Dmitry Adamushko
    Cc: Steven Rostedt
    Cc: Gregory Haskins
    Signed-off-by: Ingo Molnar

    Dmitry Adamushko
     

16 Jul, 2008

1 commit


06 Jul, 2008

1 commit


27 Jun, 2008

1 commit