10 Jul, 2009

1 commit

  • Fixes an easily triggerable BUG() when setting process affinities.

    Make sure to count the number of migratable tasks in the same place:
    the root rt_rq. Otherwise the number doesn't make sense and we'll hit
    the BUG in set_cpus_allowed_rt().

    Also, make sure we only count tasks, not groups (this is probably
    already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
    for groups, but be more explicit)

    Tested-by: Thomas Gleixner
    CC: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Jun, 2009

1 commit


08 Apr, 2009

1 commit


01 Apr, 2009

1 commit


28 Mar, 2009

1 commit


09 Feb, 2009

1 commit


01 Feb, 2009

1 commit


16 Jan, 2009

1 commit

  • Ingo Molnar wrote:

    > here's a new build failure with tip/sched/rt:
    >
    > LD .tmp_vmlinux1
    > kernel/built-in.o: In function `set_curr_task_rt':
    > sched.c:(.text+0x3675): undefined reference to `plist_del'
    > kernel/built-in.o: In function `pick_next_task_rt':
    > sched.c:(.text+0x37ce): undefined reference to `plist_del'
    > kernel/built-in.o: In function `enqueue_pushable_task':
    > sched.c:(.text+0x381c): undefined reference to `plist_del'

    Eliminate the plist library kconfig and make it available
    unconditionally.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jan, 2009

2 commits


12 Jan, 2009

1 commit


11 Jan, 2009

1 commit


04 Jan, 2009

2 commits

  • Impact: prevents panic from stack overflow on numa-capable machines.

    Some of the "removal of stack hogs" changes in kernel/sched.c by using
    node_to_cpumask_ptr were undone by the early cpumask API updates, and
    causes a panic due to stack overflow. This patch undoes those changes
    by using cpumask_of_node() which returns a 'const struct cpumask *'.

    In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
    reducing stack usage. (Both of these updates removed 9 FIXME's!)

    Also:
    Pick up some remaining changes from the old 'cpumask_t' functions to
    the new 'struct cpumask *' functions.

    Optimize memory traffic by allocating each percpu local_cpu_mask on the
    same node as the referring cpu.

    Signed-off-by: Mike Travis
    Acked-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • …ux-2.6-cpumask into merge-rr-cpumask

    Conflicts:
    arch/x86/kernel/io_apic.c
    kernel/rcuclassic.c
    kernel/sched.c
    kernel/time/tick-sched.c

    Signed-off-by: Mike Travis <travis@sgi.com>
    [ mingo@elte.hu: backmerged typo fix for io_apic.c ]
    Signed-off-by: Ingo Molnar <mingo@elte.hu>

    Mike Travis
     

29 Dec, 2008

8 commits

  • A panic was discovered by Chirag Jog where a BUG_ON sanity check
    in the new "pushable_task" logic would trigger a panic under
    certain circumstances:

    http://lkml.org/lkml/2008/9/25/189

    Gilles Carry discovered that the root cause was attributed to the
    pushable_tasks list getting corrupted in the push_rt_task logic.
    This was the result of a dropped rq lock in double_lock_balance
    allowing a task in the process of being pushed to potentially migrate
    away, and thus corrupt the pushable_tasks() list.

    I traced back the problem as introduced by the pushable_tasks patch
    that went in recently. There is a "retry" path in push_rt_task()
    that actually had a compound conditional to decide whether to
    retry or exit. I missed the meaning behind the rationale for the
    virtual "if(!task) goto out;" portion of the compound statement and
    thus did not handle it properly. The new pushable_tasks logic
    actually creates three distinct conditions:

    1) an untouched and unpushable task should be dequeued
    2) a migrated task where more pushable tasks remain should be retried
    3) a migrated task where no more pushable tasks exist should exit

    The original logic mushed (1) and (3) together, resulting in the
    system dequeuing a migrated task (against an unlocked foreign run-queue
    nonetheless).

    To fix this, we get rid of the notion of "paranoid" and we support the
    three unique conditions properly. The paranoid feature is no longer
    relevant with the new pushable logic (since pushable naturally limits
    the loop) anyway, so lets just remove it.

    Reported-By: Chirag Jog
    Found-by: Gilles Carry
    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • The RT scheduler employs a "push/pull" design to actively balance tasks
    within the system (on a per disjoint cpuset basis). When a task is
    awoken, it is immediately determined if there are any lower priority
    cpus which should be preempted. This is opposed to the way normal
    SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
    operation to occur before spreading out load.

    When a particular RQ has more than 1 active RT task, it is said to
    be in an "overloaded" state. Once this occurs, the system enters
    the active balancing mode, where it will try to push the task away,
    or persuade a different cpu to pull it over. The system will stay
    in this state until the system falls back below the lock for certain workloads, and by making sure the algorithm
    considers all eligible tasks in the system.

    [ rostedt: added a couple more BUG_ONs ]

    Signed-off-by: Gregory Haskins
    Acked-by: Steven Rostedt

    Gregory Haskins
     
  • We currently run class->post_schedule() outside of the rq->lock, which
    means that we need to test for the need to post_schedule outside of
    the lock to avoid a forced reacquistion. This is currently not a problem
    as we only look at rq->rt.overloaded. However, we want to enhance this
    going forward to look at more state to reduce the need to post_schedule to
    a bare minimum set. Therefore, we introduce a new member-func called
    needs_post_schedule() which tests for the post_schedule condtion without
    actually performing the work. Therefore it is safe to call this
    function before the rq->lock is released, because we are guaranteed not
    to drop the lock at an intermediate point (such as what post_schedule()
    may do).

    We will use this later in the series

    [ rostedt: removed paranoid BUG_ON ]

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • There is no sense in wasting time trying to push a task away that
    cannot move anywhere else. We gain no benefit from trying to push
    other tasks at this point, so if the task being woken up is non
    migratable, just skip the whole operation. This reduces overhead
    in the wakeup path for certain tasks.

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • We currently take the rq->lock for every cpu in an overload state during
    pull_rt_tasks(). However, we now have enough information via the
    highest_prio.[curr|next] fields to determine if there is any tasks of
    interest to warrant the overhead of the rq->lock, before we actually take
    it. So we use this information to reduce lock contention during the
    pull for the case where the source-rq doesnt have tasks that preempt
    the current task.

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • highest_prio.curr is actually a more accurate way to keep track of
    the pull_rt_task() threshold since it is always up to date, even
    if the "next" task migrates during double_lock. Therefore, stop
    looking at the "next" task object and simply use the highest_prio.curr.

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • We will use this later in the series to reduce the amount of rq-lock
    contention during a pull operation

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     
  • Move some common definitions up to the function prologe to simplify the
    body logic.

    Signed-off-by: Gregory Haskins

    Gregory Haskins
     

25 Dec, 2008

1 commit


17 Dec, 2008

1 commit


12 Dec, 2008

1 commit


29 Nov, 2008

1 commit

  • Move double_lock_balance()/double_unlock_balance() higher to fix the following
    with gcc-3.4.6:

    CC kernel/sched.o
    In file included from kernel/sched.c:1605:
    kernel/sched_rt.c: In function `find_lock_lowest_rq':
    kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
    kernel/sched_rt.c:1077: sorry, unimplemented: called from here
    make[2]: *** [kernel/sched.o] Error 1

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Ingo Molnar

    Alexey Dobriyan
     

26 Nov, 2008

1 commit


25 Nov, 2008

5 commits

  • Impact: Trivial API conversion

    NR_CPUS -> nr_cpu_ids
    cpumask_t -> struct cpumask
    sizeof(cpumask_t) -> cpumask_size()
    cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)

    cpu_set() -> cpumask_set_cpu()
    first_cpu() -> cpumask_first()
    cpumask_of_cpu() -> cpumask_of()
    cpus_* -> cpumask_*

    There are some FIXMEs where we all archs to complete infrastructure
    (patches have been sent):

    cpu_coregroup_map -> cpu_coregroup_mask
    node_to_cpumask* -> cpumask_of_node

    There is also one FIXME where we pass an array of cpumasks to
    partition_sched_domains(): this implies knowing the definition of
    'struct cpumask' and the size of a cpumask. This will be fixed in a
    future patch.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
    is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: stack reduction for large NR_CPUS

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    stack space.

    We simply return if the allocation fails: since we don't use it we
    could just pass NULL to cpupri_find and have it handle that.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
    is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

    def_root_domain is static, and so its masks are initialized with
    alloc_bootmem_cpumask_var. After that, alloc_cpumask_var is used.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     
  • Impact: trivial wrap of member accesses

    This eases the transition in the next patch.

    We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
    for_each_cpu_and, and sched_balance_self() due to getting weight before
    setting sd to NULL.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

07 Nov, 2008

1 commit

  • We have a test case which measures the variation in the amount of time
    needed to perform a fixed amount of work on the preempt_rt kernel. We
    started seeing deterioration in it's performance recently. The test
    should never take more than 10 microseconds, but we started 5-10%
    failure rate.

    Using elimination method, we traced the problem to commit
    1b12bbc747560ea68bcc132c3d05699e52271da0 (lockdep: re-annotate
    scheduler runqueues).

    When LOCKDEP is disabled, this patch only adds an additional function
    call to double_unlock_balance(). Hence I inlined double_unlock_balance()
    and the problem went away. Here is a patch to make this change.

    Signed-off-by: Sripathi Kodi
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Sripathi Kodi
     

03 Nov, 2008

1 commit

  • Impact: micro-optimization to SCHED_FIFO/RR scheduling

    A very minor improvement, but might it be better to check sched_rt_runtime(rt_rq)
    before taking the rt_runtime_lock?

    Peter Zijlstra observes:

    > Yes, I think its ok to do so.
    >
    > Like pointed out in the other thread, there are two races:
    >
    > - sched_rt_runtime() going to RUNTIME_INF, and that will be handled
    > properly by sched_rt_runtime_exceeded()
    >
    > - sched_rt_runtime() going to !RUNTIME_INF, and here we can miss an
    > accounting cycle, but I don't think that is something to worry too
    > much about.

    Signed-off-by: Dimitri Sivanich
    Acked-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    --

    kernel/sched_rt.c | 4 ++--
    1 file changed, 2 insertions(+), 2 deletions(-)

    Dimitri Sivanich
     

24 Oct, 2008

1 commit


22 Oct, 2008

1 commit

  • a patch from Henrik Austad did this:

    >> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is
    >> not set.

    Peter observed:

    > While a proper cleanup, could you do it by re-arranging the methods so
    > as to not create an additional ifdef?

    Do not declare select_task_rq and some other methods as part of sched_class
    when CONFIG_SMP is not set.

    Also gather those methods to avoid CONFIG_SMP mess.

    Idea-by: Henrik Austad
    Signed-off-by: Li Zefan
    Acked-by: Peter Zijlstra
    Acked-by: Henrik Austad
    Signed-off-by: Ingo Molnar

    Li Zefan
     

20 Oct, 2008

1 commit


04 Oct, 2008

1 commit

  • While working on the new version of the code for SCHED_SPORADIC I
    noticed something strange in the present throttling mechanism. More
    specifically in the throttling timer handler in sched_rt.c
    (do_sched_rt_period_timer()) and in rt_rq_enqueue().

    The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
    asks for rescheduling if the runqueue has a sched_entity associated to
    it (i.e., rt_rq->rt_se != NULL).
    Now, if the runqueue is the root rq (which has a rt_se = NULL)
    rescheduling does not take place, and it is delayed to some undefined
    instant in the future.

    This imply some random bandwidth usage by the RT tasks under throttling.
    For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
    task will get less than 95%. In our tests we got something varying
    between 70% to 95%.
    Using smaller time values, e.g., 95ms/100ms, things are even worse, and
    I can see values also going down to 20-25%!!

    The tests we performed are simply running 'yes' as a SCHED_FIFO task,
    and checking the CPU usage with top, but we can investigate thoroughly
    if you think it is needed.

    Things go much better, for us, with the attached patch... Don't know if
    it is the best approach, but it solved the issue for us.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Acked-by: Peter Zijlstra
    Cc:
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

23 Sep, 2008

2 commits