31 Mar, 2011

1 commit


04 Mar, 2011

2 commits

  • Merge reason: Add fixes before applying dependent patches.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The current sched rt code is broken when it comes to hierarchical
    scheduling, this patch fixes two problems

    1. It adds redundant enqueuing (harmless) when it finds a queue
    has tasks enqueued, but it has no run time and it is not
    throttled.

    2. The most important change is in sched_rt_rq_enqueue/dequeue.
    The code just picks the rt_rq belonging to the current cpu
    on which the period timer runs, the patch fixes it, so that
    the correct rt_se is enqueued/dequeued.

    Tested with a simple hierarchy

    /c/d, c and d assigned similar runtimes of 50,000 and a while
    1 loop runs within "d". Both c and d get throttled, without
    the patch, the task just stops running and never runs (depends
    on where the sched_rt b/w timer runs). With the patch, the
    task is throttled and runs as expected.

    [ bharata, suggestions on how to pick the rt_se belong to the
    rt_rq and correct cpu ]

    Signed-off-by: Balbir Singh
    Acked-by: Bharata B Rao
    Signed-off-by: Peter Zijlstra
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Balbir Singh
     

16 Feb, 2011

1 commit


03 Feb, 2011

1 commit

  • cpu_stopper_thread()
    migration_cpu_stop()
    __migrate_task()
    deactivate_task()
    dequeue_task()
    dequeue_task_rq()
    update_curr_rt()

    Will call update_curr_rt() on rq->curr, which at that time is
    rq->stop. The problem is that rq->stop.prio matches an RT prio and
    thus falsely assumes its a rt_sched_class task.

    Reported-Debuged-Tested-Acked-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Cc: stable@kernel.org # .37
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

26 Jan, 2011

1 commit

  • When a task is taken out of the fair class we must ensure the vruntime
    is properly normalized because when we put it back in it will assume
    to be normalized.

    The case that goes wrong is when changing away from the fair class
    while sleeping. Sleeping tasks have non-normalized vruntime in order
    to make sleeper-fairness work. So treat the switch away from fair as a
    wakeup and preserve the relative vruntime.

    Also update sysrq-n to call the ->switch_{to,from} methods.

    Reported-by: Onkalo Samu
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Nov, 2010

1 commit

  • Make certain load-balance actions scale per number of active cgroups
    instead of the number of existing cgroups.

    This makes wakeup/sleep paths more expensive, but is a win for systems
    where the vast majority of existing cgroups are idle.

    Signed-off-by: Paul Turner
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

19 Oct, 2010

2 commits

  • Scheduler accounts both softirq and interrupt processing times to the
    currently running task. This means, if the interrupt processing was
    for some other task in the system, then the current task ends up being
    penalized as it gets shorter runtime than otherwise.

    Change sched task accounting to acoount only actual task time from
    currently running task. Now update_curr(), modifies the delta_exec to
    depend on rq->clock_task.

    Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
    extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
    for later.

    This change will impact scheduling behavior in interrupt heavy conditions.

    Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
    task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
    spending 75%+ of its time in irq processing. CPU 3 spending around 35%
    time running nc task.

    Now, if I run another CPU intensive task on CPU 2, without this change
    /proc//schedstat shows 100% of time accounted to this task. With this
    change, it rightly shows less than 25% accounted to this task as remaining
    time is actually spent on irq processing.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     
  • Labels should be on column 0.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Sep, 2010

2 commits

  • If a high priority task is waking up on a CPU that is running a
    lower priority task that is bound to a CPU, see if we can move the
    high RT task to another CPU first. Note, if all other CPUs are
    running higher priority tasks than the CPU bounded current task,
    then it will be preempted regardless.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Gregory Haskins
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • When first working on the RT scheduler design, we concentrated on
    keeping all CPUs running RT tasks instead of having multiple RT
    tasks on a single CPU waiting for the migration thread to move
    them. Instead we take a more proactive stance and push or pull RT
    tasks from one CPU to another on wakeup or scheduling.

    When an RT task wakes up on a CPU that is running another RT task,
    instead of preempting it and killing the cache of the running RT
    task, we look to see if we can migrate the RT task that is waking
    up, even if the RT task waking up is of higher priority.

    This may sound a bit odd, but RT tasks should be limited in
    migration by the user anyway. But in practice, people do not do
    this, which causes high prio RT tasks to bounce around the CPUs.
    This becomes even worse when we have priority inheritance, because
    a high prio task can block on a lower prio task and boost its
    priority. When the lower prio task wakes up the high prio task, if
    it happens to be on the same CPU it will migrate off of it.

    But in reality, the above does not happen much either, because the
    wake up of the lower prio task, which has already been boosted, if
    it was on the same CPU as the higher prio task, it would then
    migrate off of it. But anyway, we do not want to migrate them
    either.

    To examine the scheduling, I created a test program and examined it
    under kernelshark. The test program created CPU * 2 threads, where
    each thread had a different priority. The program takes different
    options. The options used in this change log was to have priority
    inheritance mutexes or not.

    All threads did the following loop:

    static void grab_lock(long id, int iter, int l)
    {
    ftrace_write("thread %ld iter %d, taking lock %d\n",
    id, iter, l);
    pthread_mutex_lock(&locks[l]);
    ftrace_write("thread %ld iter %d, took lock %d\n",
    id, iter, l);
    busy_loop(nr_tasks - id);
    ftrace_write("thread %ld iter %d, unlock lock %d\n",
    id, iter, l);
    pthread_mutex_unlock(&locks[l]);
    }

    void *start_task(void *id)
    {
    [...]
    while (!done) {
    for (l = 0; l < nr_locks; l++) {
    grab_lock(id, i, l);
    ftrace_write("thread %ld iter %d sleeping\n",
    id, i);
    ms_sleep(id);
    }
    i++;
    }
    [...]
    }

    The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
    ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
    to the ftrace buffer to help analyze via ftrace.

    The higher the id, the higher the prio, the shorter it does the
    busy loop, but the longer it spins. This is usually the case with
    RT tasks, the lower priority tasks usually run longer than higher
    priority tasks.

    At the end of the test, it records the number of loops each thread
    took, as well as the number of voluntary preemptions, non-voluntary
    preemptions, and number of migrations each thread took, taking the
    information from /proc/$$/sched and /proc/$$/status.

    Running this on a 4 CPU processor, the results without changes to
    the kernel looked like this:

    Task vol nonvol migrated iterations
    ---- --- ------ -------- ----------
    0: 53 3220 1470 98
    1: 562 773 724 98
    2: 752 933 1375 98
    3: 749 39 697 98
    4: 758 5 515 98
    5: 764 2 679 99
    6: 761 2 535 99
    7: 757 3 346 99

    total: 5156 4977 6341 787

    Each thread regardless of priority migrated a few hundred times.
    The higher priority tasks, were a little better but still took
    quite an impact.

    By letting higher priority tasks bump the lower prio task from the
    CPU, things changed a bit:

    Task vol nonvol migrated iterations
    ---- --- ------ -------- ----------
    0: 37 2835 1937 98
    1: 666 1821 1865 98
    2: 654 1003 1385 98
    3: 664 635 973 99
    4: 698 197 352 99
    5: 703 101 159 99
    6: 708 1 75 99
    7: 713 1 2 99

    total: 4843 6594 6748 789

    The total # of migrations did not change (several runs showed the
    difference all within the noise). But we now see a dramatic
    improvement to the higher priority tasks. (kernelshark showed that
    the watchdog timer bumped the highest priority task to give it the
    2 count. This was actually consistent with every run).

    Notice that the # of iterations did not change either.

    The above was with priority inheritance mutexes. That is, when the
    higher prority task blocked on a lower priority task, the lower
    priority task would inherit the higher priority task (which shows
    why task 6 was bumped so many times). When not using priority
    inheritance mutexes, the current kernel shows this:

    Task vol nonvol migrated iterations
    ---- --- ------ -------- ----------
    0: 56 3101 1892 95
    1: 594 713 937 95
    2: 625 188 618 95
    3: 628 4 491 96
    4: 640 7 468 96
    5: 631 2 501 96
    6: 641 1 466 96
    7: 643 2 497 96

    total: 4458 4018 5870 765

    Not much changed with or without priority inheritance mutexes. But
    if we let the high priority task bump lower priority tasks on
    wakeup we see:

    Task vol nonvol migrated iterations
    ---- --- ------ -------- ----------
    0: 115 3439 2782 98
    1: 633 1354 1583 99
    2: 652 919 1218 99
    3: 645 713 934 99
    4: 690 3 3 99
    5: 694 1 4 99
    6: 720 3 4 99
    7: 747 0 1 100

    Which shows a even bigger change. The big difference between task 3
    and task 4 is because we have only 4 CPUs on the machine, causing
    the 4 highest prio tasks to always have preference.

    Although I did not measure cache misses, and I'm sure there would
    be little to measure since the test was not data intensive, I could
    imagine large improvements for higher priority tasks when dealing
    with lower priority tasks. Thus, I'm satisfied with making the
    change and agreeing with what Gregory Haskins argued a few years
    ago when we first had this discussion.

    One final note. All tasks in the above tests were RT tasks. Any RT
    task will always preempt a non RT task that is running on the CPU
    the RT task wants to run on.

    Signed-off-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Cc: Gregory Haskins
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     

18 Jun, 2010

1 commit


03 Apr, 2010

3 commits

  • In order to reduce the dependency on TASK_WAKING rework the enqueue
    interface to support a proper flags field.

    Replace the int wakeup, bool head arguments with an int flags argument
    and create the following flags:

    ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
    ENQUEUE_WAKING - the enqueue has relative vruntime due to
    having sched_class::task_waking() called,
    ENQUEUE_HEAD - the waking task should be places on the head
    of the priority queue (where appropriate).

    For symmetry also convert sched_class::dequeue() to a flags scheme.

    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Oleg noticed a few races with the TASK_WAKING usage on fork.

    - since TASK_WAKING is basically a spinlock, it should be IRQ safe
    - since we set TASK_WAKING (*) without holding rq->lock it could
    be there still is a rq->lock holder, thereby not actually
    providing full serialization.

    (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

    Cure the second issue by not setting TASK_WAKING in sched_fork(), but
    only temporarily in wake_up_new_task() while calling select_task_rq().

    Cure the first by holding rq->lock around the select_task_rq() call,
    this will disable IRQs, this however requires that we push down the
    rq->lock release into select_task_rq_fair()'s cgroup stuff.

    Because select_task_rq_fair() still needs to drop the rq->lock we
    cannot fully get rid of TASK_WAKING.

    Reported-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Merge reason: update to latest upstream

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

14 Mar, 2010

1 commit


11 Mar, 2010

2 commits


07 Mar, 2010

1 commit

  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

04 Feb, 2010

1 commit


23 Jan, 2010

2 commits

  • The ability of enqueueing a task to the head of a SCHED_FIFO priority
    list is required to fix some violations of POSIX scheduling policy.

    Implement the functionality in sched_rt.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Tested-by: Carsten Emde
    Tested-by: Mathias Weber
    LKML-Reference:

    Thomas Gleixner
     
  • The ability of enqueueing a task to the head of a SCHED_FIFO priority
    list is required to fix some violations of POSIX scheduling policy.

    Extend the related functions with a "head" argument.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Tested-by: Carsten Emde
    Tested-by: Mathias Weber
    LKML-Reference:

    Thomas Gleixner
     

21 Jan, 2010

1 commit


17 Jan, 2010

1 commit

  • kernel/sched: don't expose local functions

    The get_rr_interval_* functions are all class methods of
    struct sched_class. They are not exported so make them
    static.

    Signed-off-by: H Hartley Sweeten
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    H Hartley Sweeten
     

17 Dec, 2009

1 commit

  • As will be apparent in the next patch, we need a pre wakeup hook
    for sched_fair task migration, hence rename the post wakeup hook
    and one pre wakeup.

    Signed-off-by: Peter Zijlstra
    Cc: Mike Galbraith
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Dec, 2009

2 commits


09 Dec, 2009

1 commit

  • sched_rr_get_param calls
    task->sched_class->get_rr_interval(task) without protection
    against a concurrent sched_setscheduler() call which modifies
    task->sched_class.

    Serialize the access with task_rq_lock(task) and hand the rq
    pointer into get_rr_interval() as it's needed at least in the
    sched_fair implementation.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

04 Nov, 2009

1 commit

  • find_lowest_rq() wants to call pick_optimal_cpu() on the
    intersection of sched_domain_span(sd) and lowest_mask. Rather
    than doing a cpus_and into a temporary, we can open-code it.

    This actually makes the code slightly clearer, IMHO.

    Signed-off-by: Rusty Russell
    Acked-by: Gregory Haskins
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

21 Sep, 2009

1 commit


15 Sep, 2009

3 commits


04 Sep, 2009

1 commit

  • Keep an average on the amount of time spend on RT tasks and use
    that fraction to scale down the cpu_power for regular tasks.

    Signed-off-by: Peter Zijlstra
    Tested-by: Andreas Herrmann
    Acked-by: Andreas Herrmann
    Acked-by: Gautham R Shenoy
    Cc: Balbir Singh
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

02 Aug, 2009

4 commits

  • This build bug:

    In file included from kernel/sched.c:1765:
    kernel/sched_rt.c: In function ‘has_pushable_tasks’:
    kernel/sched_rt.c:1069: error: ‘struct rt_rq’ has no member named ‘pushable_tasks’
    kernel/sched_rt.c: In function ‘pick_next_task_rt’:
    kernel/sched_rt.c:1084: error: ‘struct rq’ has no member named ‘post_schedule’

    Triggers because both pushable_tasks and post_schedule are
    SMP-only fields.

    Move pushable_tasks() to the SMP section and #ifdef the post_schedule use.

    Cc: Gregory Haskins
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • A frequent mistake appears to be to call task_of() on a
    scheduler entity that is not actually a task, which can result
    in a wild pointer.

    Add a check to catch these mistakes.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Reflect "active" cpus in the rq->rd->online field, instead of
    the online_map.

    The motivation is that things that use the root-domain code
    (such as cpupri) only care about cpus classified as "active"
    anyway. By synchronizing the root-domain state with the active
    map, we allow several optimizations.

    For instance, we can remove an extra cpumask_and from the
    scheduler hotpath by utilizing rq->rd->online (since it is now
    a cached version of cpu_active_map & rq->rd->span).

    Signed-off-by: Gregory Haskins
    Acked-by: Peter Zijlstra
    Acked-by: Max Krasnyansky
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     
  • We currently have an explicit "needs_post" vtable method which
    returns a stack variable for whether we should later run
    post-schedule. This leads to an awkward exchange of the
    variable as it bubbles back up out of the context switch. Peter
    Zijlstra observed that this information could be stored in the
    run-queue itself instead of handled on the stack.

    Therefore, we revert to the method of having context_switch
    return void, and update an internal rq->post_schedule variable
    when we require further processing.

    In addition, we fix a race condition where we try to access
    current->sched_class without holding the rq->lock. This is
    technically racy, as the sched-class could change out from
    under us. Instead, we reference the per-rq post_schedule
    variable with the runqueue unlocked, but with preemption
    disabled to see if we need to reacquire the rq->lock.

    Finally, we clean the code up slightly by removing the #ifdef
    CONFIG_SMP conditionals from the schedule() call, and implement
    some inline helper functions instead.

    This patch passes checkpatch, and rt-migrate.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gregory Haskins
     

10 Jul, 2009

1 commit

  • Fixes an easily triggerable BUG() when setting process affinities.

    Make sure to count the number of migratable tasks in the same place:
    the root rt_rq. Otherwise the number doesn't make sense and we'll hit
    the BUG in set_cpus_allowed_rt().

    Also, make sure we only count tasks, not groups (this is probably
    already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
    for groups, but be more explicit)

    Tested-by: Thomas Gleixner
    CC: stable@kernel.org
    Signed-off-by: Peter Zijlstra
    Acked-by: Gregory Haskins
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Jun, 2009

1 commit