02 Apr, 2015

3 commits

  • I observed that DL tasks can't be migrated to other CPUs during CPU
    hotplug, in addition, task may/may not be running again if CPU is
    added back.

    The root cause which I found is that DL tasks will be throtted and
    removed from the DL rq after comsuming all their budget, which
    leads to the situation that stop task can't pick them up from the
    DL rq and migrate them to other CPUs during hotplug.

    The method to reproduce:

    schedtool -E -t 50000:100000 -e ./test

    Actually './test' is just a simple for loop. Then observe which CPU the
    test task is on and offline it:

    echo 0 > /sys/devices/system/cpu/cpuN/online

    This patch adds the DL task migration during CPU hotplug by finding a
    most suitable later deadline rq after DL timer fires if current rq is
    offline.

    If it fails to find a suitable later deadline rq then it falls back to
    any eligible online CPU in so that the deadline task will come back
    to us, and the push/pull mechanism should then move it around properly.

    Suggested-and-Acked-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1427411315-4298-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • dl_task_timer() may fire on a different rq from where a task was removed
    after throttling. Since the call path is:

    dl_task_timer() ->
    enqueue_task_dl() ->
    enqueue_dl_entity() ->
    replenish_dl_entity()

    and replenish_dl_entity() uses dl_se's rq, we can't use current's rq
    in dl_task_timer(), but we need to lock the task's previous one.

    Tested-by: Wanpeng Li
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Kirill Tkhai
    Cc: Juri Lelli
    Fixes: 3960c8c0c789 ("sched: Make dl_task_time() use task_rq_lock()")
    Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Obviously, 'rq' is not used in these two functions, therefore,
    there is no reason for it to be passed as an argument.

    Signed-off-by: Abel Vesa
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.com
    Signed-off-by: Ingo Molnar

    Abel Vesa
     

27 Mar, 2015

1 commit

  • Since commit 40767b0dc768 ("sched/deadline: Fix deadline parameter
    modification handling") we clear the thottled state when switching
    from a dl task, therefore we should never find it set in switching to
    a dl task.

    Signed-off-by: Wanpeng Li
    [ Improved the changelog. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Link: http://lkml.kernel.org/r/1426590931-4639-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

10 Mar, 2015

1 commit

  • This patch adds rq->clock update skip for SCHED_DEADLINE task yield,
    to tell update_rq_clock() that we've just updated the clock, so that
    we don't do a microscopic update in schedule() and double the
    fastpath cost.

    Signed-off-by: Wanpeng Li
    Cc: Juri Lelli
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1425961200-3809-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

18 Feb, 2015

3 commits

  • update_curr_dl() needs actual rq clock.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • A deadline task may be throttled and dequeued at the same time.
    This happens, when it becomes throttled in schedule(), which
    is called to go to sleep:

    current->state = TASK_INTERRUPTIBLE;
    schedule()
    deactivate_task()
    dequeue_task_dl()
    update_curr_dl()
    start_dl_timer()
    __dequeue_task_dl()
    prev->on_rq = 0;

    Later the timer fires, but the task is still dequeued:

    dl_task_timer()
    enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */

    Someone wakes it up:

    try_to_wake_up()

    enqueue_dl_entity()
    BUG_ON(on_dl_rq())

    Patch fixes this problem, it prevents queueing !on_rq tasks
    on dl_rq.

    Reported-by: Fengguang Wu
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    [ Wrote comment. ]
    Cc: Juri Lelli
    Fixes: 1019a359d3dc ("sched/deadline: Fix stale yield state")
    Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Kirill reported that a dl task can be throttled and dequeued at the
    same time. This happens, when it becomes throttled in schedule(),
    which is called to go to sleep:

    current->state = TASK_INTERRUPTIBLE;
    schedule()
    deactivate_task()
    dequeue_task_dl()
    update_curr_dl()
    start_dl_timer()
    __dequeue_task_dl()
    prev->on_rq = 0;

    This invalidates the assumption from commit 0f397f2c90ce ("sched/dl:
    Fix race in dl_task_timer()"):

    "The only reason we don't strictly need ->pi_lock now is because
    we're guaranteed to have p->state == TASK_RUNNING here and are
    thus free of ttwu races".

    And therefore we have to use the full task_rq_lock() here.

    This further amends the fact that we forgot to update the rq lock loop
    for TASK_ON_RQ_MIGRATE, from commit cca26e8009d1 ("sched: Teach
    scheduler to understand TASK_ON_RQ_MIGRATING state").

    Reported-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Feb, 2015

4 commits

  • When we fail to start the deadline timer in update_curr_dl(), we
    forget to clear ->dl_yielded, resulting in wrecked time keeping.

    Since the natural place to clear both ->dl_yielded and ->dl_throttled
    is in replenish_dl_entity(); both are after all waiting for that event;
    make it so.

    Luckily since 67dfa1b756f2 ("sched/deadline: Implement
    cancel_dl_timer() to use in switched_from_dl()") the
    task_on_rq_queued() condition in dl_task_timer() must be true, and can
    therefore call enqueue_task_dl() unconditionally.

    Reported-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-4-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • After update_curr_dl() the current task might not be the leftmost task
    anymore. In that case do not start a new hrtick for it.

    In this case NEED_RESCHED will be set and the next schedule will start
    the hrtick for the new task if and when appropriate.

    Signed-off-by: Wanpeng Li
    Acked-by: Juri Lelli
    [ Rewrote the changelog and comment. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to
    use in switched_from_dl()") removed the hrtimer_try_cancel() function
    call out from init_dl_task_timer(), which gets called from
    __setparam_dl().

    The result is that we can now re-init the timer while its active --
    this is bad and corrupts timer state.

    Furthermore; changing the parameters of an active deadline task is
    tricky in that you want to maintain guarantees, while immediately
    effective change would allow one to circumvent the CBS guarantees --
    this too is bad, as one (bad) task should not be able to affect the
    others.

    Rework things to avoid both problems. We only need to initialize the
    timer once, so move that to __sched_fork() for new tasks.

    Then make sure __setparam_dl() doesn't affect the current running
    state but only updates the parameters used to calculate the next
    scheduling period -- this guarantees the CBS functions as expected
    (albeit slightly pessimistic).

    This however means we need to make sure __dl_clear_params() needs to
    reset the active state otherwise new (and tasks flipping between
    classes) will not properly (re)compute their first instance.

    Todo: close class flipping CBS hole.
    Todo: implement delayed BW release.

    Reported-by: Luca Abeni
    Acked-by: Juri Lelli
    Tested-by: Luca Abeni
    Fixes: 67dfa1b756f2 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
    Signed-off-by: Peter Zijlstra (Intel)
    Cc:
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Jan, 2015

1 commit

  • Currently, cpudl::free_cpus contains all CPUs during init, see
    cpudl_init(). When calling cpudl_find(), we have to add rd->span
    to avoid selecting the cpu outside the current root domain, because
    cpus_allowed cannot be depended on when performing clustered
    scheduling using the cpuset, see find_later_rq().

    This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for
    changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(),
    so we can avoid the rd->span operation when calling cpudl_find()
    in find_later_rq().

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.org
    Signed-off-by: Ingo Molnar

    Xunlei Pang
     

09 Jan, 2015

2 commits

  • The dl_runtime_exceeded() function is supposed to ckeck if
    a SCHED_DEADLINE task must be throttled, by checking if its
    current runtime is
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc:
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar

    Luca Abeni
     
  • According to global EDF, tasks should be migrated between runqueues
    without checking if their scheduling deadlines and runtimes are valid.
    However, SCHED_DEADLINE currently performs such a check:
    a migration happens doing:

    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, later_rq->cpu);
    activate_task(later_rq, next_task, 0);

    which ends up calling dequeue_task_dl(), setting the new CPU, and then
    calling enqueue_task_dl().

    enqueue_task_dl() then calls enqueue_dl_entity(), which calls
    update_dl_entity(), which can modify scheduling deadline and runtime,
    breaking global EDF scheduling.

    As a result, some of the properties of global EDF are not respected:
    for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
    two cores can have unbounded response times for the third task even
    if 30/80+40/80+120/170 = 1.5809 < 2

    This can be fixed by invoking update_dl_entity() only in case of
    wakeup, or if this is a new SCHED_DEADLINE task.

    Signed-off-by: Luca Abeni
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc:
    Cc: Dario Faggioli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.it
    Signed-off-by: Ingo Molnar

    Luca Abeni
     

16 Nov, 2014

5 commits

  • Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
    the fair class.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Do not call dequeue_pushable_dl_task() when failing to push an eligible
    task, as it remains pushable, merely not at this particular moment.

    Actually the patch is the same behavior as commit 311e800e16f6 ("sched,
    rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
    This change will make fair.c, rt.c, and deadline.c all start with the
    same logic.

    Suggested-and-Acked-by: Steven Rostedt
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "pang.xunlei"
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Commit d670ec13178d0 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
    test case in cost of breaking another one. After that commit, calling
    clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
    of Y time being smaller than X time.

    Reproducer/tester can be found further below, it can be compiled and ran by:

    gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
    while ./tst-cpuclock2 ; do : ; done

    This reproducer, when running on a buggy kernel, will complain
    about "clock_gettime difference too small".

    Issue happens because on start in thread_group_cputimer() we initialize
    sum_exec_runtime of cputimer with threads runtime not yet accounted and
    then add the threads runtime to running cputimer again on scheduler
    tick, making it's sum_exec_runtime bigger than actual threads runtime.

    KOSAKI Motohiro posted a fix for this problem, but that patch was never
    applied: https://lkml.org/lkml/2013/5/26/191 .

    This patch takes different approach to cure the problem. It calls
    update_curr() when cputimer starts, that assure we will have updated
    stats of running threads and on the next schedule tick we will account
    only the runtime that elapsed from cputimer start. That also assure we
    have consistent state between cpu times of individual threads and cpu
    time of the process consisted by those threads.

    Full reproducer (tst-cpuclock2.c):

    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    /* Parameters for the Linux kernel ABI for CPU clocks. */
    #define CPUCLOCK_SCHED 2
    #define MAKE_PROCESS_CPUCLOCK(pid, clock) \
    ((~(clockid_t) (pid) << 3) | (clockid_t) (clock))

    static pthread_barrier_t barrier;

    /* Help advance the clock. */
    static void *chew_cpu(void *arg)
    {
    pthread_barrier_wait(&barrier);
    while (1) ;

    return NULL;
    }

    /* Don't use the glibc wrapper. */
    static int do_nanosleep(int flags, const struct timespec *req)
    {
    clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);

    return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
    }

    static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
    {
    int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
    int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;

    return after_i - before_i;
    }

    int main(void)
    {
    int result = 0;
    pthread_t th;

    pthread_barrier_init(&barrier, NULL, 2);

    if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
    perror("pthread_create");
    return 1;
    }

    pthread_barrier_wait(&barrier);

    /* The test. */
    struct timespec before, after, sleeptimeabs;
    int64_t sleepdiff, diffabs;
    const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };

    /* The relative nanosleep. Not sure why this is needed, but its presence
    seems to make it easier to reproduce the problem. */
    if (do_nanosleep(0, &sleeptime) != 0) {
    perror("clock_nanosleep");
    return 1;
    }

    /* Get the current time. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
    perror("clock_gettime[2]");
    return 1;
    }

    /* Compute the absolute sleep time based on the current time. */
    uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
    sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
    sleeptimeabs.tv_nsec = nsec % 1000000000;

    /* Sleep for the computed time. */
    if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
    perror("absolute clock_nanosleep");
    return 1;
    }

    /* Get the time after the sleep. */
    if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
    perror("clock_gettime[3]");
    return 1;
    }

    /* The time after sleep should always be equal to or after the absolute sleep
    time passed to clock_nanosleep. */
    sleepdiff = tsdiff(&sleeptimeabs, &after);
    if (sleepdiff < 0) {
    printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
    result = 1;

    printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
    printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec);
    printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
    }

    /* The difference between the timestamps taken before and after the
    clock_nanosleep call should be equal to or more than the duration of the
    sleep. */
    diffabs = tsdiff(&before, &after);
    if (diffabs < sleeptime.tv_nsec) {
    printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
    result = 1;
    }

    pthread_cancel(th);

    return result;
    }

    Signed-off-by: Stanislaw Gruszka
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.com
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

04 Nov, 2014

6 commits

  • There are both UP and SMP version of pull_dl_task(), so don't need
    to check CONFIG_SMP in switched_from_dl();

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • In switched_from_dl() we have to issue a resched if we successfully
    pulled some task from other cpus. This patch also aligns the behavior
    with -rt.

    Suggested-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • This patch pushes task away if the dealine of the task is equal
    to current during wake up. The same behavior as rt class.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • This patch add deadline rq status print.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • The yield semantic of deadline class is to reduce remaining runtime to
    zero, and then update_curr_dl() will stop it. However, comsumed bandwidth
    is reduced from the budget of yield task again even if it has already been
    set to zero which leads to artificial overrun. This patch fix it by make
    sure we don't steal some more time from the task that yielded in update_curr_dl().

    Suggested-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Kirill Tkhai
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Currently used hrtimer_try_to_cancel() is racy:

    raw_spin_lock(&rq->lock)
    ... dl_task_timer raw_spin_lock(&rq->lock)
    ... raw_spin_lock(&rq->lock) ...
    switched_from_dl() ... ...
    hrtimer_try_to_cancel() ... ...
    switched_to_fair() ... ...
    ... ... ...
    ... ... ...
    raw_spin_unlock(&rq->lock) ... (asquired)
    ... ... ...
    ... ... ...
    do_exit() ... ...
    schedule() ... ...
    raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock)
    ... ... ...
    raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock)
    ... ... (asquired)
    put_task_struct() ... ...
    free_task_struct() ... ...
    ... ... raw_spin_unlock(&rq->lock)
    ... (asquired) ...
    ... ... ...
    ... (use after free) ...

    So, let's implement 100% guaranteed way to cancel the timer and let's
    be sure we are safe even in very unlikely situations.

    rq unlocking does not limit the area of switched_from_dl() use, because
    this has already been possible in pull_dl_task() below.

    Let's consider the safety of of this unlocking. New code in the patch
    is working when hrtimer_try_to_cancel() fails. This means the callback
    is running. In this case hrtimer_cancel() is just waiting till the
    callback is finished. Two

    1) Since we are in switched_from_dl(), new class is not dl_sched_class and
    new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
    right after !dl_task() check. After that hrtimer_cancel() returns back too.

    The above is:

    raw_spin_lock(rq->lock); ...
    ... dl_task_timer()
    ... raw_spin_lock(rq->lock);
    switched_from_dl() ...
    hrtimer_try_to_cancel() ...
    raw_spin_unlock(rq->lock); ...
    hrtimer_cancel() ...
    ... raw_spin_unlock(rq->lock);
    ... return HRTIMER_NORESTART;
    ... ...
    raw_spin_lock(rq->lock); ...

    2) But the below is also possible:
    dl_task_timer()
    raw_spin_lock(rq->lock);
    ...
    raw_spin_unlock(rq->lock);
    raw_spin_lock(rq->lock); ...
    switched_from_dl() ...
    hrtimer_try_to_cancel() ...
    ... return HRTIMER_NORESTART;
    raw_spin_unlock(rq->lock); ...
    hrtimer_cancel(); ...
    raw_spin_lock(rq->lock); ...

    In this case hrtimer_cancel() returns immediately. Very unlikely case,
    just to mention.

    Nobody can manipulate the task, because check_class_changed() is
    always called with pi_lock locked. Nobody can force the task to
    participate in (concurrent) priority inheritance schemes (the same reason).

    All concurrent task operations require pi_lock, which is held by us.
    No deadlocks with dl_task_timer() are possible, because it returns
    right after !dl_task() check (it does nothing).

    If we receive a new dl_task during the time of unlocked rq, we just
    don't have to do pull_dl_task() in switched_from_dl() further.

    Signed-off-by: Kirill Tkhai
    [ Added comments]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

28 Oct, 2014

7 commits

  • Use nr_cpus_allowed to bail from select_task_rq() when only one cpu
    can be used, and saves some cycles for pinned tasks.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413253360-5318-2-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • There is no need to do balance during fork since SCHED_DEADLINE
    tasks can't fork. This patch avoid the SD_BALANCE_FORK check.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1413253360-5318-1-git-send-email-wanpeng.li@linux.intel.com
    Cc: Linus Torvalds
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Exclusive cpusets are the only way users can restrict SCHED_DEADLINE tasks
    affinity (performing what is commonly called clustered scheduling).
    Unfortunately, such thing is currently broken for two reasons:

    - No check is performed when the user tries to attach a task to
    an exlusive cpuset (recall that exclusive cpusets have an
    associated maximum allowed bandwidth).

    - Bandwidths of source and destination cpusets are not correctly
    updated after a task is migrated between them.

    This patch fixes both things at once, as they are opposite faces
    of the same coin.

    The check is performed in cpuset_can_attach(), as there aren't any
    points of failure after that function. The updated is split in two
    halves. We first reserve bandwidth in the destination cpuset, after
    we pass the check in cpuset_can_attach(). And we then release
    bandwidth from the source cpuset when the task's affinity is
    actually changed. Even if there can be time windows when sched_setattr()
    may erroneously fail in the source cpuset, we are fine with it, as
    we can't perfom an atomic update of both cpusets at once.

    Reported-by: Daniel Wagner
    Reported-by: Vincent Legout
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: michael@amarulasolutions.com
    Cc: luca.abeni@unitn.it
    Cc: Li Zefan
    Cc: Linus Torvalds
    Cc: cgroups@vger.kernel.org
    Link: http://lkml.kernel.org/r/1411118561-26323-3-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • As Kirill mentioned (https://lkml.org/lkml/2013/1/29/118):

    | If rq has already had 2 or more pushable tasks and we try to add a
    | pinned task then call of push_rt_task will just waste a time.

    Just switched pinned task is not able to be pushed. If the rq has had
    several dl tasks before they have already been considered as candidates
    to be pushed (or pulled). This patch implements the same behavior as rt
    class which introduced by commit 10447917551e ("sched/rt: Do not try to
    push tasks if pinned task switches to RT").

    Suggested-by: Kirill V Tkhai
    Acked-by: Juri Lelli
    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Steven Rostedt
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413938203-224610-1-git-send-email-wanpeng.li@linux.intel.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • 1) switched_to_dl() check is wrong. We reschedule only
    if rq->curr is deadline task, and we do not reschedule
    if it's a lower priority task. But we must always
    preempt a task of other classes.

    2) dl_task_timer():
    Policy does not change in case of priority inheritance.
    rt_mutex_setprio() changes prio, while policy remains old.

    So we lose some balancing logic in dl_task_timer() and
    switched_to_dl() when we check policy instead of priority. Boosted
    task may be rq->curr.

    (I didn't change switched_from_dl() because no check is necessary
    there at all).

    I've looked at this place(switched_to_dl) several times and even fixed
    this function, but found just now... I suppose some performance tests
    may work better after this.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • dl_task_timer() is racy against several paths. Daniel noticed that
    the replenishment timer may experience a race condition against an
    enqueue_dl_entity() called from rt_mutex_setprio(). With his own
    words:

    rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
    start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
    sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
    throttled is 0

    => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
    enqueued on the -deadline runqueue.

    As we do for the other races, we just bail out in the replenishment
    timer code.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In the deboost path, right after the dl_boosted flag has been
    reset, we can currently end up replenishing using -deadline
    parameters of a !SCHED_DEADLINE entity. This of course causes
    a bug, as those parameters are empty.

    In the case depicted above it is safe to simply bail out, as
    the deboosted task is going to be back to its original scheduling
    class anyway.

    Reported-by: Daniel Wagner
    Tested-by: Daniel Wagner
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: vincent@legout.info
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Fabio Checconi
    Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

15 Oct, 2014

1 commit

  • Pull percpu consistent-ops changes from Tejun Heo:
    "Way back, before the current percpu allocator was implemented, static
    and dynamic percpu memory areas were allocated and handled separately
    and had their own accessors. The distinction has been gone for many
    years now; however, the now duplicate two sets of accessors remained
    with the pointer based ones - this_cpu_*() - evolving various other
    operations over time. During the process, we also accumulated other
    inconsistent operations.

    This pull request contains Christoph's patches to clean up the
    duplicate accessor situation. __get_cpu_var() uses are replaced with
    with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr().

    Unfortunately, the former sometimes is tricky thanks to C being a bit
    messy with the distinction between lvalues and pointers, which led to
    a rather ugly solution for cpumask_var_t involving the introduction of
    this_cpu_cpumask_var_ptr().

    This converts most of the uses but not all. Christoph will follow up
    with the remaining conversions in this merge window and hopefully
    remove the obsolete accessors"

    * 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits)
    irqchip: Properly fetch the per cpu offset
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix
    ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write.
    percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t
    Revert "powerpc: Replace __get_cpu_var uses"
    percpu: Remove __this_cpu_ptr
    clocksource: Replace __this_cpu_ptr with raw_cpu_ptr
    sparc: Replace __get_cpu_var uses
    avr32: Replace __get_cpu_var with __this_cpu_write
    blackfin: Replace __get_cpu_var uses
    tile: Use this_cpu_ptr() for hardware counters
    tile: Replace __get_cpu_var uses
    powerpc: Replace __get_cpu_var uses
    alpha: Replace __get_cpu_var
    ia64: Replace __get_cpu_var uses
    s390: cio driver &__get_cpu_var replacements
    s390: Replace __get_cpu_var uses
    mips: Replace __get_cpu_var uses
    MIPS: Replace __get_cpu_var uses in FPU emulator.
    arm: Replace __this_cpu_ptr with raw_cpu_ptr
    ...

    Linus Torvalds
     

24 Sep, 2014

2 commits

  • Users can perform clustered scheduling using the cpuset facility.
    After an exclusive cpuset is created, task migrations happen only
    between CPUs belonging to the same cpuset. Inter- cpuset migrations
    can only happen when the user requires so, moving a task between
    different cpusets. This behaviour is broken in SCHED_DEADLINE, as
    currently spurious inter- cpuset migration may happen without user
    intervention.

    This patch fix the problem (and shuffles the code a bit to improve
    clarity).

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: raistlin@linux.it
    Cc: michael@amarulasolutions.com
    Cc: fchecconi@gmail.com
    Cc: daniel.wagner@bmw-carit.de
    Cc: vincent@legout.info
    Cc: luca.abeni@unitn.it
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1411118561-26323-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • When a task is using SCHED_DEADLINE and the user setschedules it to a
    different class its sched_dl_entity static parameters are not cleaned
    up. This causes a bug if the user sets it back to SCHED_DEADLINE with
    the same parameters again. The problem resides in the check we
    perform at the very beginning of dl_overflow():

    if (new_bw == p->dl.dl_bw)
    return 0;

    This condition is met in the case depicted above, so the function
    returns and dl_b->total_bw is not updated (the p->dl.dl_bw is not
    added to it). After this, admission control is broken.

    This patch fixes the thing, properly clearing static parameters for a
    task that ceases to use SCHED_DEADLINE.

    Reported-by: Daniele Alessandrelli
    Reported-by: Daniel Wagner
    Reported-by: Vincent Legout
    Tested-by: Luca Abeni
    Tested-by: Daniel Wagner
    Tested-by: Vincent Legout
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Fabio Checconi
    Cc: Dario Faggioli
    Cc: Michael Trimarchi
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1411118561-26323-2-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

19 Sep, 2014

1 commit

  • 1) Nobody calls pick_dl_task() with negative cpu, it's old RT leftover.

    2) If p->nr_cpus_allowed is 1, than the affinity has just been changed
    in set_cpus_allowed_ptr(); we'll pick it just earlier than migration
    thread.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1410529340.3569.27.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

07 Sep, 2014

1 commit

  • An overrun could happen in function start_hrtick_dl()
    when a task with SCHED_DEADLINE runs in the microseconds
    range.

    For example, if a task with SCHED_DEADLINE has the following parameters:

    Task runtime deadline period
    P1 200us 500us 500us

    The deadline and period from task P1 are less than 1ms.

    In order to achieve microsecond precision, we need to enable HRTICK feature
    by the next command:

    PC#echo "HRTICK" > /sys/kernel/debug/sched_features
    PC#trace-cmd record -e sched_switch &
    PC#./schedtool -E -t 200000:500000:500000 -e ./test

    The binary test is in an endless while(1) loop here.
    Some pieces of trace.dat are as follows:

    -0 157.603157: sched_switch: :R ==> 2481:4294967295: test
    test-2481 157.603203: sched_switch: 2481:R ==> 0:120: swapper/2
    -0 157.605657: sched_switch: :R ==> 2481:4294967295: test
    test-2481 157.608183: sched_switch: 2481:R ==> 2483:120: trace-cmd
    trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test

    We can get the runtime of P1 from the information above:

    runtime = 157.608183 - 157.605657
    runtime = 0.002526(2.526ms)

    The correct runtime should be less than or equal to 200us at some point.

    The problem is caused by a conditional judgment "delta > 10000"
    in function start_hrtick_dl().

    Because no hrtimer start up to control the rest of runtime
    when the reset of runtime is less than 10us.

    So the process will continue to run until tick-period is coming.

    Move the code with the limit of the least time slice
    from hrtick_start_fair() to hrtick_start() because the
    EDF schedule class also needs this function in start_hrtick_dl().

    To fix this problem, we call hrtimer_start() unconditionally in
    start_hrtick_dl(), and make sure the scheduling slice won't be smaller
    than 10us in hrtimer_start().

    Signed-off-by: Xiaofeng Yan
    Reviewed-by: Li Zefan
    Acked-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
    [ Massaged the changelog and the code. ]
    Signed-off-by: Ingo Molnar

    xiaofeng.yan
     

28 Aug, 2014

1 commit

  • __get_cpu_var can paper over differences in the definitions of
    cpumask_var_t and either use the address of the cpumask variable
    directly or perform a fetch of the address of the struct cpumask
    allocated elsewhere. This is important particularly when using per cpu
    cpumask_var_t declarations because in one case we have an offset into
    a per cpu area to handle and in the other case we need to fetch a
    pointer from the offset.

    This patch introduces a new macro

    this_cpu_cpumask_var_ptr()

    that is defined where cpumask_var_t is defined and performs the proper
    actions. All use cases where __get_cpu_var is used with cpumask_var_t
    are converted to the use of this_cpu_cpumask_var_ptr().

    Signed-off-by: Christoph Lameter
    Signed-off-by: Tejun Heo

    Christoph Lameter
     

20 Aug, 2014

1 commit

  • Implement task_on_rq_queued() and use it everywhere instead of
    on_rq check. No functional changes.

    The only exception is we do not use the wrapper in
    check_for_tasks(), because it requires to export
    task_on_rq_queued() in global header files. Next patch in series
    would return it back, so we do not twist it from here to there.

    Signed-off-by: Kirill Tkhai
    Cc: Peter Zijlstra
    Cc: Paul Turner
    Cc: Oleg Nesterov
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Kirill Tkhai
    Cc: Tim Chen
    Cc: Nicolas Pitre
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai