05 Jun, 2014

1 commit

  • Throttled task is still on rq, and it may be moved to other cpu
    if user is playing with sched_setaffinity(). Therefore, unlocked
    task_rq() access makes the race.

    Juri Lelli reports he got this race when dl_bandwidth_enabled()
    was not set.

    Other thing, pointed by Peter Zijlstra:

    "Now I suppose the problem can still actually happen when
    you change the root domain and trigger a effective affinity
    change that way".

    To fix that we do the same as made in __task_rq_lock(). We do not
    use __task_rq_lock() itself, because it has a useful lockdep check,
    which is not correct in case of dl_task_timer(). We do not need
    pi_lock locked here. This case is an exception (PeterZ):

    "The only reason we don't strictly need ->pi_lock now is because
    we're guaranteed to have p->state == TASK_RUNNING here and are
    thus free of ttwu races".

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: # v3.14+
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/3056991400578422@web14g.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

07 May, 2014

1 commit

  • yield_task_dl() is broken:

    o it forces current to be throttled setting its runtime to zero;
    o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
    will queue it back with proper parameters at replenish time.

    Unfortunately, dl_task_timer() has this check at the very beginning:

    if (!dl_task(p) || dl_se->dl_new)
    goto unlock;

    So, it just bails out and the task is never replenished. It actually
    yielded forever.

    To fix this, introduce a new flag indicating that the task properly yielded
    the CPU before its current runtime expired. While this is a little overdoing
    at the moment, the flag would be useful in the future to discriminate between
    "good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
    and "bad" jobs (for which dl_throttled task has been set) that needed to be
    stopped.

    Reported-by: yjay.kim
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

17 Apr, 2014

1 commit


11 Mar, 2014

2 commits

  • The problems:

    1) We check for rt_nr_running before call of put_prev_task().
    If previous task is RT, its rt_rq may become throttled
    and dequeued after this call.

    In case of p is from rt->rq this just causes picking a task
    from throttled queue, but in case of its rt_rq is child
    we are guaranteed catch BUG_ON.

    2) The same with deadline class. The only difference we operate
    on only dl_rq.

    This patch fixes all the above problems and it adds a small skip in the
    DL update like we've already done for RT class:

    if (unlikely((s64)delta_exec
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     
  • Pick up fixes before queueing up new changes.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

27 Feb, 2014

2 commits

  • Kirill Tkhai noted:

    Since deadline tasks share rt bandwidth, we must care about
    bandwidth timer set. Otherwise rt_time may grow up to infinity
    in update_curr_dl(), if there are no other available RT tasks
    on top level bandwidth.

    RT task were in fact throttled right after they got enqueued,
    and never executed again (rt_time never again went below rt_runtime).

    Peter then proposed to accrue DL execution on rt_time only when
    rt timer is active, and proposed a patch (this patch is a slight
    modification of that) to implement that behavior. While this
    solves Kirill problem, it has a drawback.

    Indeed, Kirill noted again:

    It looks we may get into a situation, when all CPU time is shared
    between RT and DL tasks:

    rt_runtime = n
    rt_period = 2n

    | RT working, DL sleeping | DL working, RT sleeping |
    -----------------------------------------------------------
    | (1) duration = n | (2) duration = n | (repeat)
    |--------------------------|------------------------------|
    | (rt_bw timer is running) | (rt_bw timer is not running) |

    No time for fair tasks at all.

    While this can happen during the first period, if rq is always backlogged,
    RT tasks won't have the opportunity to execute anymore: rt_time reached
    rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
    throttled since rt timer didn't fire, replenishment is from now on eaten up
    by DL tasks that accrue their execution on rt_time (while rt timer is
    active - we have an RT task waiting for replenishment). FAIR tasks are
    not touched after this first period. Ok, this is not ideal, and the situation
    is even worse!

    What above (the nice case), practically never happens in reality, where
    your rt timer is not aligned to tasks periods, tasks are in general not
    periodic, etc.. Long story short, you always risk to overload your system.

    This patch is based on Peter's idea, but exploits an additional fact:
    if you don't have RT tasks enqueued, it makes little sense to continue
    incrementing rt_time once you reached the upper limit (DL tasks have their
    own mechanism for throttling).

    This cures both problems:

    - no matter how many DL instances in the past, you'll have an rt_time
    slightly above rt_runtime when an RT task is enqueued, and from that
    point on (after the first replenishment), the task will normally execute;

    - you can still eat up all bandwidth during the first period, but not
    anymore after that, remember that DL execution will increment rt_time
    till the upper limit is reached.

    The situation is still not perfect! But, we have a simple solution for now,
    that limits how much you can jeopardize your system, as we keep working
    towards the right answer: RT groups scheduled using deadline servers.

    Reported-by: Kirill Tkhai
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In deadline class we do not have group scheduling.

    So, let's remove unnecessary

    X = X;

    equations.

    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Cc: Juri Lelli
    Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

22 Feb, 2014

4 commits

  • Remove a few gratuitous #ifdefs in pick_next_task*().

    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Dan Carpenter reported:

    > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
    > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)

    Kirill also spotted that migrate_tasks() will have an instant NULL
    deref because pick_next_task() will immediately deref prev.

    Instead of fixing all the corner cases because migrate_tasks() can
    pass in a NULL prev task in the unlikely case of hot-un-plug, provide
    a fake task such that we can remove all the NULL checks from the far
    more common paths.

    A further problem; not previously spotted; is that because we pushed
    pre_schedule() and idle_balance() into pick_next_task() we now need to
    avoid those getting called and pulling more tasks on our dying CPU.

    We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
    We also note that since we call pick_next_task() exactly the amount of
    times we have runnable tasks present, we should never land in
    idle_balance().

    Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()")
    Cc: Juri Lelli
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Reported-by: Kirill Tkhai
    Reported-by: Dan Carpenter
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • In deadline class we do not have group scheduling like in RT.

    dl_nr_total is the same as dl_nr_running. So, one of them should
    be removed.

    Cc: Ingo Molnar
    Cc: Juri Lelli
    Signed-off-by: Kirill Tkhai
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/368631392675853@web20h.yandex.ru
    Signed-off-by: Thomas Gleixner

    Kirill Tkhai
     
  • Rostedt writes:

    My test suite was locking up hard when enabling mmiotracer. This was due
    to the mmiotracer placing all but one CPU offline. I found this out
    when I was able to reproduce the bug with just my stress-cpu-hotplug
    test. This bug baffled me because it would not always trigger, and
    would only trigger on the first run after boot up. The
    stress-cpu-hotplug test would crash hard the first run, or never crash
    at all. But a new reboot may cause it to crash on the first run again.

    I spent all week bisecting this, as I couldn't find a consistent
    reproducer. I finally narrowed it down to the sched deadline patches,
    and even more peculiar, to the commit that added the sched
    deadline boot up self test to the latency tracer. Then it dawned on me
    to what the bug was.

    All it took was to run a task under sched deadline to screw up the CPU
    hot plugging. This explained why it would lock up only on the first run
    of the stress-cpu-hotplug test. The bug happened when the boot up self
    test of the schedule latency tracer would test a deadline task. The
    deadline task would corrupt something that would cause CPU hotplug to
    fail. If it didn't corrupt it, the stress test would always work
    (there's no other sched deadline tasks that would run to cause
    problems). If it did corrupt on boot up, the first test would lockup
    hard.

    I proved this theory by running my deadline test program on another box,
    and then run the stress-cpu-hotplug test, and it would now consistently
    lock up. I could run stress-cpu-hotplug over and over with no problem,
    but once I ran the deadline test, the next run of the
    stress-cpu-hotplug would lock hard.

    After adding lots of tracing to the code, I found the cause. The
    function tracer showed that migrate_tasks() was stuck in an infinite
    loop, where rq->nr_running never equaled 1 to break out of it. When I
    added a trace_printk() to see what that number was, it was 335 and
    never decrementing!

    Looking at the deadline code I found:

    static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
    dequeue_dl_entity(&p->dl);
    dequeue_pushable_dl_task(rq, p);
    }

    static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) {
    update_curr_dl(rq);
    __dequeue_task_dl(rq, p, flags);

    dec_nr_running(rq);
    }

    And this:

    if (dl_runtime_exceeded(rq, dl_se)) {
    __dequeue_task_dl(rq, curr, 0);
    if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted)))
    dl_se->dl_throttled = 1;
    else
    enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH);

    if (!is_leftmost(curr, &rq->dl))
    resched_task(curr);
    }

    Notice how we call __dequeue_task_dl() and in the else case we
    call enqueue_task_dl()? Also notice that dequeue_task_dl() has
    underscores where enqueue_task_dl() does not. The enqueue_task_dl()
    calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is
    where we get nr_running out of sync.

    [snip]

    Another point where nr_running can get out of sync is when the dl_timer
    fires:

    dl_se->dl_throttled = 0;
    if (p->on_rq) {
    enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
    if (task_has_dl_policy(rq->curr))
    check_preempt_curr_dl(rq, p, 0);
    else
    resched_task(rq->curr);

    This patch does two things:

    - correctly accounts for throttled tasks (that are now considered
    !running);

    - fixes the bug, updating nr_running from {inc,dec}_dl_tasks(),
    since we risk to update it twice in some situations (e.g., a
    task is dequeued while it has exceeded its budget).

    Cc: mingo@redhat.com
    Cc: torvalds@linux-foundation.org
    Cc: akpm@linux-foundation.org
    Reported-by: Steven Rostedt
    Reviewed-by: Steven Rostedt
    Tested-by: Steven Rostedt
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Thomas Gleixner

    Juri Lelli
     

11 Feb, 2014

1 commit

  • This patch both merged idle_balance() and pre_schedule() and pushes
    both of them into pick_next_task().

    Conceptually pre_schedule() and idle_balance() are rather similar,
    both are used to pull more work onto the current CPU.

    We cannot however first move idle_balance() into pre_schedule_fair()
    since there is no guarantee the last runnable task is a fair task, and
    thus we would miss newidle balances.

    Similarly, the dl and rt pre_schedule calls must be ran before
    idle_balance() since their respective tasks have higher priority and
    it would not do to delay their execution searching for less important
    tasks first.

    However, by noticing that pick_next_tasks() already traverses the
    sched_class hierarchy in the right order, we can get the right
    behaviour and do away with both calls.

    We must however change the special case optimization to also require
    that prev is of sched_class_fair, otherwise we can miss doing a dl or
    rt pull where we needed one.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Feb, 2014

1 commit

  • In order to avoid having to do put/set on a whole cgroup hierarchy
    when we context switch, push the put into pick_next_task() so that
    both operations are in the same function. Further changes then allow
    us to possibly optimize away redundant work.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Feb, 2014

1 commit

  • When p is current and it's not of dl class, then there are no other
    dl taks in the rq. If we had had pushable tasks in some other rq,
    they would have been pushed earlier. So, skip "p == rq->curr" case.

    Signed-off-by: Kirill Tkhai
    Acked-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140128072421.32315.25300.stgit@tkhai
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

28 Jan, 2014

1 commit

  • Add in Documentation/scheduler/ some hints about the design
    choices, the usage and the future possible developments of the
    sched_dl scheduling class and of the SCHED_DEADLINE policy.

    Reviewed-by: Henrik Austad
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    [ Re-wrote sections 2 and 3. ]
    Signed-off-by: Luca Abeni
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1390821615-23247-1-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     

16 Jan, 2014

1 commit

  • Dan Carpenter reported new 'Smatch' warnings:

    > tree: git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
    > head: 130816ce4d5f69167324f7272e70aa3d641677c6
    > commit: 1baca4ce16b8cc7d4f50be1f7914799af30a2861 [17/50] sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic
    >
    > kernel/sched/deadline.c:937 pick_next_task_dl() warn: variable dereferenced before check 'p' (see line 934)

    BUG_ON() already fires if pick_next_dl_entity() doesn't return a valid
    dl_se. No need to check if p is valid afterward.

    Reported-by: Dan Carpenter
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Fixes: 1baca4ce16b8 ("sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic")
    Link: http://lkml.kernel.org/r/52D54E25.6060100@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

13 Jan, 2014

8 commits

  • Remove the deadline specific sysctls for now. The problem with them is
    that the interaction with the exisiting rt knobs is nearly impossible
    to get right.

    The current (as per before this patch) situation is that the rt and dl
    bandwidth is completely separate and we enforce rt+dl < 100%. This is
    undesirable because this means that the rt default of 95% leaves us
    hardly any room, even though dl tasks are saver than rt tasks.

    Another proposed solution was (a discarted patch) to have the dl
    bandwidth be a fraction of the rt bandwidth. This is highly
    confusing imo.

    Furthermore neither proposal is consistent with the situation we
    actually want; which is rt tasks ran from a dl server. In which case
    the rt bandwidth is a direct subset of dl.

    So whichever way we go, the introduction of dl controls at this point
    is painful. Therefore remove them and instead share the rt budget.

    This means that for now the rt knobs are used for dl admission control
    and the dl runtime is accounted against the rt runtime. I realise that
    this isn't entirely desirable either; but whatever we do we appear to
    need to change the interface later, so better have a small interface
    for now.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-zpyqbqds1r0vyxtxza1e7rdc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Data from tests confirmed that the original active load balancing
    logic didn't scale neither in the number of CPU nor in the number of
    tasks (as sched_rt does).

    Here we provide a global data structure to keep track of deadlines
    of the running tasks in the system. The structure is composed by
    a bitmask showing the free CPUs and a max-heap, needed when the system
    is heavily loaded.

    The implementation and concurrent access scheme are kept simple by
    design. However, our measurements show that we can compete with sched_rt
    on large multi-CPUs machines [1].

    Only the push path is addressed, the extension to use this structure
    also for pull decisions is straightforward. However, we are currently
    evaluating different (in order to decrease/avoid contention) data
    structures to solve possibly both problems. We are also going to re-run
    tests considering recent changes inside cpupri [2].

    [1] http://retis.sssup.it/~jlelli/papers/Ospert11Lelli.pdf
    [2] http://www.spinics.net/lists/linux-rt-users/msg06778.html

    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-14-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • In order of deadline scheduling to be effective and useful, it is
    important that some method of having the allocation of the available
    CPU bandwidth to tasks and task groups under control.
    This is usually called "admission control" and if it is not performed
    at all, no guarantee can be given on the actual scheduling of the
    -deadline tasks.

    Since when RT-throttling has been introduced each task group have a
    bandwidth associated to itself, calculated as a certain amount of
    runtime over a period. Moreover, to make it possible to manipulate
    such bandwidth, readable/writable controls have been added to both
    procfs (for system wide settings) and cgroupfs (for per-group
    settings).

    Therefore, the same interface is being used for controlling the
    bandwidth distrubution to -deadline tasks and task groups, i.e.,
    new controls but with similar names, equivalent meaning and with
    the same usage paradigm are added.

    However, more discussion is needed in order to figure out how
    we want to manage SCHED_DEADLINE bandwidth at the task group level.
    Therefore, this patch adds a less sophisticated, but actually
    very sensible, mechanism to ensure that a certain utilization
    cap is not overcome per each root_domain (the single rq for !SMP
    configurations).

    Another main difference between deadline bandwidth management and
    RT-throttling is that -deadline tasks have bandwidth on their own
    (while -rt ones doesn't!), and thus we don't need an higher level
    throttling mechanism to enforce the desired bandwidth.

    This patch, therefore:

    - adds system wide deadline bandwidth management by means of:
    * /proc/sys/kernel/sched_dl_runtime_us,
    * /proc/sys/kernel/sched_dl_period_us,
    that determine (i.e., runtime / period) the total bandwidth
    available on each CPU of each root_domain for -deadline tasks;

    - couples the RT and deadline bandwidth management, i.e., enforces
    that the sum of how much bandwidth is being devoted to -rt
    -deadline tasks to stay below 100%.

    This means that, for a root_domain comprising M CPUs, -deadline tasks
    can be created until the sum of their bandwidths stay below:

    M * (sched_dl_runtime_us / sched_dl_period_us)

    It is also possible to disable this bandwidth management logic, and
    be thus free of oversubscribing the system up to any arbitrary level.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-12-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Some method to deal with rt-mutexes and make sched_dl interact with
    the current PI-coded is needed, raising all but trivial issues, that
    needs (according to us) to be solved with some restructuring of
    the pi-code (i.e., going toward a proxy execution-ish implementation).

    This is under development, in the meanwhile, as a temporary solution,
    what this commits does is:

    - ensure a pi-lock owner with waiters is never throttled down. Instead,
    when it runs out of runtime, it immediately gets replenished and it's
    deadline is postponed;

    - the scheduling parameters (relative deadline and default runtime)
    used for that replenishments --during the whole period it holds the
    pi-lock-- are the ones of the waiting task with earliest deadline.

    Acting this way, we provide some kind of boosting to the lock-owner,
    still by using the existing (actually, slightly modified by the previous
    commit) pi-architecture.

    We would stress the fact that this is only a surely needed, all but
    clean solution to the problem. In the end it's only a way to re-start
    discussion within the community. So, as always, comments, ideas, rants,
    etc.. are welcome! :-)

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    [ Added !RT_MUTEXES build fix. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Make it possible to specify a period (different or equal than
    deadline) for -deadline tasks. Relative deadlines (D_i) are used on
    task arrivals to generate new scheduling (absolute) deadlines as "d =
    t + D_i", and periods (P_i) to postpone the scheduling deadlines as "d
    = d + P_i" when the budget is zero.

    This is in general useful to model (and schedule) tasks that have slow
    activation rates (long periods), but have to be scheduled soon once
    activated (short deadlines).

    Signed-off-by: Harald Gustafsson
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-7-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Harald Gustafsson
     
  • Make the core scheduler and load balancer aware of the load
    produced by -deadline tasks, by updating the moving average
    like for sched_rt.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-6-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Introduces data structures relevant for implementing dynamic
    migration of -deadline tasks and the logic for checking if
    runqueues are overloaded with -deadline tasks and for choosing
    where a task should migrate, when it is the case.

    Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
    be moved among CPUs when necessary. It is also possible to bind a
    task to a (set of) CPU(s), thus restricting its capability of
    migrating, or forbidding migrations at all.

    The very same approach used in sched_rt is utilised:
    - -deadline tasks are kept into CPU-specific runqueues,
    - -deadline tasks are migrated among runqueues to achieve the
    following:
    * on an M-CPU system the M earliest deadline ready tasks
    are always running;
    * affinity/cpusets settings of all the -deadline tasks is
    always respected.

    Therefore, this very special form of "load balancing" is done with
    an active method, i.e., the scheduler pushes or pulls tasks between
    runqueues when they are woken up and/or (de)scheduled.
    IOW, every time a preemption occurs, the descheduled task might be sent
    to some other CPU (depending on its deadline) to continue executing
    (push). On the other hand, every time a CPU becomes idle, it might pull
    the second earliest deadline ready task from some other CPU.

    To enforce this, a pull operation is always attempted before taking any
    scheduling decision (pre_schedule()), as well as a push one after each
    scheduling decision (post_schedule()). In addition, when a task arrives
    or wakes up, the best CPU where to resume it is selected taking into
    account its affinity mask, the system topology, but also its deadline.
    E.g., from the scheduling point of view, the best CPU where to wake
    up (and also where to push) a task is the one which is running the task
    with the latest deadline among the M executing ones.

    In order to facilitate these decisions, per-runqueue "caching" of the
    deadlines of the currently running and of the first ready task is used.
    Queued but not running tasks are also parked in another rb-tree to
    speed-up pushes.

    Signed-off-by: Juri Lelli
    Signed-off-by: Dario Faggioli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     
  • Introduces the data structures, constants and symbols needed for
    SCHED_DEADLINE implementation.

    Core data structure of SCHED_DEADLINE are defined, along with their
    initializers. Hooks for checking if a task belong to the new policy
    are also added where they are needed.

    Adds a scheduling class, in sched/dl.c and a new policy called
    SCHED_DEADLINE. It is an implementation of the Earliest Deadline
    First (EDF) scheduling algorithm, augmented with a mechanism (called
    Constant Bandwidth Server, CBS) that makes it possible to isolate
    the behaviour of tasks between each other.

    The typical -deadline task will be made up of a computation phase
    (instance) which is activated on a periodic or sporadic fashion. The
    expected (maximum) duration of such computation is called the task's
    runtime; the time interval by which each instance need to be completed
    is called the task's relative deadline. The task's absolute deadline
    is dynamically calculated as the time instant a task (better, an
    instance) activates plus the relative deadline.

    The EDF algorithms selects the task with the smallest absolute
    deadline as the one to be executed first, while the CBS ensures each
    task to run for at most its runtime every (relative) deadline
    length time interval, avoiding any interference between different
    tasks (bandwidth isolation).
    Thanks to this feature, also tasks that do not strictly comply with
    the computational model sketched above can effectively use the new
    policy.

    To summarize, this patch:
    - introduces the data structures, constants and symbols needed;
    - implements the core logic of the scheduling algorithm in the new
    scheduling class file;
    - provides all the glue code between the new scheduling class and
    the core scheduler and refines the interactions between sched/dl
    and the other existing scheduling classes.

    Signed-off-by: Dario Faggioli
    Signed-off-by: Michael Trimarchi
    Signed-off-by: Fabio Checconi
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-4-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli