20 Feb, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

08 Feb, 2013

1 commit

  • Add a /proc/sys/kernel scheduler knob named
    sched_rr_timeslice_ms that allows global changing of the
    SCHED_RR timeslice value. User visable value is in milliseconds
    but is stored as jiffies. Setting to 0 (zero) resets to the
    default (currently 100ms).

    Signed-off-by: Clark Williams
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20130207094704.13751796@riff.lan
    Signed-off-by: Ingo Molnar

    Clark Williams
     

04 Feb, 2013

1 commit

  • Function next_prio() has been removed and pull_rt_task() is the
    only user of pick_next_highest_task_rt() at the moment.

    pull_rt_task is not interested in p->nr_cpus_allowed, its only
    interest is the fact that cpu is allowed to execute p. If
    nr_cpus_allowed == 1, cpu != task_cpu(p) and cpu is allowed then
    it means that task p is in the middle of the migration
    techniques; the task waits until it is moved by migration
    thread. So, lets pull it earlier.

    Signed-off-by: Kirill V Tkhai
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    CC: linux-rt-users
    Link: http://lkml.kernel.org/r/70871359644177@web16d.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

31 Jan, 2013

1 commit

  • There are several places of consecutive calls of
    dequeue_task_rt() and put_prev_task_rt() in the scheduler.
    For example, function rt_mutex_setprio() does it.

    The both calls lead to update_curr_rt(), the second of it
    receives zeroed delta_exec. The only effective action in this
    case is call of sched_rt_avg_update(), which can change
    rq->age_stamp and rq->rt_avg. But it is possible in case of
    ""floating"" rq->clock. This fact is not reasonable to be
    accounted. Another actions do nothing.

    Signed-off-by: Kirill V Tkhai
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    CC: linux-rt-users
    Link: http://lkml.kernel.org/r/931541359550236@web1g.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

25 Jan, 2013

3 commits

  • The issue below was found in 2.6.34-rt rather than mainline rt
    kernel, but the issue still exists upstream as well.

    So please let me describe how it was noticed on 2.6.34-rt:

    On this version, each softirq has its own thread, it means there
    is at least one RT FIFO task per cpu. The priority of these
    tasks is set to 49 by default. If user launches an RT FIFO task
    with priority lower than 49 of softirq RT tasks, it's possible
    there are two RT FIFO tasks enqueued one cpu runqueue at one
    moment. By current strategy of balancing RT tasks, when it comes
    to RT tasks, we really need to put them off to a CPU that they
    can run on as soon as possible. Even if it means a bit of cache
    line flushing, we want RT tasks to be run with the least latency.

    When the user RT FIFO task which just launched before is
    running, the sched timer tick of the current cpu happens. In this
    tick period, the timeout value of the user RT task will be
    updated once. Subsequently, we try to wake up one softirq RT
    task on its local cpu. As the priority of current user RT task
    is lower than the softirq RT task, the current task will be
    preempted by the higher priority softirq RT task. Before
    preemption, we check to see if current can readily move to a
    different cpu. If so, we will reschedule to allow the RT push logic
    to try to move current somewhere else. Whenever the woken
    softirq RT task runs, it first tries to migrate the user FIFO RT
    task over to a cpu that is running a task of lesser priority. If
    migration is done, it will send a reschedule request to the found
    cpu by IPI interrupt. Once the target cpu responds the IPI
    interrupt, it will pick the migrated user RT task to preempt its
    current task. When the user RT task is running on the new cpu,
    the sched timer tick of the cpu fires. So it will tick the user
    RT task again. This also means the RT task timeout value will be
    updated again. As the migration may be done in one tick period,
    it means the user RT task timeout value will be updated twice
    within one tick.

    If we set a limit on the amount of cpu time for the user RT task
    by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
    upon reaching the soft limit.

    But exactly when the SIGXCPU signal should be sent depends on the
    RT task timeout value. In fact the timeout mechanism of sending
    the SIGXCPU signal assumes the RT task timeout is increased once
    every tick.

    However, currently the timeout value may be added twice per
    tick. So it results in the SIGXCPU signal being sent earlier
    than expected.

    To solve this issue, we prevent the timeout value from increasing
    twice within one tick time by remembering the jiffies value of
    last updating the timeout. As long as the RT task's jiffies is
    different with the global jiffies value, we allow its timeout to
    be updated.

    Signed-off-by: Ying Xue
    Signed-off-by: Fan Du
    Reviewed-by: Yong Zhang
    Acked-by: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
    Signed-off-by: Ingo Molnar

    Ying Xue
     
  • When the system has multiple domains do_sched_rt_period_timer()
    can run on any CPU and may iterate over all rt_rq in
    cpu_online_mask. This means when balance_runtime() is run for a
    given rt_rq that rt_rq may be in a different rd than the current
    processor. Thus if we use smp_processor_id() to get rd in
    do_balance_runtime() we may borrow runtime from a rt_rq that is
    not part of our rd.

    This changes do_balance_runtime to get the rd from the passed in
    rt_rq ensuring that we borrow runtime only from the correct rd
    for the given rt_rq.

    This fixes a BUG at kernel/sched/rt.c:687! in __disable_runtime
    when we try reclaim runtime lent to other rt_rq but runtime has
    been lent to a rt_rq in another rd.

    Signed-off-by: Shawn Bohrer
    Acked-by: Steven Rostedt
    Acked-by: Mike Galbraith
    Cc: peterz@infradead.org
    Cc:
    Link: http://lkml.kernel.org/r/1358186131-29494-1-git-send-email-sbohrer@rgmadvisors.com
    Signed-off-by: Ingo Molnar

    Shawn Bohrer
     
  • Reschedule rq->curr if the first RT task has just been
    pulled to the rq.

    Signed-off-by: Kirill V Tkhai
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Tkhai Kirill
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ru
    Signed-off-by: Ingo Molnar

    Kirill Tkhai
     

13 Sep, 2012

1 commit

  • Now that the last architecture to use this has stopped doing so (ARM,
    thanks Catalin!) we can remove this complexity from the scheduler
    core.

    Signed-off-by: Peter Zijlstra
    Cc: Oleg Nesterov
    Cc: Catalin Marinas
    Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Sep, 2012

1 commit

  • migrate_tasks() uses _pick_next_task_rt() to get tasks from the
    real-time runqueues to be migrated. When rt_rq is throttled
    _pick_next_task_rt() won't return anything, in which case
    migrate_tasks() can't move all threads over and gets stuck in an
    infinite loop.

    Instead unthrottle rt runqueues before migrating tasks.

    Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

    Signed-off-by: Peter Boonstoppel
    Signed-off-by: Peter Zijlstra
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
    Signed-off-by: Ingo Molnar

    Peter Boonstoppel
     

14 Aug, 2012

1 commit


06 Jun, 2012

1 commit

  • Roland Dreier reported spurious, hard to trigger lockdep warnings
    within the scheduler - without any real lockup.

    This bit gives us the right clue:

    > [89945.640512] [] double_lock_balance+0x5a/0x90
    > [89945.640568] [] push_rt_task+0xc6/0x290

    if you look at that code you'll find the double_lock_balance() in
    question is the one in find_lock_lowest_rq() [yay for inlining].

    Now find_lock_lowest_rq() has a bug.. it fails to use
    double_unlock_balance() in one exit path, if this results in a retry in
    push_rt_task() we'll call double_lock_balance() again, at which point
    we'll run into said lockdep confusion.

    Reported-by: Roland Dreier
    Acked-by: Steven Rostedt
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337282386.4281.77.camel@twins
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

30 May, 2012

2 commits

  • task_tick_rt() has an optimization to only reschedule SCHED_RR tasks
    if they were the only element on their rq. However, with cgroups
    a SCHED_RR task could be the only element on its per-cgroup rq but
    still be competing with other SCHED_RR tasks in its parent's
    cgroup. In this case, the SCHED_RR task in the child cgroup would
    never yield at the end of its timeslice. If the child cgroup
    rt_runtime_us was the same as the parent cgroup rt_runtime_us,
    the task in the parent cgroup would starve completely.

    Modify task_tick_rt() to check that the task is the only task on its
    rq, and that the each of the scheduling entities of its ancestors
    is also the only entity on its rq.

    Signed-off-by: Colin Cross
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1337229266-15798-1-git-send-email-ccross@android.com
    Signed-off-by: Ingo Molnar

    Colin Cross
     
  • Since nr_cpus_allowed is used outside of sched/rt.c and wants to be
    used outside of there more, move it to a more natural site.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Apr, 2012

1 commit


13 Apr, 2012

1 commit

  • Migration status depends on a difference of weight from 0 and 1.
    If weight > 1 ( 1) then task becomes
    pushable (or not pushable). We are not insterested in its exact
    values, is it 3 or 4, for example.
    Now if we are changing affinity from a set of 3 cpus to a set of 4, the-
    task will be dequeued and enqueued sequentially without important
    difference in comparison with initial state. The only difference is in
    internal representation of plist queue of pushable tasks and the fact
    that the task may won't be the first in a sequence of the same priority
    tasks. But it seems to me it gives nothing.

    Link: http://lkml.kernel.org/r/273741334120764@web83.yandex.ru

    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Signed-off-by: Tkhai Kirill
    Signed-off-by: Steven Rostedt

    Kirill Tkhai
     

27 Mar, 2012

1 commit

  • Avoid extra work by continuing on to the next rt_rq if the highest
    prio task in current rt_rq is the same priority as our candidate
    task.

    More detailed explanation: if next is not NULL, then we have found a
    candidate task, and its priority is next->prio. Now we are looking
    for an even higher priority task in the other rt_rq's. idx is the
    highest priority in the current candidate rt_rq. In the current 3.3
    code, if idx is equal to next->prio, we would start scanning the tasks
    in that rt_rq and replace the current candidate task with a task from
    that rt_rq. But the new task would only have a priority that is equal
    to our previous candidate task, so we have not advanced our goal of
    finding a higher prio task. So we should avoid the extra work by
    continuing on to the next rt_rq if idx is equal to next->prio.

    Signed-off-by: Michael J Wang
    Acked-by: Steven Rostedt
    Reviewed-by: Yong Zhang
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/2EF88150C0EF2C43A218742ED384C1BC0FC83D6B@IRVEXCHMB08.corp.ad.broadcom.com
    Signed-off-by: Ingo Molnar

    Michael J Wang
     

13 Mar, 2012

1 commit

  • There's a few awkward printk()s inside of scheduler guts that people
    prefer to keep but really are rather deadlock prone. Fudge around it
    by storing the text in a per-cpu buffer and poll it using the existing
    printk_tick() handler.

    This will drop output when its more frequent than once a tick, however
    only the affinity thing could possible go that fast and for that just
    one should suffice to notify the admin he's done something silly..

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

01 Mar, 2012

2 commits

  • When a runqueue has rt_runtime_us = 0 then the only way it can
    accumulate rt_time is via PI boosting. That causes the runqueue
    to be throttled and replenishing does not change anything due to
    rt_runtime_us = 0. So avoid that situation by clearing rt_time and
    skip the throttling alltogether.

    Signed-off-by: Peter Zijlstra
    [ Changelog ]
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-7x70cypsotjb4jvcor3edctk@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When a runqueue is throttled we cannot disable the period timer
    because that timer is the only way to undo the throttling.

    We got stale throttling entries when a rq was throttled and then the
    global sysctl was disabled, which stopped the timer.

    Signed-off-by: Peter Zijlstra
    [ Added changelog ]
    Signed-off-by: Thomas Gleixner
    Link: http://lkml.kernel.org/n/tip-nuj34q52p6ro7szapuz84i0v@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Feb, 2012

1 commit

  • Current the initial SCHED_RR timeslice of init_task is HZ, which means
    1s, and is not same as the default SCHED_RR timeslice DEF_TIMESLICE.

    Change that initial timeslice to the DEF_TIMESLICE.

    Signed-off-by: Hiroshi Shimamoto
    [ s/DEF_TIMESLICE/RR_TIMESLICE/g ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/4F3C9995.3010800@ct.jp.nec.com
    Signed-off-by: Ingo Molnar

    Hiroshi Shimamoto
     

27 Jan, 2012

1 commit

  • This issue happens under the following conditions:

    1. preemption is off
    2. __ARCH_WANT_INTERRUPTS_ON_CTXSW is defined
    3. RT scheduling class
    4. SMP system

    Sequence is as follows:

    1.suppose current task is A. start schedule()
    2.task A is enqueued pushable task at the entry of schedule()
    __schedule
    prev = rq->curr;
    ...
    put_prev_task
    put_prev_task_rt
    enqueue_pushable_task
    4.pick the task B as next task.
    next = pick_next_task(rq);
    3.rq->curr set to task B and context_switch is started.
    rq->curr = next;
    4.At the entry of context_swtich, release this cpu's rq->lock.
    context_switch
    prepare_task_switch
    prepare_lock_switch
    raw_spin_unlock_irq(&rq->lock);
    5.Shortly after rq->lock is released, interrupt is occurred and start IRQ context
    6.try_to_wake_up() which called by ISR acquires rq->lock
    try_to_wake_up
    ttwu_remote
    rq = __task_rq_lock(p)
    ttwu_do_wakeup(rq, p, wake_flags);
    task_woken_rt
    7.push_rt_task picks the task A which is enqueued before.
    task_woken_rt
    push_rt_tasks(rq)
    next_task = pick_next_pushable_task(rq)
    8.At find_lock_lowest_rq(), If double_lock_balance() returns 0,
    lowest_rq can be the remote rq.
    (But,If preemption is on, double_lock_balance always return 1 and it
    does't happen.)
    push_rt_task
    find_lock_lowest_rq
    if (double_lock_balance(rq, lowest_rq))..
    9.find_lock_lowest_rq return the available rq. task A is migrated to
    the remote cpu/rq.
    push_rt_task
    ...
    deactivate_task(rq, next_task, 0);
    set_task_cpu(next_task, lowest_rq->cpu);
    activate_task(lowest_rq, next_task, 0);
    10. But, task A is on irq context at this cpu.
    So, task A is scheduled by two cpus at the same time until restore from IRQ.
    Task A's stack is corrupted.

    To fix it, don't migrate an RT task if it's still running.

    Signed-off-by: Chanho Min
    Signed-off-by: Peter Zijlstra
    Acked-by: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/CAOAMb1BHA=5fm7KTewYyke6u-8DP0iUuJMpgQw54vNeXFsGpoQ@mail.gmail.com
    Signed-off-by: Ingo Molnar

    Chanho Min
     

06 Dec, 2011

2 commits

  • The second call to sched_rt_period() is redundant, because the value of the
    rt_runtime was already read and it was protected by the ->rt_runtime_lock.

    Signed-off-by: Shan Hai
    Reviewed-by: Kamalesh Babulal
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1322535836-13590-2-git-send-email-haishan.bai@gmail.com
    Signed-off-by: Ingo Molnar

    Shan Hai
     
  • rt.nr_cpus_allowed is always available, use it to bail from select_task_rq()
    when only one cpu can be used, and saves some cycles for pinned tasks.

    See the line marked with '*' below:

    # taskset -c 3 pipe-test

    PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3)
    ------------------------------------------------------------------------------------------------

    Virgin Patched
    samples pcnt function samples pcnt function
    _______ _____ ___________________________ _______ _____ ___________________________

    2880.00 10.2% __schedule 3136.00 11.3% __schedule
    1634.00 5.8% pipe_read 1615.00 5.8% pipe_read
    1458.00 5.2% system_call 1534.00 5.5% system_call
    1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave
    1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string
    1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to
    1097.00 3.9% __switch_to 929.00 3.3% mutex_lock
    872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock
    687.00 2.4% mutex_unlock 804.00 2.9% pipe_write
    682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock
    643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore
    617.00 2.2% sched_clock_local 633.00 2.3% fsnotify
    612.00 2.2% fsnotify 605.00 2.2% sched_clock_local
    596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs
    542.00 1.9% sysret_check 559.00 2.0% sysret_check
    467.00 1.7% fget_light 472.00 1.7% fget_light
    462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch
    437.00 1.5% vfs_write 442.00 1.6% vfs_write
    431.00 1.5% do_sync_write 428.00 1.5% do_sync_write
    * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq
    386.00 1.4% update_curr 402.00 1.4% update_curr
    385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read
    377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read
    369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user
    360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key
    342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.net
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

17 Nov, 2011

1 commit