Eric Lee / smarc-fsl-linux-kernel

06 Oct, 2011

3 commits

1c83437e8 sched: Warn on rt throttling ... Browse Code »

The default rt-throttling is a source of never ending questions. Warn
once when we go into throttling so folks have that info in dmesg.

Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1110051331480.18778@ionos
Signed-off-by: Ingo Molnar

Thomas Gleixner
2011-10-06 18:47:04 +0800
4939602a2 sched: Unify the ->cpus_allowed mask copy ... Browse Code »

Currently every sched_class::set_cpus_allowed() implementation has to
copy the cpumask into task_struct::cpus_allowed, this is pointless,
put this copy in the generic code.

Signed-off-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Link: http://lkml.kernel.org/n/tip-jhl5s9fckd9ptw1fzbqqlrd3@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-06 18:47:00 +0800
fa17b507f sched: Wrap scheduler p->cpus_allowed access ... Browse Code »

This task is preparatory for the migrate_disable() implementation, but
stands on its own and provides a cleanup.

It currently only converts those sites required for task-placement.
Kosaki-san once mentioned replacing cpus_allowed with a proper
cpumask_t instead of the NR_CPUS sized array it currently is, that
would also require something like this.

Signed-off-by: Peter Zijlstra
Acked-by: Thomas Gleixner
Cc: KOSAKI Motohiro
Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-10-06 18:46:56 +0800

04 Oct, 2011

1 commit

22f92bacb Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: pick up the latest fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-10-04 17:09:08 +0800

18 Sep, 2011

1 commit

3be209a8e sched/rt: Migrate equal priority tasks to available CPUs ... Browse Code »
1

Commit 43fa5460fe60dea5c610490a1d263415419c60f6 ("sched: Try not to
migrate higher priority RT tasks") also introduced a change in behavior
which keeps RT tasks on the same CPU if there is an equal priority RT
task currently running even if there are empty CPUs available.

This can cause unnecessary wakeup latencies, and can prevent the
scheduler from balancing all RT tasks across available CPUs.

This change causes an RT task to search for a new CPU if an equal
priority RT task is already running on wakeup. Lower priority tasks
will still have to wait on higher priority tasks, but the system should
still balance out because there is always the possibility that if there
are both a high and low priority RT tasks on a given CPU that the high
priority task could wakeup while the low priority task is running and
force it to search for a better runqueue.

Signed-off-by: Shawn Bohrer
Acked-by: Steven Rostedt
Tested-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org # 37+
Link: http://lkml.kernel.org/r/1315837684-18733-1-git-send-email-sbohrer@rgmadvisors.com
Signed-off-by: Ingo Molnar

Shawn Bohrer
2011-09-18 19:48:56 +0800

14 Aug, 2011

6 commits

953bfcd10 sched: Implement hierarchical task accounting for SCHED_OTHER ... Browse Code »

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.

Signed-off-by: Paul Turner
Reviewed-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.com
Signed-off-by: Ingo Molnar

Paul Turner
2011-08-14 18:01:13 +0800
5181f4a46 sched: Use pushable_tasks to determine next highest prio ... Browse Code »

Hillf Danton proposed a patch (see link) that cleaned up the
sched_rt code that calculates the priority of the next highest priority
task to be used in finding run queues to pull from.

His patch removed the calculating of the next prio to just use the current
prio when deteriming if we should examine a run queue to pull from. The problem
with his patch was that it caused more false checks. Because we check a run
queue for pushable tasks if the current priority of that run queue is higher
in priority than the task about to run on our run queue. But after grabbing
the locks and doing the real check, we find that there may not be a task
that has a higher prio task to pull. Thus the locks were taken with nothing to
do.

I added some trace_printks() to record when and how many times the run queue
locks were taken to check for pullable tasks, compared to how many times we
pulled a task.

With the current method, it was:

3806 locks taken vs 2812 pulled tasks

With Hillf's patch:

6728 locks taken vs 2804 pulled tasks

The number of times locks were taken to pull a task went up almost double with
no more success rate.

But his patch did get me thinking. When we look at the priority of the highest
task to consider taking the locks to do a pull, a failure to pull can be one
of the following: (in order of most likely)

o RT task was pushed off already between the check and taking the lock
o Waiting RT task can not be migrated
o RT task's CPU affinity does not include the target run queue's CPU
o RT task's priority changed between the check and taking the lock

And with Hillf's patch, the thing that caused most of the failures, is
the RT task to pull was not at the right priority to pull (not greater than
the current RT task priority on the target run queue).

Most of the above cases we can't help. But the current method does not check
if the next highest prio RT task can be migrated or not, and if it can not,
we still grab the locks to do the test (we don't find out about this fact until
after we have the locks). I thought about this case, and realized that the
pushable task plist that is maintained only holds RT tasks that can migrate.
If we move the calculating of the next highest prio task from the inc/dec_rt_task()
functions into the queuing of the pushable tasks, then we only measure the
priorities of those tasks that we push, and we get this basically for free.

Not only does this patch make the code a little more efficient, it cleans it
up and makes it a little simpler.

Thanks to Hillf Danton for inspiring me on this patch.

Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Hillf Danton
Cc: Gregory Haskins
Link: http://lkml.kernel.org/r/BANLkTimQ67180HxCx5vgMqumqw1EkFh3qg@mail.gmail.com
Signed-off-by: Ingo Molnar

Steven Rostedt
2011-08-14 18:00:55 +0800
c37495fd0 sched: Balance RT tasks when forked as well ... Browse Code »

When a new task is woken, the code to balance the RT task is currently
skipped in the select_task_rq() call. But it will be pushed if the rq
is currently overloaded with RT tasks anyway. The issue is that we
already queued the task, and if it does get pushed, it will have to
be dequeued and requeued on the new run queue. The advantage with
pushing it first is that we avoid this requeuing as we are pushing it
off before the task is ever queued.

See commit 318e0893ce3f524 ("sched: pre-route RT tasks on wakeup")
for more details.

The return of select_task_rq() when it is not a wake up has also been
changed to return task_cpu() instead of smp_processor_id(). This is more
of a sanity because the current only other user of select_task_rq()
besides wake ups, is an exec, where task_cpu() should also be the same
as smp_processor_id(). But if it is used for other purposes, lets keep
the task on the same CPU. Why would we mant to migrate it to the current
CPU?

Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Hillf Danton
Link: http://lkml.kernel.org/r/20110617015919.832743148@goodmis.org
Signed-off-by: Ingo Molnar

Steven Rostedt
2011-08-14 18:00:52 +0800
1812a643c sched: Remove resetting exec_start in put_prev_task_rt() ... Browse Code »

There's no reason to clean the exec_start in put_prev_task_rt() as it is reset
when the task gets back to the run queue. This saves us doing a store() in the
fast path.

Signed-off-by: Hillf Danton
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Yong Zhang
Link: http://lkml.kernel.org/r/BANLkTimqWD=q6YnSDi-v9y=LMWecgEzEWg@mail.gmail.com
Signed-off-by: Ingo Molnar

Hillf Danton
2011-08-14 18:00:50 +0800
311e800e1 sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task() ... Browse Code »

Do not call dequeue_pushable_task() when failing to push an eligible
task, as it remains pushable, merely not at this particular moment.

Signed-off-by: Hillf Danton
Signed-off-by: Mike Galbraith
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Yong Zhang
Link: http://lkml.kernel.org/r/1306895385.4791.26.camel@marge.simson.net
Signed-off-by: Ingo Molnar

Hillf Danton
2011-08-14 18:00:48 +0800
67d955383 sched: Remove noop in next_prio() ... Browse Code »

When computing the next priority for a given run-queue, the check for
RT priority of the task determined by the pick_next_highest_task_rt()
function could be removed, since only RT tasks are returned by the
function.

Reviewed-by: Yong Zhang
Signed-off-by: Hillf Danton
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/BANLkTimxmWiof9s5AvS3v_0X+sMiE=0x5g@mail.gmail.com
Signed-off-by: Ingo Molnar

Hillf Danton
2011-08-14 18:00:45 +0800

01 Jul, 2011

2 commits

1c09ab0d2 sched: Skip autogroup when looking for all rt sched groups ... Browse Code »

Since commit ec514c48 ("sched: Fix rt_rq runtime leakage bug")
'cat /proc/sched_debug' will print data of root_task_group.rt_rq
multiple times.

This is because autogroup does not have its own rt group, instead
rt group of autogroup is linked to root_task_group.

So skip it when we are looking for all rt sched groups, and it
will also save some noop operation against root_task_group when
__disable_runtime()/__enable_runtime().

-v2: Based on Cheng Xu's idea which uses less code.

Signed-off-by: Yong Zhang
Cc: Mike Galbraith
Cc: Cheng Xu
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/BANLkTi=87P3RoTF_UEtamNfc_XGxQXE__Q@mail.gmail.com
Signed-off-by: Ingo Molnar

Yong Zhang
2011-07-01 16:39:08 +0800
36b2e922b Merge commit 'v3.0-rc5' into sched/core ... Browse Code »

Merge reason: Move to a (much) newer base.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-07-01 16:34:24 +0800

15 Jun, 2011

2 commits

0da938c44 sched: Check if lowest_mask is initialized in find_lowest_rq() ... Browse Code »

On system boot up, the lowest_mask is initialized with an
early_initcall(). But RT tasks may wake up on other
early_initcall() callers before the lowest_mask is initialized,
causing a system crash.

Commit "d72bce0e67 rcu: Cure load woes" was the first commit
to wake up RT tasks in early init. Before this commit this bug
should not happen.

Reported-by: Andrew Theurer
Tested-by: Andrew Theurer
Tested-by: Paul E. McKenney
Signed-off-by: Steven Rostedt
Acked-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110614223657.824872966@goodmis.org
Signed-off-by: Ingo Molnar

Steven Rostedt
2011-06-15 17:44:48 +0800
8dd0de8be sched: Fix need_resched() when checking peempt ... Browse Code »

The RT preempt check tests the wrong task if NEED_RESCHED is
set. It currently checks the local CPU task. It is supposed to
check the task that is running on the runqueue we are about to
wake another task on.

Signed-off-by: Hillf Danton
Reviewed-by: Yong Zhang
Signed-off-by: Steven Rostedt
Link: http://lkml.kernel.org/r/20110614223657.450239027@goodmis.org
Signed-off-by: Ingo Molnar

Hillf Danton
2011-06-15 15:50:32 +0800

03 Jun, 2011

1 commit

e197f094b Merge branch 'unlikely/sched' of git://git.kernel.org/pub/scm/linux/kernel/git/r… ... Browse Code »

…ostedt/linux-2.6-trace into sched/urgent

Ingo Molnar
2011-06-03 16:27:47 +0800

28 May, 2011

1 commit

cd4ae6adf sched: More sched_domain iterations fixes ... Browse Code »

sched_domain iterations needs to be protected by rcu_read_lock() now,
this patch adds another two places which needs the rcu lock, which is
spotted by following suspicious rcu_dereference_check() usage warnings.

kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!

Signed-off-by: Xiaotian Feng
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com
Signed-off-by: Ingo Molnar

Xiaotian Feng
2011-05-28 23:02:54 +0800

20 May, 2011

1 commit

80fe02b5d Merge branches 'sched-core-for-linus' and 'sched-urgent-for-linus' of git://git.… ... Browse Code »

…kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (60 commits)
sched: Fix and optimise calculation of the weight-inverse
sched: Avoid going ahead if ->cpus_allowed is not changed
sched, rt: Update rq clock when unthrottling of an otherwise idle CPU
sched: Remove unused parameters from sched_fork() and wake_up_new_task()
sched: Shorten the construction of the span cpu mask of sched domain
sched: Wrap the 'cfs_rq->nr_spread_over' field with CONFIG_SCHED_DEBUG
sched: Remove unused 'this_best_prio arg' from balance_tasks()
sched: Remove noop in alloc_rt_sched_group()
sched: Get rid of lock_depth
sched: Remove obsolete comment from scheduler_tick()
sched: Fix sched_domain iterations vs. RCU
sched: Next buddy hint on sleep and preempt path
sched: Make set_*_buddy() work on non-task entities
sched: Remove need_migrate_task()
sched: Move the second half of ttwu() to the remote cpu
sched: Restructure ttwu() some more
sched: Rename ttwu_post_activation() to ttwu_do_wakeup()
sched: Remove rq argument from ttwu_stat()
sched: Remove rq->lock from the first half of ttwu()
sched: Drop rq->lock from sched_exec()
...

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix rt_rq runtime leakage bug

Linus Torvalds
2011-05-20 08:41:22 +0800

16 May, 2011

2 commits

61eadef6a sched, rt: Update rq clock when unthrottling of an otherwise idle CPU ... Browse Code »

If an RT task is awakened while it's rt_rq is throttled, the time between
wakeup/enqueue and unthrottle/selection may be accounted as rt_time
if the CPU is idle. Set rq->skip_clock_update negative upon throttle
release to tell put_prev_task() that we need a clock update.

Reported-by: Thomas Giesel
Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1304059010.7472.1.camel@marge.simson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
2011-05-16 17:01:17 +0800
ec514c487 sched: Fix rt_rq runtime leakage bug ... Browse Code »

This patch is to fix the real-time scheduler bug reported at:

https://lkml.org/lkml/2011/4/26/13

That is, when running multiple real-time threads on every logical CPUs
and then turning off one CPU, the kernel will bug at function
__disable_runtime().

Function __disable_runtime() bugs and reports leakage of rt_rq runtime.
The root cause is __disable_runtime() assumes it iterates through all
the existing rt_rq's while walking rq->leaf_rt_rq_list, which actually
contains only runnable rt_rq's. This problem also applies to
__enable_runtime() and print_rt_stats().

The patch is based on above analysis, appears to fix the problem, but is
only lightly tested.

Reported-by: Paul E. McKenney
Tested-by: Paul E. McKenney
Signed-off-by: Cheng Xu
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/4DCE1F12.6040609@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Cheng Xu
2011-05-16 17:00:54 +0800

14 Apr, 2011

2 commits

7608dec2c sched: Drop the rq argument to sched_class::select_task_rq() ... Browse Code »

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.

Reviewed-by: Frank Rowand
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nl
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-04-14 14:52:36 +0800
fd2f4419b sched: Provide p->on_rq ... Browse Code »

Provide a generic p->on_rq because the p->se.on_rq semantics are
unfavourable for lockless wakeups but needed for sched_fair.

In particular, p->on_rq is only cleared when we actually dequeue the
task in schedule() and not on any random dequeue as done by things
like __migrate_task() and __sched_setscheduler().

This also allows us to remove p->se usage from !sched_fair code.

Reviewed-by: Frank Rowand
Cc: Mike Galbraith
Cc: Nick Piggin
Cc: Linus Torvalds
Cc: Andrew Morton
Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20110405152728.949545047@chello.nl

Peter Zijlstra
2011-04-14 14:52:35 +0800

31 Mar, 2011

1 commit

25985edce Fix common misspellings ... Browse Code »

Fixes generated by 'codespell' and manually reviewed.

Signed-off-by: Lucas De Marchi

Lucas De Marchi
2011-03-31 22:26:23 +0800

04 Mar, 2011

2 commits

e0a92c174 Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge reason: Add fixes before applying dependent patches.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-03-04 18:12:26 +0800
0c3b91680 sched: Fix sched rt group scheduling when hierachy is enabled ... Browse Code »

The current sched rt code is broken when it comes to hierarchical
scheduling, this patch fixes two problems

1. It adds redundant enqueuing (harmless) when it finds a queue
has tasks enqueued, but it has no run time and it is not
throttled.

2. The most important change is in sched_rt_rq_enqueue/dequeue.
The code just picks the rt_rq belonging to the current cpu
on which the period timer runs, the patch fixes it, so that
the correct rt_se is enqueued/dequeued.

Tested with a simple hierarchy

/c/d, c and d assigned similar runtimes of 50,000 and a while
1 loop runs within "d". Both c and d get throttled, without
the patch, the task just stops running and never runs (depends
on where the sched_rt b/w timer runs). With the patch, the
task is throttled and runs as expected.

[ bharata, suggestions on how to pick the rt_se belong to the
rt_rq and correct cpu ]

Signed-off-by: Balbir Singh
Acked-by: Bharata B Rao
Signed-off-by: Peter Zijlstra
Cc: stable@kernel.org
LKML-Reference:
Signed-off-by: Ingo Molnar

Balbir Singh
2011-03-04 18:03:18 +0800

16 Feb, 2011

1 commit

48fa4b8ec Merge commit 'v2.6.38-rc5' into sched/core ... Browse Code »

Merge reason: Pick up upstream fixes.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-02-16 20:31:55 +0800

03 Feb, 2011

1 commit

06c3bc655 sched: Fix update_curr_rt() ... Browse Code »

cpu_stopper_thread()
migration_cpu_stop()
__migrate_task()
deactivate_task()
dequeue_task()
dequeue_task_rq()
update_curr_rt()

Will call update_curr_rt() on rq->curr, which at that time is
rq->stop. The problem is that rq->stop.prio matches an RT prio and
thus falsely assumes its a rt_sched_class task.

Reported-Debuged-Tested-Acked-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Cc: stable@kernel.org # .37
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-02-03 19:21:33 +0800

26 Jan, 2011

1 commit

da7a735e5 sched: Fix switch_from_fair() ... Browse Code »

When a task is taken out of the fair class we must ensure the vruntime
is properly normalized because when we put it back in it will assume
to be normalized.

The case that goes wrong is when changing away from the fair class
while sleeping. Sleeping tasks have non-normalized vruntime in order
to make sleeper-fairness work. So treat the switch away from fair as a
wakeup and preserve the relative vruntime.

Also update sysrq-n to call the ->switch_{to,from} methods.

Reported-by: Onkalo Samu
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-01-26 19:33:22 +0800

14 Dec, 2010

2 commits

8e54a2c03 sched: Change pick_next_task_rt from unlikely to likely ... Browse Code »

The if (unlikely(!rt_rq->rt_nr_running)) test in pick_next_task_rt()
tests if there is another rt task ready to run. If so, then pick it.

In most systems, only one RT task runs at a time most of the time.
Running the branch unlikely annotator profiler on a system doing average
work "running firefox, evolution, xchat, distcc builds, etc", it showed the
following:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
324344 135104992 99 _pick_next_task_rt sched_rt.c 1064

99% of the time the condition is true. When an RT task schedules out,
it is unlikely that another RT task is waiting to run on that same run queue.

Simply remove the unlikely() condition.

Acked-by: Gregory Haskins
Cc:Peter Zijlstra
Signed-off-by: Steven Rostedt

Steven Rostedt
2010-12-14 08:55:15 +0800
33c3d6c61 sched: Cleanup pre_schedule_rt ... Browse Code »

Since [commit 9a897c5a:
sched: RT-balance, replace hooks with pre/post schedule and wakeup methods]
we must call pre_schedule_rt if prev is rt task.
So condition rt_task(prev) is always true and the 'unlikely' declaration is
simply incorrect.

Signed-off-by: Yong Zhang
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Rusty Russell
Signed-off-by: Steven Rostedt

Yong Zhang
2010-12-14 04:02:46 +0800

18 Nov, 2010

1 commit

3d4b47b4b sched: Implement on-demand (active) cfs_rq list ... Browse Code »

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.

Signed-off-by: Paul Turner
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-11-18 20:27:47 +0800

19 Oct, 2010

2 commits

305e6835e sched: Do not account irq time to current task ... Browse Code »

Scheduler accounts both softirq and interrupt processing times to the
currently running task. This means, if the interrupt processing was
for some other task in the system, then the current task ends up being
penalized as it gets shorter runtime than otherwise.

Change sched task accounting to acoount only actual task time from
currently running task. Now update_curr(), modifies the delta_exec to
depend on rq->clock_task.

Note that this change only handles CONFIG_IRQ_TIME_ACCOUNTING case. We can
extend this to CONFIG_VIRT_CPU_ACCOUNTING with minimal effort. But, thats
for later.

This change will impact scheduling behavior in interrupt heavy conditions.

Tested on a 4-way system with eth0 handled by CPU 2 and a network heavy
task (nc) running on CPU 3 (and no RSS/RFS). With that I have CPU 2
spending 75%+ of its time in irq processing. CPU 3 spending around 35%
time running nc task.

Now, if I run another CPU intensive task on CPU 2, without this change
/proc//schedstat shows 100% of time accounted to this task. With this
change, it rightly shows less than 25% accounted to this task as remaining
time is actually spent on irq processing.

Signed-off-by: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Venkatesh Pallipadi
2010-10-19 02:52:26 +0800
492462742 sched: Unindent labels ... Browse Code »

Labels should be on column 0.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-10-19 00:41:56 +0800

21 Sep, 2010

2 commits

b3bc211cf sched: Give CPU bound RT tasks preference ... Browse Code »

If a high priority task is waking up on a CPU that is running a
lower priority task that is bound to a CPU, see if we can move the
high RT task to another CPU first. Note, if all other CPUs are
running higher priority tasks than the CPU bounded current task,
then it will be preempted regardless.

Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Gregory Haskins
LKML-Reference:
Signed-off-by: Ingo Molnar

Steven Rostedt
2010-09-21 19:57:12 +0800
43fa5460f sched: Try not to migrate higher priority RT tasks ... Browse Code »
1

When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
ftrace_write("thread %ld iter %d, taking lock %d\n",
id, iter, l);
pthread_mutex_lock(&locks[l]);
ftrace_write("thread %ld iter %d, took lock %d\n",
id, iter, l);
busy_loop(nr_tasks - id);
ftrace_write("thread %ld iter %d, unlock lock %d\n",
id, iter, l);
pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
[...]
while (!done) {
for (l = 0; l < nr_locks; l++) {
grab_lock(id, i, l);
ftrace_write("thread %ld iter %d sleeping\n",
id, i);
ms_sleep(id);
}
i++;
}
[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.

At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:

Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 53 3220 1470 98
1: 562 773 724 98
2: 752 933 1375 98
3: 749 39 697 98
4: 758 5 515 98
5: 764 2 679 99
6: 761 2 535 99
7: 757 3 346 99

total: 5156 4977 6341 787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:

Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 37 2835 1937 98
1: 666 1821 1865 98
2: 654 1003 1385 98
3: 664 635 973 99
4: 698 197 352 99
5: 703 101 159 99
6: 708 1 75 99
7: 713 1 2 99

total: 4843 6594 6748 789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:

Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 56 3101 1892 95
1: 594 713 937 95
2: 625 188 618 95
3: 628 4 491 96
4: 640 7 468 96
5: 631 2 501 96
6: 641 1 466 96
7: 643 2 497 96

total: 4458 4018 5870 765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:

Task vol nonvol migrated iterations
---- --- ------ -------- ----------
0: 115 3439 2782 98
1: 633 1354 1583 99
2: 652 919 1218 99
3: 645 713 934 99
4: 690 3 3 99
5: 694 1 4 99
6: 720 3 4 99
7: 747 0 1 100

Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.

Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: Gregory Haskins
LKML-Reference:
Signed-off-by: Ingo Molnar

Steven Rostedt
2010-09-21 19:57:12 +0800

18 Jun, 2010

1 commit

c32b4fce7 sched: task_tick_rt: Remove the obsolete ->signal != NULL check ... Browse Code »

Remove the obsolete ->signal != NULL check in watchdog().
Since ea6d290c ->signal can't be NULL.

Signed-off-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Oleg Nesterov
2010-06-18 16:46:56 +0800

03 Apr, 2010

3 commits

371fd7e7a sched: Add enqueue/dequeue flags ... Browse Code »

In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.

Replace the int wakeup, bool head arguments with an int flags argument
and create the following flags:

ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
ENQUEUE_WAKING - the enqueue has relative vruntime due to
having sched_class::task_waking() called,
ENQUEUE_HEAD - the waking task should be places on the head
of the priority queue (where appropriate).

For symmetry also convert sched_class::dequeue() to a flags scheme.

Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:05 +0800
0017d7350 sched: Fix TASK_WAKING vs fork deadlock ... Browse Code »

Oleg noticed a few races with the TASK_WAKING usage on fork.

- since TASK_WAKING is basically a spinlock, it should be IRQ safe
- since we set TASK_WAKING (*) without holding rq->lock it could
be there still is a rq->lock holder, thereby not actually
providing full serialization.

(*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

Cure the second issue by not setting TASK_WAKING in sched_fork(), but
only temporarily in wake_up_new_task() while calling select_task_rq().

Cure the first by holding rq->lock around the select_task_rq() call,
this will disable IRQs, this however requires that we push down the
rq->lock release into select_task_rq_fair()'s cgroup stuff.

Because select_task_rq_fair() still needs to drop the rq->lock we
cannot fully get rid of TASK_WAKING.

Reported-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-04-03 02:12:03 +0800
c9494727c Merge branch 'linus' into sched/core ... Browse Code »

Merge reason: update to latest upstream

Signed-off-by: Ingo Molnar

Ingo Molnar
2010-04-03 02:03:08 +0800

14 Mar, 2010

1 commit

80a186074 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix pick_next_highest_task_rt() for cgroups
sched: Cleanup: remove unused variable in try_to_wake_up()
x86: Fix sched_clock_cpu for systems with unsynchronized TSC

Linus Torvalds
2010-03-14 06:46:18 +0800