Eric Lee / smarc-fsl-linux-kernel

12 Jan, 2009

1 commit

d38b223c8 cpumask: reduce stack usage in find_lowest_rq ... Browse Code »

Impact: reduce stack usage, cleanup

Use a cpumask_var_t in find_lowest_rq() and clean up other old
cpumask_t calls.

Signed-off-by: Mike Travis

Mike Travis
2009-01-12 02:13:22 +0800

04 Jan, 2009

2 commits

6ca09dfc9 sched: put back some stack hog changes that were undone in kernel/sched.c ... Browse Code »

Impact: prevents panic from stack overflow on numa-capable machines.

Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow. This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.

In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage. (Both of these updates removed 9 FIXME's!)

Also:
Pick up some remaining changes from the old 'cpumask_t' functions to
the new 'struct cpumask *' functions.

Optimize memory traffic by allocating each percpu local_cpu_mask on the
same node as the referring cpu.

Signed-off-by: Mike Travis
Acked-by: Rusty Russell
Signed-off-by: Ingo Molnar

Mike Travis
2009-01-04 02:00:09 +0800
7eb195533 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/lin… ... Browse Code »

…ux-2.6-cpumask into merge-rr-cpumask

Conflicts:
arch/x86/kernel/io_apic.c
kernel/rcuclassic.c
kernel/sched.c
kernel/time/tick-sched.c

Signed-off-by: Mike Travis <travis@sgi.com>
[ mingo@elte.hu: backmerged typo fix for io_apic.c ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>

Mike Travis
2009-01-04 01:53:31 +0800

25 Dec, 2008

1 commit

4e202284e Merge branch 'sched/urgent'; commit 'v2.6.28' into sched/core Browse Code »

Ingo Molnar
2008-12-25 20:42:23 +0800

17 Dec, 2008

1 commit

80f40ee4a sched: use RCU variant of list traversal in for_each_leaf_rt_rq() ... Browse Code »

Impact: fix potential of rare crash

for_each_leaf_rt_rq() walks an RCU protected list (rq->leaf_rt_rq_list),
but doesn't use list_for_each_entry_rcu(). Fix this.

Signed-off-by: Bharata B Rao
Cc: Peter Zijlstra
Signed-off-by: Ingo Molnar

Bharata B Rao
2008-12-17 04:39:14 +0800

12 Dec, 2008

1 commit

45ab6b0c7 Merge branch 'sched/core' into cpus4096 ... Browse Code »

Conflicts:
include/linux/ftrace.h
kernel/sched.c

Ingo Molnar
2008-12-12 20:48:57 +0800

29 Nov, 2008

1 commit

70574a996 sched: move double_unlock_balance() higher ... Browse Code »

Move double_lock_balance()/double_unlock_balance() higher to fix the following
with gcc-3.4.6:

CC kernel/sched.o
In file included from kernel/sched.c:1605:
kernel/sched_rt.c: In function `find_lock_lowest_rq':
kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
kernel/sched_rt.c:1077: sorry, unimplemented: called from here
make[2]: *** [kernel/sched.o] Error 1

Signed-off-by: Alexey Dobriyan
Signed-off-by: Ingo Molnar

Alexey Dobriyan
2008-11-29 03:11:15 +0800

26 Nov, 2008

1 commit

3d8cbdf86 sched: convert local_cpu_mask to cpumask_var_t, fix ... Browse Code »

Impact: build fix for !CONFIG_SMP

Signed-off-by: Rusty Russell
Acked-by: Mike Travis
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-26 14:58:28 +0800

25 Nov, 2008

5 commits

96f874e26 sched: convert remaining old-style cpumask operators ... Browse Code »

Impact: Trivial API conversion

NR_CPUS -> nr_cpu_ids
cpumask_t -> struct cpumask
sizeof(cpumask_t) -> cpumask_size()
cpumask_a = cpumask_b -> cpumask_copy(&cpumask_a, &cpumask_b)

cpu_set() -> cpumask_set_cpu()
first_cpu() -> cpumask_first()
cpumask_of_cpu() -> cpumask_of()
cpus_* -> cpumask_*

There are some FIXMEs where we all archs to complete infrastructure
(patches have been sent):

cpu_coregroup_map -> cpu_coregroup_mask
node_to_cpumask* -> cpumask_of_node

There is also one FIXME where we pass an array of cpumasks to
partition_sched_domains(): this implies knowing the definition of
'struct cpumask' and the size of a cpumask. This will be fixed in a
future patch.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:52:42 +0800
0e3900e6d sched: convert local_cpu_mask to cpumask_var_t. ... Browse Code »

Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:52:35 +0800
24600ce89 sched: convert check_preempt_equal_prio to cpumask_var_t. ... Browse Code »

Impact: stack reduction for large NR_CPUS

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
stack space.

We simply return if the allocation fails: since we don't use it we
could just pass NULL to cpupri_find and have it handle that.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:52:28 +0800
c6c4927b2 sched: convert struct root_domain to cpumask_var_t. ... Browse Code »

Impact: (future) size reduction for large NR_CPUS.

Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

def_root_domain is static, and so its masks are initialized with
alloc_bootmem_cpumask_var. After that, alloc_cpumask_var is used.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:51:18 +0800
758b2cdc6 sched: wrap sched_group and sched_domain cpumask accesses. ... Browse Code »

Impact: trivial wrap of member accesses

This eases the transition in the next patch.

We also get rid of a temporary cpumask in find_idlest_cpu() thanks to
for_each_cpu_and, and sched_balance_self() due to getting weight before
setting sd to NULL.

Signed-off-by: Rusty Russell
Signed-off-by: Ingo Molnar

Rusty Russell
2008-11-25 00:50:45 +0800

07 Nov, 2008

1 commit

cf7f8690e sched, lockdep: inline double_unlock_balance() ... Browse Code »

We have a test case which measures the variation in the amount of time
needed to perform a fixed amount of work on the preempt_rt kernel. We
started seeing deterioration in it's performance recently. The test
should never take more than 10 microseconds, but we started 5-10%
failure rate.

Using elimination method, we traced the problem to commit
1b12bbc747560ea68bcc132c3d05699e52271da0 (lockdep: re-annotate
scheduler runqueues).

When LOCKDEP is disabled, this patch only adds an additional function
call to double_unlock_balance(). Hence I inlined double_unlock_balance()
and the problem went away. Here is a patch to make this change.

Signed-off-by: Sripathi Kodi
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Sripathi Kodi
2008-11-07 05:12:09 +0800

03 Nov, 2008

1 commit

e113a745f sched/rt: small optimization to update_curr_rt() ... Browse Code »

Impact: micro-optimization to SCHED_FIFO/RR scheduling

A very minor improvement, but might it be better to check sched_rt_runtime(rt_rq)
before taking the rt_runtime_lock?

Peter Zijlstra observes:

> Yes, I think its ok to do so.
>
> Like pointed out in the other thread, there are two races:
>
> - sched_rt_runtime() going to RUNTIME_INF, and that will be handled
> properly by sched_rt_runtime_exceeded()
>
> - sched_rt_runtime() going to !RUNTIME_INF, and here we can miss an
> accounting cycle, but I don't think that is something to worry too
> much about.

Signed-off-by: Dimitri Sivanich
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

--

kernel/sched_rt.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Dimitri Sivanich
2008-11-03 18:29:00 +0800

24 Oct, 2008

1 commit

8c82a17e9 Merge commit 'v2.6.28-rc1' into sched/urgent Browse Code »

Ingo Molnar
2008-10-24 18:48:46 +0800

22 Oct, 2008

1 commit

4ce72a2c0 sched: add CONFIG_SMP consistency ... Browse Code »

a patch from Henrik Austad did this:

>> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is
>> not set.

Peter observed:

> While a proper cleanup, could you do it by re-arranging the methods so
> as to not create an additional ifdef?

Do not declare select_task_rq and some other methods as part of sched_class
when CONFIG_SMP is not set.

Also gather those methods to avoid CONFIG_SMP mess.

Idea-by: Henrik Austad
Signed-off-by: Li Zefan
Acked-by: Peter Zijlstra
Acked-by: Henrik Austad
Signed-off-by: Ingo Molnar

Li Zefan
2008-10-22 16:01:52 +0800

20 Oct, 2008

1 commit

c465a76af Merge branches 'timers/clocksource', 'timers/hrtimers', 'timers/nohz', 'timers/n… ... Browse Code »

…tp', 'timers/posixtimers' and 'timers/debug' into v28-timers-for-linus

Thomas Gleixner
2008-10-20 19:14:06 +0800

04 Oct, 2008

1 commit

f6121f4f8 sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq ... Browse Code »

While working on the new version of the code for SCHED_SPORADIC I
noticed something strange in the present throttling mechanism. More
specifically in the throttling timer handler in sched_rt.c
(do_sched_rt_period_timer()) and in rt_rq_enqueue().

The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
asks for rescheduling if the runqueue has a sched_entity associated to
it (i.e., rt_rq->rt_se != NULL).
Now, if the runqueue is the root rq (which has a rt_se = NULL)
rescheduling does not take place, and it is delayed to some undefined
instant in the future.

This imply some random bandwidth usage by the RT tasks under throttling.
For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
task will get less than 95%. In our tests we got something varying
between 70% to 95%.
Using smaller time values, e.g., 95ms/100ms, things are even worse, and
I can see values also going down to 20-25%!!

The tests we performed are simply running 'yes' as a SCHED_FIFO task,
and checking the CPU usage with top, but we can investigate thoroughly
if you think it is needed.

Things go much better, for us, with the attached patch... Don't know if
it is the best approach, but it solved the issue for us.

Signed-off-by: Dario Faggioli
Signed-off-by: Michael Trimarchi
Acked-by: Peter Zijlstra
Cc:
Signed-off-by: Ingo Molnar

Dario Faggioli
2008-10-04 20:31:54 +0800

23 Sep, 2008

2 commits

78333cdd0 sched: add some comments to the bandwidth code ... Browse Code »

Hopefully clarify some of this code a little.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-09-23 22:23:16 +0800
63e5c3985 Merge branches 'sched/urgent' and 'sched/rt' into sched/devel Browse Code »

Ingo Molnar
2008-09-23 22:23:05 +0800

22 Sep, 2008

1 commit

15afe09bf sched: wakeup preempt when small overlap ... Browse Code »

Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

The difference seems to come from different preemption agressiveness,
which affects the cache footprint of the workload and its effective
cache trashing.

Aggresively preempt a task if its avg overlap is very small, this should
avoid the task going to sleep and find it still running when we schedule
back to it - saving a wakeup.

Reported-by: Lin Ming
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-09-22 22:28:32 +0800

14 Sep, 2008

1 commit

f06febc96 timers: fix itimer/many thread hang ... Browse Code »

Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
struct task_cputime totals;
};

struct thread_group_cputime {
struct task_cputime *totals;
};

We also add a new task_cputime substructure directly to signal_struct, to
cache the earliest expiration of process-wide timers, and task_cputime also
replaces the it_*_expires fields of task_struct (used for earliest expiration
of thread timers). The "thread_group_cputime" structure contains process-wide
timers that are updated via account_user_time() and friends. In the non-SMP
case the structure is a simple aggregator; unfortunately in the SMP case that
simplicity was not achievable due to cache-line contention between CPUs (in
one measured case performance was actually _worse_ on a 16-cpu system than
the same test on a 4-cpu system, due to this contention). For SMP, the
thread_group_cputime counters are maintained as a per-cpu structure allocated
using alloc_percpu(). The timer functions update only the timer field in
the structure corresponding to the running CPU, obtained using per_cpu_ptr().

We define a set of inline functions in sched.h that we use to maintain the
thread_group_cputime structure and hide the differences between UP and SMP
implementations from the rest of the kernel. The thread_group_cputime_init()
function initializes the thread_group_cputime structure for the given task.
The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
in the per-cpu structures and fields. The thread_group_cputime_free()
function, also a no-op for UP, in SMP frees the per-cpu structures. The
thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
thread_group_cputime_alloc() if the per-cpu structures haven't yet been
allocated. The thread_group_cputime() function fills the task_cputime
structure it is passed with the contents of the thread_group_cputime fields;
in UP it's that simple but in SMP it must also safely check that tsk->signal
is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
if so, sums the per-cpu values for each online CPU. Finally, the three
functions account_group_user_time(), account_group_system_time() and
account_group_exec_runtime() are used by timer functions to update the
respective fields of the thread_group_cputime structure.

Non-SMP operation is trivial and will not be mentioned further.

The per-cpu structure is always allocated when a task creates its first new
thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
It is freed at process exit via a call to thread_group_cputime_free() from
cleanup_signal().

All functions that formerly summed utime/stime/sum_sched_runtime values from
from all threads in the thread group now use thread_group_cputime() to
snapshot the values in the thread_group_cputime structure or the values in
the task structure itself if the per-cpu structure hasn't been allocated.

Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
The run_posix_cpu_timers() function has been split into a fast path and a
slow path; the former safely checks whether there are any expired thread
timers and, if not, just returns, while the slow path does the heavy lifting.
With the dedicated thread group fields, timers are no longer "rebalanced" and
the process_timer_rebalance() function and related code has gone away. All
summing loops are gone and all code that used them now uses the
thread_group_cputime() inline. When process-wide timers are set, the new
task_cputime structure in signal_struct is used to cache the earliest
expiration; this is checked in the fast path.

Performance

The fix appears not to add significant overhead to existing operations. It
generally performs the same as the current code except in two cases, one in
which it performs slightly worse (Case 5 below) and one in which it performs
very significantly better (Case 2 below). Overall it's a wash except in those
two cases.

I've since done somewhat more involved testing on a dual-core Opteron system.

Case 1: With no itimer running, for a test with 100,000 threads, the fixed
kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
all of which was spent in the system. There were twice as many
voluntary context switches with the fix as without it.

Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
an unmodified kernel can handle), the fixed kernel ran the test in
eight percent of the time (5.8 seconds as opposed to 70 seconds) and
had better tick accuracy (.012 seconds per tick as opposed to .023
seconds per tick).

Case 3: A 4000-thread test with an initial timer tick of .01 second and an
interval of 10,000 seconds (i.e. a timer that ticks only once) had
very nearly the same performance in both cases: 6.3 seconds elapsed
for the fixed kernel versus 5.5 seconds for the unfixed kernel.

With fewer threads (eight in these tests), the Case 1 test ran in essentially
the same time on both the modified and unmodified kernels (5.2 seconds versus
5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
tick versus .025 seconds per tick for the unmodified kernel.

Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
running), the modified kernel was very slightly favored in that while
it killed the process in 19.997 seconds of CPU time (5.002 seconds of
wall time), only .003 seconds of that was system time, the rest was
user time. The unmodified kernel killed the process in 20.001 seconds
of CPU (5.014 seconds of wall time) of which .016 seconds was system
time. Really, though, the results were too close to call. The results
were essentially the same with no itimer running.

Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
(where the hard limit would never be reached) and an itimer running,
the modified kernel exhibited worse tick accuracy than the unmodified
kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
performance was almost indistinguishable. With no itimer running this
test exhibited virtually identical behavior and times in both cases.

In times past I did some limited performance testing. those results are below.

On a four-cpu Opteron system without this fix, a sixteen-thread test executed
in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
the same system with the fix, user and elapsed time were about the same, but
system time dropped to 0.007 seconds. Performance with eight, four and one
thread were comparable. Interestingly, the timer ticks with the fix seemed
more accurate: The sixteen-thread test with the fix received 149543 ticks
for 0.024 seconds per tick, while the same test without the fix received 58720
for 0.061 seconds per tick. Both cases were configured for an interval of
0.01 seconds. Again, the other tests were comparable. Each thread in this
test computed the primes up to 25,000,000.

I also did a test with a large number of threads, 100,000 threads, which is
impossible without the fix. In this case each thread computed the primes only
up to 10,000 (to make the runtime manageable). System time dominated, at
1546.968 seconds out of a total 2176.906 seconds (giving a user time of
629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
accurate. There is obviously no comparable test without the fix.

Signed-off-by: Frank Mayhar
Cc: Roland McGrath
Cc: Alexey Dobriyan
Cc: Andrew Morton
Signed-off-by: Ingo Molnar

Frank Mayhar
2008-09-14 22:25:35 +0800

11 Sep, 2008

1 commit

baf25731e sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly ... Browse Code »

On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.

Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
to 0. When every cpu is up, per-cpu migration_thread is created and it runs
very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
quickly. After all cpus are up, with below calling chain:

sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
=> cpu_attach_domain => rq_attach_root => set_rq_online => ...
=> _enable_runtime

_enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
scheduled to run on that cpu. here is RT task migration_thread which is
woken up when a task is migrated to another cpu.

Below patch fixes it against 2.6.27-rc5.

Signed-off-by: Zhang Yanmin
Signed-off-by: Ingo Molnar

Zhang, Yanmin
2008-09-11 15:34:28 +0800

28 Aug, 2008

2 commits

cc2991cf1 sched: rt-bandwidth accounting fix ... Browse Code »

It fixes an accounting bug where we would continue accumulating runtime
even though the bandwidth control is disabled. This would lead to very long
throttle periods once bandwidth control gets turned on again.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-28 19:42:38 +0800
f3ade8378 sched: fix sched_rt_rq_enqueue() resched idle ... Browse Code »

When sysctl_sched_rt_runtime is set to something other than -1 and the
CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state
where we see one or more CPUs idling forvever even though there are
real-time
tasks in their rt runqueue that are able to run (no longer throttled).

The sequence is:

- A real-time task is running when the timer sets the rt runqueue
to throttled, and the rt task is resched_task()ed and switched
out, and idle is switched in since there are no non-rt tasks to
run on that cpu.

- Eventually the do_sched_rt_period_timer() runs and un-throttles
the rt runqueue, but we just exit the timer interrupt and go back
to executing the idle task in the idle loop forever.

If we change the sched_rt_rq_enqueue() routine to use some of the code
from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and
resched_task() the currently executing task (idle in our case) if it is
a lower priority task than the higher rt task in the now un-throttled
runqueue, the problem is no longer observed.

Signed-off-by: John Blackwood
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

John Blackwood
2008-08-28 17:13:24 +0800

19 Aug, 2008

2 commits

0b148fa04 sched: rt-bandwidth group disable fixes ... Browse Code »

More extensive disable of bandwidth control. It allows sysctl_sched_rt_runtime
to disable full group bandwidth control.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-19 19:10:10 +0800
6f0d5c390 sched: rt-bandwidth accounting fix ... Browse Code »

It fixes an accounting bug where we would continue accumulating runtime
even though the bandwidth control is disabled. This would lead to very long
throttle periods once bandwidth control gets turned on again.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-19 19:10:09 +0800

14 Aug, 2008

1 commit

f1679d084 sched: fix rt-bandwidth hotplug race ... Browse Code »

When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be
detatched. Alex observed the case where a runqueue was stealing bandwidth
from an already disabled runqueue to satisfy its own needs.

Stop this by skipping over already disabled runqueues.

Reported-by: Alex Nixon
Signed-off-by: Peter Zijlstra
Tested-by: Alex Nixon
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-14 21:50:58 +0800

11 Aug, 2008

1 commit

1b12bbc74 lockdep: re-annotate scheduler runqueues ... Browse Code »

Instead of using a per-rq lock class, use the regular nesting operations.

However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).

So what can happen is:

spin_lock(rq->lock); // this rq subclass 0

double_lock_balance(rq, other_rq);
// release rq
// acquire other_rq->lock subclass 0
// acquire rq->lock subclass 1

spin_unlock(other_rq->lock);

leaving you with rq->lock in subclass 1

So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.

Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-08-11 15:30:22 +0800

25 Jul, 2008

1 commit

8ffa5b659 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: clean up compiler warning
sched: fix hrtick & generic-ipi dependency

Linus Torvalds
2008-07-25 03:53:51 +0800

24 Jul, 2008

2 commits

58838cf3c sched: clean up compiler warning ... Browse Code »

Reported-by: Daniel Walker
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-07-24 19:24:57 +0800
7f9dce383 Merge branch 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: hrtick_enabled() should use cpu_active()
sched, x86: clean up hrtick implementation
sched: fix build error, provide partition_sched_domains() unconditionally
sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed
cpu hotplug: Make cpu_active_map synchronization dependency clear
cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2)
sched: rework of "prioritize non-migratable tasks over migratable ones"
sched: reduce stack size in isolated_cpu_setup()
Revert parts of "ftrace: do not trace scheduler functions"

Fixed up conflicts in include/asm-x86/thread_info.h (due to the
TIF_SINGLESTEP unification vs TIF_HRTICK_RESCHED removal) and
kernel/sched_fair.c (due to cpu_active_map vs for_each_cpu_mask_nr()
introduction).

Linus Torvalds
2008-07-24 10:36:53 +0800

20 Jul, 2008

1 commit

d986434a7 Merge branch 'sched/urgent' into sched/devel Browse Code »

Ingo Molnar
2008-07-20 17:01:29 +0800

18 Jul, 2008

3 commits

577b4a58d sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed ... Browse Code »

Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed. It is
declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.

This is a consequence of patch 1f11eb6a8bc92536d9e93ead48fa3ffbd1478571 plus
patch 1100ac91b6af02d8639d518fad5b434b1bf44ed6.

Signed-off-by: David Howells
Signed-off-by: Ingo Molnar

David Howells
2008-07-18 19:56:03 +0800
e761b7725 cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2) ... Browse Code »

This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).

This not only boots but also easily handles
while true; do make clean; make -j 8; done
and
while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.

I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).

Signed-off-by: Max Krasnyanskiy
Acked-by: Linus Torvalds
Acked-by: Peter Zijlstra
Acked-by: Gregory Haskins
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: Ingo Molnar

Max Krasnyansky
2008-07-18 19:22:25 +0800
7ebefa8ce sched: rework of "prioritize non-migratable tasks over migratable ones" ... Browse Code »

(1) handle in a generic way all cases when a newly woken-up task is
not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
1")

(2) if current is to be preempted, then make sure "p" will be picked
up by pick_next_task_rt().
i.e. move task's group at the head of its list as well.

currently, it's not a case for the group-scheduling case as described
here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.html

Signed-off-by: Dmitry Adamushko
Cc: Steven Rostedt
Cc: Gregory Haskins
Signed-off-by: Ingo Molnar

Dmitry Adamushko
2008-07-18 18:55:14 +0800

16 Jul, 2008

1 commit

82638844d Merge branch 'linus' into cpus4096 ... Browse Code »

Conflicts:

arch/x86/xen/smp.c
kernel/sched_rt.c
net/iucv/iucv.c

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-07-16 06:29:07 +0800

06 Jul, 2008

1 commit

68083e05d Merge commit 'v2.6.26-rc9' into cpus4096 Browse Code »

Ingo Molnar
2008-07-06 20:23:39 +0800

27 Jun, 2008

1 commit

55e12e5e7 sched: make sched_{rt,fair}.c ifdefs more readable ... Browse Code »

Signed-off-by: Dhaval Giani
Cc: Srivatsa Vaddagiri
Cc: Peter Zijlstra
Signed-off-by: Ingo Molnar

Dhaval Giani
2008-06-27 20:32:05 +0800