Eric Lee / smarc-fsl-linux-kernel

26 Sep, 2006

1 commit

0a2966b48 [PATCH] Fix longstanding load balancing bug in the scheduler ... Browse Code »

The scheduler will stop load balancing if the most busy processor contains
processes pinned via processor affinity.

The scheduler currently only does one search for busiest cpu. If it cannot
pull any tasks away from the busiest cpu because they were pinned then the
scheduler goes into a corner and sulks leaving the idle processors idle.

F.e. If you have processor 0 busy running four tasks pinned via taskset,
there are none on processor 1 and one just started two processes on
processor 2 then the scheduler will not move one of the two processes away
from processor 2.

This patch fixes that issue by forcing the scheduler to come out of its
corner and retrying the load balancing by considering other processors for
load balancing.

This patch was originally developed by John Hawkes and discussed at

http://marc.theaimsgroup.com/?l=linux-kernel&m=113901368523205&w=2.

I have removed extraneous material and gone back to equipping struct rq
with the cpu the queue is associated with since this makes the patch much
easier and it is likely that others in the future will have the same
difficulty of figuring out which processor owns which runqueue.

The overhead added through these patches is a single word on the stack if
the kernel is configured to support 32 cpus or less (32 bit). For 32 bit
environments the maximum number of cpus that can be configued is 255 which
would result in the use of 32 bytes additional on the stack. On IA64 up to
1k cpus can be configured which will result in the use of 128 additional
bytes on the stack. The maximum additional cache footprint is one
cacheline. Typically memory use will be much less than a cacheline and the
additional cpumask will be placed on the stack in a cacheline that already
contains other local variable.

Signed-off-by: Christoph Lameter
Cc: John Hawkes
Cc: "Siddha, Suresh B"
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Peter Williams
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2006-09-26 23:48:43 +0800

28 Aug, 2006

1 commit

f8986c241 [PATCH] revert "Drop tasklist lock in do_sched_setscheduler" ... Browse Code »

sched_setscheduler() looks at ->signal->rlim[]. It is unsafe do
dereference ->signal unless tasklist_lock or ->siglock is held (or p ==
current). We pin the task structure, but this can't prevent from
release_task()->__exit_signal() which sets ->signal = NULL.

Restore tasklist_lock across the setscheduler call.

Signed-off-by: Oleg Nesterov
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2006-08-28 02:01:29 +0800

01 Aug, 2006

3 commits

b50f60cee [PATCH] pi-futex: missing pi_waiters plist initialization ... Browse Code »

Initialize init task's pi_waiters plist. Otherwise cpu hotplug of cpu 0
might crash, since rt_mutex_getprio() accesses an uninitialized list head.

call chain which led to crash:

take_cpu_down
sched_idle_next
__setscheduler
rt_mutex_getprio

Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately,
since the pi_waiters member is only conditionally present.

Cc: Arjan van de Ven
Cc: Thomas Gleixner
Acked-by: Ingo Molnar
Signed-off-by: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2006-08-01 04:28:41 +0800
2d7d25354 [PATCH] fix cond_resched() fix ... Browse Code »

In cond_resched_lock() it calls __resched_legal() before dropping the spin
lock. __resched_legal() will always finds the preempt_count non-zero and
will prevent the call to __cond_resched().

The attached patch adds a parameter to __resched_legal() with the expected
preempt_count value.

Cc: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jim Houston
2006-08-01 04:28:40 +0800
f712c0c7e [PATCH] sched: build_sched_domains() fix ... Browse Code »

Use the correct groups while initializing sched groups power for
allnodes_domain. This fixes the crash observed while creating exclusive
cpusets.

Signed-off-by: Suresh Siddha
Reported-and-tested-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Siddha, Suresh B
2006-08-01 04:28:36 +0800

15 Jul, 2006

3 commits

52f17b6c2 [PATCH] per-task-delay-accounting: cpu delay collection via schedstats ... Browse Code »

Make the task-related schedstats functions callable by delay accounting even
if schedstats collection isn't turned on. This removes the dependency of
delay accounting on schedstats.

Signed-off-by: Chandra Seetharaman
Signed-off-by: Shailabh Nagar
Signed-off-by: Balbir Singh
Cc: Jes Sorensen
Cc: Peter Chubb
Cc: Erich Focht
Cc: Levent Serinol
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-07-15 12:53:56 +0800
0ff922452 [PATCH] per-task-delay-accounting: sync block I/O and swapin delay collection ... Browse Code »

Unlike earlier iterations of the delay accounting patches, now delays are only
collected for the actual I/O waits rather than try and cover the delays seen
in I/O submission paths.

Account separately for block I/O delays incurred as a result of swapin page
faults whose frequency can be affected by the task/process' rss limit. Hence
swapin delays can act as feedback for rss limit changes independent of I/O
priority changes.

Signed-off-by: Shailabh Nagar
Signed-off-by: Balbir Singh
Cc: Jes Sorensen
Cc: Peter Chubb
Cc: Erich Focht
Cc: Levent Serinol
Cc: Jay Lan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shailabh Nagar
2006-07-15 12:53:56 +0800
3a5f5e488 [PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSW ... Browse Code »

On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement
lock validator support there's a bug in rq->lock handling: in this case we
dont 'carry over' the runqueue lock into another task - but still we did a
spinlock_release() of it. Fix this by making the spinlock_release() in
context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW.

(Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW.
This fixes a lockdep-internal BUG message on such platforms.)

Signed-off-by: Ingo Molnar
Cc: Ralf Baechle
Cc: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-15 12:53:55 +0800

11 Jul, 2006

2 commits

2ed6e34f8 [PATCH] small kernel/sched.c cleanup ... Browse Code »

- constify and optimize stat_nam (thanks to Michael Tokarev!)
- spelling and comment fixes

Signed-off-by: Andreas Mohr
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andreas Mohr
2006-07-11 04:24:13 +0800
0a565f791 [PATCH] sched: fix bug in __migrate_task() ... Browse Code »

Problem:

In the function __migrate_task(), deactivate_task() followed by
activate_task() is used to move the task from one run queue to
another. This has two undesirable effects:

1. The task's priority is recalculated. (Nowhere else in the
scheduler code is the priority recalculated for a change of CPU.)

2. The task's time stamp is set to the current time. At the very least,
this makes the adjustment of the time stamp before the call to
deactivate_task() redundant but I believe the problem is more serious
as the time stamp now holds the time of the queue change instead of
the time at which the task was woken. In addition, unless dest_rq is
the same queue as "current" is on the time stamp could be inaccurate
due to inter CPU drift.

Solution:

Replace the call to activate_task() with one to __activate_task().

Signed-off-by: Peter Williams
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Williams
2006-07-11 04:24:13 +0800

04 Jul, 2006

7 commits

70b97a7f0 [PATCH] sched: cleanup, convert sched.c-internal typedefs to struct ... Browse Code »

convert:

- runqueue_t to 'struct rq'
- prio_array_t to 'struct prio_array'
- migration_req_t to 'struct migration_req'

I was the one who added these but they are both against the kernel coding
style and also were used inconsistently at places. So just get rid of them at
once, now that we are flushing the scheduler patch-queue anyway.

Conversion was mostly scripted, the result was reviewed and all secondary
whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:11 +0800
36c8b5868 [PATCH] sched: cleanup, remove task_t, convert to struct task_struct ... Browse Code »

cleanup: remove task_t and convert all the uses to struct task_struct. I
introduced it for the scheduler anno and it was a mistake.

Conversion was mostly scripted, the result was reviewed and all
secondary whitespace and style impact (if any) was fixed up by hand.

Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:11 +0800
48f24c4da [PATCH] sched: clean up fallout of recent changes ... Browse Code »

Clean up some of the impact of recent (and not so recent) scheduler
changes:

- turning macros into nice inline functions
- sanitizing and unifying variable definitions
- whitespace, style consistency, 80-lines, comment correctness, spelling
and curly braces police

Due to the macro hell and variable placement simplifications there's even 26
bytes of .text saved:

text data bss dec hex filename
25510 4153 192 29855 749f sched.o.before
25484 4153 192 29829 7485 sched.o.after

[akpm@osdl.org: build fix]
Signed-off-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:10 +0800
fcb993712 [PATCH] lockdep: annotate scheduler runqueue locks ... Browse Code »

Teach per-CPU runqueue locks and recursive locking code to the lock validator.
Has no effect on non-lockdep kernels.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:07 +0800
8a25d5deb [PATCH] lockdep: prove spinlock rwlock locking correctness ... Browse Code »

Use the lock validator framework to prove spinlock and rwlock locking
correctness.

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:04 +0800
de30a2b35 [PATCH] lockdep: irqtrace subsystem, core ... Browse Code »

Accurate hard-IRQ-flags and softirq-flags state tracing.

This allows us to attach extra functionality to IRQ flags on/off
events (such as trace-on/off).

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:03 +0800
9a11b49a8 [PATCH] lockdep: better lock debugging ... Browse Code »

Generic lock debugging:

- generalized lock debugging framework. For example, a bug in one lock
subsystem turns off debugging in all lock subsystems.

- got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from
the mutex/rtmutex debugging code: it caused way too much prototype
hackery, and lockdep will give the same information anyway.

- ability to do silent tests

- check lock freeing in vfree too.

- more finegrained debugging options, to allow distributions to
turn off more expensive debugging features.

There's no separate 'held mutexes' list anymore - but there's a 'held locks'
stack within lockdep, which unifies deadlock detection across all lock
classes. (this is independent of the lockdep validation stuff - lockdep first
checks whether we are holding a lock already)

Here are the current debugging options:

CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_LOCK_ALLOC=y

which do:

config DEBUG_MUTEXES
bool "Mutex debugging, basic checks"

config DEBUG_LOCK_ALLOC
bool "Detect incorrect freeing of live mutexes"

Signed-off-by: Ingo Molnar
Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-07-04 06:27:01 +0800

01 Jul, 2006

1 commit

e7b384043 [PATCH] cond_resched() fix ... Browse Code »

Fix a bug identified by Zou Nan hai :

If the system is in state SYSTEM_BOOTING, and need_resched() is true,
cond_resched() returns true even though it didn't reschedule. Consequently
need_resched() remains true and JBD locks up.

Fix that by teaching cond_resched() to only return true if it really did call
schedule().

cond_resched_lock() and cond_resched_softirq() have a problem too. If we're
in SYSTEM_BOOTING state and need_resched() is true, these functions will drop
the lock and will then try to call schedule(), but the SYSTEM_BOOTING state
will prevent schedule() from being called. So on return, need_resched() will
still be true, but cond_resched_lock() has to return 1 to tell the caller that
the lock was dropped. The caller will probably lock up.

Bottom line: if these functions dropped the lock, they _must_ call schedule()
to clear need_resched(). Make it so.

Also, uninline __cond_resched(). It's largeish, and slowpath.

Acked-by: Ingo Molnar
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2006-07-01 02:25:38 +0800

28 Jun, 2006

20 commits

95e02ca9b [PATCH] rtmutex: Propagate priority settings into PI lock chains ... Browse Code »

When the priority of a task, which is blocked on a lock, changes we must
propagate this change into the PI lock chain. Therefor the chain walk code
is changed to get rid of the references to current to avoid false positives
in the deadlock detector, as setscheduler might be called by a task which
holds the lock on which the task whose priority is changed is blocked.

Also add some comments about the get/put_task_struct usage to avoid
confusion.

Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2006-06-28 08:32:48 +0800
e74c69f46 [PATCH] Drop tasklist lock in do_sched_setscheduler ... Browse Code »

There is no need to hold tasklist_lock across the setscheduler call, when
we pin the task structure with get_task_struct(). Interrupts are disabled
in setscheduler anyway and the permission checks do not need interrupts
disabled.

Signed-off-by: Thomas Gleixner
Signed-off-by: Ingo Molnar
Cc: Steven Rostedt
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Gleixner
2006-06-28 08:32:47 +0800
b29739f90 [PATCH] pi-futex: scheduler support for pi ... Browse Code »

Add framework to boost/unboost the priority of RT tasks.

This consists of:

- caching the 'normal' priority in ->normal_prio
- providing a functions to set/get the priority of the task
- make sched_setscheduler() aware of boosting

The effective_prio() cleanups also fix a priority-calculation bug pointed out
by Andrey Gelman, in set_user_nice().

has_rt_policy() fix: Peter Williams

Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner
Signed-off-by: Arjan van de Ven
Cc: Andrey Gelman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2006-06-28 08:32:46 +0800
66e5393a7 [PATCH] BUG() if setscheduler is called from interrupt context ... Browse Code »

Thomas Gleixner is adding the call to a rtmutex function in setscheduler.
This call grabs a spin_lock that is not always protected by interrupts
disabled. So this means that setscheduler cant be called from interrupt
context.

To prevent this from happening in the future, this patch adds a
BUG_ON(in_interrupt()) in that function. (Thanks to akpm for this suggestion).

Signed-off-by: Steven Rostedt
Cc: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2006-06-28 08:32:46 +0800
9fea80e4d [PATCH] sched: uninline task_rq_lock() ... Browse Code »

Saves 543 bytes from sched.o (gcc 3.3.3).

Signed-off-by: Oleg Nesterov
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Con Kolivas
Cc: Peter Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2006-06-28 08:32:45 +0800
5c45bf279 [PATCH] sched: mc/smt power savings sched policy ... Browse Code »

sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in
/sys/devices/system/cpu/ control the MC/SMT power savings policy for the
scheduler.

Based on the values (1-enable, 0-disable) for these controls, sched groups
cpu power will be determined for different domains. When power savings
policy is enabled and under light load conditions, scheduler will minimize
the physical packages/cpu cores carrying the load and thus conserving
power(with a perf impact based on the workload characteristics... see OLS
2005 CMP kernel scheduler paper for more details..)

Signed-off-by: Suresh Siddha
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Con Kolivas
Cc: "Chen, Kenneth W"
Cc: "David S. Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Siddha, Suresh B
2006-06-28 08:32:45 +0800
369381694 [PATCH] sched_domai: Allocate sched_group structures dynamically ... Browse Code »

As explained here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2

there is a problem with sharing sched_group structures between two
separate sched_group structures for different sched_domains.

The patch has been tested and found to avoid the kernel lockup problem
described in above URL.

Signed-off-by: Srivatsa Vaddagiri
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2006-06-28 08:32:45 +0800
15f0b676a [PATCH] sched_domai: Use kmalloc_node ... Browse Code »

The sched group structures used to represent various nodes need to be
allocated from respective nodes (as suggested here also:

http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html)

Signed-off-by: Srivatsa Vaddagiri
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2006-06-28 08:32:45 +0800
d3a5aa985 [PATCH] sched_domai: Don't use GFP_ATOMIC ... Browse Code »

Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based
allocation.

Signed-off-by: Srivatsa Vaddagiri
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2006-06-28 08:32:45 +0800
51888ca25 [PATCH] sched_domain: handle kmalloc failure ... Browse Code »

Try to handle mem allocation failures in build_sched_domains by bailing out
and cleaning up thus-far allocated memory. The patch has a direct consequence
that we disable load balancing completely (even at sibling level) upon *any*
memory allocation failure.

[Lee.Schermerhorn@hp.com: bugfix]
Signed-off-by: Srivatsa Vaddagir
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Signed-off-by: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Srivatsa Vaddagiri
2006-06-28 08:32:45 +0800
615052dc3 [PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks() ... Browse Code »

Problem:

To help distribute high priority tasks evenly across the available CPUs
move_tasks() does not, under some circumstances, skip tasks whose load
weight is bigger than the designated amount. Because the highest priority
task on the busiest queue may be on the expired array it may be moved as a
result of this mechanism. Apart from not being the most desirable way to
redistribute the high priority tasks (we'd rather move the second highest
priority task), there is a risk that this could set up a loop with this
task bouncing backwards and forwards between the two queues. (This latter
possibility can be demonstrated by running a nice==-20 CPU bound task on an
otherwise quiet 2 CPU system.)

Solution:

Modify the mechanism so that it does not override skip for the highest
priority task on the CPU. Of course, if there are more than one tasks at
the highest priority then it will allow the override for one of them as
this is a desirable redistribution of high priority tasks.

Signed-off-by: Peter Williams
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Williams
2006-06-28 08:32:44 +0800
50ddd9691 [PATCH] sched: modify move_tasks() to improve load balancing outcomes ... Browse Code »

Problem:

The move_tasks() function is designed to move UP TO the amount of load it
is asked to move and in doing this it skips over tasks looking for ones
whose load weights are less than or equal to the remaining load to be
moved. This is (in general) a good thing but it has the unfortunate result
of breaking one of the original load balancer's good points: namely, that
(within the limits imposed by the active/expired array model and the fact
the expired is processed first) it moves high priority tasks before low
priority ones and this means there's a good chance (see active/expired
problem for why it's only a chance) that the highest priority task on the
queue but not actually on the CPU will be moved to the other CPU where (as
a high priority task) it may preempt the current task.

Solution:

Modify move_tasks() so that high priority tasks are not skipped when moving
them will make them the highest priority task on their new run queue.

Signed-off-by: Peter Williams
Cc: Ingo Molnar
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Williams
2006-06-28 08:32:44 +0800
2dd73a4f0 [PATCH] sched: implement smpnice ... Browse Code »

Problem:

The introduction of separate run queues per CPU has brought with it "nice"
enforcement problems that are best described by a simple example.

For the sake of argument suppose that on a single CPU machine with a
nice==19 hard spinner and a nice==0 hard spinner running that the nice==0
task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now
suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and
2 nice==0 hard spinners running. The user of this system would be entitled
to expect that the nice==0 tasks each get 95% of a CPU and the nice==19
tasks only get 5% each. However, whether this expectation is met is pretty
much down to luck as there are four equally likely distributions of the
tasks to the CPUs that the load balancing code will consider to be balanced
with loads of 2.0 for each CPU. Two of these distributions involve one
nice==0 and one nice==19 task per CPU and in these circumstances the users
expectations will be met. The other two distributions both involve both
nice==0 tasks being on one CPU and both nice==19 being on the other CPU and
each task will get 50% of a CPU and the user's expectations will not be
met.

Solution:

The solution to this problem that is implemented in the attached patch is
to use weighted loads when determining if the system is balanced and, when
an imbalance is detected, to move an amount of weighted load between run
queues (as opposed to a number of tasks) to restore the balance. Once
again, the easiest way to explain why both of these measures are necessary
is to use a simple example. Suppose that (in a slight variation of the
above example) that we have a two CPU system with 4 nice==0 and 4 nice=19
hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and
the 4 nice==19 tasks are on the other CPU. The weighted loads for the two
CPUs would be 4.0 and 0.2 respectively and the load balancing code would
move 2 tasks resulting in one CPU with a load of 2.0 and the other with
load of 2.2. If this was considered to be a big enough imbalance to
justify moving a task and that task was moved using the current
move_tasks() then it would move the highest priority task that it found and
this would result in one CPU with a load of 3.0 and the other with a load
of 1.2 which would result in the movement of a task in the opposite
direction and so on -- infinite loop. If, on the other hand, an amount of
load to be moved is calculated from the imbalance (in this case 0.1) and
move_tasks() skips tasks until it find ones whose contributions to the
weighted load are less than this amount it would move two of the nice==19
tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with
loads of 2.1 for each CPU.

One of the advantages of this mechanism is that on a system where all tasks
have nice==0 the load balancing calculations would be mathematically
identical to the current load balancing code.

Notes:

struct task_struct:

has a new field load_weight which (in a trade off of space for speed)
stores the contribution that this task makes to a CPU's weighted load when
it is runnable.

struct runqueue:

has a new field raw_weighted_load which is the sum of the load_weight
values for the currently runnable tasks on this run queue. This field
always needs to be updated when nr_running is updated so two new inline
functions inc_nr_running() and dec_nr_running() have been created to make
sure that this happens. This also offers a convenient way to optimize away
this part of the smpnice mechanism when CONFIG_SMP is not defined.

int try_to_wake_up():

in this function the value SCHED_LOAD_BALANCE is used to represent the load
contribution of a single task in various calculations in the code that
decides which CPU to put the waking task on. While this would be a valid
on a system where the nice values for the runnable tasks were distributed
evenly around zero it will lead to anomalous load balancing if the
distribution is skewed in either direction. To overcome this problem
SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
or by the average load_weight per task for the queue in question (as
appropriate).

int move_tasks():

The modifications to this function were complicated by the fact that
active_load_balance() uses it to move exactly one task without checking
whether an imbalance actually exists. This precluded the simple
overloading of max_nr_move with max_load_move and necessitated the addition
of the latter as an extra argument to the function. The internal
implementation is then modified to move up to max_nr_move tasks and
max_load_move of weighted load. This slightly complicates the code where
move_tasks() is called and if ever active_load_balance() is changed to not
use move_tasks() the implementation of move_tasks() should be simplified
accordingly.

struct sched_group *find_busiest_group():

Similar to try_to_wake_up(), there are places in this function where
SCHED_LOAD_SCALE is used to represent the load contribution of a single
task and the same issues are created. A similar solution is adopted except
that it is now the average per task contribution to a group's load (as
opposed to a run queue) that is required. As this value is not directly
available from the group it is calculated on the fly as the queues in the
groups are visited when determining the busiest group.

A key change to this function is that it is no longer to scale down
*imbalance on exit as move_tasks() uses the load in its scaled form.

void set_user_nice():

has been modified to update the task's load_weight field when it's nice
value and also to ensure that its run queue's raw_weighted_load field is
updated if it was runnable.

From: "Siddha, Suresh B"

With smpnice, sched groups with highest priority tasks can mask the imbalance
between the other sched groups with in the same domain. This patch fixes some
of the listed down scenarios by not considering the sched groups which are
lightly loaded.

a) on a simple 4-way MP system, if we have one high priority and 4 normal
priority tasks, with smpnice we would like to see the high priority task
scheduled on one cpu, two other cpus getting one normal task each and the
fourth cpu getting the remaining two normal tasks. but with current
smpnice extra normal priority task keeps jumping from one cpu to another
cpu having the normal priority task. This is because of the
busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the
cpu with high priority task in max_load calculations but including that in
total and avg_load calcuations.. leading to max_load < avg_load and load
balance between cpus running normal priority tasks(2 Vs 1) will always show
imbalanace as one normal priority and the extra normal priority task will
keep moving from one cpu to another cpu having normal priority task..

b) 4-way system with HT (8 logical processors). Package-P0 T0 has a
highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal
priority task each.. P2 and P3 are idle. With this patch, one of the
normal priority tasks on P1 will be moved to P2 or P3..

c) With the current weighted smp nice calculations, it doesn't always make
sense to look at the highest weighted runqueue in the busy group..
Consider a load balance scenario on a DP with HT system, with Package-0
containing one high priority and one low priority, Package-1 containing one
low priority(with other thread being idle).. Package-1 thinks that it need
to take the low priority thread from Package-0. And find_busiest_queue()
returns the cpu thread with highest priority task.. And ultimately(with
help of active load balance) we move high priority task to Package-1. And
same continues with Package-0 now, moving high priority task from package-1
to package-0.. Even without the presence of active load balance, load
balance will fail to balance the above scenario.. Fix find_busiest_queue
to use "imbalance" when it is lightly loaded.

[kernel@kolivas.org: sched: store weighted load on up]
[kernel@kolivas.org: sched: add discrete weighted cpu load function]
[suresh.b.siddha@intel.com: sched: remove dead code]
Signed-off-by: Peter Williams
Cc: "Siddha, Suresh B"
Cc: "Chen, Kenneth W"
Acked-by: Ingo Molnar
Cc: Nick Piggin
Signed-off-by: Con Kolivas
Cc: John Hawkes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Williams
2006-06-28 08:32:44 +0800
efc30814a [PATCH] sched: CPU hotplug race vs. set_cpus_allowed() ... Browse Code »

There is a race between set_cpus_allowed() and move_task_off_dead_cpu().
__migrate_task() doesn't report any err code, so task can be left on its
runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a
possible target. Also, chaning cpus_allowed mask requires rq->lock being
held.

Signed-off-by: Kirill Korotaev
Acked-By: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Korotaev
2006-06-28 08:32:44 +0800
cc94abfcb [PATCH] unnecessary long index i in sched ... Browse Code »

Unless we expect to have more than 2G CPUs, there's no reason to have 'i'
as a long long here.

Signed-off-by: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2006-06-28 08:32:44 +0800
72d2854d4 [PATCH] sched: fix interactive ceiling code ... Browse Code »

The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect
and not explicit enough. The sleep boost is not supposed to be any larger
than without this code and the comment is not clear enough about what
exactly it does, just the reason it does it. Fix it.

There is a ceiling to the priority beyond which tasks that only ever sleep
for very long periods cannot surpass. Fix it.

Prevent the on-runqueue bonus logic from defeating the idle sleep logic.

Opportunity to micro-optimise.

Signed-off-by: Con Kolivas
Signed-off-by: Mike Galbraith
Acked-by: Ingo Molnar
Signed-off-by: Ken Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Con Kolivas
2006-06-28 08:32:44 +0800
d444886e1 [PATCH] sched: simplify bitmap definition ... Browse Code »

Signed-off-by: Steven Rostedt
Acked-by: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven Rostedt
2006-06-28 08:32:44 +0800
c96d145e7 [PATCH] sched: fix smt nice lock contention and optimization ... Browse Code »

Initial report and lock contention fix from Chris Mason:

Recent benchmarks showed some performance regressions between 2.6.16 and
2.6.5. We tracked down one of the regressions to lock contention in
schedule heavy workloads (~70,000 context switches per second)

kernel/sched.c:dependent_sleeper() was responsible for most of the lock
contention, hammering on the run queue locks. The patch below is more of a
discussion point than a suggested fix (although it does reduce lock
contention significantly). The dependent_sleeper code looks very expensive
to me, especially for using a spinlock to bounce control between two
different siblings in the same cpu.

It is further optimized:

* perform dependent_sleeper check after next task is determined
* convert wake_sleeping_dependent to use trylock
* skip smt runqueue check if trylock fails
* optimize double_rq_lock now that smt nice is converted to trylock
* early exit in searching first SD_SHARE_CPUPOWER domain
* speedup fast path of dependent_sleeper

[akpm@osdl.org: cleanup]
Signed-off-by: Ken Chen
Acked-by: Ingo Molnar
Acked-by: Con Kolivas
Signed-off-by: Nick Piggin
Acked-by: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen, Kenneth W
2006-06-28 08:32:44 +0800
26c2143b6 [PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit only ... Browse Code »

Make notifier_calls associated with cpu_notifier as __cpuinit.

__cpuinit makes sure that the function is init time only unless
CONFIG_HOTPLUG_CPU is defined.

[akpm@osdl.org: section fix]
Signed-off-by: Chandra Seetharaman
Cc: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-06-28 08:32:41 +0800
054cc8a2d [PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17 ... Browse Code »

This patch reverts notifier_block changes made in 2.6.17

Signed-off-by: Chandra Seetharaman
Cc: Ashok Raj
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chandra Seetharaman
2006-06-28 08:32:41 +0800

27 Jun, 2006

2 commits

81a07d758 Merge branch 'x86-64' ... Browse Code »

* x86-64: (83 commits)
[PATCH] x86_64: x86_64 stack usage debugging
[PATCH] x86_64: (resend) x86_64 stack overflow debugging
[PATCH] x86_64: msi_apic.c build fix
[PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs
[PATCH] x86_64: Avoid broadcasting NMI IPIs
[PATCH] x86_64: fix apic error on bootup
[PATCH] x86_64: enlarge window for stack growth
[PATCH] x86_64: Minor string functions optimizations
[PATCH] x86_64: Move export symbols to their C functions
[PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR
[PATCH] x86_64: Fix modular pc speaker
[PATCH] x86_64: remove sys32_ni_syscall()
[PATCH] x86_64: Do not use -ffunction-sections for modules
[PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle
[PATCH] x86_64: adjust kstack_depth_to_print default
[PATCH] i386/x86-64: adjust /proc/interrupts column headings
[PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels
[PATCH] x86_64: Fix fast check in safe_smp_processor_id
[PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information
[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status
...

Manual resolve of trivial conflict in arch/i386/kernel/Makefile

Linus Torvalds
2006-06-27 01:51:09 +0800
495ab9c04 [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status ... Browse Code »

During some profiling I noticed that default_idle causes a lot of
memory traffic. I think that is caused by the atomic operations
to clear/set the polling flag in thread_info. There is actually
no reason to make this atomic - only the idle thread does it
to itself, other CPUs only read it. So I moved it into ti->status.

Converted i386/x86-64/ia64 for now because that was the easiest
way to fix ACPI which also manipulates these flags in its idle
function.

Cc: Nick Piggin
Cc: Tony Luck
Cc: Len Brown
Signed-off-by: Andi Kleen
Signed-off-by: Linus Torvalds

Andi Kleen
2006-06-27 01:48:21 +0800