Eric Lee / linux-smarc-t335x-v3.2

18 Oct, 2007

1 commit

e6d5a11da Merge git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched:
sched: fix new task startup crash
sched: fix !SYSFS build breakage
sched: fix improper load balance across sched domain
sched: more robust sd-sysctl entry freeing

Linus Torvalds
2007-10-18 00:11:18 +0800

17 Oct, 2007

7 commits

d2da272a4 migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock() ... Browse Code »

Change migration_call(CPU_DEAD) to use direct spin_lock_irq() instead of
task_rq_lock(rq->idle), rq->idle can't change its task_rq().

This makes the code a bit more symmetrical with migrate_dead_tasks()'s path
which uses spin_lock_irq/spin_unlock_irq.

Signed-off-by: Oleg Nesterov
Cc: Cliff Wickman
Cc: Gautham R Shenoy
Cc: Ingo Molnar
Cc: Srivatsa Vaddagiri
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:43:03 +0800
f7b4cddcc do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist) ... Browse Code »

Currently move_task_off_dead_cpu() is called under
write_lock_irq(tasklist). This means it can't use task_lock() which is
needed to improve migrating to take task's ->cpuset into account.

Change the code to call move_task_off_dead_cpu() with irqs enabled, and
change migrate_live_tasks() to use read_lock(tasklist).

This all is a preparation for the futher changes proposed by Cliff Wickman, see
http://marc.info/?t=117327786100003

Signed-off-by: Oleg Nesterov
Cc: Cliff Wickman
Cc: Gautham R Shenoy
Cc: Ingo Molnar
Cc: Srivatsa Vaddagiri
Cc: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2007-10-17 23:43:03 +0800
b9dca1e0f sched: fix new task startup crash ... Browse Code »

Child task may be added on a different cpu that the one on which parent
is running. In which case, task_new_fair() should check whether the new
born task's parent entity should be added as well on the cfs_rq.

Patch below fixes the problem in task_new_fair.

This could fix the put_prev_task_fair() crashes reported.

Reported-by: Kamalesh Babulal
Reported-by: Andy Whitcroft
Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Ingo Molnar

Srivatsa Vaddagiri
2007-10-17 22:55:11 +0800
908a7c1b9 sched: fix improper load balance across sched domain ... Browse Code »

We recently discovered a nasty performance bug in the kernel CPU load
balancer where we were hit by 50% performance regression.

When tasks are assigned to a subset of CPUs that span across
sched_domains (either ccNUMA node or the new multi-core domain) via
cpu affinity, kernel fails to perform proper load balance at
these domains, due to several logic in find_busiest_group() miss
identified busiest sched group within a given domain. This leads to
inadequate load balance and causes 50% performance hit.

To give you a concrete example, on a dual-core, 2 socket numa system,
there are 4 logical cpu, organized as:

CPU0 attaching sched-domain:
domain 0: span 0003 groups: 0001 0002
domain 1: span 000f groups: 0003 000c
CPU1 attaching sched-domain:
domain 0: span 0003 groups: 0002 0001
domain 1: span 000f groups: 0003 000c
CPU2 attaching sched-domain:
domain 0: span 000c groups: 0004 0008
domain 1: span 000f groups: 000c 0003
CPU3 attaching sched-domain:
domain 0: span 000c groups: 0008 0004
domain 1: span 000f groups: 000c 0003

If I run 2 tasks with CPU affinity set to 0x5. There are situation
where cpu0 has run queue length of 2, and cpu2 will be idle. The
kernel load balancer is unable to balance out these two tasks over
cpu0 and cpu2 due to at least three logics in find_busiest_group()
that heavily bias load balance towards power saving mode. e.g. while
determining "busiest" variable, kernel only set it when
"sum_nr_running > group_capacity". This test is flawed that
"sum_nr_running" is not necessary same as
sum-tasks-allowed-to-run-within-the sched-group. The end result is
that kernel "think" everything is balanced, but in reality we have an
imbalance and thus causing one CPU to be over-subscribed and leaving
other idle. There are two other logic in the same function will also
causing similar effect. The nastiness of this bug is that kernel not
be able to get unstuck in this unfortunate broken state. From what
we've seen in our environment, kernel will stuck in imbalanced state
for extended period of time and it is also very easy for the kernel to
stuck into that state (it's pretty much 100% reproducible for us).

So proposing the following fix: add addition logic in
find_busiest_group to detect intrinsic imbalance within the busiest
group. When such condition is detected, load balance goes into spread
mode instead of default grouping mode.

Signed-off-by: Ken Chen
Signed-off-by: Ingo Molnar

Ken Chen
2007-10-17 22:55:11 +0800
cd7900763 sched: more robust sd-sysctl entry freeing ... Browse Code »

It occurred to me this morning that the procname field was dynamically
allocated and needed to be freed. I started to put in break statements
when allocation failed but it was approaching 50% error handling code.

I came up with this alternative of looping while entry->mode is set and
checking proc_handler instead of ->table. Alternatively, the string
version of the domain name and cpu number could be stored the structs.

I verified by compiling CONFIG_DEBUG_SLAB and checking the allocation
counts after taking a cpuset exclusive and back.

Signed-off-by: Ingo Molnar

Milton Miller
2007-10-17 22:55:11 +0800
607717a65 cpuset: remove sched domain hooks from cpusets ... Browse Code »

Remove the cpuset hooks that defined sched domains depending on the setting
of the 'cpu_exclusive' flag.

The cpu_exclusive flag can only be set on a child if it is set on the
parent.

This made that flag painfully unsuitable for use as a flag defining a
partitioning of a system.

It was entirely unobvious to a cpuset user what partitioning of sched
domains they would be causing when they set that one cpu_exclusive bit on
one cpuset, because it depended on what CPUs were in the remainder of that
cpusets siblings and child cpusets, after subtracting out other
cpu_exclusive cpusets.

Furthermore, there was no way on production systems to query the
result.

Using the cpu_exclusive flag for this was simply wrong from the get go.

Fortunately, it was sufficiently borked that so far as I know, almost no
successful use has been made of this. One real time group did use it to
affectively isolate CPUs from any load balancing efforts. They are willing
to adapt to alternative mechanisms for this, such as someway to manipulate
the list of isolated CPUs on a running system. They can do without this
present cpu_exclusive based mechanism while we develop an alternative.

There is a real risk, to the best of my understanding, of users
accidentally setting up a partitioned scheduler domains, inhibiting desired
load balancing across all their CPUs, due to the nonobvious (from the
cpuset perspective) side affects of the cpu_exclusive flag.

Furthermore, since there was no way on a running system to see what one was
doing with sched domains, this change will be invisible to any using code.
Unless they have real insight to the scheduler load balancing choices, they
will be unable to detect that this change has been made in the kernel's
behaviour.

Initial discussion on lkml of this patch has generated much comment. My
(probably controversial) take on that discussion is that it has reached a
rough concensus that the current cpuset cpu_exclusive mechanism for
defining sched domains is borked. There is no concensus on the
replacement. But since we can remove this mechanism, and since its
continued presence risks causing unwanted partitioning of the schedulers
load balancing, we should remove it while we can, as we proceed to work the
replacement scheduler domain mechanisms.

Signed-off-by: Paul Jackson
Cc: Ingo Molnar
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Dinakar Guniguntala
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2007-10-17 00:43:09 +0800
d5a7430dd Convert cpu_sibling_map to be a per cpu variable ... Browse Code »

Convert cpu_sibling_map from a static array sized by NR_CPUS to a per_cpu
variable. This saves sizeof(cpumask_t) * NR unused cpus. Access is mostly
from startup and CPU HOTPLUG functions.

Signed-off-by: Mike Travis
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: "Siddha, Suresh B"
Cc: "David S. Miller"
Cc: Paul Mackerras
Cc: Benjamin Herrenschmidt
Cc: "Luck, Tony"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Travis
2007-10-17 00:42:50 +0800

15 Oct, 2007

32 commits

9c63d9c02 sched: sync wakeups preempt too ... Browse Code »

make sure sync wakeups preempt too - the scheduler will not
overschedule as we've got various throttles against that.
As a result, sync wakeups can be used more widely in the kernel
(to signal wakeup affinity between tasks), and no arbitrary
latencies will be introduced either.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:20 +0800
71e20f187 sched: affine sync wakeups ... Browse Code »

make sync wakeups affine for cache-cold tasks: if a cache-cold task
is woken up by a sync wakeup then use the opportunity to migrate it
straight away. (the two tasks are 'related' because they communicate)

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:19 +0800
94886b84b sched: guest CPU accounting: maintain stats in account_system_time() ... Browse Code »

modify account_system_time() to add cputime to cpustat->guest if we are
running a VCPU. We add this cputime to cpustat->user instead of
cpustat->system because this part of KVM code is in fact user code
although it is executed in the kernel. We duplicate VCPU time between
guest and user to allow an unmodified "top(1)" to display correct value.
A modified "top(1)" is able to display good cpu user time and cpu guest
time by subtracting cpu guest time from cpu user time. Update "gtime" in
task_struct accordingly.

Signed-off-by: Laurent Vivier
Acked-by: Avi Kivity
Signed-off-by: Ingo Molnar

Laurent Vivier
2007-10-15 23:00:19 +0800
6323469f9 sched: domain sysctl fixes: add terminator comment ... Browse Code »

we had an incorrect-terminator bug in sd_alloc_ctl_domain_table()
before, so add a comment that documents it.

Signed-off-by: Milton Miller
Signed-off-by: Ingo Molnar

Milton Miller
2007-10-15 23:00:19 +0800
ad1cdc1d7 sched: domain sysctl fixes: do not crash on allocation failure ... Browse Code »

Now that we are calling this at runtime, a more relaxed error path is
suggested. If an allocation fails, we just register the partial table,
which will show empty directories.

Signed-off-by: Milton Miller
Signed-off-by: Ingo Molnar

Milton Miller
2007-10-15 23:00:19 +0800
6382bc90f sched: domain sysctl fixes: unregister the sysctl table before domains ... Browse Code »

Unregister and free the sysctl table before destroying domains, then
rebuild and register after creating the new domains. This prevents the
sysctl table from pointing to freed memory for root to write.

Signed-off-by: Milton Miller
Signed-off-by: Ingo Molnar

Milton Miller
2007-10-15 23:00:19 +0800
97b6ea7b6 sched: domain sysctl fixes: use for_each_online_cpu() ... Browse Code »

init_sched_domain_sysctl was walking cpus 0-n and referencing per_cpu
variables. If the cpus_possible mask is not contigious this will result
in a crash referencing unallocated data. If the online mask is not
contigious then we would show offline cpus and miss online ones.

Signed-off-by: Milton Miller
Signed-off-by: Ingo Molnar

Milton Miller
2007-10-15 23:00:19 +0800
5cf9f062c sched: domain sysctl fixes: use kcalloc() ... Browse Code »

kcalloc checks for n * sizeof(element) overflows and it zeros.

Signed-off-by: Milton Miller
Signed-off-by: Ingo Molnar

Milton Miller
2007-10-15 23:00:19 +0800
6bc1665ba sched: allow the immediate migration of cache-cold tasks ... Browse Code »

allow the immediate migration of cache-cold tasks.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:18 +0800
cc367732f sched: debug, improve migration statistics ... Browse Code »

add new migration statistics when SCHED_DEBUG and SCHEDSTATS
is enabled. Available in /proc//sched.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:18 +0800
ff56b2f01 sched: activate task_hot() only on fair-scheduled tasks ... Browse Code »

activate task_hot() only for fair-scheduled tasks (i.e. disable it
for RT tasks).

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2007-10-15 23:00:18 +0800
da84d9617 sched: reintroduce cache-hot affinity ... Browse Code »

reintroduce a simplified version of cache-hot/cold scheduling
affinity. This improves performance with certain SMP workloads,
such as sysbench.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:18 +0800
178be7934 sched: do not normalize kernel threads via SysRq-N ... Browse Code »

do not normalize kernel threads via SysRq-N: the migration threads,
softlockup threads, etc. might be essential for the system to
function properly. So only zap user tasks.

pointed out by Andi Kleen.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:18 +0800
1666703af sched: remove stale comment from sched_group_set_shares() ... Browse Code »

remove stale comment from sched_group_set_shares().

Function never returns -EINVAL.

Signed-off-by: Andi Kleen
Signed-off-by: Ingo Molnar

Andi Kleen
2007-10-15 23:00:18 +0800
d5036e89d sched: clean up is_migration_thread() ... Browse Code »

clean up is_migration_thread() and turn it into an inline function.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:15 +0800
3a5e4dc12 sched: cleanup: refactor normalize_rt_tasks ... Browse Code »

Replace a particularly ugly ifdef with an inline and a new macro.
Also split up the function to be easier to read.

Signed-off-by: Andi Kleen
Signed-off-by: Ingo Molnar

Andi Kleen
2007-10-15 23:00:15 +0800
8cbbe86df sched: cleanup: refactor common code of sleep_on / wait_for_completion ... Browse Code »

Refactor common code of sleep_on / wait_for_completion

These functions were largely cut'n'pasted. This moves
the common code into single helpers instead. Advantage
is about 1k less code on x86-64 and 91 lines of code removed.
It adds one function call to the non timeout version of
the functions; i don't expect this to be measurable.

Signed-off-by: Andi Kleen
Signed-off-by: Ingo Molnar

Andi Kleen
2007-10-15 23:00:14 +0800
3a5c359a5 sched: cleanup: remove unnecessary gotos ... Browse Code »

Replace loops implemented with gotos with real loops.
Replace err = ...; goto x; x: return err; with return ...;

No functional changes.

Signed-off-by: Andi Kleen
Signed-off-by: Ingo Molnar

Andi Kleen
2007-10-15 23:00:14 +0800
95938a35c sched: prevent wakeup over-scheduling ... Browse Code »

Prevent wakeup over-scheduling. Once a task has been preempted by a
task of the same or lower priority, it becomes ineligible for repeated
preemption by same until it has been ticked, or slept. Instead, the
task is marked for preemption at the next tick. Tasks of higher
priority still preempt immediately.

Signed-off-by: Mike Galbraith
Signed-off-by: Ingo Molnar

Mike Galbraith
2007-10-15 23:00:14 +0800
ce6c13113 sched: disable forced preemption by default ... Browse Code »

Implement feature bit to disable forced preemption. This way
it can be checked whether a workload is overscheduling or not.

Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2007-10-15 23:00:14 +0800
ace8b3d63 sched: some proc entries are missed in sched_domain sys_ctl debug code ... Browse Code »

cache_nice_tries and flags entry do not appear in proc fs sched_domain
directory, because ctl_table entry is skipped.

This patch fixes the issue.

Signed-off-by: Zou Nan hai
Signed-off-by: Andrew Morton
Signed-off-by: Ingo Molnar

Zou Nan hai
2007-10-15 23:00:14 +0800
638e13ac3 sched: fix rt ptracer monopolizing CPU ... Browse Code »

yield() in wait_task_inactive(), can cause a high priority thread to be
scheduled back in, and there by loop forever while it is waiting for some
lower priority thread which is unfortunately still on the runqueue.

Use schedule_timeout_uninterruptible(1) instead.

Signed-off-by: Gautham R Shenoy
Credit: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Ingo Molnar

Gautham R Shenoy
2007-10-15 23:00:14 +0800
5cb350baf sched: group scheduling, sysfs tunables ... Browse Code »

Add tunables in sysfs to modify a user's cpu share.

A directory is created in sysfs for each new user in the system.

/sys/kernel/uids//cpu_share

Reading this file returns the cpu shares granted for the user.
Writing into this file modifies the cpu share for the user. Only an
administrator is allowed to modify a user's cpu share.

Ex:
# cd /sys/kernel/uids/
# cat 512/cpu_share
1024
# echo 2048 > 512/cpu_share
# cat 512/cpu_share
2048
#

Signed-off-by: Srivatsa Vaddagiri
Signed-off-by: Dhaval Giani
Signed-off-by: Ingo Molnar

Dhaval Giani
2007-10-15 23:00:14 +0800
a58f6f253 sched: export cpu_clock() ... Browse Code »

export cpu_clock() - the preferred API instead of sched_clock().

Signed-off-by: Paul E. McKenney
Signed-off-by: Andrew Morton
Signed-off-by: Ingo Molnar

Paul E. McKenney
2007-10-15 23:00:14 +0800
00bf7bfc2 sched: fix: move the CPU check into ->task_new_fair() ... Browse Code »

noticed by Peter Zijlstra:

fix: move the CPU check into ->task_new_fair(), this way we
can call place_entity() and get child ->vruntime right at
initial wakeup time.

(without this there can be large latencies)

Signed-off-by: Ingo Molnar
Signed-off-by: Peter Zijlstra

Ingo Molnar
2007-10-15 23:00:14 +0800
4cf86d77f sched: cleanup: rename task_grp to task_group ... Browse Code »

cleanup: rename task_grp to task_group. No need to save two characters
and 'grp' is annoying to read.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:14 +0800
06877c33f sched: cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG ... Browse Code »

cleanup: rename SCHED_FEAT_USE_TREE_AVG to SCHED_FEAT_TREE_AVG, to
make SCHED_FEAT_ names more consistent.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:13 +0800
a65914b36 sched: kfree(NULL) is valid ... Browse Code »

kfree(NULL) is valid.

pointed out by checkpatch.pl.

the fix shrinks the code a bit:

text data bss dec hex filename
40024 3842 100 43966 abbe sched.o.before
40002 3842 100 43944 aba8 sched.o.after

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:13 +0800
8927f4947 sched: style cleanup ... Browse Code »

fix up __setup() style bug - noticed via checkpatch.pl.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:13 +0800
26797a34a sched: break out if printing a warning in sched_domain_debug() ... Browse Code »

checkpatch.pl and Andy Whitcroft noticed the following bug: we did
not break out after printing an error.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:13 +0800
3e9830dca sched: run sched_domain_debug() if CONFIG_SCHED_DEBUG=y ... Browse Code »

run sched_domain_debug() if CONFIG_SCHED_DEBUG=y, instead
of relying on the hand-crafted SCHED_DOMAIN_DEBUG switch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2007-10-15 23:00:13 +0800
a4ec24b48 sched: tidy up SCHED_RR ... Browse Code »

- make timeslices of SCHED_RR tasks constant and not
dependent on task's static_prio [1] ;
- remove obsolete code (timeslice related bits);
- make sched_rr_get_interval() return something more
meaningful [2] for SCHED_OTHER tasks.

[1] according to the following link, it's not compliant with SUSv3
(not sure though, what is the reference for us :-)
http://lkml.org/lkml/2007/3/7/656

[2] the interval is dynamic and can be depicted as follows "should a
task be one of the runnable tasks at this particular moment, it would
expect to run for this interval of time before being re-scheduled by the
scheduler tick".
(i.e. it's more precise if a task is runnable at the moment)

yeah, this seems to require task_rq_lock/unlock() but this is not a hot
path.

results:

(SCHED_FIFO)

dimm@earth:~/storage/prog$ sudo chrt -f 10 ./rr_interval
time_slice: 0 : 0

(SCHED_RR)

dimm@earth:~/storage/prog$ sudo chrt 10 ./rr_interval
time_slice: 0 : 99984800

(SCHED_NORMAL)

dimm@earth:~/storage/prog$ ./rr_interval
time_slice: 0 : 19996960

(SCHED_NORMAL + a cpu_hog of similar 'weight' on the same CPU --- so should be a half of the previous result)

dimm@earth:~/storage/prog$ taskset 1 ./rr_interval
time_slice: 0 : 9998480

Signed-off-by: Dmitry Adamushko
Signed-off-by: Ingo Molnar

Dmitry Adamushko
2007-10-15 23:00:13 +0800