Eric Lee / smarc-fsl-linux-kernel

08 Dec, 2009

1 commit

1557d3300 Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/sysctl-2.6: (43 commits)
security/tomoyo: Remove now unnecessary handling of security_sysctl.
security/tomoyo: Add a special case to handle accesses through the internal proc mount.
sysctl: Drop & in front of every proc_handler.
sysctl: Remove CTL_NONE and CTL_UNNUMBERED
sysctl: kill dead ctl_handler definitions.
sysctl: Remove the last of the generic binary sysctl support
sysctl net: Remove unused binary sysctl code
sysctl security/tomoyo: Don't look at ctl_name
sysctl arm: Remove binary sysctl support
sysctl x86: Remove dead binary sysctl support
sysctl sh: Remove dead binary sysctl support
sysctl powerpc: Remove dead binary sysctl support
sysctl ia64: Remove dead binary sysctl support
sysctl s390: Remove dead sysctl binary support
sysctl frv: Remove dead binary sysctl support
sysctl mips/lasat: Remove dead binary sysctl support
sysctl drivers: Remove dead binary sysctl support
sysctl crypto: Remove dead binary sysctl support
sysctl security/keys: Remove dead binary sysctl support
sysctl kernel: Remove binary sysctl logic
...

Linus Torvalds
2009-12-08 23:38:50 +0800

06 Dec, 2009

2 commits

897e81bea Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
sched, cputime: Introduce thread_group_times()
sched, cputime: Cleanups related to task_times()
Revert "sched, x86: Optimize branch hint in __switch_to()"
sched: Fix isolcpus boot option
sched: Revert 498657a478c60be092208422fefa9c7b248729c2
sched, time: Define nsecs_to_jiffies()
sched: Remove task_{u,s,g}time()
sched: Introduce task_times() to replace task_{u,s}time() pair
sched: Limit the number of scheduler debug messages
sched.c: Call debug_show_all_locks() when dumping all tasks
sched, x86: Optimize branch hint in __switch_to()
sched: Optimize branch hint in context_switch()
sched: Optimize branch hint in pick_next_task_fair()
sched_feat_write(): Update ppos instead of file->f_pos
sched: Sched_rt_periodic_timer vs cpu hotplug
sched, kvm: Fix race condition involving sched_in_preempt_notifers
sched: More generic WAKE_AFFINE vs select_idle_sibling()
sched: Cleanup select_task_rq_fair()
sched: Fix granularity of task_u/stime()
sched: Fix/add missing update_rq_clock() calls
...

Linus Torvalds
2009-12-06 07:30:49 +0800
607781762 Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip ... Browse Code »

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
rcu: Make RCU's CPU-stall detector be default
rcu: Add expedited grace-period support for preemptible RCU
rcu: Enable fourth level of TREE_RCU hierarchy
rcu: Rename "quiet" functions
rcu: Re-arrange code to reduce #ifdef pain
rcu: Eliminate unneeded function wrapping
rcu: Fix grace-period-stall bug on large systems with CPU hotplug
rcu: Eliminate __rcu_pending() false positives
rcu: Further cleanups of use of lastcomp
rcu: Simplify association of forced quiescent states with grace periods
rcu: Accelerate callback processing on CPUs not detecting GP end
rcu: Mark init-time-only rcu_bootup_announce() as __init
rcu: Simplify association of quiescent states with grace periods
rcu: Rename dynticks_completed to completed_fqs
rcu: Enable synchronize_sched_expedited() fastpath
rcu: Remove inline from forward-referenced functions
rcu: Fix note_new_gpnum() uses of ->gpnum
rcu: Fix synchronization for rcu_process_gp_end() uses of ->completed counter
rcu: Prepare for synchronization fixes: clean up for non-NO_HZ handling of ->completed counter
rcu: Cleanup: balance rcu_irq_enter()/rcu_irq_exit() calls
...

Linus Torvalds
2009-12-06 01:52:14 +0800

03 Dec, 2009

3 commits

c08f78298 mutex: Fix missing conditions to build mutex_spin_on_owner() ... Browse Code »

We don't need to build mutex_spin_on_owner() if we have
CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
it won't be used under such configs.

Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
checks before building it.

Signed-off-by: Frederic Weisbecker
Acked-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
Cc: Peter Zijlstra

Frederic Weisbecker
2009-12-03 18:50:11 +0800
0cf55e1ec sched, cputime: Introduce thread_group_times() ... Browse Code »

This is a real fix for problem of utime/stime values decreasing
described in the thread:

http://lkml.org/lkml/2009/11/3/522

Now cputime is accounted in the following way:

- {u,s}time in task_struct are increased every time when the thread
is interrupted by a tick (timer interrupt).

- When a thread exits, its {u,s}time are added to signal->{u,s}time,
after adjusted by task_times().

- When all threads in a thread_group exits, accumulated {u,s}time
(and also c{u,s}time) in signal struct are added to c{u,s}time
in signal struct of the group's parent.

So {u,s}time in task struct are "raw" tick count, while
{u,s}time and c{u,s}time in signal struct are "adjusted" values.

And accounted values are used by:

- task_times(), to get cputime of a thread:
This function returns adjusted values that originates from raw
{u,s}time and scaled by sum_exec_runtime that accounted by CFS.

- thread_group_cputime(), to get cputime of a thread group:
This function returns sum of all {u,s}time of living threads in
the group, plus {u,s}time in the signal struct that is sum of
adjusted cputimes of all exited threads belonged to the group.

The problem is the return value of thread_group_cputime(),
because it is mixed sum of "raw" value and "adjusted" value:

group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

This misbehavior can break {u,s}time monotonicity.
Assume that if there is a thread that have raw values greater
than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
but only runs 45ms) and if it exits, cputime will decrease (e.g.
-5ms).

To fix this, we could do:

group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

But task_times() contains hard divisions, so applying it for
every thread should be avoided.

This patch fixes the above problem in the following way:

- Modify thread's exit (= __exit_signal()) not to use task_times().
It means {u,s}time in signal struct accumulates raw values instead
of adjusted values. As the result it makes thread_group_cputime()
to return pure sum of "raw" values.

- Introduce a new function thread_group_times(*task, *utime, *stime)
that converts "raw" values of thread_group_cputime() to "adjusted"
values, in same calculation procedure as task_times().

- Modify group's exit (= wait_task_zombie()) to use this introduced
thread_group_times(). It make c{u,s}time in signal struct to
have adjusted values like before this patch.

- Replace some thread_group_cputime() by thread_group_times().
This replacements are only applied where conveys the "adjusted"
cputime to users, and where already uses task_times() near by it.
(i.e. sys_times(), getrusage(), and /proc//stat.)

This patch have a positive side effect:

- Before this patch, if a group contains many short-life threads
(e.g. runs 0.9ms and not interrupted by ticks), the group's
cputime could be invisible since thread's cputime was accumulated
after adjusted: imagine adjustment function as adj(ticks, runtime),
{adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
After this patch it will not happen because the adjustment is
applied after accumulated.

v2:
- remove if()s, put new variables into signal_struct.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Spencer Candland
Cc: Americo Wang
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Stanislaw Gruszka
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-12-03 00:32:40 +0800
d99ca3b97 sched, cputime: Cleanups related to task_times() ... Browse Code »

- Remove if({u,s}t)s because no one call it with NULL now.
- Use cputime_{add,sub}().
- Add ifndef-endif for prev_{u,s}time since they are used
only when !VIRT_CPU_ACCOUNTING.

Signed-off-by: Hidetoshi Seto
Cc: Peter Zijlstra
Cc: Spencer Candland
Cc: Americo Wang
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Stanislaw Gruszka
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-12-03 00:32:39 +0800

02 Dec, 2009

2 commits

bdddd2963 sched: Fix isolcpus boot option ... Browse Code »

Anton Blanchard wrote:

> We allocate and zero cpu_isolated_map after the isolcpus
> __setup option has run. This means cpu_isolated_map always
> ends up empty and if CPUMASK_OFFSTACK is enabled we write to a
> cpumask that hasn't been allocated.

I introduced this regression in 49557e620339cb13 (sched: Fix
boot crash by zalloc()ing most of the cpu masks).

Use the bootmem allocator if they set isolcpus=, otherwise
allocate and zero like normal.

Reported-by: Anton Blanchard
Signed-off-by: Rusty Russell
Cc: peterz@infradead.org
Cc: Linus Torvalds
Cc:
LKML-Reference:
Signed-off-by: Ingo Molnar
Tested-by: Anton Blanchard

Rusty Russell
2009-12-02 17:27:16 +0800
8592e6486 sched: Revert 498657a478c60be092208422fefa9c7b248729c2 ... Browse Code »

498657a478c60be092208422fefa9c7b248729c2 incorrectly assumed
that preempt wasn't disabled around context_switch() and thus
was fixing imaginary problem. It also broke KVM because it
depended on ->sched_in() to be called with irq enabled so that
it can do smp calls from there.

Revert the incorrect commit and add comment describing different
contexts under with the two callbacks are invoked.

Avi: spotted transposed in/out in the added comment.

Signed-off-by: Tejun Heo
Acked-by: Avi Kivity
Cc: peterz@infradead.org
Cc: efault@gmx.de
Cc: rusty@rustcorp.com.au
LKML-Reference:
Signed-off-by: Ingo Molnar

Tejun Heo
2009-12-02 16:55:33 +0800

26 Nov, 2009

5 commits

b7b20df91 sched, time: Define nsecs_to_jiffies() ... Browse Code »

Use of msecs_to_jiffies() for nsecs_to_cputime() have some
problems:

- The type of msecs_to_jiffies()'s argument is unsigned int, so
it cannot convert msecs greater than UINT_MAX = about 49.7 days.

- msecs_to_jiffies() returns MAX_JIFFY_OFFSET if MSB of argument
is set, assuming that input was negative value. So it cannot
convert msecs greater than INT_MAX = about 24.8 days too.

This patch defines a new function nsecs_to_jiffies() that can
deal greater values, and that can deal all incoming values as
unsigned.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Amrico Wang
Cc: Thomas Gleixner
Cc: John Stultz
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-26 19:59:20 +0800
d5b7c78e9 sched: Remove task_{u,s,g}time() ... Browse Code »

Now all task_{u,s}time() pairs are replaced by task_times().
And task_gtime() is too simple to be an inline function.

Cleanup them all.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Americo Wang
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-26 19:59:20 +0800
d180c5bcc sched: Introduce task_times() to replace task_{u,s}time() pair ... Browse Code »

Functions task_{u,s}time() are called in pair in almost all
cases. However task_stime() is implemented to call task_utime()
from its inside, so such paired calls run task_utime() twice.

It means we do heavy divisions (div_u64 + do_div) twice to get
utime and stime which can be obtained at same time by one set
of divisions.

This patch introduces a function task_times(*tsk, *utime,
*stime) to retrieve utime and stime at once in better, optimized
way.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: Stanislaw Gruszka
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Balbir Singh
Cc: Americo Wang
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-26 19:59:19 +0800
16bc67ede Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge reason: Pick up fixes that did not make it into .32.0

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-11-26 17:50:42 +0800
f6630114d sched: Limit the number of scheduler debug messages ... Browse Code »

Remove the verbose scheduler debug messages unless kernel
parameter "sched_debug" set. /proc/sched_debug unchanged.

Signed-off-by: Mike Travis
Cc: Heiko Carstens
Cc: Roland Dreier
Cc: Randy Dunlap
Cc: Tejun Heo
Cc: Andi Kleen
Cc: Greg Kroah-Hartman
Cc: Yinghai Lu
Cc: David Rientjes
Cc: Steven Rostedt
Cc: Rusty Russell
Cc: Hidetoshi Seto
Cc: Jack Steiner
Cc: Frederic Weisbecker
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Travis
2009-11-26 17:17:30 +0800

25 Nov, 2009

1 commit

93335a215 sched.c: Call debug_show_all_locks() when dumping all tasks ... Browse Code »

In commit v2.6.21-691-g39bc89f ("make SysRq-T show all tasks
again") the interface of show_state_filter() was changed: zero
valued 'state_filter' specifies "dump all tasks" (instead of -1).

However, the condition for calling debug_show_all_locks() ("show
locks if all tasks are dumped") was not updated accordingly.

Signed-off-by: Shmulik Ladkani
Cc: peterz@infradead.org
LKML-Reference:
Signed-off-by: Ingo Molnar

Shmulik Ladkani
2009-11-25 21:26:52 +0800

24 Nov, 2009

2 commits

710390d90 sched: Optimize branch hint in context_switch() ... Browse Code »

Branch hint profiling on my nehalem machine showed over 90%
incorrect branch hints:

10420275 170645395 94 context_switch sched.c
3043
10408421 171098521 94 context_switch sched.c
3050

Signed-off-by: Tim Blechmann
Cc: Peter Zijlstra
Cc: Mike Galbraith
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: Frederic Weisbecker
LKML-Reference:
Signed-off-by: Ingo Molnar

Tim Blechmann
2009-11-24 19:18:42 +0800
429947248 sched_feat_write(): Update ppos instead of file->f_pos ... Browse Code »

sched_feat_write() should update ppos instead of file->f_pos.

(This reduces some BKL dependencies of this code.)

Signed-off-by: Jan Blunck
Cc: jkacur@redhat.com
Cc: Arnd Bergmann
Cc: Frederic Weisbecker
Cc: Jamie Lokier
Cc: Peter Zijlstra
Cc: Christoph Hellwig
Cc: Alan Cox
LKML-Reference:
Signed-off-by: Ingo Molnar

Jan Blunck
2009-11-24 02:38:03 +0800

17 Nov, 2009

1 commit

bb9074ff5 Merge commit 'v2.6.32-rc7' ... Browse Code »

Resolve the conflict between v2.6.32-rc7 where dn_def_dev_handler
gets a small bug fix and the sysctl tree where I am removing all
sysctl strategy routines.

Eric W. Biederman
2009-11-17 17:01:34 +0800

16 Nov, 2009

1 commit

047106adc sched: Sched_rt_periodic_timer vs cpu hotplug ... Browse Code »

Heiko reported a case where a timer interrupt managed to
reference a root_domain structure that was already freed by a
concurrent hot-un-plug operation.

Solve this like the regular sched_domain stuff is also
synchronized, by adding a synchronize_sched() stmt to the free
path, this ensures that a root_domain stays present for any
atomic section that could have observed it.

Reported-by: Heiko Carstens
Signed-off-by: Peter Zijlstra
Acked-by: Heiko Carstens
Cc: Gregory Haskins
Cc: Siddha Suresh B
Cc: Martin Schwidefsky
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-11-16 17:46:27 +0800

15 Nov, 2009

1 commit

498657a47 sched, kvm: Fix race condition involving sched_in_preempt_notifers ... Browse Code »

In finish_task_switch(), fire_sched_in_preempt_notifiers() is
called after finish_lock_switch().

However, depending on architecture, preemption can be enabled after
finish_lock_switch() which breaks the semantics of preempt
notifiers.

So move it before finish_arch_switch(). This also makes the in-
notifiers symmetric to out- notifiers in terms of locking - now
both are called under rq lock.

Signed-off-by: Tejun Heo
Acked-by: Avi Kivity
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Tejun Heo
2009-11-15 16:59:54 +0800

12 Nov, 2009

3 commits

761b1d26d sched: Fix granularity of task_u/stime() ... Browse Code »

Originally task_s/utime() were designed to return clock_t but
later changed to return cputime_t by following commit:

commit efe567fc8281661524ffa75477a7c4ca9b466c63
Author: Christian Borntraeger
Date: Thu Aug 23 15:18:02 2007 +0200

It only changed the type of return value, but not the
implementation. As the result the granularity of task_s/utime()
is still that of clock_t, not that of cputime_t.

So using task_s/utime() in __exit_signal() makes values
accumulated to the signal struct to be rounded and coarse
grained.

This patch removes casts to clock_t in task_u/stime(), to keep
granularity of cputime_t over the calculation.

v2:
Use div_u64() to avoid error "undefined reference to `__udivdi3`"
on some 32bit systems.

Signed-off-by: Hidetoshi Seto
Acked-by: Peter Zijlstra
Cc: xiyou.wangcong@gmail.com
Cc: Spencer Candland
Cc: Oleg Nesterov
Cc: Stanislaw Gruszka
LKML-Reference:
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2009-11-12 22:23:47 +0800
055a00865 sched: Fix/add missing update_rq_clock() calls ... Browse Code »

kthread_bind(), migrate_task() and sched_fork were missing
updates, and try_to_wake_up() was updating after having already
used the stale clock.

Aside from preventing potential latency hits, there' a side
benefit in that early boot printk time stamps become monotonic.

Signed-off-by: Mike Galbraith
Acked-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar
LKML-Reference:

Mike Galbraith
2009-11-12 19:28:29 +0800
56992309c sysctl kernel: Remove binary sysctl logic ... Browse Code »

Now that sys_sysctl is a generic wrapper around /proc/sys .ctl_name
and .strategy members of sysctl tables are dead code. Remove them.

Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: David Howells
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2009-11-12 18:04:55 +0800

11 Nov, 2009

2 commits

956539b75 rcu: Enable synchronize_sched_expedited() fastpath ... Browse Code »

This patch adds a counter increment to enable tasks to actually
take the synchronize_sched_expedited() function's fastpath.

Signed-off-by: Paul E. McKenney
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Paul E. McKenney
2009-11-11 05:48:49 +0800
ffd44db5f sched: Make sure task has correct sched_class after policy change ... Browse Code »

From the code in rt_mutex_setprio(), it is evident that the
intention is that task's with a RT 'prio' value as a consequence
of receiving a PI boost also have their 'sched_class' field set
to '&rt_sched_class'.

However, Peter noticed that the code in __setscheduler() could
result in this intention being frustrated. Fix it.

Reported-by: Peter Williams
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2009-11-11 03:22:31 +0800

10 Nov, 2009

1 commit

eae0c9dfb sched: Fix and clean up rate-limit newidle code ... Browse Code »

Commit 1b9508f, "Rate-limit newidle" has been confirmed to fix
the netperf UDP loopback regression reported by Alex Shi.

This is a cleanup and a fix:

- moved to a more out of the way spot

- fix to ensure that balancing doesn't try to balance
runqueues which haven't gone online yet, which can
mess up CPU enumeration during boot.

Reported-by: Alex Shi
Reported-by: Zhang, Yanmin
Signed-off-by: Mike Galbraith
Acked-by: Peter Zijlstra
Cc: # .32.x: a1f84a3: sched: Check for an idle shared cache
Cc: # .32.x: 1b9508f: sched: Rate-limit newidle
Cc: # .32.x: fd21073: sched: Fix affinity logic
Cc: # .32.x
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-11-10 11:25:58 +0800

08 Nov, 2009

3 commits

d8c80ce09 sched, no_hz: Remove unused rq->last_tick_seen field ... Browse Code »

In 15934a37324f32e0fda633dc7984a671ea81cd75,
field last_tick_seen is added to struct rq.
But it is unused now.

Signed-off-by: Lai Jiangshan
Cc: Guillaume Chazarain
LKML-Reference:
Signed-off-by: Ingo Molnar

Lai Jiangshan
2009-11-08 20:17:47 +0800
e9036b36e sched: Use root_task_group_empty only with FAIR_GROUP_SCHED ... Browse Code »

root_task_group_empty is used only with FAIR_GROUP_SCHED
so if we use other scheduler options we get:

kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used

So move CONFIG_FAIR_GROUP_SCHED up that it covers
root_task_group_empty().

Signed-off-by: Cyrill Gorcunov
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Cyrill Gorcunov
2009-11-08 20:15:48 +0800
968c86458 sched: Fix kernel-doc function parameter name ... Browse Code »

Fix variable name in sched.c kernel-doc notation.

Fixes this DocBook warning:

Warning(kernel/sched.c:2008): No description found for parameter
'p' Warning(kernel/sched.c:2008): Excess function parameter 'k'
description in 'kthread_bind'

Signed-off-by: Randy Dunlap
LKML-Reference:
Signed-off-by: Ingo Molnar

Randy Dunlap
2009-11-08 18:26:25 +0800

06 Nov, 2009

1 commit

608221fdf Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c
sched: Disable SD_PREFER_LOCAL at node level
sched: Fix boot crash by zalloc()ing most of the cpu masks
sched: Strengthen buddies and mitigate buddy induced latencies

Linus Torvalds
2009-11-06 02:56:47 +0800

05 Nov, 2009

1 commit

1b9508f68 sched: Rate-limit newidle ... Browse Code »

Rate limit newidle to migration_cost. It's a win for all
stages of sysbench oltp tests.

Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-11-05 02:13:48 +0800

04 Nov, 2009

2 commits

acc3f5d7c cpumask: Partition_sched_domains takes array of cpumask_var_t ... Browse Code »

Currently partition_sched_domains() takes a 'struct cpumask
*doms_new' which is a kmalloc'ed array of cpumask_t. You can't
have such an array if 'struct cpumask' is undefined, as we plan
for CONFIG_CPUMASK_OFFSTACK=y.

So, we make this an array of cpumask_var_t instead: this is the
same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
Hence we add alloc_sched_domains() and free_sched_domains()
functions.

Signed-off-by: Rusty Russell
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rusty Russell
2009-11-04 20:16:40 +0800
9824a2b72 sched: Remove unused cpu_nr_migrations() ... Browse Code »

cpu_nr_migrations() is not used, remove it.

Signed-off-by: Hiroshi Shimamoto
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Hiroshi Shimamoto
2009-11-04 18:43:43 +0800

03 Nov, 2009

1 commit

b84ff7d6f sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c ... Browse Code »

Eric Paris reported that commit
f685ceacab07d3f6c236f04803e2f2f0dbcc5afb causes boot time
PREEMPT_DEBUG complaints.

[ 4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
[ 4.593043] caller is task_hot+0x86/0xd0

Since kthread_bind() messes with scheduler internals, move the
body to sched.c, and lock the runqueue.

Reported-by: Eric Paris
Signed-off-by: Mike Galbraith
Tested-by: Eric Paris
Cc: Peter Zijlstra
LKML-Reference:
[ v2: fix !SMP build and clean up ]
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-11-03 14:25:00 +0800

02 Nov, 2009

1 commit

49557e620 sched: Fix boot crash by zalloc()ing most of the cpu masks ... Browse Code »

I got a boot crash when forcing cpumasks offstack on 32 bit,
because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
wasn't zeroed).

AFAICT the others need to be zeroed too: only
nohz.ilb_grp_nohz_mask is initialized before use.

Signed-off-by: Rusty Russell
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Rusty Russell
2009-11-02 22:48:54 +0800

30 Oct, 2009

1 commit

8633322c5 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
sched: move rq_weight data array out of .percpu
percpu: allow pcpu_alloc() to be called with IRQs off

Linus Torvalds
2009-10-30 00:19:29 +0800

28 Oct, 2009

1 commit

4a6cc4bd3 sched: move rq_weight data array out of .percpu ... Browse Code »

Commit 34d76c41 introduced percpu array update_shares_data, size of which
being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
NR_CPUS configuration, as ia64 allows only 64k for .percpu section.

Fix this by allocating this array dynamically and keep only pointer to it
percpu.

The per-cpu handling doesn't impose significant performance penalty on
potentially contented path in tg_shares_up().

...
ffffffff8104337c: 65 48 8b 14 25 20 cd mov %gs:0xcd20,%rdx
ffffffff81043383: 00 00
ffffffff81043385: 48 c7 c0 00 e1 00 00 mov $0xe100,%rax
ffffffff8104338c: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
ffffffff81043393: 00
ffffffff81043394: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp)
ffffffff8104339b: 00
ffffffff8104339c: 48 01 d0 add %rdx,%rax
ffffffff8104339f: 49 8d 94 24 08 01 00 lea 0x108(%r12),%rdx
ffffffff810433a6: 00
ffffffff810433a7: b9 ff ff ff ff mov $0xffffffff,%ecx
ffffffff810433ac: 48 89 45 b0 mov %rax,-0x50(%rbp)
ffffffff810433b0: bb 00 04 00 00 mov $0x400,%ebx
ffffffff810433b5: 48 89 55 c0 mov %rdx,-0x40(%rbp)
...

After:

...
ffffffff8104337c: 65 8b 04 25 28 cd 00 mov %gs:0xcd28,%eax
ffffffff81043383: 00
ffffffff81043384: 48 98 cltq
ffffffff81043386: 49 8d bc 24 08 01 00 lea 0x108(%r12),%rdi
ffffffff8104338d: 00
ffffffff8104338e: 48 8b 15 d3 7f 76 00 mov 0x767fd3(%rip),%rdx # ffffffff817ab368
ffffffff81043395: 48 8b 34 c5 00 ee 6d mov -0x7e921200(,%rax,8),%rsi
ffffffff8104339c: 81
ffffffff8104339d: 48 c7 45 a0 00 00 00 movq $0x0,-0x60(%rbp)
ffffffff810433a4: 00
ffffffff810433a5: b9 ff ff ff ff mov $0xffffffff,%ecx
ffffffff810433aa: 48 89 7d c0 mov %rdi,-0x40(%rbp)
ffffffff810433ae: 48 c7 45 a8 00 00 00 movq $0x0,-0x58(%rbp)
ffffffff810433b5: 00
ffffffff810433b6: bb 00 04 00 00 mov $0x400,%ebx
ffffffff810433bb: 48 01 f2 add %rsi,%rdx
ffffffff810433be: 48 89 55 b0 mov %rdx,-0x50(%rbp)
...

Signed-off-by: Jiri Kosina
Acked-by: Ingo Molnar
Signed-off-by: Tejun Heo

Jiri Kosina
2009-10-28 23:26:00 +0800

26 Oct, 2009

2 commits

ce0e7b28f sched, cpuacct: Fix niced guest time accounting ... Browse Code »

CPU time of a guest is always accounted in 'user' time
without concern for the nice value of its counterpart
process although the guest is scheduled under the nice
value.

This patch fixes the defect and accounts cpu time of
a niced guest in 'nice' time as same as a niced process.

And also the patch adds 'guest_nice' to cpuacct. The
value provides niced guest cpu time which is like 'nice'
to 'user'.

The original discussions can be found here:

http://www.mail-archive.com/kvm@vger.kernel.org/msg23982.html
http://www.mail-archive.com/kvm@vger.kernel.org/msg23860.html

Signed-off-by: Ryota Ozaki
Acked-by: Avi Kivity
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Ryota Ozaki
2009-10-26 00:31:30 +0800
0b9e31e92 Merge branch 'linus' into sched/core ... Browse Code »

Conflicts:
fs/proc/array.c

Merge reason: resolve conflict and queue up dependent patch.

Signed-off-by: Ingo Molnar

Ingo Molnar
2009-10-26 00:30:53 +0800

24 Oct, 2009

1 commit

f685ceaca sched: Strengthen buddies and mitigate buddy induced latencies ... Browse Code »

This patch restores the effectiveness of LAST_BUDDY in preventing
pgsql+oltp from collapsing due to wakeup preemption. It also
switches LAST_BUDDY to exclusively do what it does best, namely
mitigate the effects of aggressive wakeup preemption, which
improves vmark throughput markedly, and restores mysql+oltp
scalability.

Since buddies are about scalability, enable them beginning at the
point where we begin expanding sched_latency, namely
sched_nr_latency. Previously, buddies were cleared aggressively,
which seriously reduced their effectiveness. Not clearing
aggressively however, produces a small drop in mysql+oltp
throughput immediately after peak, indicating that LAST_BUDDY is
actually doing some harm. This is right at the point where X on the
desktop in competition with another load wants low latency service.
Ergo, do not enable until we need to scale.

To mitigate latency induced by buddies, or by a task just missing
wakeup preemption, check latency at tick time.

Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
CACHE_HOT_BUDDY.

Supporting performance tests:

tip = v2.6.32-rc5-1497-ga525b32
tipx = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs

(Three run averages except where noted.)

vmark:
------
tip 108466 messages per second
tip+ 125307 messages per second
tip+x 125335 messages per second
tipx 117781 messages per second
2.6.31.3 122729 messages per second

mysql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
tip+ 10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
tipx 9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
2.6.31.3 8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86

pgsql+oltp:
-----------
clients 1 2 4 8 16 32 64 128 256
..........................................................................................
tip 13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
tip+x 13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
tipx 13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
2.6.31.3 13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25

Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Mike Galbraith
2009-10-24 05:48:28 +0800

15 Oct, 2009

1 commit

f061d83a2 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: Fix missing kernel-doc notation
Revert "x86, timers: Check for pending timers after (device) interrupts"
sched: Update the clock of runqueue select_task_rq() selected

Linus Torvalds
2009-10-15 06:25:04 +0800