Eric Lee / smarc-fsl-linux-kernel

12 Oct, 2012

2 commits

0588f1f93 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"A CPU hotplug related crash fix and a nohz accounting fixlet."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update sched_domains_numa_masks[][] when new cpus are onlined
sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
nohz: Fix one jiffy count too far in idle cputime

Linus Torvalds
2012-10-12 21:13:05 +0800
8213a2f3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull pile 2 of execve and kernel_thread unification work from Al Viro:
"Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
several more architectures plus assorted signal fixes and cleanups.

There'll be more (in particular, real fixes for the alpha
do_notify_resume() irq mess)..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
alpha: don't open-code trace_report_syscall_{enter,exit}
Uninclude linux/freezer.h
m32r: trim masks
avr32: trim masks
tile: don't bother with SIGTRAP in setup_frame
microblaze: don't bother with SIGTRAP in setup_rt_frame()
mn10300: don't bother with SIGTRAP in setup_frame()
frv: no need to raise SIGTRAP in setup_frame()
x86: get rid of duplicate code in case of CONFIG_VM86
unicore32: remove pointless test
h8300: trim _TIF_WORK_MASK
parisc: decide whether to go to slow path (tracesys) based on thread flags
parisc: don't bother looping in do_signal()
parisc: fix double restarts
bury the rest of TIF_IRET
sanitize tsk_is_polling()
bury _TIF_RESTORE_SIGMASK
unicore32: unobfuscate _TIF_WORK_MASK
mips: NOTIFY_RESUME is not needed in TIF masks
mips: merge the identical "return from syscall" per-ABI code
...

Conflicts:
arch/arm/include/asm/thread_info.h

Linus Torvalds
2012-10-12 09:49:08 +0800

05 Oct, 2012

2 commits

301a5cba2 sched: Update sched_domains_numa_masks[][] when new cpus are onlined ... Browse Code »

Once array sched_domains_numa_masks[] []is defined, it is never updated.

When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic:

[ 3189.403280] Call Trace:
[ 3189.403286] [] warn_slowpath_common+0x7f/0xc0
[ 3189.403289] [] warn_slowpath_null+0x1a/0x20
[ 3189.403292] [] build_sched_domains+0x467/0x470
[ 3189.403296] [] partition_sched_domains+0x307/0x510
[ 3189.403299] [] ? partition_sched_domains+0x142/0x510
[ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308] [] cpuset_cpu_active+0x38/0x70
[ 3189.403316] [] notifier_call_chain+0x67/0x150
[ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333] [] __cpu_notify+0x20/0x40
[ 3189.403337] [] _cpu_up+0xe9/0x131
[ 3189.403340] [] cpu_up+0xdb/0xee
[ 3189.403348] [] store_online+0x9c/0xd0
[ 3189.403355] [] dev_attr_store+0x20/0x30
[ 3189.403361] [] sysfs_write_file+0xa3/0x100
[ 3189.403368] [] vfs_write+0xd0/0x1a0
[ 3189.403371] [] sys_write+0x54/0xa0
[ 3189.403375] [] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
[ fixed compile warning ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Tang Chen
2012-10-05 19:54:48 +0800
5f7865f3e sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions ... Browse Code »

We should temporarily reset 'sched_domains_numa_levels' to 0 after
it is reset to 'level' in sched_init_numa(). If it fails to allocate
memory for array sched_domains_numa_masks[][], the array will contain
less then 'level' members. This could be dangerous when we use it to
iterate array sched_domains_numa_masks[][] in other functions.

This patch set sched_domains_numa_levels to 0 before initializing
array sched_domains_numa_masks[][], and reset it to 'level' when
sched_domains_numa_masks[][] is fully initialized.

Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Tang Chen
2012-10-05 19:54:46 +0800

02 Oct, 2012

1 commit

0b981cb94 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"Continued quest to clean up and enhance the cputime code by Frederic
Weisbecker, in preparation for future tickless kernel features.

Other than that, smallish changes."

Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
cputime: Make finegrained irqtime accounting generally available
cputime: Gather time/stats accounting config options into a single menu
ia64: Reuse system and user vtime accounting functions on task switch
ia64: Consolidate user vtime accounting
vtime: Consolidate system/idle context detection
cputime: Use a proper subsystem naming for vtime related APIs
sched: cpu_power: enable ARCH_POWER
sched/nohz: Clean up select_nohz_load_balancer()
sched: Fix load avg vs. cpu-hotplug
sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix nohz_idle_balance()
sched: Remove useless code in yield_to()
sched: Add time unit suffix to sched sysctl knobs
sched/debug: Limit sd->*_idx range on sysctl
sched: Remove AFFINE_WAKEUPS feature flag
s390: Remove leftover account_tick_vtime() header
cputime: Consolidate vtime handling on context switch
sched: Move cputime code to its own file
cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
tile: Remove SD_PREFER_LOCAL leftover
...

Linus Torvalds
2012-10-02 01:43:39 +0800

01 Oct, 2012

1 commit

16a801637 sanitize tsk_is_polling() ... Browse Code »

Make default just return 0. The current default (checking
TIF_POLLING_NRFLAG) is taken to architectures that need it;
ones that don't do polling in their idle threads don't need
to defined TIF_POLLING_NRFLAG at all.

ia64 defined both TS_POLLING (used by its tsk_is_polling())
and TIF_POLLING_NRFLAG (not used at all). Killed the latter...

Signed-off-by: Al Viro

Al Viro
2012-10-01 21:58:13 +0800

26 Sep, 2012

4 commits

20ab65e33 rcu: Exit RCU extended QS on user preemption ... Browse Code »

When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.

If we call it, it is because we have the TIF_RESCHED flag:

- It can be set after random local calls to set_need_resched()
(RCU, drm, ...)

- A wake up happened and the CPU needs preemption. This can
happen in several ways:

* Remotely: the remote waking CPU has set TIF_RESCHED and send the
wakee an IPI to schedule the new task.
* Remotely enqueued: the remote waking CPU sends an IPI to the target
and the wake up is made by the target.
* Locally: waking CPU == wakee CPU and the wakeup is done locally.
set_need_resched() is called without IPI.

In the case of local and remotely enqueued wake ups, the tick can
be restarted when we enqueue the new task and RCU can exit the
extended quiescent state at the same time. Then by the time we reach
irq exit path and we call schedule, we are not in RCU user mode.

But if we call schedule() only because something called set_need_resched(),
RCU may still be in user mode when we reach schedule.

Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
flag and call schedule while the IPI has not yet happen to restart the
tick and exit RCU user mode.

We need to manually protect against these corner cases.

Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:11 +0800
90a340ed5 rcu: Exit RCU extended QS on kernel preemption after irq/exception ... Browse Code »

When an exception or an irq exits, and we are going to resume into
interrupted kernel code, the low level architecture code calls
preempt_schedule_irq() if there is a need to reschedule.

If the interrupt/exception occured between a call to rcu_user_enter()
(from syscall exit, exception exit, do_notify_resume exit, ...) and
a real resume to userspace (iret,...), preempt_schedule_irq() can be
called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
is going to run kernel code and may be some RCU read side critical
section. We must exit the userspace extended quiescent state before
we call it.

To solve this, just call rcu_user_exit() in the beginning of
preempt_schedule_irq().

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:09 +0800
04e7e9515 rcu: Switch task's syscall hooks on context switch ... Browse Code »

Clear the syscalls hook of a task when it's scheduled out so that if
the task migrates, it doesn't run the syscall slow path on a CPU
that might not need it.

Also set the syscalls hook on the next task if needed.

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:02 +0800
593d1006c Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25b ... Browse Code »

Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
approach from https://lkml.org/lkml/2012/9/5/585.

Paul E. McKenney
2012-09-26 01:03:56 +0800

25 Sep, 2012

2 commits

a7e1a9e3a vtime: Consolidate system/idle context detection ... Browse Code »

Move the code that finds out to which context we account the
cputime into generic layer.

Archs that consider the whole time spent in the idle task as idle
time (ia64, powerpc) can rely on the generic vtime_account()
and implement vtime_account_system() and vtime_account_idle(),
letting the generic code to decide when to call which API.

Archs that have their own meaning of idle time, such as s390
that only considers the time spent in CPU low power mode as idle
time, can just override vtime_account().

Signed-off-by: Frederic Weisbecker
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-09-25 21:42:37 +0800
bf9fae9f5 cputime: Use a proper subsystem naming for vtime related APIs ... Browse Code »

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-09-25 21:31:31 +0800

23 Sep, 2012

1 commit

5d1802329 sched: Fix load avg vs cpu-hotplug ... Browse Code »

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]

Reported-by: Rakib Mullick
Reported-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Signed-off-by: Paul E. McKenney

Peter Zijlstra
2012-09-23 22:43:56 +0800

17 Sep, 2012

1 commit

37407ea7f Revert "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" ... Browse Code »

This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda.

Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.

Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.

There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.

Reported-by: Nikolay Ulyanitsky
Bisected-by: Borislav Petkov
Acked-by: Mike Galbraith
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-09-17 03:29:43 +0800

13 Sep, 2012

5 commits

bc2a27cd2 sched: cpu_power: enable ARCH_POWER ... Browse Code »

Heteregeneous ARM platform uses arch_scale_freq_power function
to reflect the relative capacity of each core

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1341826026-6504-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2012-09-13 22:52:06 +0800
c1cc017c5 sched/nohz: Clean up select_nohz_load_balancer() ... Browse Code »

There is no load_balancer to be selected now. It just sets the
state of the nohz tick to stop.

So rename the function, pass the 'cpu' as a parameter and then
remove the useless call from tick_nohz_restart_sched_tick().

[ s/set_nohz_tick_stopped/nohz_balance_enter_idle/g
s/clear_nohz_tick_stopped/nohz_balance_exit_idle/g ]
Signed-off-by: Alex Shi
Acked-by: Suresh Siddha
Cc: Venkatesh Pallipadi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1347261059-24747-1-git-send-email-alex.shi@intel.com
Signed-off-by: Ingo Molnar

Alex Shi
2012-09-13 22:52:05 +0800
08bedae1d sched: Fix load avg vs. cpu-hotplug ... Browse Code »

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Suggested-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-13 22:52:05 +0800
f3e947867 sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW ... Browse Code »

Now that the last architecture to use this has stopped doing so (ARM,
thanks Catalin!) we can remove this complexity from the scheduler
core.

Signed-off-by: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Catalin Marinas
Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-13 22:52:04 +0800
5ed4f1d96 sched: Fix nohz_idle_balance() ... Browse Code »

On tickless systems, one CPU runs load balance for all idle CPUs.

The cpu_load of this CPU is updated before starting the load balance
of each other idle CPUs. We should instead update the cpu_load of
the balance_cpu.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra
Cc: Venkatesh Pallipadi
Cc: Suresh Siddha
Link: http://lkml.kernel.org/r/1347509486-8688-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2012-09-13 22:52:03 +0800

04 Sep, 2012

7 commits

38b8dd6f8 sched: Remove useless code in yield_to() ... Browse Code »

It's impossible to enter the else branch if we have set
skip_clock_update in task_yield_fair(), as yield_to_task_fair()
will directly return true after invoke task_yield_fair().

Signed-off-by: Michael Wang
Acked-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Michael Wang
2012-09-04 20:31:42 +0800
201c373e8 sched/debug: Limit sd->*_idx range on sysctl ... Browse Code »

Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.

Signed-off-by: Namhyung Kim
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar

Namhyung Kim
2012-09-04 20:31:32 +0800
c751134ef sched: Remove AFFINE_WAKEUPS feature flag ... Browse Code »

Commit beac4c7e4a1c ("sched: Remove AFFINE_WAKEUPS feature") removed
use of the flag but left the definition. Get rid of it.

Signed-off-by: Namhyung Kim
Signed-off-by: Peter Zijlstra
Cc: Mike Galbraith
Link: http://lkml.kernel.org/r/1345090865-20851-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar

Namhyung Kim
2012-09-04 20:31:31 +0800
59f979455 Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge in the current fixes branch, we are going to apply dependent patches.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-09-04 20:31:00 +0800
9450d57ea sched: Fix kernel-doc warnings in kernel/sched/fair.c ... Browse Code »

Fix two kernel-doc warnings in kernel/sched/fair.c:

Warning(kernel/sched/fair.c:3660): Excess function parameter 'cpus' description in 'update_sg_lb_stats'
Warning(kernel/sched/fair.c:3806): Excess function parameter 'cpus' description in 'update_sd_lb_stats'

Signed-off-by: Randy Dunlap
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/50303714.3090204@xenotime.net
Signed-off-by: Ingo Molnar

Randy Dunlap
2012-09-04 20:30:49 +0800
a4c96ae31 sched: Unthrottle rt runqueues in __disable_runtime() ... Browse Code »

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel
Signed-off-by: Peter Zijlstra
Cc: Paul Turner
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar

Peter Boonstoppel
2012-09-04 20:30:30 +0800
f319da0c6 sched: Fix load avg vs cpu-hotplug ... Browse Code »
43

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick
Reported-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-04 20:30:18 +0800

21 Aug, 2012

1 commit

53795ced6 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix migration thread runtime bogosity
sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
sched,cgroup: Fix up task_groups list
sched: fix divide by zero at {thread_group,task}_times
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies

Linus Torvalds
2012-08-21 01:35:05 +0800

20 Aug, 2012

2 commits

baa36046d cputime: Consolidate vtime handling on context switch ... Browse Code »

The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.

These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.

Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.

Signed-off-by: Frederic Weisbecker
Acked-by: Martin Schwidefsky
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-08-20 19:05:28 +0800
73fbec604 sched: Move cputime code to its own file ... Browse Code »

Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c

Signed-off-by: Frederic Weisbecker
Acked-by: Martin Schwidefsky
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-08-20 19:05:17 +0800

14 Aug, 2012

10 commits

f03542a70 sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean up ... Browse Code »

Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.

So, clean up the obsolete code, includes SD_PREFER_LOCAL.

Signed-off-by: Alex Shi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner

Alex Shi
2012-08-14 01:02:05 +0800
78feefc51 sched: using dst_rq instead of this_rq during load balance ... Browse Code »

As we already have dst_rq in lb_env, using or changing "this_rq" do not
make sense.

This patch will replace "this_rq" with dst_rq in load_balance, and we
don't need to change "this_rq" while process LBF_SOME_PINNED any more.

Signed-off-by: Michael Wang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/501F8357.3070102@linux.vnet.ibm.com
Signed-off-by: Thomas Gleixner

Michael Wang
2012-08-14 00:58:15 +0800
edde96eaf sched: Document schedule() entry points ... Browse Code »

This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.

Acked-by: Randy Dunlap
Explained-by: Ingo Molnar
Explained-by: Peter Zijlstra
Signed-off-by: Pekka Enberg
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
Signed-off-by: Thomas Gleixner

Pekka Enberg
2012-08-14 00:58:15 +0800
532b1858c sched: Fix __sched_period comment ... Browse Code »

It should be sched_nr_latency so fix it before it annoys me more.

Signed-off-by: Borislav Petkov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344435364-18632-1-git-send-email-bp@amd64.org
Signed-off-by: Thomas Gleixner

Borislav Petkov
2012-08-14 00:58:15 +0800
a4133765c Merge branch 'sched/urgent' into sched/core Browse Code »

Thomas Gleixner
2012-08-14 00:56:46 +0800
8f6189684 sched: Fix migration thread runtime bogosity ... Browse Code »

Make stop scheduler class do the same accounting as other classes,

Migration threads can be caught in the act while doing exec balancing,
leading to the below due to use of unmaintained ->se.exec_start. The
load that triggered this particular instance was an apparently out of
control heavily threaded application that does system monitoring in
what equated to an exec bomb, with one of the VERY frequently migrated
tasks being ps.

%CPU PID USER CMD
99.3 45 root [migration/10]
97.7 53 root [migration/12]
97.0 57 root [migration/13]
90.1 49 root [migration/11]
89.6 65 root [migration/15]
88.7 17 root [migration/3]
80.4 37 root [migration/8]
78.1 41 root [migration/9]
44.2 13 root [migration/2]

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344051854.6739.19.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner

Mike Galbraith
2012-08-14 00:41:55 +0800
e221d028b sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled ... Browse Code »

Root task group bandwidth replenishment must service all CPUs, regardless of
where the timer was last started, and regardless of the isolation mechanism,
lest 'Quoth the Raven, "Nevermore"' become rt scheduling policy.

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344326558.6968.25.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner

Mike Galbraith
2012-08-14 00:41:55 +0800
35cf4e50b sched,cgroup: Fix up task_groups list ... Browse Code »

With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.

Signed-off-by: Mike Galbraith
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner

Mike Galbraith
2012-08-14 00:41:54 +0800
bea6832cc sched: fix divide by zero at {thread_group,task}_times ... Browse Code »

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner

Stanislaw Gruszka
2012-08-14 00:41:54 +0800
a35b6466a sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies ... Browse Code »

Peter Portante reported that for large cgroup hierarchies (and or on
large CPU counts) we get immense lock contention on rq->lock and stuff
stops working properly.

His workload was a ton of processes, each in their own cgroup,
everybody idling except for a sporadic wakeup once every so often.

It was found that:

schedule()
idle_balance()
load_balance()
local_irq_save()
double_rq_lock()
update_h_load()
walk_tg_tree(tg_load_down)
tg_load_down()

Results in an entire cgroup hierarchy walk under rq->lock for every
new-idle balance and since new-idle balance isn't throttled this
results in a lot of work while holding the rq->lock.

This patch does two things, it removes the work from under rq->lock
based on the good principle of race and pray which is widely employed
in the load-balancer as a whole. And secondly it throttles the
update_h_load() calculation to max once per jiffy.

I considered excluding update_h_load() for new-idle balance
all-together, but purely relying on regular balance passes to update
this data might not work out under some rare circumstances where the
new-idle busiest isn't the regular busiest for a while (unlikely, but
a nightmare to debug if someone hits it and suffers).

Cc: pjt@google.com
Cc: Larry Woodman
Cc: Mike Galbraith
Reported-by: Peter Portante
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-aaarrzfpnaam7pqrekofu8a6@git.kernel.org
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2012-08-14 00:41:54 +0800

04 Aug, 2012

1 commit

fcc1d2a9c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Fixes and two late cleanups"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
sched: Fix comment about PREEMPT_ACTIVE bit location
sched: Fix minor code style issues
sched: Use task_rq_unlock() in __sched_setscheduler()
sched/numa: Add SD_PERFER_SIBLING to CPU domain

Linus Torvalds
2012-08-04 01:58:13 +0800