Doug / smarc-fsl-linux-kernel | Embedian Git Server

12 Oct, 2012

2 commits

0588f1f93 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"A CPU hotplug related crash fix and a nohz accounting fixlet."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Update sched_domains_numa_masks[][] when new cpus are onlined
sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions
nohz: Fix one jiffy count too far in idle cputime

Linus Torvalds
2012-10-12 21:13:05 +0800
8213a2f3e Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal ... Browse Code »

Pull pile 2 of execve and kernel_thread unification work from Al Viro:
"Stuff in there: kernel_thread/kernel_execve/sys_execve conversions for
several more architectures plus assorted signal fixes and cleanups.

There'll be more (in particular, real fixes for the alpha
do_notify_resume() irq mess)..."

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (43 commits)
alpha: don't open-code trace_report_syscall_{enter,exit}
Uninclude linux/freezer.h
m32r: trim masks
avr32: trim masks
tile: don't bother with SIGTRAP in setup_frame
microblaze: don't bother with SIGTRAP in setup_rt_frame()
mn10300: don't bother with SIGTRAP in setup_frame()
frv: no need to raise SIGTRAP in setup_frame()
x86: get rid of duplicate code in case of CONFIG_VM86
unicore32: remove pointless test
h8300: trim _TIF_WORK_MASK
parisc: decide whether to go to slow path (tracesys) based on thread flags
parisc: don't bother looping in do_signal()
parisc: fix double restarts
bury the rest of TIF_IRET
sanitize tsk_is_polling()
bury _TIF_RESTORE_SIGMASK
unicore32: unobfuscate _TIF_WORK_MASK
mips: NOTIFY_RESUME is not needed in TIF masks
mips: merge the identical "return from syscall" per-ABI code
...

Conflicts:
arch/arm/include/asm/thread_info.h

Linus Torvalds
2012-10-12 09:49:08 +0800

05 Oct, 2012

2 commits

301a5cba2 sched: Update sched_domains_numa_masks[][] when new cpus are onlined ... Browse Code »

Once array sched_domains_numa_masks[] []is defined, it is never updated.

When a new cpu on a new node is onlined, the coincident member in
sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
As a result, the build_overlap_sched_groups() will initialize a NULL
sched_group for the new cpu on the new node, which will lead to kernel panic:

[ 3189.403280] Call Trace:
[ 3189.403286] [] warn_slowpath_common+0x7f/0xc0
[ 3189.403289] [] warn_slowpath_null+0x1a/0x20
[ 3189.403292] [] build_sched_domains+0x467/0x470
[ 3189.403296] [] partition_sched_domains+0x307/0x510
[ 3189.403299] [] ? partition_sched_domains+0x142/0x510
[ 3189.403305] [] cpuset_update_active_cpus+0x83/0x90
[ 3189.403308] [] cpuset_cpu_active+0x38/0x70
[ 3189.403316] [] notifier_call_chain+0x67/0x150
[ 3189.403320] [] ? native_cpu_up+0x18a/0x1b5
[ 3189.403328] [] __raw_notifier_call_chain+0xe/0x10
[ 3189.403333] [] __cpu_notify+0x20/0x40
[ 3189.403337] [] _cpu_up+0xe9/0x131
[ 3189.403340] [] cpu_up+0xdb/0xee
[ 3189.403348] [] store_online+0x9c/0xd0
[ 3189.403355] [] dev_attr_store+0x20/0x30
[ 3189.403361] [] sysfs_write_file+0xa3/0x100
[ 3189.403368] [] vfs_write+0xd0/0x1a0
[ 3189.403371] [] sys_write+0x54/0xa0
[ 3189.403375] [] system_call_fastpath+0x16/0x1b
[ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
[ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018

This patch registers a new notifier for cpu hotplug notify chain, and
updates sched_domains_numa_masks every time a new cpu is onlined or offlined.

Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
[ fixed compile warning ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Tang Chen
2012-10-05 19:54:48 +0800
5f7865f3e sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions ... Browse Code »

We should temporarily reset 'sched_domains_numa_levels' to 0 after
it is reset to 'level' in sched_init_numa(). If it fails to allocate
memory for array sched_domains_numa_masks[][], the array will contain
less then 'level' members. This could be dangerous when we use it to
iterate array sched_domains_numa_masks[][] in other functions.

This patch set sched_domains_numa_levels to 0 before initializing
array sched_domains_numa_masks[][], and reset it to 'level' when
sched_domains_numa_masks[][] is fully initialized.

Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Tang Chen
2012-10-05 19:54:46 +0800

02 Oct, 2012

1 commit

0b981cb94 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"Continued quest to clean up and enhance the cputime code by Frederic
Weisbecker, in preparation for future tickless kernel features.

Other than that, smallish changes."

Fix up trivial conflicts due to additions next to each other in arch/{x86/}Kconfig

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
cputime: Make finegrained irqtime accounting generally available
cputime: Gather time/stats accounting config options into a single menu
ia64: Reuse system and user vtime accounting functions on task switch
ia64: Consolidate user vtime accounting
vtime: Consolidate system/idle context detection
cputime: Use a proper subsystem naming for vtime related APIs
sched: cpu_power: enable ARCH_POWER
sched/nohz: Clean up select_nohz_load_balancer()
sched: Fix load avg vs. cpu-hotplug
sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: Fix nohz_idle_balance()
sched: Remove useless code in yield_to()
sched: Add time unit suffix to sched sysctl knobs
sched/debug: Limit sd->*_idx range on sysctl
sched: Remove AFFINE_WAKEUPS feature flag
s390: Remove leftover account_tick_vtime() header
cputime: Consolidate vtime handling on context switch
sched: Move cputime code to its own file
cputime: Generalize CONFIG_VIRT_CPU_ACCOUNTING
tile: Remove SD_PREFER_LOCAL leftover
...

Linus Torvalds
2012-10-02 01:43:39 +0800

01 Oct, 2012

1 commit

16a801637 sanitize tsk_is_polling() ... Browse Code »

Make default just return 0. The current default (checking
TIF_POLLING_NRFLAG) is taken to architectures that need it;
ones that don't do polling in their idle threads don't need
to defined TIF_POLLING_NRFLAG at all.

ia64 defined both TS_POLLING (used by its tsk_is_polling())
and TIF_POLLING_NRFLAG (not used at all). Killed the latter...

Signed-off-by: Al Viro

Al Viro
2012-10-01 21:58:13 +0800

26 Sep, 2012

4 commits

20ab65e33 rcu: Exit RCU extended QS on user preemption ... Browse Code »

When exceptions or irq are about to resume userspace, if
the task needs to be rescheduled, the arch low level code
calls schedule() directly.

If we call it, it is because we have the TIF_RESCHED flag:

- It can be set after random local calls to set_need_resched()
(RCU, drm, ...)

- A wake up happened and the CPU needs preemption. This can
happen in several ways:

* Remotely: the remote waking CPU has set TIF_RESCHED and send the
wakee an IPI to schedule the new task.
* Remotely enqueued: the remote waking CPU sends an IPI to the target
and the wake up is made by the target.
* Locally: waking CPU == wakee CPU and the wakeup is done locally.
set_need_resched() is called without IPI.

In the case of local and remotely enqueued wake ups, the tick can
be restarted when we enqueue the new task and RCU can exit the
extended quiescent state at the same time. Then by the time we reach
irq exit path and we call schedule, we are not in RCU user mode.

But if we call schedule() only because something called set_need_resched(),
RCU may still be in user mode when we reach schedule.

Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
flag and call schedule while the IPI has not yet happen to restart the
tick and exit RCU user mode.

We need to manually protect against these corner cases.

Create a new API schedule_user() that calls schedule() inside
rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
will need to rely on it now to implement user preemption safely.

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:11 +0800
90a340ed5 rcu: Exit RCU extended QS on kernel preemption after irq/exception ... Browse Code »

When an exception or an irq exits, and we are going to resume into
interrupted kernel code, the low level architecture code calls
preempt_schedule_irq() if there is a need to reschedule.

If the interrupt/exception occured between a call to rcu_user_enter()
(from syscall exit, exception exit, do_notify_resume exit, ...) and
a real resume to userspace (iret,...), preempt_schedule_irq() can be
called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
is going to run kernel code and may be some RCU read side critical
section. We must exit the userspace extended quiescent state before
we call it.

To solve this, just call rcu_user_exit() in the beginning of
preempt_schedule_irq().

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:09 +0800
04e7e9515 rcu: Switch task's syscall hooks on context switch ... Browse Code »

Clear the syscalls hook of a task when it's scheduled out so that if
the task migrates, it doesn't run the syscall slow path on a CPU
that might not need it.

Also set the syscalls hook on the next task if needed.

Signed-off-by: Frederic Weisbecker
Cc: Alessio Igor Bogani
Cc: Andrew Morton
Cc: Avi Kivity
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Geoff Levand
Cc: Gilad Ben Yossef
Cc: Hakan Akkan
Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Josh Triplett
Cc: Kevin Hilman
Cc: Max Krasnyansky
Cc: Peter Zijlstra
Cc: Stephen Hemminger
Cc: Steven Rostedt
Cc: Sven-Thorsten Dietrich
Cc: Thomas Gleixner
Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Frederic Weisbecker
2012-09-26 21:47:02 +0800
593d1006c Merge remote-tracking branch 'tip/core/rcu' into next.2012.09.25b ... Browse Code »

Resolved conflict in kernel/sched/core.c using Peter Zijlstra's
approach from https://lkml.org/lkml/2012/9/5/585.

Paul E. McKenney
2012-09-26 01:03:56 +0800

25 Sep, 2012

1 commit

bf9fae9f5 cputime: Use a proper subsystem naming for vtime related APIs ... Browse Code »

Use a naming based on vtime as a prefix for virtual based
cputime accounting APIs:

- account_system_vtime() -> vtime_account()
- account_switch_vtime() -> vtime_task_switch()

It makes it easier to allow for further declension such
as vtime_account_system(), vtime_account_idle(), ... if we
want to find out the context we account to from generic code.

This also make it better to know on which subsystem these APIs
refer to.

Signed-off-by: Frederic Weisbecker
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-09-25 21:31:31 +0800

23 Sep, 2012

1 commit

5d1802329 sched: Fix load avg vs cpu-hotplug ... Browse Code »

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

[ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
miscounting noted by Rakib. ]

Reported-by: Rakib Mullick
Reported-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Signed-off-by: Paul E. McKenney

Peter Zijlstra
2012-09-23 22:43:56 +0800

17 Sep, 2012

1 commit

37407ea7f Revert "sched: Improve scalability via 'CPU buddies', which withstand random perturbations" ... Browse Code »

This reverts commit 970e178985cadbca660feb02f4d2ee3a09f7fdda.

Nikolay Ulyanitsky reported thatthe 3.6-rc5 kernel has a 15-20%
performance drop on PostgreSQL 9.2 on his machine (running "pgbench").

Borislav Petkov was able to reproduce this, and bisected it to this
commit 970e178985ca ("sched: Improve scalability via 'CPU buddies' ...")
apparently because the new single-idle-buddy model simply doesn't find
idle CPU's to reschedule on aggressively enough.

Mike Galbraith suspects that it is likely due to the user-mode spinlocks
in PostgreSQL not reacting well to preemption, but we don't really know
the details - I'll just revert the commit for now.

There are hopefully other approaches to improve scheduler scalability
without it causing these kinds of downsides.

Reported-by: Nikolay Ulyanitsky
Bisected-by: Borislav Petkov
Acked-by: Mike Galbraith
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Ingo Molnar
Signed-off-by: Linus Torvalds

Linus Torvalds
2012-09-17 03:29:43 +0800

13 Sep, 2012

2 commits

08bedae1d sched: Fix load avg vs. cpu-hotplug ... Browse Code »

Commit f319da0c68 ("sched: Fix load avg vs cpu-hotplug") was an
incomplete fix:

In particular, the problem is that at the point it calls
calc_load_migrate() nr_running := 1 (the stopper thread), so move the
call to CPU_DEAD where we're sure that nr_running := 0.

Also note that we can call calc_load_migrate() without serialization, we
know the state of rq is stable since its cpu is dead, and we modify the
global state using appropriate atomic ops.

Suggested-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1346882630.2600.59.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-13 22:52:05 +0800
f3e947867 sched: Remove __ARCH_WANT_INTERRUPTS_ON_CTXSW ... Browse Code »

Now that the last architecture to use this has stopped doing so (ARM,
thanks Catalin!) we can remove this complexity from the scheduler
core.

Signed-off-by: Peter Zijlstra
Cc: Oleg Nesterov
Cc: Catalin Marinas
Link: http://lkml.kernel.org/n/tip-g9p2a1w81xxbrze25v9zpzbf@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-13 22:52:04 +0800

04 Sep, 2012

5 commits

38b8dd6f8 sched: Remove useless code in yield_to() ... Browse Code »

It's impossible to enter the else branch if we have set
skip_clock_update in task_yield_fair(), as yield_to_task_fair()
will directly return true after invoke task_yield_fair().

Signed-off-by: Michael Wang
Acked-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/4FF2925A.9060005@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Michael Wang
2012-09-04 20:31:42 +0800
201c373e8 sched/debug: Limit sd->*_idx range on sysctl ... Browse Code »

Various sd->*_idx's are used for refering the rq's load average table
when selecting a cpu to run. However they can be set to any number
with sysctl knobs so that it can crash the kernel if something bad is
given. Fix it by limiting them into the actual range.

Signed-off-by: Namhyung Kim
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1345104204-8317-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar

Namhyung Kim
2012-09-04 20:31:32 +0800
59f979455 Merge branch 'sched/urgent' into sched/core ... Browse Code »

Merge in the current fixes branch, we are going to apply dependent patches.

Signed-off-by: Ingo Molnar

Ingo Molnar
2012-09-04 20:31:00 +0800
a4c96ae31 sched: Unthrottle rt runqueues in __disable_runtime() ... Browse Code »

migrate_tasks() uses _pick_next_task_rt() to get tasks from the
real-time runqueues to be migrated. When rt_rq is throttled
_pick_next_task_rt() won't return anything, in which case
migrate_tasks() can't move all threads over and gets stuck in an
infinite loop.

Instead unthrottle rt runqueues before migrating tasks.

Additionally: move unthrottle_offline_cfs_rqs() to rq_offline_fair()

Signed-off-by: Peter Boonstoppel
Signed-off-by: Peter Zijlstra
Cc: Paul Turner
Link: http://lkml.kernel.org/r/5FBF8E85CA34454794F0F7ECBA79798F379D3648B7@HQMAIL04.nvidia.com
Signed-off-by: Ingo Molnar

Peter Boonstoppel
2012-09-04 20:30:30 +0800
f319da0c6 sched: Fix load avg vs cpu-hotplug ... Browse Code »

Rabik and Paul reported two different issues related to the same few
lines of code.

Rabik's issue is that the nr_uninterruptible migration code is wrong in
that he sees artifacts due to this (Rabik please do expand in more
detail).

Paul's issue is that this code as it stands relies on us using
stop_machine() for unplug, we all would like to remove this assumption
so that eventually we can remove this stop_machine() usage altogether.

The only reason we'd have to migrate nr_uninterruptible is so that we
could use for_each_online_cpu() loops in favour of
for_each_possible_cpu() loops, however since nr_uninterruptible() is the
only such loop and its using possible lets not bother at all.

The problem Rabik sees is (probably) caused by the fact that by
migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
involved.

So don't bother with fancy migration schemes (meaning we now have to
keep using for_each_possible_cpu()) and instead fold any nr_active delta
after we migrate all tasks away to make sure we don't have any skewed
nr_active accounting.

Reported-by: Rakib Mullick
Reported-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1345454817.23018.27.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-09-04 20:30:18 +0800

21 Aug, 2012

1 commit

53795ced6 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar.

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix migration thread runtime bogosity
sched,rt: fix isolated CPUs leaving root_task_group indefinitely throttled
sched,cgroup: Fix up task_groups list
sched: fix divide by zero at {thread_group,task}_times
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies

Linus Torvalds
2012-08-21 01:35:05 +0800

20 Aug, 2012

2 commits

baa36046d cputime: Consolidate vtime handling on context switch ... Browse Code »

The archs that implement virtual cputime accounting all
flush the cputime of a task when it gets descheduled
and sometimes set up some ground initialization for the
next task to account its cputime.

These archs all put their own hooks in their context
switch callbacks and handle the off-case themselves.

Consolidate this by creating a new account_switch_vtime()
callback called in generic code right after a context switch
and that these archs must implement to flush the prev task
cputime and initialize the next task cputime related state.

Signed-off-by: Frederic Weisbecker
Acked-by: Martin Schwidefsky
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-08-20 19:05:28 +0800
73fbec604 sched: Move cputime code to its own file ... Browse Code »

Extract cputime code from the giant sched/core.c and
put it in its own file. This make it easier to deal with
this particular area and de-bloat a bit more core.c

Signed-off-by: Frederic Weisbecker
Acked-by: Martin Schwidefsky
Cc: Tony Luck
Cc: Fenghua Yu
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: Peter Zijlstra

Frederic Weisbecker
2012-08-20 19:05:17 +0800

14 Aug, 2012

5 commits

f03542a70 sched: recover SD_WAKE_AFFINE in select_task_rq_fair and code clean up ... Browse Code »

Since power saving code was removed from sched now, the implement
code is out of service in this function, and even pollute other logical.
like, 'want_sd' never has chance to be set '0', that remove the effect
of SD_WAKE_AFFINE here.

So, clean up the obsolete code, includes SD_PREFER_LOCAL.

Signed-off-by: Alex Shi
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/5028F431.6000306@intel.com
Signed-off-by: Thomas Gleixner

Alex Shi
2012-08-14 01:02:05 +0800
edde96eaf sched: Document schedule() entry points ... Browse Code »

This patch adds a comment on top of the schedule() function to explain
to scheduler newbies how the main scheduler function is entered.

Acked-by: Randy Dunlap
Explained-by: Ingo Molnar
Explained-by: Peter Zijlstra
Signed-off-by: Pekka Enberg
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344070187-2420-1-git-send-email-penberg@kernel.org
Signed-off-by: Thomas Gleixner

Pekka Enberg
2012-08-14 00:58:15 +0800
a4133765c Merge branch 'sched/urgent' into sched/core Browse Code »

Thomas Gleixner
2012-08-14 00:56:46 +0800
35cf4e50b sched,cgroup: Fix up task_groups list ... Browse Code »

With multiple instances of task_groups, for_each_rt_rq() is a noop,
no task groups having been added to the rt.c list instance. This
renders __enable/disable_runtime() and print_rt_stats() noop, the
user (non) visible effect being that rt task groups are missing in
/proc/sched_debug.

Signed-off-by: Mike Galbraith
Cc: stable@kernel.org # v3.3+
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1344308413.6846.7.camel@marge.simpson.net
Signed-off-by: Thomas Gleixner

Mike Galbraith
2012-08-14 00:41:54 +0800
bea6832cc sched: fix divide by zero at {thread_group,task}_times ... Browse Code »

On architectures where cputime_t is 64 bit type, is possible to trigger
divide by zero on do_div(temp, (__force u32) total) line, if total is a
non zero number but has lower 32 bit's zeroed. Removing casting is not
a good solution since some do_div() implementations do cast to u32
internally.

This problem can be triggered in practice on very long lived processes:

PID: 2331 TASK: ffff880472814b00 CPU: 2 COMMAND: "oraagent.bin"
#0 [ffff880472a51b70] machine_kexec at ffffffff8103214b
#1 [ffff880472a51bd0] crash_kexec at ffffffff810b91c2
#2 [ffff880472a51ca0] oops_end at ffffffff814f0b00
#3 [ffff880472a51cd0] die at ffffffff8100f26b
#4 [ffff880472a51d00] do_trap at ffffffff814f03f4
#5 [ffff880472a51d60] do_divide_error at ffffffff8100cfff
#6 [ffff880472a51e00] divide_error at ffffffff8100be7b
[exception RIP: thread_group_times+0x56]
RIP: ffffffff81056a16 RSP: ffff880472a51eb8 RFLAGS: 00010046
RAX: bc3572c9fe12d194 RBX: ffff880874150800 RCX: 0000000110266fad
RDX: 0000000000000000 RSI: ffff880472a51eb8 RDI: 001038ae7d9633dc
RBP: ffff880472a51ef8 R8: 00000000b10a3a64 R9: ffff880874150800
R10: 00007fcba27ab680 R11: 0000000000000202 R12: ffff880472a51f08
R13: ffff880472a51f10 R14: 0000000000000000 R15: 0000000000000007
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff880472a51f00] do_sys_times at ffffffff8108845d
#8 [ffff880472a51f40] sys_times at ffffffff81088524
#9 [ffff880472a51f80] system_call_fastpath at ffffffff8100b0f2
RIP: 0000003808caac3a RSP: 00007fcba27ab6d8 RFLAGS: 00000202
RAX: 0000000000000064 RBX: ffffffff8100b0f2 RCX: 0000000000000000
RDX: 00007fcba27ab6e0 RSI: 000000000076d58e RDI: 00007fcba27ab6e0
RBP: 00007fcba27ab700 R8: 0000000000000020 R9: 000000000000091b
R10: 00007fcba27ab680 R11: 0000000000000202 R12: 00007fff9ca41940
R13: 0000000000000000 R14: 00007fcba27ac9c0 R15: 00007fff9ca41940
ORIG_RAX: 0000000000000064 CS: 0033 SS: 002b

Cc: stable@vger.kernel.org
Signed-off-by: Stanislaw Gruszka
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120808092714.GA3580@redhat.com
Signed-off-by: Thomas Gleixner

Stanislaw Gruszka
2012-08-14 00:41:54 +0800

04 Aug, 2012

1 commit

fcc1d2a9c Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Fixes and two late cleanups"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/cleanups: Add load balance cpumask pointer to 'struct lb_env'
sched: Fix comment about PREEMPT_ACTIVE bit location
sched: Fix minor code style issues
sched: Use task_rq_unlock() in __sched_setscheduler()
sched/numa: Add SD_PERFER_SIBLING to CPU domain

Linus Torvalds
2012-08-04 01:58:13 +0800

01 Aug, 2012

1 commit

bca1a5c0e Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...

Linus Torvalds
2012-08-01 06:34:13 +0800

26 Jul, 2012

2 commits

895dd92c0 sched: Deliver sched_switch events to the current task ... Browse Code »

Otherwise they can't be filtered for a defined task:

perf record -e sched:sched_switch ./foo

This command doesn't report any events without this patch.

I think it isn't a security concern if someone knows who will
be executed next - this can already be observed by polling /proc
state. By default perf is disabled for non-root users in any case.

I need these events for profiling sleep times. sched_switch is used for
getting callchains and sched_stat_* is used for getting time periods.
These events are combined in user space, then it can be analyzed by
perf tools.

Signed-off-by: Andrew Vagin
Signed-off-by: Peter Zijlstra
Cc: Steven Rostedt
Cc: Arun Sharma
Link: http://lkml.kernel.org/r/1342088069-1005148-1-git-send-email-avagin@openvz.org
Signed-off-by: Ingo Molnar

Andrew Vagin
2012-07-26 18:23:10 +0800
45afb1734 sched: Use task_rq_unlock() in __sched_setscheduler() ... Browse Code »

It seems there's no specific reason to open-code it. I guess
commit 0122ec5b02f76 ("sched: Add p->pi_lock to task_rq_lock()")
simply missed it. Let's be consistent with others.

Signed-off-by: Namhyung Kim
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1341647342-6742-1-git-send-email-namhyung@kernel.org
Signed-off-by: Ingo Molnar

Namhyung Kim
2012-07-26 17:46:59 +0800

24 Jul, 2012

4 commits

8323f26ce sched: Fix race in task_group() ... Browse Code »

Stefan reported a crash on a kernel before a3e5d1091c1 ("sched:
Don't call task_group() too many times in set_task_rq()"), he
found the reason to be that the multiple task_group()
invocations in set_task_rq() returned different values.

Looking at all that I found a lack of serialization and plain
wrong comments.

The below tries to fix it using an extra pointer which is
updated under the appropriate scheduler locks. Its not pretty,
but I can't really see another way given how all the cgroup
stuff works.

Reported-and-tested-by: Stefan Bader
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1340364965.18025.71.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-07-24 19:58:20 +0800
970e17898 sched: Improve scalability via 'CPU buddies', which withstand random perturbations ... Browse Code »

Traversing an entire package is not only expensive, it also leads to tasks
bouncing all over a partially idle and possible quite large package. Fix
that up by assigning a 'buddy' CPU to try to motivate. Each buddy may try
to motivate that one other CPU, if it's busy, tough, it may then try its
SMT sibling, but that's all this optimization is allowed to cost.

Sibling cache buddies are cross-wired to prevent bouncing.

4 socket 40 core + SMT Westmere box, single 30 sec tbench runs, higher is better:

clients 1 2 4 8 16 32 64 128
..........................................................................
pre 30 41 118 645 3769 6214 12233 14312
post 299 603 1211 2418 4697 6847 11606 14557

A nice increase in performance.

Signed-off-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1339471112.7352.32.camel@marge.simpson.net
Signed-off-by: Ingo Molnar

Mike Galbraith
2012-07-24 19:53:34 +0800
7ddf96b02 cpusets, hotplug: Restructure functions that are invoked during hotplug ... Browse Code »

Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".

And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which is more appropriate considering how
it is restructured.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141650.3692.48637.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:22 +0800
d35be8bab CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume ... Browse Code »

In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.

However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend.

In order to achieve that, do the following:

1. Don't modify cpusets during suspend/resume. At all.
In particular, don't move the tasks from one cpuset to another, and
don't modify any cpuset's cpus_allowed mask. So, simply ignore cpusets
during the CPU hotplug operations that are carried out in the
suspend/resume path.

2. However, cpusets and sched domains are related. We just want to avoid
altering cpusets alone. So, to keep the sched domains updated, build
a single sched domain (containing all active cpus) during each of the
CPU hotplug operations carried out in s/r path, effectively ignoring
the cpusets' cpus_allowed masks.

(Since userspace is frozen while doing all this, it will go unnoticed.)

3. During the last CPU online operation during resume, build the sched
domains by looking up the (unaltered) cpusets' cpus_allowed masks.
That will bring back the system to the same original state as it was in
before suspend.

Ultimately, this will not only solve the cpuset problem related to suspend
resume (ie., restores the cpusets to exactly what it was before suspend, by
not touching it at all) but also speeds up suspend/resume because we avoid
running cpuset update code for every CPU being offlined/onlined.

Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20120524141611.3692.20155.stgit@srivatsabhat.in.ibm.com
Signed-off-by: Ingo Molnar

Srivatsa S. Bhat
2012-07-24 19:53:14 +0800

15 Jul, 2012

1 commit

ab93eb821 Merge branches 'core-urgent-for-linus', 'perf-urgent-for-linus' and 'sched-urgen… ... Browse Code »

…t-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull RCU, perf, and scheduler fixes from Ingo Molnar.

The RCU fix is a revert for an optimization that could cause deadlocks.

One of the scheduler commits (164c33c6adee "sched: Fix fork() error path
to not crash") is correct but not complete (some architectures like Tile
are not covered yet) - the resulting additional fixes are still WIP and
Ingo did not want to delay these pending fixes. See this thread on
lkml:

[PATCH] fork: fix error handling in dup_task()

The perf fixes are just trivial oneliners.

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf kvm: Fix segfault with report and mixed guestmount use
perf kvm: Fix regression with guest machine creation
perf script: Fix format regression due to libtraceevent merge
ring-buffer: Fix accounting of entries when removing pages
ring-buffer: Fix crash due to uninitialized new_pages list head

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS/sched: Update scheduler file pattern
sched/nohz: Rewrite and fix load-avg computation -- again
sched: Fix fork() error path to not crash

Linus Torvalds
2012-07-15 02:16:24 +0800

06 Jul, 2012

1 commit

5167e8d54 sched/nohz: Rewrite and fix load-avg computation -- again ... Browse Code »

Thanks to Charles Wang for spotting the defects in the current code:

- If we go idle during the sample window -- after sampling, we get a
negative bias because we can negate our own sample.

- If we wake up during the sample window we get a positive bias
because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Reported-and-tested-by: Doug Smythies
Reported-and-tested-by: Charles Wang
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: stable@kernel.org
Link: http://lkml.kernel.org/r/1340373782.18025.74.camel@twins
[ minor edits ]
Signed-off-by: Ingo Molnar

Peter Zijlstra
2012-07-06 02:58:13 +0800

03 Jul, 2012

1 commit

cba6d0d64 Revert "rcu: Move PREEMPT_RCU preemption to switch_to() invocation" ... Browse Code »

This reverts commit 616c310e83b872024271c915c1b9ab505b9efad9.
(Move PREEMPT_RCU preemption to switch_to() invocation).
Testing by Sasha Levin showed that this
can result in deadlock due to invoking the scheduler when one of
the runqueue locks is held. Because this commit was simply a
performance optimization, revert it.

Reported-by: Sasha Levin
Signed-off-by: Paul E. McKenney
Tested-by: Sasha Levin

Paul E. McKenney
2012-07-03 02:39:19 +0800

06 Jun, 2012

1 commit

a841f8cef sched: Fix the relax_domain_level boot parameter ... Browse Code »

It does not get processed because sched_domain_level_max is 0 at the
time that setup_relax_domain_level() is run.

Simply accept the value as it is, as we don't know the value of
sched_domain_level_max until sched domain construction is completed.

Fix sched_relax_domain_level in cpuset. The build_sched_domain() routine calls
the set_domain_attribute() routine prior to setting the sd->level, however,
the set_domain_attribute() routine relies on the sd->level to decide whether
idle load balancing will be off/on.

Signed-off-by: Dimitri Sivanich
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20120605184436.GA15668@sgi.com
Signed-off-by: Ingo Molnar

Dimitri Sivanich
2012-06-06 23:07:41 +0800