Eric Lee / smarc-fsl-linux-kernel

31 May, 2011

1 commit

d72bce0e6 rcu: Cure load woes ... Browse Code »

Commit cc3ce5176d83 (rcu: Start RCU kthreads in TASK_INTERRUPTIBLE
state) fudges a sleeping task' state, resulting in the scheduler seeing
a TASK_UNINTERRUPTIBLE task going to sleep, but a TASK_INTERRUPTIBLE
task waking up. The result is unbalanced load calculation.

The problem that patch tried to address is that the RCU threads could
stay in UNINTERRUPTIBLE state for quite a while and triggering the hung
task detector due to on-demand wake-ups.

Cure the problem differently by always giving the tasks at least one
wake-up once the CPU is fully up and running, this will kick them out of
the initial UNINTERRUPTIBLE state and into the regular INTERRUPTIBLE
wait state.

[ The alternative would be teaching kthread_create() to start threads as
INTERRUPTIBLE but that needs a tad more thought. ]

Reported-by: Damien Wyart
Signed-off-by: Peter Zijlstra
Acked-by: Paul E. McKenney
Link: http://lkml.kernel.org/r/1306755291.1200.2872.camel@twins
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-31 16:01:48 +0800

30 May, 2011

2 commits

6345d24da mm: Fix boot crash in mm_alloc() ... Browse Code »

Thomas Gleixner reports that we now have a boot crash triggered by
CONFIG_CPUMASK_OFFSTACK=y:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [] find_next_bit+0x55/0xb0
Call Trace:
[] cpumask_any_but+0x2a/0x70
[] flush_tlb_mm+0x2b/0x80
[] pud_populate+0x35/0x50
[] pgd_alloc+0x9a/0xf0
[] mm_init+0xec/0x120
[] mm_alloc+0x53/0xd0

which was introduced by commit de03c72cfce5 ("mm: convert
mm->cpu_vm_cpumask into cpumask_var_t"), and is due to wrong ordering of
mm_init() vs mm_init_cpumask

Thomas wrote a patch to just fix the ordering of initialization, but I
hate the new double allocation in the fork path, so I ended up instead
doing some more radical surgery to clean it all up.

Reported-by: Thomas Gleixner
Reported-by: Ingo Molnar
Cc: KOSAKI Motohiro
Cc: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Torvalds
2011-05-30 02:32:28 +0800
f31064212 Merge branch 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6 ... Browse Code »

* 'idle-release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux-idle-2.6:
x86 idle: deprecate mwait_idle() and "idle=mwait" cmdline param
x86 idle: deprecate "no-hlt" cmdline param
x86 idle APM: deprecate CONFIG_APM_CPU_IDLE
x86 idle floppy: deprecate disable_hlt()
x86 idle: EXPORT_SYMBOL(default_idle, pm_idle) only when APM demands it
x86 idle: clarify AMD erratum 400 workaround
idle governor: Avoid lock acquisition to read pm_qos before entering idle
cpuidle: menu: fixed wrapping timers at 4.294 seconds

Linus Torvalds
2011-05-30 02:18:09 +0800

29 May, 2011

4 commits

333c5ae99 idle governor: Avoid lock acquisition to read pm_qos before entering idle ... Browse Code »

Thanks to the reviews and comments by Rafael, James, Mark and Andi.
Here's version 2 of the patch incorporating your comments and also some
update to my previous patch comments.

I noticed that before entering idle state, the menu idle governor will
look up the current pm_qos target value according to the list of qos
requests received. This look up currently needs the acquisition of a
lock to access the list of qos requests to find the qos target value,
slowing down the entrance into idle state due to contention by multiple
cpus to access this list. The contention is severe when there are a lot
of cpus waking and going into idle. For example, for a simple workload
that has 32 pair of processes ping ponging messages to each other, where
64 cpu cores are active in test system, I see the following profile with
37.82% of cpu cycles spent in contention of pm_qos_lock:

- 37.82% swapper [kernel.kallsyms] [k]
_raw_spin_lock_irqsave
- _raw_spin_lock_irqsave
- 95.65% pm_qos_request
menu_select
cpuidle_idle_call
- cpu_idle
99.98% start_secondary

A better approach will be to cache the updated pm_qos target value so
reading it does not require lock acquisition as in the patch below.
With this patch the contention for pm_qos_lock is removed and I saw a
2.2X increase in throughput for my message passing workload.

cc: stable@kernel.org
Signed-off-by: Tim Chen
Acked-by: Andi Kleen
Acked-by: James Bottomley
Acked-by: mark gross
Signed-off-by: Len Brown

Tim Chen
2011-05-29 12:50:59 +0800
08a8b7960 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/linux-2.6-tip

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed
sched: Fix ->min_vruntime calculation in dequeue_entity()
sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW
sched: More sched_domain iterations fixes

Linus Torvalds
2011-05-29 03:56:46 +0800
1ba4b8cb9 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state
rcu: Remove waitqueue usage for cpu, node, and boost kthreads
rcu: Avoid acquiring rcu_node locks in timer functions
atomic: Add atomic_or()
Documentation: Add statistics about nested locks
rcu: Decrease memory-barrier usage based on semi-formal proof
rcu: Make rcu_enter_nohz() pay attention to nesting
rcu: Don't do reschedule unless in irq
rcu: Remove old memory barriers from rcu_process_callbacks()
rcu: Add memory barriers
rcu: Fix unpaired rcu_irq_enter() from locking selftests

Linus Torvalds
2011-05-29 03:56:32 +0800
c4a227d89 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (25 commits)
perf: Fix SIGIO handling
perf top: Don't stop if no kernel symtab is found
perf top: Handle kptr_restrict
perf top: Remove unused macro
perf events: initialize fd array to -1 instead of 0
perf tools: Make sure kptr_restrict warnings fit 80 col terms
perf tools: Fix build on older systems
perf symbols: Handle /proc/sys/kernel/kptr_restrict
perf: Remove duplicate headers
ftrace: Add internal recursive checks
tracing: Update btrfs's tracepoints to use u64 interface
tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine
ftrace: Set ops->flag to enabled even on static function tracing
tracing: Have event with function tracer check error return
ftrace: Have ftrace_startup() return failure code
jump_label: Check entries limit in __jump_label_update
ftrace/recordmcount: Avoid STT_FUNC symbols as base on ARM
scripts/tags.sh: Add magic for trace-events for etags too
scripts/tags.sh: Fix ctags for DEFINE_EVENT()
x86/ftrace: Fix compiler warning in ftrace.c
...

Linus Torvalds
2011-05-29 03:55:55 +0800

28 May, 2011

11 commits

cc3ce5176 rcu: Start RCU kthreads in TASK_INTERRUPTIBLE state ... Browse Code »

Upon creation, kthreads are in TASK_UNINTERRUPTIBLE state, which can
result in softlockup warnings. Because some of RCU's kthreads can
legitimately be idle indefinitely, start them in TASK_INTERRUPTIBLE
state in order to avoid those warnings.

Suggested-by: Peter Zijlstra
Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Tested-by: Yinghai Lu
Signed-off-by: Ingo Molnar

Paul E. McKenney
2011-05-28 23:41:56 +0800
08bca60a6 rcu: Remove waitqueue usage for cpu, node, and boost kthreads ... Browse Code »

It is not necessary to use waitqueues for the RCU kthreads because
we always know exactly which thread is to be awakened. In addition,
wake_up() only issues an actual wakeup when there is a thread waiting on
the queue, which was why there was an extra explicit wake_up_process()
to get the RCU kthreads started.

Eliminating the waitqueues (and wake_up()) in favor of wake_up_process()
eliminates the need for the initial wake_up_process() and also shrinks
the data structure size a bit. The wakeup logic is placed in a new
rcu_wait() macro.

Signed-off-by: Peter Zijlstra
Signed-off-by: Paul E. McKenney
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:41:52 +0800
8826f3b03 rcu: Avoid acquiring rcu_node locks in timer functions ... Browse Code »

This commit switches manipulations of the rcu_node ->wakemask field
to atomic operations, which allows rcu_cpu_kthread_timer() to avoid
acquiring the rcu_node lock. This should avoid the following lockdep
splat reported by Valdis Kletnieks:

[ 12.872150] usb 1-4: new high speed USB device number 3 using ehci_hcd
[ 12.986667] usb 1-4: New USB device found, idVendor=413c, idProduct=2513
[ 12.986679] usb 1-4: New USB device strings: Mfr=0, Product=0, SerialNumber=0
[ 12.987691] hub 1-4:1.0: USB hub found
[ 12.987877] hub 1-4:1.0: 3 ports detected
[ 12.996372] input: PS/2 Generic Mouse as /devices/platform/i8042/serio1/input/input10
[ 13.071471] udevadm used greatest stack depth: 3984 bytes left
[ 13.172129]
[ 13.172130] =======================================================
[ 13.172425] [ INFO: possible circular locking dependency detected ]
[ 13.172650] 2.6.39-rc6-mmotm0506 #1
[ 13.172773] -------------------------------------------------------
[ 13.172997] blkid/267 is trying to acquire lock:
[ 13.173009] (&p->pi_lock){-.-.-.}, at: [] try_to_wake_up+0x29/0x1aa
[ 13.173009]
[ 13.173009] but task is already holding lock:
[ 13.173009] (rcu_node_level_0){..-...}, at: [] rcu_cpu_kthread_timer+0x27/0x58
[ 13.173009]
[ 13.173009] which lock already depends on the new lock.
[ 13.173009]
[ 13.173009]
[ 13.173009] the existing dependency chain (in reverse order) is:
[ 13.173009]
[ 13.173009] -> #2 (rcu_node_level_0){..-...}:
[ 13.173009] [] check_prevs_add+0x8b/0x104
[ 13.173009] [] validate_chain+0x36f/0x3ab
[ 13.173009] [] __lock_acquire+0x369/0x3e2
[ 13.173009] [] lock_acquire+0xfc/0x14c
[ 13.173009] [] _raw_spin_lock+0x36/0x45
[ 13.173009] [] rcu_read_unlock_special+0x8c/0x1d5
[ 13.173009] [] __rcu_read_unlock+0x4f/0xd7
[ 13.173009] [] rcu_read_unlock+0x21/0x23
[ 13.173009] [] cpuacct_charge+0x6c/0x75
[ 13.173009] [] update_curr+0x101/0x12e
[ 13.173009] [] check_preempt_wakeup+0xf7/0x23b
[ 13.173009] [] check_preempt_curr+0x2b/0x68
[ 13.173009] [] ttwu_do_wakeup+0x76/0x128
[ 13.173009] [] ttwu_do_activate.constprop.63+0x57/0x5c
[ 13.173009] [] scheduler_ipi+0x48/0x5d
[ 13.173009] [] smp_reschedule_interrupt+0x16/0x18
[ 13.173009] [] reschedule_interrupt+0x13/0x20
[ 13.173009] [] rcu_read_unlock+0x21/0x23
[ 13.173009] [] find_get_page+0xa9/0xb9
[ 13.173009] [] filemap_fault+0x6a/0x34d
[ 13.173009] [] __do_fault+0x54/0x3e6
[ 13.173009] [] handle_pte_fault+0x12c/0x1ed
[ 13.173009] [] handle_mm_fault+0x1cd/0x1e0
[ 13.173009] [] do_page_fault+0x42d/0x5de
[ 13.173009] [] page_fault+0x1f/0x30
[ 13.173009]
[ 13.173009] -> #1 (&rq->lock){-.-.-.}:
[ 13.173009] [] check_prevs_add+0x8b/0x104
[ 13.173009] [] validate_chain+0x36f/0x3ab
[ 13.173009] [] __lock_acquire+0x369/0x3e2
[ 13.173009] [] lock_acquire+0xfc/0x14c
[ 13.173009] [] _raw_spin_lock+0x36/0x45
[ 13.173009] [] __task_rq_lock+0x8b/0xd3
[ 13.173009] [] wake_up_new_task+0x41/0x108
[ 13.173009] [] do_fork+0x265/0x33f
[ 13.173009] [] kernel_thread+0x6b/0x6d
[ 13.173009] [] rest_init+0x21/0xd2
[ 13.173009] [] start_kernel+0x3bb/0x3c6
[ 13.173009] [] x86_64_start_reservations+0xaf/0xb3
[ 13.173009] [] x86_64_start_kernel+0xf0/0xf7
[ 13.173009]
[ 13.173009] -> #0 (&p->pi_lock){-.-.-.}:
[ 13.173009] [] check_prev_add+0x68/0x20e
[ 13.173009] [] check_prevs_add+0x8b/0x104
[ 13.173009] [] validate_chain+0x36f/0x3ab
[ 13.173009] [] __lock_acquire+0x369/0x3e2
[ 13.173009] [] lock_acquire+0xfc/0x14c
[ 13.173009] [] _raw_spin_lock_irqsave+0x44/0x57
[ 13.173009] [] try_to_wake_up+0x29/0x1aa
[ 13.173009] [] wake_up_process+0x10/0x12
[ 13.173009] [] rcu_cpu_kthread_timer+0x44/0x58
[ 13.173009] [] call_timer_fn+0xac/0x1e9
[ 13.173009] [] run_timer_softirq+0x1aa/0x1f2
[ 13.173009] [] __do_softirq+0x109/0x26a
[ 13.173009] [] call_softirq+0x1c/0x30
[ 13.173009] [] do_softirq+0x44/0xf1
[ 13.173009] [] irq_exit+0x58/0xc8
[ 13.173009] [] smp_apic_timer_interrupt+0x79/0x87
[ 13.173009] [] apic_timer_interrupt+0x13/0x20
[ 13.173009] [] get_page_from_freelist+0x2aa/0x310
[ 13.173009] [] __alloc_pages_nodemask+0x178/0x243
[ 13.173009] [] pte_alloc_one+0x1e/0x3a
[ 13.173009] [] __pte_alloc+0x22/0x14b
[ 13.173009] [] handle_mm_fault+0x17e/0x1e0
[ 13.173009] [] do_page_fault+0x42d/0x5de
[ 13.173009] [] page_fault+0x1f/0x30
[ 13.173009]
[ 13.173009] other info that might help us debug this:
[ 13.173009]
[ 13.173009] Chain exists of:
[ 13.173009] &p->pi_lock --> &rq->lock --> rcu_node_level_0
[ 13.173009]
[ 13.173009] Possible unsafe locking scenario:
[ 13.173009]
[ 13.173009] CPU0 CPU1
[ 13.173009] ---- ----
[ 13.173009] lock(rcu_node_level_0);
[ 13.173009] lock(&rq->lock);
[ 13.173009] lock(rcu_node_level_0);
[ 13.173009] lock(&p->pi_lock);
[ 13.173009]
[ 13.173009] *** DEADLOCK ***
[ 13.173009]
[ 13.173009] 3 locks held by blkid/267:
[ 13.173009] #0: (&mm->mmap_sem){++++++}, at: [] do_page_fault+0x1f3/0x5de
[ 13.173009] #1: (&yield_timer){+.-...}, at: [] call_timer_fn+0x0/0x1e9
[ 13.173009] #2: (rcu_node_level_0){..-...}, at: [] rcu_cpu_kthread_timer+0x27/0x58
[ 13.173009]
[ 13.173009] stack backtrace:
[ 13.173009] Pid: 267, comm: blkid Not tainted 2.6.39-rc6-mmotm0506 #1
[ 13.173009] Call Trace:
[ 13.173009] [] print_circular_bug+0xc8/0xd9
[ 13.173009] [] check_prev_add+0x68/0x20e
[ 13.173009] [] ? save_stack_trace+0x28/0x46
[ 13.173009] [] check_prevs_add+0x8b/0x104
[ 13.173009] [] validate_chain+0x36f/0x3ab
[ 13.173009] [] __lock_acquire+0x369/0x3e2
[ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
[ 13.173009] [] lock_acquire+0xfc/0x14c
[ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
[ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
[ 13.173009] [] _raw_spin_lock_irqsave+0x44/0x57
[ 13.173009] [] ? try_to_wake_up+0x29/0x1aa
[ 13.173009] [] try_to_wake_up+0x29/0x1aa
[ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
[ 13.173009] [] wake_up_process+0x10/0x12
[ 13.173009] [] rcu_cpu_kthread_timer+0x44/0x58
[ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
[ 13.173009] [] call_timer_fn+0xac/0x1e9
[ 13.173009] [] ? del_timer+0x75/0x75
[ 13.173009] [] ? rcu_check_quiescent_state+0x82/0x82
[ 13.173009] [] run_timer_softirq+0x1aa/0x1f2
[ 13.173009] [] __do_softirq+0x109/0x26a
[ 13.173009] [] ? tick_dev_program_event+0x37/0xf6
[ 13.173009] [] ? time_hardirqs_off+0x1b/0x2f
[ 13.173009] [] call_softirq+0x1c/0x30
[ 13.173009] [] do_softirq+0x44/0xf1
[ 13.173009] [] irq_exit+0x58/0xc8
[ 13.173009] [] smp_apic_timer_interrupt+0x79/0x87
[ 13.173009] [] apic_timer_interrupt+0x13/0x20
[ 13.173009] [] ? get_page_from_freelist+0x114/0x310
[ 13.173009] [] ? get_page_from_freelist+0x2aa/0x310
[ 13.173009] [] ? clear_page_c+0x7/0x10
[ 13.173009] [] ? prep_new_page+0x14c/0x1cd
[ 13.173009] [] get_page_from_freelist+0x2aa/0x310
[ 13.173009] [] __alloc_pages_nodemask+0x178/0x243
[ 13.173009] [] ? __pmd_alloc+0x87/0x99
[ 13.173009] [] pte_alloc_one+0x1e/0x3a
[ 13.173009] [] ? __pmd_alloc+0x87/0x99
[ 13.173009] [] __pte_alloc+0x22/0x14b
[ 13.173009] [] handle_mm_fault+0x17e/0x1e0
[ 13.173009] [] do_page_fault+0x42d/0x5de
[ 13.173009] [] ? sys_brk+0x32/0x10c
[ 13.173009] [] ? time_hardirqs_off+0x1b/0x2f
[ 13.173009] [] ? trace_hardirqs_off_caller+0x3f/0x9c
[ 13.173009] [] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 13.173009] [] page_fault+0x1f/0x30
[ 14.010075] usb 5-1: new full speed USB device number 2 using uhci_hcd

Reported-by: Valdis Kletnieks
Signed-off-by: Paul E. McKenney
Signed-off-by: Paul E. McKenney
Signed-off-by: Ingo Molnar

Paul E. McKenney
2011-05-28 23:41:49 +0800
29f742f88 Merge branch 'rcu/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulm… ... Browse Code »

…ck/linux-2.6-rcu into core/urgent

Ingo Molnar
2011-05-28 23:41:05 +0800
f506b3dc0 perf: Fix SIGIO handling ... Browse Code »

Vince noticed that unless we mmap() a buffer, SIGIO gets lost. So
explicitly push the wakeup (including signals) when requested.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc:
Link: http://lkml.kernel.org/n/tip-2euus3f3x3dyvdk52cjxw8zu@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:04:59 +0800
1e1b6c511 cpuset: Fix cpuset_cpus_allowed_fallback(), don't update tsk->rt.nr_cpus_allowed ... Browse Code »

The rule is, we have to update tsk->rt.nr_cpus_allowed if we change
tsk->cpus_allowed. Otherwise RT scheduler may confuse.

Signed-off-by: KOSAKI Motohiro
Cc: Oleg Nesterov
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/4DD4B3FA.5060901@jp.fujitsu.com
Signed-off-by: Ingo Molnar

KOSAKI Motohiro
2011-05-28 23:02:57 +0800
1e8762317 sched: Fix ->min_vruntime calculation in dequeue_entity() ... Browse Code »

Dima Zavin reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."

Reported-by: Dima Zavin
Signed-off-by: John Stultz
Recalls-having-tested-once-upon-a-time-by: Mike Galbraith
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:02:56 +0800
d6aa8f85f sched: Fix ttwu() for __ARCH_WANT_INTERRUPTS_ON_CTXSW ... Browse Code »

Marc reported that e4a52bcb9 (sched: Remove rq->lock from the first
half of ttwu()) broke his ARM-SMP machine. Now ARM is one of the few
__ARCH_WANT_INTERRUPTS_ON_CTXSW users, so that exception in the ttwu()
code was suspect.

Yong found that the interrupt could hit after context_switch() changes
current but before it clears p->on_cpu, if that interrupt were to
attempt a wake-up of p we would indeed find ourselves spinning in IRQ
context.

Fix this by reverting to the old behaviour for this situation and
perform a full remote wake-up.

Cc: Frank Rowand
Cc: Yong Zhang
Cc: Oleg Nesterov
Reported-by: Marc Zyngier
Tested-by: Marc Zyngier
Signed-off-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Peter Zijlstra
2011-05-28 23:02:55 +0800
cd4ae6adf sched: More sched_domain iterations fixes ... Browse Code »

sched_domain iterations needs to be protected by rcu_read_lock() now,
this patch adds another two places which needs the rcu lock, which is
spotted by following suspicious rcu_dereference_check() usage warnings.

kernel/sched_rt.c:1244 invoked rcu_dereference_check() without protection!
kernel/sched_stats.h:41 invoked rcu_dereference_check() without protection!

Signed-off-by: Xiaotian Feng
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1303469634-11678-1-git-send-email-dfeng@redhat.com
Signed-off-by: Ingo Molnar

Xiaotian Feng
2011-05-28 23:02:54 +0800
f23a5e140 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM: Fix PM QOS's user mode interface to work with ASCII input
PM / Hibernate: Update kerneldoc comments in hibernate.c
PM / Hibernate: Remove arch_prepare_suspend()
PM / Hibernate: Update some comments in core hibernate code

Linus Torvalds
2011-05-28 05:27:34 +0800
e52e713ec Merge branch 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs ... Browse Code »

* 'docs-move' of git://git.kernel.org/pub/scm/linux/kernel/git/rdunlap/linux-docs:
Create Documentation/security/, move LSM-, credentials-, and keys-related files from Documentation/ to Documentation/security/, add Documentation/security/00-INDEX, and update all occurrences of Documentation/ to Documentation/security/.

Linus Torvalds
2011-05-28 01:25:02 +0800

27 May, 2011

14 commits

d6a72fe46 Merge branch 'tip/perf/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/… ... Browse Code »

…rostedt/linux-2.6-trace into perf/urgent

Ingo Molnar
2011-05-27 20:28:09 +0800
6f7bd76f0 kernel/profile.c: remove some duplicate code from profile_hits() ... Browse Code »

profile_hits() has a common check for prof_on and prof_buffer regardless
of SMP or !SMP. So, remove some duplicate code by splitting profile_hits
into two.

[akpm@linux-foundation.org: make do_profile_hits static]
Signed-off-by: Rakib Mullick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rakib Mullick
2011-05-27 08:12:37 +0800
386460138 mm: extract exe_file handling from procfs ... Browse Code »

Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
This was because exe_file was needed only for /proc//exe. Since we
will need the exe_file functionality also for core dumps (so core name can
contain full binary path), built this functionality always into the
kernel.

To achieve that move that out of proc FS to the kernel/ where in fact it
should belong. By doing that we can make dup_mm_exe_file static. Also we
can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

Signed-off-by: Jiri Slaby
Cc: Alexander Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2011-05-27 08:12:36 +0800
a77aea920 cgroup: remove the ns_cgroup ... Browse Code »

The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
leads to some problems:

* cgroup creation is out-of-control
* cgroup name can conflict when pids are looping
* it is not possible to have a single process handling a lot of
namespaces without falling in a exponential creation time
* we may want to create a namespace without creating a cgroup

The ns_cgroup was replaced by a compatibility flag 'clone_children',
where a newly created cgroup will copy the parent cgroup values.
The userspace has to manually create a cgroup and add a task to
the 'tasks' file.

This patch removes the ns_cgroup as suggested in the following thread:

https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html

The 'cgroup_clone' function is removed because it is no longer used.

This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify
ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
printk warning users that the feature is planned for removal. Since that
time we have heard from XXX users who were affected by this.

Signed-off-by: Daniel Lezcano
Signed-off-by: Serge E. Hallyn
Cc: Eric W. Biederman
Cc: Jamal Hadi Salim
Reviewed-by: Li Zefan
Acked-by: Paul Menage
Acked-by: Matt Helsley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daniel Lezcano
2011-05-27 08:12:34 +0800
d846687d7 cgroups: use flex_array in attach_proc ... Browse Code »

Convert cgroup_attach_proc to use flex_array.

The cgroup_attach_proc implementation requires a pre-allocated array to
store task pointers to atomically move a thread-group, but asking for a
monolithic array with kmalloc() may be unreliable for very large groups.
Using flex_array provides the same functionality with less risk of
failure.

This is a post-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800
74a1166df cgroups: make procs file writable ... Browse Code »

Make procs file writable to move all threads by tgid at once.

Add functionality that enables users to move all threads in a threadgroup
at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This
current implementation makes use of a per-threadgroup rwsem that's taken
for reading in the fork() path to prevent newly forking threads within the
threadgroup from "escaping" while the move is in progress.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800
f780bdb7c cgroups: add per-thread subsystem callbacks ... Browse Code »

Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
for cgroups's subsystem interface. Unlike can_attach and attach, these
are for per-thread operations, to be called potentially many times when
attaching an entire threadgroup.

Also, the old "bool threadgroup" interface is removed, as replaced by
this. All subsystems are modified for the new interface - of note is
cpuset, which requires from/to nodemasks for attach to be globally scoped
(though per-cpuset would work too) to persist from its pre_attach to
attach_task and attach.

This is a pre-patch for cgroup-procs-writable.patch.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800
4714d1d32 cgroups: read-write lock CLONE_THREAD forking per threadgroup ... Browse Code »

Adds functionality to read/write lock CLONE_THREAD fork()ing per-threadgroup

Add an rwsem that lives in a threadgroup's signal_struct that's taken for
reading in the fork path, under CONFIG_CGROUPS. If another part of the
kernel later wants to use such a locking mechanism, the CONFIG_CGROUPS
ifdefs should be changed to a higher-up flag that CGROUPS and the other
system would both depend on.

This is a pre-patch for cgroup-procs-write.patch.

Signed-off-by: Ben Blum
Cc: "Eric W. Biederman"
Cc: Li Zefan
Cc: Matt Helsley
Reviewed-by: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Cc: Miao Xie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ben Blum
2011-05-27 08:12:34 +0800
0775a60ac PM: Fix PM QOS's user mode interface to work with ASCII input ... Browse Code »

Make pm_qos_power_write() accept values passed to it in the ASCII hex
format either with or without an ending newline.

Signed-off-by: Rafael J. Wysocki
Acked-by: Mark Gross

Rafael J. Wysocki
2011-05-27 06:05:23 +0800
23b5c8fa0 rcu: Decrease memory-barrier usage based on semi-formal proof ... Browse Code »

(Note: this was reverted, and is now being re-applied in pieces, with
this being the fifth and final piece. See below for the reason that
it is now felt to be safe to re-apply this.)

Commit d09b62d fixed grace-period synchronization, but left some smp_mb()
invocations in rcu_process_callbacks() that are no longer needed, but
sheer paranoia prevented them from being removed. This commit removes
them and provides a proof of correctness in their absence. It also adds
a memory barrier to rcu_report_qs_rsp() immediately before the update to
rsp->completed in order to handle the theoretical possibility that the
compiler or CPU might move massive quantities of code into a lock-based
critical section. This also proves that the sheer paranoia was not
entirely unjustified, at least from a theoretical point of view.

In addition, the old dyntick-idle synchronization depended on the fact
that grace periods were many milliseconds in duration, so that it could
be assumed that no dyntick-idle CPU could reorder a memory reference
across an entire grace period. Unfortunately for this design, the
addition of expedited grace periods breaks this assumption, which has
the unfortunate side-effect of requiring atomic operations in the
functions that track dyntick-idle state for RCU. (There is some hope
that the algorithms used in user-level RCU might be applied here, but
some work is required to handle the NMIs that user-space applications
can happily ignore. For the short term, better safe than sorry.)

This proof assumes that neither compiler nor CPU will allow a lock
acquisition and release to be reordered, as doing so can result in
deadlock. The proof is as follows:

1. A given CPU declares a quiescent state under the protection of
its leaf rcu_node's lock.

2. If there is more than one level of rcu_node hierarchy, the
last CPU to declare a quiescent state will also acquire the
->lock of the next rcu_node up in the hierarchy, but only
after releasing the lower level's lock. The acquisition of this
lock clearly cannot occur prior to the acquisition of the leaf
node's lock.

3. Step 2 repeats until we reach the root rcu_node structure.
Please note again that only one lock is held at a time through
this process. The acquisition of the root rcu_node's ->lock
must occur after the release of that of the leaf rcu_node.

4. At this point, we set the ->completed field in the rcu_state
structure in rcu_report_qs_rsp(). However, if the rcu_node
hierarchy contains only one rcu_node, then in theory the code
preceding the quiescent state could leak into the critical
section. We therefore precede the update of ->completed with a
memory barrier. All CPUs will therefore agree that any updates
preceding any report of a quiescent state will have happened
before the update of ->completed.

5. Regardless of whether a new grace period is needed, rcu_start_gp()
will propagate the new value of ->completed to all of the leaf
rcu_node structures, under the protection of each rcu_node's ->lock.
If a new grace period is needed immediately, this propagation
will occur in the same critical section that ->completed was
set in, but courtesy of the memory barrier in #4 above, is still
seen to follow any pre-quiescent-state activity.

6. When a given CPU invokes __rcu_process_gp_end(), it becomes
aware of the end of the old grace period and therefore makes
any RCU callbacks that were waiting on that grace period eligible
for invocation.

If this CPU is the same one that detected the end of the grace
period, and if there is but a single rcu_node in the hierarchy,
we will still be in the single critical section. In this case,
the memory barrier in step #4 guarantees that all callbacks will
be seen to execute after each CPU's quiescent state.

On the other hand, if this is a different CPU, it will acquire
the leaf rcu_node's ->lock, and will again be serialized after
each CPU's quiescent state for the old grace period.

On the strength of this proof, this commit therefore removes the memory
barriers from rcu_process_callbacks() and adds one to rcu_report_qs_rsp().
The effect is to reduce the number of memory barriers by one and to
reduce the frequency of execution from about once per scheduling tick
per CPU to once per grace period.

This was reverted do to hangs found during testing by Yinghai Lu and
Ingo Molnar. Frederic Weisbecker supplied Yinghai with tracing that
located the underlying problem, and Frederic also provided the fix.

The underlying problem was that the HARDIRQ_ENTER() macro from
lib/locking-selftest.c invoked irq_enter(), which in turn invokes
rcu_irq_enter(), but HARDIRQ_EXIT() invoked __irq_exit(), which
does not invoke rcu_irq_exit(). This situation resulted in calls
to rcu_irq_enter() that were not balanced by the required calls to
rcu_irq_exit(). Therefore, after these locking selftests completed,
RCU's dyntick-idle nesting count was a large number (for example,
72), which caused RCU to to conclude that the affected CPU was not in
dyntick-idle mode when in fact it was.

RCU would therefore incorrectly wait for this dyntick-idle CPU, resulting
in hangs.

In contrast, with Frederic's patch, which replaces the irq_enter()
in HARDIRQ_ENTER() with an __irq_enter(), these tests don't ever call
either rcu_irq_enter() or rcu_irq_exit(), which works because the CPU
running the test is already marked as not being in dyntick-idle mode.
This means that the rcu_irq_enter() and rcu_irq_exit() calls and RCU
then has no problem working out which CPUs are in dyntick-idle mode and
which are not.

The reason that the imbalance was not noticed before the barrier patch
was applied is that the old implementation of rcu_enter_nohz() ignored
the nesting depth. This could still result in delays, but much shorter
ones. Whenever there was a delay, RCU would IPI the CPU with the
unbalanced nesting level, which would eventually result in rcu_enter_nohz()
being called, which in turn would force RCU to see that the CPU was in
dyntick-idle mode.

The reason that very few people noticed the problem is that the mismatched
irq_enter() vs. __irq_exit() occured only when the kernel was built with
CONFIG_DEBUG_LOCKING_API_SELFTESTS.

Signed-off-by: Paul E. McKenney
Reviewed-by: Josh Triplett

Paul E. McKenney
2011-05-27 00:42:23 +0800
4305ce789 rcu: Make rcu_enter_nohz() pay attention to nesting ... Browse Code »

The old version of rcu_enter_nohz() forced RCU into nohz mode even if
the nesting count was non-zero. This change causes rcu_enter_nohz()
to hold off for non-zero nesting counts.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-05-27 00:42:22 +0800
b5904090c rcu: Don't do reschedule unless in irq ... Browse Code »

Condition the set_need_resched() in rcu_irq_exit() on in_irq(). This
should be a no-op, because rcu_irq_exit() should only be called from irq.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-05-27 00:42:21 +0800
1135633bd rcu: Remove old memory barriers from rcu_process_callbacks() ... Browse Code »

Second step of partitioning of commit e59fb3120b.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-05-27 00:42:21 +0800
0bbcc529f rcu: Add memory barriers ... Browse Code »

Add the memory barriers added by e59fb3120b.

Signed-off-by: Paul E. McKenney

Paul E. McKenney
2011-05-27 00:42:20 +0800

26 May, 2011

8 commits

1102c660d Merge branch 'linus' into perf/urgent ... Browse Code »

Merge reason: Linus applied an overlapping commit:

5f2e8e2b0bf0: kernel/watchdog.c: Use proper ANSI C prototypes

So merge it in to make sure we can iterate the file without conflicts.

Signed-off-by: Ingo Molnar

Ingo Molnar
2011-05-26 19:48:39 +0800
def945eeb irq: Remove smp_affinity_list when unregister irq proc ... Browse Code »

commit 4b06042(bitmap, irq: add smp_affinity_list interface to
/proc/irq) causes the following warning:

[ 274.239500] WARNING: at fs/proc/generic.c:850 remove_proc_entry+0x24c/0x27a()
[ 274.251761] remove_proc_entry: removing non-empty directory 'irq/184',
leaking at least 'smp_affinity_list'

Remove the new file in the exit path.

Signed-off-by: Yinghai Lu
Cc: Mike Travis
Link: http://lkml.kernel.org/r/4DDDE094.6050505@kernel.org
Signed-off-by: Thomas Gleixner

Yinghai Lu
2011-05-26 19:15:28 +0800
b1cff0ad1 ftrace: Add internal recursive checks ... Browse Code »

Witold reported a reboot caused by the selftests of the dynamic function
tracer. He sent me a config and I used ktest to do a config_bisect on it
(as my config did not cause the crash). It pointed out that the problem
config was CONFIG_PROVE_RCU.

What happened was that if multiple callbacks are attached to the
function tracer, we iterate a list of callbacks. Because the list is
managed by synchronize_sched() and preempt_disable, the access to the
pointers uses rcu_dereference_raw().

When PROVE_RCU is enabled, the rcu_dereference_raw() calls some
debugging functions, which happen to be traced. The tracing of the debug
function would then call rcu_dereference_raw() which would then call the
debug function and then... well you get the idea.

I first wrote two different patches to solve this bug.

1) add a __rcu_dereference_raw() that would not do any checks.
2) add notrace to the offending debug functions.

Both of these patches worked.

Talking with Paul McKenney on IRC, he suggested to add recursion
detection instead. This seemed to be a better solution, so I decided to
implement it. As the task_struct already has a trace_recursion to detect
recursion in the ring buffer, and that has a very small number it
allows, I decided to use that same variable to add flags that can detect
the recursion inside the infrastructure of the function tracer.

I plan to change it so that the task struct bit can be checked in
mcount, but as that requires changes to all archs, I will hold that off
to the next merge window.

Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Frederic Weisbecker
Cc: Paul E. McKenney
Link: http://lkml.kernel.org/r/1306348063.1465.116.camel@gandalf.stny.rr.com
Reported-by: Witold Baryluk
Signed-off-by: Steven Rostedt

Steven Rostedt
2011-05-26 10:13:49 +0800
2fc1b6f0d tracing: Add __print_symbolic_u64 to avoid warnings on 32bit machine ... Browse Code »

Filesystem, like Btrfs, has some "ULL" macros, and when these macros are passed
to tracepoints'__print_symbolic(), there will be 64->32 truncate WARNINGS during
compiling on 32bit box.

Signed-off-by: Liu Bo
Link: http://lkml.kernel.org/r/4DACE6E0.7000507@cn.fujitsu.com
Signed-off-by: Steven Rostedt

liubo
2011-05-26 10:13:44 +0800
3b6cfdb17 ftrace: Set ops->flag to enabled even on static function tracing ... Browse Code »

When dynamic ftrace is not configured, the ops->flags still needs
to have its FTRACE_OPS_FL_ENABLED bit set in ftrace_startup().

Signed-off-by: Steven Rostedt

Steven Rostedt
2011-05-26 10:13:42 +0800
17bb615ad tracing: Have event with function tracer check error return ... Browse Code »

The self tests for event tracer does not check if the function
tracing was successfully activated. It needs to before it continues
the tests, otherwise the wrong errors may be reported.

Signed-off-by: Steven Rostedt

Steven Rostedt
2011-05-26 10:13:39 +0800
a1cd61735 ftrace: Have ftrace_startup() return failure code ... Browse Code »

The register_ftrace_function() returns an error code on failure
except if the call to ftrace_startup() fails. Add a error return to
ftrace_startup() if it fails to start, allowing register_ftrace_funtion()
to return a proper error value.

Signed-off-by: Steven Rostedt

Steven Rostedt
2011-05-26 10:13:37 +0800
14d74e0ca Merge git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-2.6-nsfd:
net: fix get_net_ns_by_fd for !CONFIG_NET_NS
ns proc: Return -ENOENT for a nonexistent /proc/self/ns/ entry.
ns: Declare sys_setns in syscalls.h
net: Allow setting the network namespace by fd
ns proc: Add support for the ipc namespace
ns proc: Add support for the uts namespace
ns proc: Add support for the network namespace.
ns: Introduce the setns syscall
ns: proc files for namespace naming policy.

Linus Torvalds
2011-05-26 09:10:16 +0800