Doug / smarc-fsl-linux-kernel | Embedian Git Server

11 Sep, 2010

1 commit

aad1830e6 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, tsc: Fix a preemption leak in restore_sched_clock_state()
sched: Move sched_avg_update() to update_cpu_load()

Linus Torvalds
2010-09-11 22:59:49 +0800

10 Sep, 2010

9 commits

f2955b490 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread
perf symbols: Fix multiple initialization of symbol system
perf: Fix CPU hotplug
perf, trace: Fix module leak
tracing/kprobe: Fix handling of C-unlike argument names
tracing/kprobes: Fix handling of argument names
perf probe: Fix handling of arguments names
perf probe: Fix return probe support
tracing/kprobe: Fix a memory leak in error case
tracing: Do not allow llseek to set_ftrace_filter

Linus Torvalds
2010-09-10 22:31:24 +0800
df0916255 tracing: t_start: reset FTRACE_ITER_HASH in case of seek/pread ... Browse Code »

Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without
having properly started the iterator to iterate the hash. This case is
degenerate and, as discovered by Robert Swiecki, can cause t_hash_show()
to misuse a pointer. This causes a NULL ptr deref with possible security
implications. Tracked as CVE-2010-3079.

Cc: Robert Swiecki
Cc: Eugene Teo
Cc:
Signed-off-by: Chris Wright
Signed-off-by: Steven Rostedt

Chris Wright
2010-09-10 10:43:49 +0800
910321ea8 swap: revert special hibernation allocation ... Browse Code »

Please revert 2.6.36-rc commit d2997b1042ec150616c1963b5e5e919ffd0b0ebf
"hibernation: freeze swap at hibernation". It complicated matters by
adding a second swap allocation path, just for hibernation; without in any
way fixing the issue that it was intended to address - page reclaim after
fixing the hibernation image might free swap from a page already imaged as
swapcache, letting its swap be reallocated to store a different page of
the image: resulting in data corruption if the imaged page were freed as
clean then swapped back in. Pages freed to si->swap_map were still in
danger of being reallocated by the alternative allocation path.

I guess it inadvertently fixed slow SSD swap allocation for hibernation,
as reported by Nigel Cunningham: by missing out the discards that occur on
the usual swap allocation path; but that was unintentional, and needs a
separate fix.

Signed-off-by: Hugh Dickins
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: "Rafael J. Wysocki"
Cc: Ondrej Zary
Cc: Andrea Gelmini
Cc: Balbir Singh
Cc: Andrea Arcangeli
Cc: Nigel Cunningham
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-09-10 09:57:25 +0800
1c24de60e kernel/groups.c: fix integer overflow in groups_search ... Browse Code »

gid_t is a unsigned int. If group_info contains a gid greater than
MAX_INT, groups_search() function may look on the wrong side of the search
tree.

This solves some unfair "permission denied" problems.

Signed-off-by: Jerome Marchand
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2010-09-10 09:57:24 +0800
31583bb0c cgroups: fix API thinko ... Browse Code »

Add cgroup_attach_task_all()

The existing cgroup_attach_task_current_cg() API is called by a thread to
attach another thread to all of its cgroups; this is unsuitable for cases
where a privileged task wants to attach itself to the cgroups of a less
privileged one, since the call must be made from the context of the target
task.

This patch adds a more generic cgroup_attach_task_all() API that allows
both the source task and to-be-moved task to be specified.
cgroup_attach_task_current_cg() becomes a specialization of the more
generic new function.

[menage@google.com: rewrote changelog]
[akpm@linux-foundation.org: address reviewer comments]
Signed-off-by: Michael S. Tsirkin
Tested-by: Alex Williamson
Acked-by: Paul Menage
Cc: Li Zefan
Cc: Ben Blum
Cc: Sridhar Samudrala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael S. Tsirkin
2010-09-10 09:57:23 +0800
85a0fdfd0 gcov: fix null-pointer dereference for certain module types ... Browse Code »

The gcov-kernel infrastructure expects that each object file is loaded
only once. This may not be true, e.g. when loading multiple kernel
modules which are linked to the same object file. As a result, loading
such kernel modules will result in incorrect gcov results while unloading
will cause a null-pointer dereference.

This patch fixes these problems by changing the gcov-kernel infrastructure
so that multiple profiling data sets can be associated with one debugfs
entry. It applies to 2.6.36-rc1.

Signed-off-by: Peter Oberparleiter
Reported-by: Werner Spies
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Oberparleiter
2010-09-10 09:57:23 +0800
da2b71edd sched: Move sched_avg_update() to update_cpu_load() ... Browse Code »

Currently sched_avg_update() (which updates rt_avg stats in the rq)
is getting called from scale_rt_power() (in the load balance context)
which doesn't take rq->lock.

Fix it by moving the sched_avg_update() to more appropriate
update_cpu_load() where the CFS load gets updated as well.

Signed-off-by: Suresh Siddha
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Suresh Siddha
2010-09-10 02:39:33 +0800
5e11637e2 perf: Fix CPU hotplug ... Browse Code »

Since we have UP_PREPARE, we should also have UP_CANCELED.

Signed-off-by: Peter Zijlstra
Cc: paulus
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-09-10 02:38:52 +0800
9cb627d5f perf, trace: Fix module leak ... Browse Code »

Commit 1c024eca (perf, trace: Optimize tracepoints by using
per-tracepoint-per-cpu hlist to track events) caused a module
refcount leak.

Reported-And-Tested-by: Avi Kivity
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Li Zefan
2010-09-10 02:38:51 +0800

09 Sep, 2010

3 commits

79637a41e Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
gcc-4.6: kernel/*: Fix unused but set warnings
mutex: Fix annotations to include it in kernel-locking docbook
pid: make setpgid() system call use RCU read-side critical section
MAINTAINERS: Add RCU's public git tree

Linus Torvalds
2010-09-09 02:13:42 +0800
899edae61 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86: Try to handle unknown nmis with an enabled PMU
perf, x86: Fix handle_irq return values
perf, x86: Fix accidentally ack'ing a second event on intel perf counter
oprofile, x86: fix init_sysfs() function stub
lockup_detector: Sync touch_*_watchdog back to old semantics
tracing: Fix a race in function profile
oprofile, x86: fix init_sysfs error handling
perf_events: Fix time tracking for events with pid != -1 and cpu != -1
perf: Initialize callchains roots's childen hits
oprofile: fix crash when accessing freed task structs

Linus Torvalds
2010-09-09 02:13:16 +0800
9c55cb12c tracing: Do not allow llseek to set_ftrace_filter ... Browse Code »

Reading the file set_ftrace_filter does three things.

1) shows whether or not filters are set for the function tracer
2) shows what functions are set for the function tracer
3) shows what triggers are set on any functions

3 is independent from 1 and 2.

The way this file currently works is that it is a state machine,
and as you read it, it may change state. But this assumption breaks
when you use lseek() on the file. The state machine gets out of sync
and the t_show() may use the wrong pointer and cause a kernel oops.

Luckily, this will only kill the app that does the lseek, but the app
dies while holding a mutex. This prevents anyone else from using the
set_ftrace_filter file (or any other function tracing file for that matter).

A real fix for this is to rewrite the code, but that is too much for
a -rc release or stable. This patch simply disables llseek on the
set_ftrace_filter() file for now, and we can do the proper fix for the
next major release.

Reported-by: Robert Swiecki
Cc: Chris Wright
Cc: Tavis Ormandy
Cc: Eugene Teo
Cc: vendor-sec@lst.de
Cc:
Signed-off-by: Steven Rostedt

Steven Rostedt
2010-09-09 00:08:01 +0800

08 Sep, 2010

4 commits

da34634fd tracing/kprobe: Fix handling of C-unlike argument names ... Browse Code »

Check the argument name whether it is invalid (not C-like symbol name). This
makes event format simple.

Reported-by: Srikar Dronamraju
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
LKML-Reference:
Signed-off-by: Masami Hiramatsu
Signed-off-by: Arnaldo Carvalho de Melo

Masami Hiramatsu
2010-09-08 22:47:19 +0800
aba91595c tracing/kprobes: Fix handling of argument names ... Browse Code »

Set "argN" name for each argument automatically if it has no specified name.
Since dynamic trace event(kprobe_events) accepts special characters for its
argument, its format can show those special characters (e.g. '$', '%', '+').
However, perf can't parse those format because of the character (especially
'%') mess up the format. This sets "argX" name for those arguments if user
omitted the argument names.

E.g.
# echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events
# cat tracing/kprobe_events
p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack

Reported-by: Srikar Dronamraju
Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
LKML-Reference:
Signed-off-by: Masami Hiramatsu
Signed-off-by: Arnaldo Carvalho de Melo

Masami Hiramatsu
2010-09-08 22:47:19 +0800
61a527362 tracing/kprobe: Fix a memory leak in error case ... Browse Code »

Fix a memory leak which happens when a field name conflicts with others. In
error case, free_trace_probe() will free all arguments until nr_args, so this
increments nr_args the begining of the loop instead of the end.

Cc: Steven Rostedt
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Mathieu Desnoyers
LKML-Reference:
Signed-off-by: Masami Hiramatsu
Signed-off-by: Arnaldo Carvalho de Melo

Masami Hiramatsu
2010-09-08 22:47:18 +0800
cd4d4fc41 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask
workqueue: fix GCWQ_DISASSOCIATED initialization
workqueue: Add a workqueue chapter to the tracepoint docbook
workqueue: fix cwq->nr_active underflow
workqueue: improve destroy_workqueue() debuggability
workqueue: mark lock acquisition on worker_maybe_bind_and_lock()
workqueue: annotate lock context change
workqueue: free rescuer on destroy_workqueue

Linus Torvalds
2010-09-08 05:08:17 +0800

05 Sep, 2010

1 commit

b3bd3de66 gcc-4.6: kernel/*: Fix unused but set warnings ... Browse Code »

No real bugs I believe, just some dead code.

Signed-off-by: Andi Kleen
Cc: Peter Zijlstra
Cc: andi@firstfloor.org
Signed-off-by: Andrew Morton
Signed-off-by: Ingo Molnar

Andi Kleen
2010-09-05 20:36:58 +0800

03 Sep, 2010

1 commit

ef5dc121d mutex: Fix annotations to include it in kernel-locking docbook ... Browse Code »

Fix kernel-doc notation in linux/mutex.h and kernel/mutex.c,
then add these 2 files to the kernel-locking docbook as the
Mutex API reference chapter.

Add one API function to mutex-design.txt and correct a typo in
that file.

Signed-off-by: Randy Dunlap
Cc: Rusty Russell
LKML-Reference:
Signed-off-by: Ingo Molnar

Randy Dunlap
2010-09-03 14:19:51 +0800

01 Sep, 2010

3 commits

68d3f1d81 lockup_detector: Sync touch_*_watchdog back to old semantics ... Browse Code »

During my rewrite, the semantics of touch_nmi_watchdog and
touch_softlockup_watchdog changed enough to break some drivers
(mostly over preemptable regions).

These are cases where long delays on one CPU (due to
print_delay for example) can cause long delays on other
CPUs - so we must 'touch' the nmi_watchdog flag of those
other CPUs as well.

This change brings those touch_*_watchdog() functions back in line
with to how they used to work.

Signed-off-by: Don Zickus
Acked-by: Cyrill Gorcunov
Cc: peterz@infradead.org
Cc: fweisbec@gmail.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Don Zickus
2010-09-01 16:02:28 +0800
950eaaca6 pid: make setpgid() system call use RCU read-side critical section ... Browse Code »

[ 23.584719]
[ 23.584720] ===================================================
[ 23.585059] [ INFO: suspicious rcu_dereference_check() usage. ]
[ 23.585176] ---------------------------------------------------
[ 23.585176] kernel/pid.c:419 invoked rcu_dereference_check() without protection!
[ 23.585176]
[ 23.585176] other info that might help us debug this:
[ 23.585176]
[ 23.585176]
[ 23.585176] rcu_scheduler_active = 1, debug_locks = 1
[ 23.585176] 1 lock held by rc.sysinit/728:
[ 23.585176] #0: (tasklist_lock){.+.+..}, at: [] sys_setpgid+0x5f/0x193
[ 23.585176]
[ 23.585176] stack backtrace:
[ 23.585176] Pid: 728, comm: rc.sysinit Not tainted 2.6.36-rc2 #2
[ 23.585176] Call Trace:
[ 23.585176] [] lockdep_rcu_dereference+0x99/0xa2
[ 23.585176] [] find_task_by_pid_ns+0x50/0x6a
[ 23.585176] [] find_task_by_vpid+0x1d/0x1f
[ 23.585176] [] sys_setpgid+0x67/0x193
[ 23.585176] [] system_call_fastpath+0x16/0x1b
[ 24.959669] type=1400 audit(1282938522.956:4): avc: denied { module_request } for pid=766 comm="hwclock" kmod="char-major-10-135" scontext=system_u:system_r:hwclock_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclas

It turns out that the setpgid() system call fails to enter an RCU
read-side critical section before doing a PID-to-task_struct translation.
This commit therefore does rcu_read_lock() before the translation, and
also does rcu_read_unlock() after the last use of the returned pointer.

Reported-by: Andrew Morton
Signed-off-by: Paul E. McKenney
Acked-by: David Howells

Paul E. McKenney
2010-09-01 08:00:18 +0800
3aaba20f2 tracing: Fix a race in function profile ... Browse Code »

While we are reading trace_stat/functionX and someone just
disabled function_profile at that time, we can trigger this:

divide error: 0000 [#1] PREEMPT SMP
...
EIP is at function_stat_show+0x90/0x230
...

This fix just takes the ftrace_profile_lock and checks if
rec->counter is 0. If it's 0, we know the profile buffer
has been reset.

Signed-off-by: Li Zefan
Cc: stable@kernel.org
LKML-Reference:
Signed-off-by: Steven Rostedt

Li Zefan
2010-09-01 04:46:23 +0800

31 Aug, 2010

2 commits

9c37547ab workqueue: use zalloc_cpumask_var() for gcwq->mayday_mask ... Browse Code »

alloc_mayday_mask() was using alloc_cpumask_var() making
gcwq->mayday_mask contain garbage after initialization on
CONFIG_CPUMASK_OFFSTACK=y configurations. This combined with the
previously fixed GCWQ_DISASSOCIATED initialization bug could make
rescuers fall into infinite loop trying to bind to an offline cpu.

Signed-off-by: Tejun Heo
Reported-by: CAI Qian

Tejun Heo
2010-08-31 17:18:34 +0800
477a3c33d workqueue: fix GCWQ_DISASSOCIATED initialization ... Browse Code »

init_workqueues() incorrectly marks workqueues for all possible CPUs
associated. Combined with mayday_mask initialization bug, this can
make rescuers keep trying to bind to an offline gcwq indefinitely.
Fix init_workqueues() such that only online CPUs have their gcwqs have
GCWQ_DISASSOCIATED cleared.

Signed-off-by: Tejun Heo
Reported-by: CAI Qian

Tejun Heo
2010-08-31 16:54:35 +0800

30 Aug, 2010

1 commit

fa66f07aa perf_events: Fix time tracking for events with pid != -1 and cpu != -1 ... Browse Code »

Per-thread events with a cpu filter, i.e., cpu != -1, were not
reporting correct timings when the thread never ran on the
monitored cpu. The time enabled was reported as a negative
value.

This patch fixes the problem by updating tstamp_stopped,
tstamp_running in event_sched_out() for events with filters and
which are marked as INACTIVE.

The function group_sched_out() is modified to systematically
call into event_sched_out() to avoid duplicating the timing
adjustment code twice.

With the patch, I now get:

$ task_cpu -i -e unhalted_core_cycles,unhalted_core_cycles
noploop 2 noploop for 2 seconds
CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0)
CPU0 0 unhalted_core_cycles (ena=1,991,136,594, run=0)

CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0)
CPU1 0 unhalted_core_cycles (ena=1,991,136,594, run=0)

CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0)
CPU2 0 unhalted_core_cycles (ena=1,991,136,594, run=0)

CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594)
CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594)

Signed-off-by: Stephane Eranian
Acked-by: Peter Zijlstra
Cc: paulus@samba.org
Cc: davem@davemloft.net
Cc: fweisbec@gmail.com
Cc: perfmon2-devel@lists.sf.net
Cc: eranian@google.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Stephane Eranian
2010-08-30 18:16:55 +0800

29 Aug, 2010

2 commits

6f4dbeca1 Merge branch 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6 ... Browse Code »

* 'pm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/suspend-2.6:
PM QoS: Fix inline documentation.
PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open()

Linus Torvalds
2010-08-29 05:06:19 +0800
2637d139f Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: pxa27x_keypad - remove input_free_device() in pxa27x_keypad_remove()
Input: mousedev - fix regression of inverting axes
Input: uinput - add devname alias to allow module on-demand load
Input: hil_kbd - fix compile error
USB: drop tty argument from usb_serial_handle_sysrq_char()
Input: sysrq - drop tty argument form handle_sysrq()
Input: sysrq - drop tty argument from sysrq ops handlers

Linus Torvalds
2010-08-29 04:55:31 +0800

27 Aug, 2010

1 commit

25cc69ec3 PM QoS: Fix inline documentation. ... Browse Code »

Fix the pm_qos_add_request() kerneldoc comment that doesn't reflect
the behavior of the function after the last PM QoS update.

Signed-off-by: Saravana Kannan
Acked-by: mark gross
Signed-off-by: Rafael J. Wysocki

Saravana Kannan
2010-08-27 02:18:43 +0800

26 Aug, 2010

1 commit

d4348c678 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf, x86, Pentium4: Clear the P4_CCCR_FORCE_OVF flag
tracing/trace_stack: Fix stack trace on ppc64

Linus Torvalds
2010-08-26 01:50:07 +0800

25 Aug, 2010

7 commits

5e686019d Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
x86, tsc, sched: Recompute cyc2ns_offset's during resume from sleep states
sched: Fix rq->clock synchronization when migrating tasks

Linus Torvalds
2010-08-25 23:40:56 +0800
151772dbf tracing/trace_stack: Fix stack trace on ppc64 ... Browse Code »

save_stack_trace() stores the instruction pointer, not the
function descriptor. On ppc64 the trace stack code currently
dereferences the instruction pointer and shows 8 bytes of
instructions in our backtraces:

# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (26 entries)
----- ---- --------
0) 5424 112 0x6000000048000004
1) 5312 160 0x60000000ebad01b0
2) 5152 160 0x2c23000041c20030
3) 4992 240 0x600000007c781b79
4) 4752 160 0xe84100284800000c
5) 4592 192 0x600000002fa30000
6) 4400 256 0x7f1800347b7407e0
7) 4144 208 0xe89f0108f87f0070
8) 3936 272 0xe84100282fa30000

Since we aren't dealing with function descriptors, use %pS
instead of %pF to fix it:

# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (26 entries)
----- ---- --------
0) 5424 112 ftrace_call+0x4/0x8
1) 5312 160 .current_io_context+0x28/0x74
2) 5152 160 .get_io_context+0x48/0xa0
3) 4992 240 .cfq_set_request+0x94/0x4c4
4) 4752 160 .elv_set_request+0x60/0x84
5) 4592 192 .get_request+0x2d4/0x468
6) 4400 256 .get_request_wait+0x7c/0x258
7) 4144 208 .__make_request+0x49c/0x610
8) 3936 272 .generic_make_request+0x390/0x434

Signed-off-by: Anton Blanchard
Cc: rostedt@goodmis.org
Cc: fweisbec@gmail.com
LKML-Reference:
Signed-off-by: Ingo Molnar

Anton Blanchard
2010-08-25 19:08:48 +0800
8a2e8e5de workqueue: fix cwq->nr_active underflow ... Browse Code »

cwq->nr_active is used to keep track of how many work items are active
for the cpu workqueue, where 'active' is defined as either pending on
global worklist or executing. This is used to implement the
max_active limit and workqueue freezing. If a work item is queued
after nr_active has already reached max_active, the work item doesn't
increment nr_active and is put on the delayed queue and gets activated
later as previous active work items retire.

try_to_grab_pending() which is used in the cancellation path
unconditionally decremented nr_active whether the work item being
cancelled is currently active or delayed, so cancelling a delayed work
item makes nr_active underflow. This breaks max_active enforcement
and triggers BUG_ON() in destroy_workqueue() later on.

This patch fixes this bug by adding a flag WORK_STRUCT_DELAYED, which
is set while a work item in on the delayed list and making
try_to_grab_pending() decrement nr_active iff the work item is
currently active.

The addition of the flag enlarges cwq alignment to 256 bytes which is
getting a bit too large. It's scheduled to be reduced back to 128
bytes by merging WORK_STRUCT_PENDING and WORK_STRUCT_CWQ in the next
devel cycle.

Signed-off-by: Tejun Heo
Reported-by: Johannes Berg

Tejun Heo
2010-08-25 16:33:56 +0800
502adf577 Merge branch 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
watchdog: Don't throttle the watchdog
tracing: Fix timer tracing

Linus Torvalds
2010-08-25 03:21:49 +0800
3b6c5507a Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
mutex: Improve the scalability of optimistic spinning

Linus Torvalds
2010-08-25 03:21:02 +0800
bac1e74db PM QoS: Fix kzalloc() parameters swapped in pm_qos_power_open() ... Browse Code »

sparse spotted that the kzalloc() in pm_qos_power_open() in the
current Linus' git tree had its parameters swapped. Fix this.

Signed-off-by: David Alan Gilbert
Acked-by: mark gross
Signed-off-by: Rafael J. Wysocki

David Alan Gilbert
2010-08-25 02:22:18 +0800
e41e704bc workqueue: improve destroy_workqueue() debuggability ... Browse Code »

Now that the worklist is global, having works pending after wq
destruction can easily lead to oops and destroy_workqueue() have
several BUG_ON()s to catch these cases. Unfortunately, BUG_ON()
doesn't tell much about how the work became pending after the final
flush_workqueue().

This patch adds WQ_DYING which is set before the final flush begins.
If a work is requested to be queued on a dying workqueue,
WARN_ON_ONCE() is triggered and the request is ignored. This clearly
indicates which caller is trying to queue a work on a dying workqueue
and keeps the system working in most cases.

Locking rule comment is updated such that the 'I' rule includes
modifying the field from destruction path.

Signed-off-by: Tejun Heo

Tejun Heo
2010-08-25 00:01:32 +0800

23 Aug, 2010

4 commits

972fa1c53 workqueue: mark lock acquisition on worker_maybe_bind_and_lock() ... Browse Code »

worker_maybe_bind_and_lock() actually grabs gcwq->lock but was missing proper
annotation. Add it. So this patch will remove following sparse warnings:

kernel/workqueue.c:1214:13: warning: context imbalance in 'worker_maybe_bind_and_lock' - wrong count at exit
arch/x86/include/asm/irqflags.h:44:9: warning: context imbalance in 'worker_rebind_fn' - unexpected unlock
kernel/workqueue.c:1991:17: warning: context imbalance in 'rescuer_thread' - unexpected unlock

Signed-off-by: Namhyung Kim
Signed-off-by: Tejun Heo

Namhyung Kim
2010-08-23 17:37:49 +0800
06bd6ebff workqueue: annotate lock context change ... Browse Code »

Some of internal functions called within gcwq->lock context releases and
regrabs the lock but were missing proper annotations. Add it.

Signed-off-by: Namhyung Kim
Signed-off-by: Tejun Heo

Namhyung Kim
2010-08-23 17:37:49 +0800
9d0f4dcc5 mutex: Improve the scalability of optimistic spinning ... Browse Code »

There is a scalability issue for current implementation of optimistic
mutex spin in the kernel. It is found on a 8 node 64 core Nehalem-EX
system (HT mode).

The intention of the optimistic mutex spin is to busy wait and spin on a
mutex if the owner of the mutex is running, in the hope that the mutex
will be released soon and be acquired, without the thread trying to
acquire mutex going to sleep. However, when we have a large number of
threads, contending for the mutex, we could have the mutex grabbed by
other thread, and then another ……, and we will keep spinning, wasting cpu
cycles and adding to the contention. One possible fix is to quit
spinning and put the current thread on wait-list if mutex lock switch to
a new owner while we spin, indicating heavy contention (see the patch
included).

I did some testing on a 8 socket Nehalem-EX system with a total of 64
cores. Using Ingo's test-mutex program that creates/delete files with 256
threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
after putting in the mutex spin fix:

./mutex-test V 256 10
Ops/sec
2.6.34 62864
With fix 197200

Repeating the test with Aim7 fserver workload, again there is a speed up
with the fix:

Jobs/min
2.6.34 91657
With fix 149325

To look at the impact on the distribution of mutex acquisition time, I
collected the mutex acquisition time on Aim7 fserver workload with some
instrumentation. The average acquisition time is reduced by 48% and
number of contentions reduced by 32%.

#contentions Time to acquire mutex (cycles)
2.6.34 72973 44765791
With fix 49210 23067129

The histogram of mutex acquisition time is listed below. The acquisition
time is in 2^bin cycles. We see that without the fix, the acquisition
time is mostly around 2^26 cycles. With the fix, we the distribution get
spread out a lot more towards the lower cycles, starting from 2^13.
However, there is an increase of the tail distribution with the fix at
2^28 and 2^29 cycles. It seems a small price to pay for the reduced
average acquisition time and also getting the cpu to do useful work.

Mutex acquisition time distribution (acq time = 2^bin cycles):
2.6.34 With Fix
bin #occurrence % #occurrence %
11 2 0.00% 120 0.24%
12 10 0.01% 790 1.61%
13 14 0.02% 2058 4.18%
14 86 0.12% 3378 6.86%
15 393 0.54% 4831 9.82%
16 710 0.97% 4893 9.94%
17 815 1.12% 4667 9.48%
18 790 1.08% 5147 10.46%
19 580 0.80% 6250 12.70%
20 429 0.59% 6870 13.96%
21 311 0.43% 1809 3.68%
22 255 0.35% 2305 4.68%
23 317 0.44% 916 1.86%
24 610 0.84% 233 0.47%
25 3128 4.29% 95 0.19%
26 63902 87.69% 122 0.25%
27 619 0.85% 286 0.58%
28 0 0.00% 3536 7.19%
29 0 0.00% 903 1.83%
30 0 0.00% 0 0.00%

I've done similar experiments with 2.6.35 kernel on smaller boxes as
well. One is on a dual-socket Westmere box (12 cores total, with HT).
Another experiment is on an old dual-socket Core 2 box (4 cores total, no
HT)

On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
program with my mutex patch but no significant difference in aim7's
fserver workload.

On the 4-core Core 2 box, I see the difference with the patch for both
mutex-test and aim7 fserver are negligible.

So far, it seems like the patch has not caused regression on smaller
systems.

Signed-off-by: Tim Chen
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Frederic Weisbecker
Cc: # .35.x
LKML-Reference:
Signed-off-by: Ingo Molnar

Tim Chen
2010-08-23 16:56:27 +0800
c6db67cda watchdog: Don't throttle the watchdog ... Browse Code »

Stephane reported that when the machine locks up, the regular ticks,
which are responsible to resetting the throttle count, stop too.

Hence the NMI watchdog can end up being throttled before it reports on
the locked up state, and we end up being sad..

Cure this by having the watchdog overflow reset its own throttle count.

Reported-by: Stephane Eranian
Tested-by: Stephane Eranian
Cc: Don Zickus
Cc: Frederic Weisbecker
Signed-off-by: Peter Zijlstra
LKML-Reference:
Signed-off-by: Ingo Molnar

Peter Zijlstra
2010-08-23 16:48:05 +0800