Eric Lee / smarc-fsl-linux-kernel

16 Dec, 2008

1 commit

307257cf4 cgroups: fix a race between rmdir and remount ... Browse Code »

When a cgroup is removed, it's unlinked from its parent's children list,
but not actually freed until the last dentry on it is released (at which
point cgrp->root->number_of_cgroups is decremented).

Currently rebind_subsystems checks for the top cgroup's child list being
empty in order to rebind subsystems into or out of a hierarchy - this can
result in the set of subsystems bound to a hierarchy being
removed-but-not-freed cgroup.

The simplest fix for this is to forbid remounts that change the set of
subsystems on a hierarchy that has removed-but-not-freed cgroups. This
bug can be reproduced via:

mkdir /mnt/cg
mount -t cgroup -o ns,freezer cgroup /mnt/cg
mkdir /mnt/cg/foo
sleep 1h < /mnt/cg/foo &
rmdir /mnt/cg/foo
mount -t cgroup -o remount,ns,devices,freezer cgroup /mnt/cg
kill $!

Though the above will cause oops in -mm only but not mainline, but the bug
can cause memory leak in mainline (and even oops)

Signed-off-by: Paul Menage
Reviewed-by: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-12-16 08:27:07 +0800

15 Dec, 2008

1 commit

ca7e716c7 Revert "sched_clock: prevent scd->clock from moving backwards" ... Browse Code »

This reverts commit 5b7dba4ff834259a5623e03a565748704a8fe449, which
caused a regression in hibernate, reported by and bisected by Fabio
Comolli.

This revert fixes

http://bugzilla.kernel.org/show_bug.cgi?id=12155
http://bugzilla.kernel.org/show_bug.cgi?id=12149

Bisected-by: Fabio Comolli
Requested-by: Rafael J. Wysocki
Acked-by: Dave Kleikamp
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Linus Torvalds

Linus Torvalds
2008-12-15 08:23:17 +0800

11 Dec, 2008

4 commits

f9fc05e76 Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: CPU remove deadlock fix

Linus Torvalds
2008-12-11 06:41:06 +0800
b88ed2059 fix mapping_writably_mapped() ... Browse Code »

Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
test in 2.6.7! Bad bad bug, good good find.

The i_mmap_writable count must be incremented for VM_SHARED (just as
i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
when dup_mmap() copies the vma for fork: it has its own more optimal
version of __vma_link_file(), and I missed this out. So the count
was later going down to 0 (dangerous) when one end unmapped, then
wrapping negative (inefficient) when the other end unmapped.

The only impact on x86 would have been that setting a mandatory lock on
a file which has at some time been opened O_RDWR and mapped MAP_SHARED
(but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
when it should succeed, or succeed when it should fail.

But those architectures which rely on flush_dcache_page() to flush
userspace modifications back into the page before the kernel reads it,
may in some cases have skipped the flush after such a fork - though any
repetitive test will soon wrap the count negative, in which case it will
flush_dcache_page() unnecessarily.

Fix would be a two-liner, but mapping variable added, and comment moved.

Reported-by: Lee Schermerhorn
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-12-11 06:40:45 +0800
9c2462472 KSYM_SYMBOL_LEN fixes ... Browse Code »

Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use
less stack exposing a bug in slub's list_locations() -
kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
beyond the end of page provided.

The 100 slop which list_locations() allows at end of page looks roughly
enough for all the other stuff it might print after the symbol before
it checks again: break out KSYM_SYMBOL_LEN earlier than before.

Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
them.

[akpm@linux-foundation.org: ftrace.h needs module.h]
Signed-off-by: Hugh Dickins
Cc: Christoph Lameter
Cc Miles Lane
Acked-by: Pekka Enberg
Acked-by: Steven Rostedt
Acked-by: Frederic Weisbecker
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-12-11 00:01:54 +0800
fbb5b7ae4 relayfs: fix infinite loop with splice() ... Browse Code »

Running kmemtraced, which uses splice() on relayfs, causes a hard lock on
x86-64 SMP. As described by Tom Zanussi:

It looks like you hit the same problem as described here:

commit 8191ecd1d14c6914c660dfa007154860a7908857

splice: fix infinite loop in generic_file_splice_read()

relay uses the same loop but it never got noticed or fixed.

Cc: Mathieu Desnoyers
Tested-by: Pekka Enberg
Signed-off-by: Tom Zanussi
Signed-off-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tom Zanussi
2008-12-11 00:01:52 +0800

10 Dec, 2008

1 commit

9a2bd244e sched: CPU remove deadlock fix ... Browse Code »

Impact: fix possible deadlock in CPU hot-remove path

This patch fixes a possible deadlock scenario in the CPU remove path.
migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
with the lock held. Then one of the tasks on the migration queue ends up
calling tg_shares_up which then also tries to acquire the same rq->lock.

[c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
[c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
[c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
[c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
[c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
[c000000058eab640] c00000000007ada4 .complete+0x54/0x80
[c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
[c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
[c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
[c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
[c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
[c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
[c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
[c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
[c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
[c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40

Signed-off-by: Brian King
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Brian King
2008-12-10 02:27:03 +0800

09 Dec, 2008

4 commits

48887e63d [PATCH] fix broken timestamps in AVC generated by kernel threads ... Browse Code »

Timestamp in audit_context is valid only if ->in_syscall is set.

Signed-off-by: Al Viro

Al Viro
2008-12-09 15:27:41 +0800
7f0ed77d2 [patch 1/1] audit: remove excess kernel-doc ... Browse Code »

Delete excess kernel-doc notation in kernel/auditsc.c:

Warning(linux-2.6.27-git10//kernel/auditsc.c:1481): Excess function parameter or struct member 'tsk' description in 'audit_syscall_entry'
Warning(linux-2.6.27-git10//kernel/auditsc.c:1564): Excess function parameter or struct member 'tsk' description in 'audit_syscall_exit'

Signed-off-by: Randy Dunlap
Cc: Al Viro
Cc: Eric Paris
Signed-off-by: Andrew Morton
Signed-off-by: Al Viro

Randy Dunlap
2008-12-09 15:27:40 +0800
a64e64944 [PATCH] return records for fork() both to child and parent ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2008-12-09 15:27:38 +0800
a3f07114e [PATCH] Audit: make audit=0 actually turn off audit ... Browse Code »

Currently audit=0 on the kernel command line does absolutely nothing.
Audit always loads and always uses its resources such as creating the
kernel netlink socket. This patch causes audit=0 to actually disable
audit. Audit will use no resources and starting the userspace auditd
daemon will not cause the kernel audit system to activate.

Signed-off-by: Eric Paris
Signed-off-by: Al Viro

Eric Paris
2008-12-09 15:27:37 +0800

05 Dec, 2008

3 commits

bbeba4c35 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/bdev:
[PATCH] fix bogus argument of blkdev_put() in pktcdvd
[PATCH 2/2] documnt FMODE_ constants
[PATCH 1/2] kill FMODE_NDELAY_NOW
[PATCH] clean up blkdev_get a little bit
[PATCH] Fix block dev compat ioctl handling
[PATCH] kill obsolete temporary comment in swsusp_close()

Linus Torvalds
2008-12-05 13:45:44 +0800
4857339d7 Merge branch 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kern… ... Browse Code »

…el/git/tip/linux-2.6-tip

* 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
time: catch xtime_nsec underflows and fix them
posix-cpu-timers: fix clock_gettime with CLOCK_PROCESS_CPUTIME_ID

Linus Torvalds
2008-12-05 13:40:29 +0800
3b666ce6a Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
check_hung_task(): unsigned sysctl_hung_task_warnings cannot be less than 0
documentation: local_ops fix on_each_cpu

Linus Torvalds
2008-12-05 13:39:41 +0800

04 Dec, 2008

2 commits

50c396d38 [PATCH] kill obsolete temporary comment in swsusp_close() ... Browse Code »

it had been put there to mark the call of blkdev_put() that
needed proper argument propagated to it; later patch in the
same series had done just that.

Signed-off-by: Al Viro

Al Viro
2008-12-04 17:22:54 +0800
6c9bacb41 time: catch xtime_nsec underflows and fix them ... Browse Code »

Impact: fix time warp bug

Alex Shi, along with Yanmin Zhang have been noticing occasional time
inconsistencies recently. Through their great diagnosis, they found that
the xtime_nsec value used in update_wall_time was occasionally going
negative. After looking through the code for awhile, I realized we have
the possibility for an underflow when three conditions are met in
update_wall_time():

1) We have accumulated a second's worth of nanoseconds, so we
incremented xtime.tv_sec and appropriately decrement xtime_nsec.
(This doesn't cause xtime_nsec to go negative, but it can cause it
to be small).

2) The remaining offset value is large, but just slightly less then
cycle_interval.

3) clocksource_adjust() is speeding up the clock, causing a
corrective amount (compensating for the increase in the multiplier
being multiplied against the unaccumulated offset value) to be
subtracted from xtime_nsec.

This can cause xtime_nsec to underflow.

Unfortunately, since we notify the NTP subsystem via second_overflow()
whenever we accumulate a full second, and this effects the error
accumulation that has already occured, we cannot simply revert the
accumulated second from xtime nor move the second accumulation to after
the clocksource_adjust call without a change in behavior.

This leaves us with (at least) two options:

1) Simply return from clocksource_adjust() without making a change if we
notice the adjustment would cause xtime_nsec to go negative.

This would work, but I'm concerned that if a large adjustment was needed
(due to the error being large), it may be possible to get stuck with an
ever increasing error that becomes too large to correct (since it may
always force xtime_nsec negative). This may just be paranoia on my part.

2) Catch xtime_nsec if it is negative, then add back the amount its
negative to both xtime_nsec and the error.

This second method is consistent with how we've handled earlier rounding
issues, and also has the benefit that the error being added is always in
the oposite direction also always equal or smaller then the correction
being applied. So the risk of a corner case where things get out of
control is lessened.

This patch fixes bug 11970, as tested by Yanmin Zhang
http://bugzilla.kernel.org/show_bug.cgi?id=11970

Reported-by: alex.shi@intel.com
Signed-off-by: John Stultz
Acked-by: "Zhang, Yanmin"
Tested-by: "Zhang, Yanmin"
Signed-off-by: Ingo Molnar

john stultz
2008-12-04 15:43:02 +0800

03 Dec, 2008

1 commit

201955463 check_hung_task(): unsigned sysctl_hung_task_warnings cannot be less than 0 ... Browse Code »

Impact: fix warnings-limit cutoff check for debug feature

unsigned sysctl_hung_task_warnings cannot be less than 0

Signed-off-by: Roel Kluin
Signed-off-by: Ingo Molnar

Roel Kluin
2008-12-03 17:11:51 +0800

02 Dec, 2008

2 commits

a80059928 taint: add missing comment ... Browse Code »

The description for 'D' was missing in the comment... (causing me a
minute of WTF followed by looking at more of the code)

Signed-off-by: Arjan van de Ven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arjan van de Ven
2008-12-02 11:55:24 +0800
7ef9964e6 epoll: introduce resource usage limits ... Browse Code »

It has been thought that the per-user file descriptors limit would also
limit the resources that a normal user can request via the epoll
interface. Vegard Nossum reported a very simple program (a modified
version attached) that can make a normal user to request a pretty large
amount of kernel memory, well within the its maximum number of fds. To
solve such problem, default limits are now imposed, and /proc based
configuration has been introduced. A new directory has been created,
named /proc/sys/fs/epoll/ and inside there, there are two configuration
points:

max_user_instances = Maximum number of devices - per user

max_user_watches = Maximum number of "watched" fds - per user

The current default for "max_user_watches" limits the memory used by epoll
to store "watches", to 1/32 of the amount of the low RAM. As example, a
256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
That should be enough to not break existing heavy epoll users. The
default value for "max_user_instances" is set to 128, that should be
enough too.

This also changes the userspace, because a new error code can now come out
from EPOLL_CTL_ADD (-ENOSPC). The EMFILE from epoll_create() was already
listed, so that should be ok.

[akpm@linux-foundation.org: use get_current_user()]
Signed-off-by: Davide Libenzi
Cc: Michael Kerrisk
Cc:
Cc: Cyrill Gorcunov
Reported-by: Vegard Nossum
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davide Libenzi
2008-12-02 11:55:24 +0800

01 Dec, 2008

6 commits

9bd062d9e Merge branch 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
sched: prevent divide by zero error in cpu_avg_load_per_task, update
sched, cpusets: fix warning in kernel/cpuset.c
sched: prevent divide by zero error in cpu_avg_load_per_task

Linus Torvalds
2008-12-01 05:06:47 +0800
72244c0e6 Merge branch 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/tip/linux-2.6-tip

* 'irq-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
irq.h: fix missing/extra kernel-doc
genirq: __irq_set_trigger: change pr_warning to pr_debug
irq: fix typo
x86: apic honour irq affinity which was set in early boot
genirq: fix the affinity setting in setup_irq
genirq: keep affinities set from userspace across free/request_irq()

Linus Torvalds
2008-12-01 05:06:20 +0800
93b10052f Merge branch 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel… ... Browse Code »

…/git/tip/linux-2.6-tip

* 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
lockdep: consistent alignement for lockdep info

Linus Torvalds
2008-12-01 05:05:46 +0800
7bbc67fbf Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/linux-2.6-tip

* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ftrace: prevent recursion
tracing, doc: update mmiotrace documentation
x86, mmiotrace: fix buffer overrun detection
function tracing: fix wrong position computing of stack_trace

Linus Torvalds
2008-12-01 05:05:31 +0800
96b8936a9 remove __ARCH_WANT_COMPAT_SYS_PTRACE ... Browse Code »

All architectures now use the generic compat_sys_ptrace, as should every
new architecture that needs 32bit compat (if we'll ever get another).

Remove the now superflous __ARCH_WANT_COMPAT_SYS_PTRACE define, and also
kill a comment about __ARCH_SYS_PTRACE that was added after
__ARCH_SYS_PTRACE was already gone.

Signed-off-by: Christoph Hellwig
Acked-by: David S. Miller
Signed-off-by: Linus Torvalds

Christoph Hellwig
2008-12-01 03:00:15 +0800
841964145 cpuinit fixes in kernel/* ... Browse Code »

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2008-12-01 02:03:37 +0800

30 Nov, 2008

2 commits

af6d596fd sched: prevent divide by zero error in cpu_avg_load_per_task, update ... Browse Code »

Regarding the bug addressed in:

4cd4262: sched: prevent divide by zero error in cpu_avg_load_per_task

Linus points out that the fix is not complete:

> There's nothing that keeps gcc from deciding not to reload
> rq->nr_running.
>
> Of course, in _practice_, I don't think gcc ever will (if it decides
> that it will spill, gcc is likely going to decide that it will
> literally spill the local variable to the stack rather than decide to
> reload off the pointer), but it's a valid compiler optimization, and
> it even has a name (rematerialization).
>
> So I suspect that your patch does fix the bug, but it still leaves the
> fairly unlikely _potential_ for it to re-appear at some point.
>
> We have ACCESS_ONCE() as a macro to guarantee that the compiler
> doesn't rematerialize a pointer access. That also would clarify
> the fact that we access something unsafe outside a lock.

So make sure our nr_running value is immutable and cannot change
after we check it for nonzero.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-11-30 03:45:15 +0800
1583715dd sched, cpusets: fix warning in kernel/cpuset.c ... Browse Code »

this warning:

kernel/cpuset.c: In function ‘generate_sched_domains’:
kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function

triggers because GCC does not recognize that ndoms stays uninitialized
only if doms is NULL - but that flow is covered at the end of
generate_sched_domains().

Help out GCC by initializing this variable to 0. (that's prudent anyway)

Also, this function needs a splitup and code flow simplification:
with 160 lines length it's clearly too long.

Signed-off-by: Ingo Molnar

Ingo Molnar
2008-11-30 03:39:29 +0800

27 Nov, 2008

2 commits

4cd426203 sched: prevent divide by zero error in cpu_avg_load_per_task ... Browse Code »

Impact: fix divide by zero crash in scheduler rebalance irq

While testing the branch profiler, I hit this crash:

divide error: 0000 [#1] PREEMPT SMP
[...]
RIP: 0010:[] [] cpu_avg_load_per_task+0x50/0x7f
[...]
Call Trace:
[] find_busiest_group+0x3e5/0xcaa
[] rebalance_domains+0x2da/0xa21
[] ? find_next_bit+0x1b2/0x1e6
[] run_rebalance_domains+0x112/0x19f
[] __do_softirq+0xa8/0x232
[] call_softirq+0x1c/0x3e
[] do_softirq+0x94/0x1cd
[] irq_exit+0x6b/0x10e
[] smp_apic_timer_interrupt+0xd3/0xff
[] apic_timer_interrupt+0x13/0x20

The code for cpu_avg_load_per_task has:

if (rq->nr_running)
rq->avg_load_per_task = rq->load.weight / rq->nr_running;

The runqueue lock is not held here, and there is nothing that prevents
the rq->nr_running from going to zero after it passes the if condition.

The branch profiler simply made the race window bigger.

This patch saves off the rq->nr_running to a local variable and uses that
for both the condition and the division.

Signed-off-by: Steven Rostedt
Peter Zijlstra
Signed-off-by: Ingo Molnar

Steven Rostedt
2008-11-27 17:29:52 +0800
4f5a7f40d ftrace: prevent recursion ... Browse Code »

Impact: prevent unnecessary stack recursion

if the resched flag was set before we entered, then don't reschedule.

Signed-off-by: Lai Jiangshan
Acked-by: Steven Rostedt
Signed-off-by: Ingo Molnar

Lai Jiangshan
2008-11-27 17:11:53 +0800

24 Nov, 2008

2 commits

eccdaeafa posix-cpu-timers: fix clock_gettime with CLOCK_PROCESS_CPUTIME_ID ... Browse Code »

Since CLOCK_PROCESS_CPUTIME_ID is in fact translated to -6, the switch
statement in cpu_clock_sample_group() must first mask off the irrelevant
bits, similar to cpu_clock_sample().

Signed-off-by: Petr Tesarik
Signed-off-by: Thomas Gleixner

--
posix-cpu-timers.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Petr Tesarik
2008-11-24 23:41:40 +0800
7ee1768dd x86, mmiotrace: fix buffer overrun detection ... Browse Code »

Impact: fix mmiotrace overrun tracing

When ftrace framework moved to use the ring buffer facility, the buffer
overrun detection was broken after 2.6.27 by commit

| commit 3928a8a2d98081d1bc3c0a84a2d70e29b90ecf1c
| Author: Steven Rostedt
| Date: Mon Sep 29 23:02:41 2008 -0400
|
| ftrace: make work with new ring buffer
|
| This patch ports ftrace over to the new ring buffer.

The detection is now fixed by using the ring buffer API.

When mmiotrace detects a buffer overrun, it will report the number of
lost events. People reading an mmiotrace log must know if something was
missed, otherwise the data may not make sense.

Signed-off-by: Pekka Paalanen
Acked-by: Steven Rostedt
Signed-off-by: Ingo Molnar

Pekka Paalanen
2008-11-24 03:33:23 +0800

23 Nov, 2008

1 commit

9f1441644 Merge commit 'v2.6.28-rc6' into irq/urgent Browse Code »

Ingo Molnar
2008-11-23 17:52:33 +0800

21 Nov, 2008

3 commits

b0788caf7 lockdep: consistent alignement for lockdep info ... Browse Code »

Impact: prettify /proc/lockdep_info

Just feel odd that not all lines of lockdep info are aligned.

Signed-off-by: Li Zefan
Acked-by: Peter Zijlstra
Signed-off-by: Ingo Molnar

Li Zefan
2008-11-21 15:59:40 +0800
522a110b4 function tracing: fix wrong position computing of stack_trace ... Browse Code »

Impact: make output of stack_trace complete if buffer overruns

When read buffer overruns, the output of stack_trace isn't complete.

When printing records with seq_printf in t_show, if the read buffer
has overruned by the current record, then this record won't be
printed to user space through read buffer, it will just be dropped in
this printing.

When next printing, t_start should return the "*pos"th record, which
is the one dropped by previous printing, but it just returns
(m->private + *pos)th record.

Here we use a more sane method to implement seq_operations which can
be found in kernel code. Thus we needn't initialize m->private.

About testing, it's not easy to overrun read buffer, but we can use
seq_printf to print more padding bytes in t_show, then it's easy to
check whether or not records are lost.

This commit has been tested on both condition of overrun and non
overrun.

Signed-off-by: Liming Wang
Acked-by: Steven Rostedt
Signed-off-by: Ingo Molnar

Liming Wang
2008-11-21 15:49:52 +0800
95763dd52 Merge branch 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/ker… ... Browse Code »

…nel/git/tip/linux-2.6-tip

* 'tracing-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
ftrace: fix dyn ftrace filter selection
ftrace: make filtered functions effective on setting
ftrace: fix set_ftrace_filter
trace: introduce missing mutex_unlock()
tracing: kernel/trace/trace.c: introduce missing kfree()

Linus Torvalds
2008-11-21 05:11:21 +0800

20 Nov, 2008

5 commits

33d283bef cgroups: fix a serious bug in cgroupstats ... Browse Code »

Try this, and you'll get oops immediately:
# cd Documentation/accounting/
# gcc -o getdelays getdelays.c
# mount -t cgroup -o debug xxx /mnt
# ./getdelays -C /mnt/tasks

Because a normal file's dentry->d_fsdata is a pointer to struct cftype,
not struct cgroup.

After the patch, it returns EINVAL if we try to get cgroupstats
from a normal file.

Cc: Balbir Singh
Signed-off-by: Li Zefan
Acked-by: Paul Menage
Cc: [2.6.25.x, 2.6.26.x, 2.6.27.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-11-20 10:50:00 +0800
966c8c12d sprint_symbol(): use less stack ... Browse Code »

sprint_symbol(), itself used when dumping stacks, has been wasting 128
bytes of stack: lookup the symbol directly into the buffer supplied by the
caller, instead of using a locally declared namebuf.

I believe the name != buffer strcpy() is obsolete: the design here dates
from when module symbol lookup pointed into a supposedly const but sadly
volatile table; nowadays it copies, but an uncalled strcpy() looks better
here than the risk of a recursive BUG_ON().

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-11-20 10:49:58 +0800
3fa59dfbc cgroup: fix potential deadlock in pre_destroy ... Browse Code »

As Balbir pointed out, memcg's pre_destroy handler has potential deadlock.

It has following lock sequence.

cgroup_mutex (cgroup_rmdir)
-> pre_destroy -> mem_cgroup_pre_destroy-> force_empty
-> cpu_hotplug.lock. (lru_add_drain_all->
schedule_work->
get_online_cpus)

But, cpuset has following.
cpu_hotplug.lock (call notifier)
-> cgroup_mutex. (within notifier)

Then, this lock sequence should be fixed.

Considering how pre_destroy works, it's not necessary to holding
cgroup_mutex() while calling it.

As a side effect, we don't have to wait at this mutex while memcg's
force_empty works.(it can be long when there are tons of pages.)

Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-11-20 10:49:58 +0800
f481891fd cpuset: update top cpuset's mems after adding a node ... Browse Code »

After adding a node into the machine, top cpuset's mems isn't updated.

By reviewing the code, we found that the update function

cpuset_track_online_nodes()

was invoked after node_states[N_ONLINE] changes. It is wrong because
N_ONLINE just means node has pgdat, and if node has/added memory, we use
N_HIGH_MEMORY. So, We should invoke the update function after
node_states[N_HIGH_MEMORY] changes, just like its commit says.

This patch fixes it. And we use notifier of memory hotplug instead of
direct calling of cpuset_track_online_nodes().

Signed-off-by: Miao Xie
Acked-by: Yasunori Goto
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Linus Torvalds

Miao Xie
2008-11-20 10:49:58 +0800
de11defeb reintroduce accept4 ... Browse Code »

Introduce a new accept4() system call. The addition of this system call
matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
inotify_init1(), epoll_create1(), pipe2()) which added new system calls
that differed from analogous traditional system calls in adding a flags
argument that can be used to access additional functionality.

The accept4() system call is exactly the same as accept(), except that
it adds a flags bit-mask argument. Two flags are initially implemented.
(Most of the new system calls in 2.6.27 also had both of these flags.)

SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
for the new file descriptor returned by accept4(). This is a useful
security feature to avoid leaking information in a multithreaded
program where one thread is doing an accept() at the same time as
another thread is doing a fork() plus exec(). More details here:
http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
Ulrich Drepper).

The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
to be enabled on the new open file description created by accept4().
(This flag is merely a convenience, saving the use of additional calls
fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.

Here's a test program. Works on x86-32. Should work on x86-64, but
I (mtk) don't have a system to hand to test with.

It tests accept4() with each of the four possible combinations of
SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
that the appropriate flags are set on the file descriptor/open file
description returned by accept4().

I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
and it passes according to my test program.

/* test_accept4.c

Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk

Licensed under the GNU GPLv2 or later.
*/
#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include

#define PORT_NUM 33333

#define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)

/**********************************************************************/

/* The following is what we need until glibc gets a wrapper for
accept4() */

/* Flags for socket(), socketpair(), accept4() */
#ifndef SOCK_CLOEXEC
#define SOCK_CLOEXEC O_CLOEXEC
#endif
#ifndef SOCK_NONBLOCK
#define SOCK_NONBLOCK O_NONBLOCK
#endif

#ifdef __x86_64__
#define SYS_accept4 288
#elif __i386__
#define USE_SOCKETCALL 1
#define SYS_ACCEPT4 18
#else
#error "Sorry -- don't know the syscall # on this architecture"
#endif

static int
accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
{
printf("Calling accept4(): flags = %x", flags);
if (flags != 0) {
printf(" (");
if (flags & SOCK_CLOEXEC)
printf("SOCK_CLOEXEC");
if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
printf(" ");
if (flags & SOCK_NONBLOCK)
printf("SOCK_NONBLOCK");
printf(")");
}
printf("\n");

#if USE_SOCKETCALL
long args[6];

args[0] = fd;
args[1] = (long) sockaddr;
args[2] = (long) addrlen;
args[3] = flags;

return syscall(SYS_socketcall, SYS_ACCEPT4, args);
#else
return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
#endif
}

/**********************************************************************/

static int
do_test(int lfd, struct sockaddr_in *conn_addr,
int closeonexec_flag, int nonblock_flag)
{
int connfd, acceptfd;
int fdf, flf, fdf_pass, flf_pass;
struct sockaddr_in claddr;
socklen_t addrlen;

printf("=======================================\n");

connfd = socket(AF_INET, SOCK_STREAM, 0);
if (connfd == -1)
die("socket");
if (connect(connfd, (struct sockaddr *) conn_addr,
sizeof(struct sockaddr_in)) == -1)
die("connect");

addrlen = sizeof(struct sockaddr_in);
acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
closeonexec_flag | nonblock_flag);
if (acceptfd == -1) {
perror("accept4()");
close(connfd);
return 0;
}

fdf = fcntl(acceptfd, F_GETFD);
if (fdf == -1)
die("fcntl:F_GETFD");
fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
((closeonexec_flag & SOCK_CLOEXEC) != 0);
printf("Close-on-exec flag is %sset (%s); ",
(fdf & FD_CLOEXEC) ? "" : "not ",
fdf_pass ? "OK" : "failed");

flf = fcntl(acceptfd, F_GETFL);
if (flf == -1)
die("fcntl:F_GETFD");
flf_pass = ((flf & O_NONBLOCK) != 0) ==
((nonblock_flag & SOCK_NONBLOCK) !=0);
printf("nonblock flag is %sset (%s)\n",
(flf & O_NONBLOCK) ? "" : "not ",
flf_pass ? "OK" : "failed");

close(acceptfd);
close(connfd);

printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
return fdf_pass && flf_pass;
}

static int
create_listening_socket(int port_num)
{
struct sockaddr_in svaddr;
int lfd;
int optval;

memset(&svaddr, 0, sizeof(struct sockaddr_in));
svaddr.sin_family = AF_INET;
svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
svaddr.sin_port = htons(port_num);

lfd = socket(AF_INET, SOCK_STREAM, 0);
if (lfd == -1)
die("socket");

optval = 1;
if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
sizeof(optval)) == -1)
die("setsockopt");

if (bind(lfd, (struct sockaddr *) &svaddr,
sizeof(struct sockaddr_in)) == -1)
die("bind");

if (listen(lfd, 5) == -1)
die("listen");

return lfd;
}

int
main(int argc, char *argv[])
{
struct sockaddr_in conn_addr;
int lfd;
int port_num;
int passed;

passed = 1;

port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;

memset(&conn_addr, 0, sizeof(struct sockaddr_in));
conn_addr.sin_family = AF_INET;
conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
conn_addr.sin_port = htons(port_num);

lfd = create_listening_socket(port_num);

if (!do_test(lfd, &conn_addr, 0, 0))
passed = 0;
if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
passed = 0;
if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
passed = 0;
if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
passed = 0;

close(lfd);

exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
}

[mtk.manpages@gmail.com: rewrote changelog, updated test program]
Signed-off-by: Ulrich Drepper
Tested-by: Michael Kerrisk
Acked-by: Michael Kerrisk
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ulrich Drepper
2008-11-20 10:49:57 +0800