Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2012

5 commits

a6bc32b89 mm: compaction: introduce sync-light migration for use by compaction ... Browse Code »
43

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
b969c4ab9 mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage ... Browse Code »

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
28d82dc1c epoll: limit paths ... Browse Code »
45

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron
Cc: Nelson Elhage
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-01-13 12:13:04 +0800
2ccd4f4d4 pipe: fail cleanly when root tries F_SETPIPE_SZ with big size ... Browse Code »

When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:

------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace 432f702e6db7b5ee ]---

Instead, make kcalloc() handle the overflow case and fail quietly.

[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin
Cc: Alexander Viro
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-01-13 12:13:04 +0800
a2ef990ab proc: fix null pointer deref in proc_pid_permission() ... Browse Code »

get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [] proc_pid_permission+0x64/0xe0
PGD 112075067 PUD 112814067 PMD 0
Oops: 0002 [#1] PREEMPT SMP

This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options"). The kernel should return -ESRCH if
get_proc_task() failed.

Signed-off-by: Xiaotian Feng
Cc: Al Viro
Cc: Vasiliy Kulikov
Cc: Stephen Wilson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiaotian Feng
2012-01-13 12:13:02 +0800

11 Jan, 2012

31 commits

5cd9599bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs4: deal with autofs4_write/autofs4_write races
autofs4: catatonic_mode vs. notify_daemon race
autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
hfsplus: creation of hidden dir on mount can fail
block_dev: Suppress bdev_cache_init() kmemleak warninig
fix shrink_dcache_parent() livelock
coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
coda: deal correctly with allocation failure from coda_cnode_makectl()
securityfs: fix object creation races

Linus Torvalds
2012-01-11 13:46:36 +0800
d668dc566 autofs4: deal with autofs4_write/autofs4_write races ... Browse Code »

Just serialize the actual writing of packets into pipe on
a new mutex, independent from everything else in the locking
hierarchy. As soon as something has started feeding a piece
of packet into the pipe to daemon, we *want* everything else
about to try the same to wait until we are done.

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:20:12 +0800
875333326 autofs4: catatonic_mode vs. notify_daemon race ... Browse Code »

we need to hold ->wq_mutex while we are forming the packet to send,
lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
just as autofs4_notify_daemon() decides to memcpy() from it...

We do have check for catatonic mode immediately after that (under
->wq_mutex, as it ought to be) and packet won't be actually sent,
but it'll be too late for us if we oops on that memcpy() from NULL...

Fix is obvious - just extend the area covered by ->wq_mutex over
that switch and check whether it's catatonic *before* doing anything
else.

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:19:58 +0800
4041bcdc7 autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race ... Browse Code »

We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
for good, or we might end up with wq inserted into queue after
autofs4_catatonic_mode() had done its thing. It will stick there
forever, since there won't be anything to clear its ->name.name.

A bit of a complication: validate_request() drops and regains ->wq_mutex.
It actually ends up the most convenient place to stick the check into...

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:19:12 +0800
001a541ea Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback

Linus Torvalds
2012-01-11 08:59:59 +0800
40ba58792 Merge branch 'akpm' (aka "Andrew's patch-bomb") ... Browse Code »

Andrew elucidates:
- First installmeant of MM. We have a HUGE number of MM patches this
time. It's crazy.
- MAINTAINERS updates
- backlight updates
- leds
- checkpatch updates
- misc ELF stuff
- rtc updates
- reiserfs
- procfs
- some misc other bits

* akpm: (124 commits)
user namespace: make signal.c respect user namespaces
workqueue: make alloc_workqueue() take printf fmt and args for name
procfs: add hidepid= and gid= mount options
procfs: parse mount options
procfs: introduce the /proc//map_files/ directory
procfs: make proc_get_link to use dentry instead of inode
signal: add block_sigmask() for adding sigmask to current->blocked
sparc: make SA_NOMASK a synonym of SA_NODEFER
reiserfs: don't lock root inode searching
reiserfs: don't lock journal_init()
reiserfs: delay reiserfs lock until journal initialization
reiserfs: delete comments referring to the BKL
drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
drivers/rtc/: remove redundant spi driver bus initialization
drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
rtc: convert drivers/rtc/* to use module_platform_driver()
drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
...

Linus Torvalds
2012-01-11 08:42:48 +0800
0499680a4 procfs: add hidepid= and gid= mount options ... Browse Code »

Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc// directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.

gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.

hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:

http://www.openwall.com/lists/oss-security/2011/11/05/3

hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.

Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.

Signed-off-by: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Randy Dunlap
Cc: "H. Peter Anvin"
Cc: Greg KH
Cc: Theodore Tso
Cc: Alan Cox
Cc: James Morris
Cc: Oleg Nesterov
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2012-01-11 08:30:54 +0800
97412950b procfs: parse mount options ... Browse Code »

Add support for procfs mount options. Actual mount options are coming in
the next patches.

Signed-off-by: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Randy Dunlap
Cc: "H. Peter Anvin"
Cc: Greg KH
Cc: Theodore Tso
Cc: Alan Cox
Cc: James Morris
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2012-01-11 08:30:54 +0800
640708a2c procfs: introduce the /proc/<pid>/map_files/ directory ... Browse Code »

This one behaves similarly to the /proc//fd/ one - it contains
symlinks one for each mapping with file, the name of a symlink is
"vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
results in a file that point exactly to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc//map_files/

| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped
by particular region. We do this by opening
/proc/$pid/map_files/$address symlink the way we do with file
descriptors.

2. This also helps in determining which anonymous shared mappings are
shared with each other by comparing the inodes of them.

3. When restoring a set of processes in case two of them has a mapping
shared, we map the memory by the 1st one and then open its
/proc/$pid/map_files/$address file and map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings
repeatable re-reading and reparsing for this text file which slows down
restore procedure significantly. Also as being pointed in (3) it is a way
easier to use top level shared mapping in children as
/proc/$pid/map_files/$address when needed.

[akpm@linux-foundation.org: coding-style fixes]
[gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Vasiliy Kulikov
Reviewed-by: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2012-01-11 08:30:54 +0800
7773fbc54 procfs: make proc_get_link to use dentry instead of inode ... Browse Code »

Prepare the ground for the next "map_files" patch which needs a name of a
link file to analyse.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Cc: Vasiliy Kulikov
Cc: "Kirill A. Shutemov"
Cc: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-01-11 08:30:54 +0800
9b467e6eb reiserfs: don't lock root inode searching ... Browse Code »

Nothing requires that we lock the filesystem until the root inode is
provided.

Also iget5_locked() triggers a warning because we are holding the
filesystem lock while allocating the inode, which result in a lockdep
suspicion that we have a lock inversion against the reclaim path:

[ 1986.896979] =================================
[ 1986.896990] [ INFO: inconsistent lock state ]
[ 1986.896997] 3.1.1-main #8
[ 1986.897001] ---------------------------------
[ 1986.897007] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[ 1986.897016] kswapd0/16 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 1986.897023] (&REISERFS_SB(s)->lock){+.+.?.}, at: [] reiserfs_write_lock+0x20/0x2a
[ 1986.897044] {RECLAIM_FS-ON-W} state was registered at:
[ 1986.897050] [] mark_held_locks+0xae/0xd0
[ 1986.897060] [] lockdep_trace_alloc+0x7d/0x91
[ 1986.897068] [] kmem_cache_alloc+0x1a/0x93
[ 1986.897078] [] reiserfs_alloc_inode+0x13/0x3d
[ 1986.897088] [] alloc_inode+0x14/0x5f
[ 1986.897097] [] iget5_locked+0x62/0x13a
[ 1986.897106] [] reiserfs_fill_super+0x410/0x8b9
[ 1986.897114] [] mount_bdev+0x10b/0x159
[ 1986.897123] [] get_super_block+0x10/0x12
[ 1986.897131] [] mount_fs+0x59/0x12d
[ 1986.897138] [] vfs_kern_mount+0x45/0x7a
[ 1986.897147] [] do_kern_mount+0x2f/0xb0
[ 1986.897155] [] do_mount+0x5c2/0x612
[ 1986.897163] [] sys_mount+0x61/0x8f
[ 1986.897170] [] sysenter_do_call+0x12/0x32
[ 1986.897181] irq event stamp: 7509691
[ 1986.897186] hardirqs last enabled at (7509691): [] kmem_cache_alloc+0x6e/0x93
[ 1986.897197] hardirqs last disabled at (7509690): [] kmem_cache_alloc+0x24/0x93
[ 1986.897209] softirqs last enabled at (7508896): [] __do_softirq+0xee/0xfd
[ 1986.897222] softirqs last disabled at (7508859): [] do_softirq+0x50/0x9d
[ 1986.897234]
[ 1986.897235] other info that might help us debug this:
[ 1986.897242] Possible unsafe locking scenario:
[ 1986.897244]
[ 1986.897250] CPU0
[ 1986.897254] ----
[ 1986.897257] lock(&REISERFS_SB(s)->lock);
[ 1986.897265]
[ 1986.897269] lock(&REISERFS_SB(s)->lock);
[ 1986.897276]
[ 1986.897277] *** DEADLOCK ***
[ 1986.897278]
[ 1986.897286] no locks held by kswapd0/16.
[ 1986.897291]
[ 1986.897292] stack backtrace:
[ 1986.897299] Pid: 16, comm: kswapd0 Not tainted 3.1.1-main #8
[ 1986.897306] Call Trace:
[ 1986.897314] [] ? printk+0xf/0x11
[ 1986.897324] [] print_usage_bug+0x20e/0x21a
[ 1986.897332] [] ? print_irq_inversion_bug+0x172/0x172
[ 1986.897341] [] mark_lock+0x27f/0x483
[ 1986.897349] [] __lock_acquire+0x628/0x1472
[ 1986.897358] [] lock_acquire+0x47/0x5e
[ 1986.897366] [] ? reiserfs_write_lock+0x20/0x2a
[ 1986.897384] [] ? reiserfs_write_lock+0x20/0x2a
[ 1986.897397] [] mutex_lock_nested+0x35/0x26f
[ 1986.897409] [] ? reiserfs_write_lock+0x20/0x2a
[ 1986.897421] [] reiserfs_write_lock+0x20/0x2a
[ 1986.897433] [] map_block_for_writepage+0xc9/0x590
[ 1986.897448] [] ? create_empty_buffers+0x33/0x8f
[ 1986.897461] [] ? get_parent_ip+0xb/0x31
[ 1986.897472] [] ? sub_preempt_count+0x81/0x8e
[ 1986.897485] [] ? _raw_spin_unlock+0x27/0x3d
[ 1986.897496] [] ? get_parent_ip+0xb/0x31
[ 1986.897508] [] reiserfs_writepage+0x1b9/0x3e7
[ 1986.897521] [] ? clear_page_dirty_for_io+0xcb/0xde
[ 1986.897533] [] ? trace_hardirqs_on_caller+0x108/0x138
[ 1986.897546] [] ? trace_hardirqs_on+0xb/0xd
[ 1986.897559] [] shrink_page_list+0x34f/0x5e2
[ 1986.897572] [] shrink_inactive_list+0x172/0x22c
[ 1986.897585] [] shrink_zone+0x303/0x3b1
[ 1986.897597] [] ? _raw_spin_unlock+0x27/0x3d
[ 1986.897611] [] kswapd+0x3b7/0x5f2

The deadlock shouldn't happen since we are doing that allocation in the
mount path, the filesystem is not available for any reclaim. Still the
warning is annoying.

To solve this, acquire the lock later only where we need it, right before
calling reiserfs_read_locked_inode() that wants to lock to walk the tree.

Reported-by: Knut Petersen
Signed-off-by: Frederic Weisbecker
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Jeff Mahoney
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2012-01-11 08:30:54 +0800
37c69b98d reiserfs: don't lock journal_init() ... Browse Code »

journal_init() doesn't need the lock since no operation on the filesystem
is involved there. journal_read() and get_list_bitmap() have yet to be
reviewed carefully though before removing the lock there. Just keep the
it around these two calls for safety.

Signed-off-by: Frederic Weisbecker
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Jeff Mahoney
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2012-01-11 08:30:53 +0800
f32485be8 reiserfs: delay reiserfs lock until journal initialization ... Browse Code »

In the mount path, transactions that are made before journal
initialization don't involve the filesystem. We can delay the reiserfs
lock until we play with the journal.

Signed-off-by: Frederic Weisbecker
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Jeff Mahoney
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2012-01-11 08:30:53 +0800
b18c1c6e0 reiserfs: delete comments referring to the BKL ... Browse Code »

Signed-off-by: Davidlohr Bueso
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2012-01-11 08:30:53 +0800
e39f56023 fs: binfmt_elf: create Kconfig variable for PIE randomization ... Browse Code »

Randomization of PIE load address is hard coded in binfmt_elf.c for X86
and ARM. Create a new Kconfig variable
(CONFIG_ARCH_BINFMT_ELF_RANDOMIZE_PIE) for this and use it instead. Thus
architecture specific policy is pushed out of the generic binfmt_elf.c and
into the architecture Kconfig files.

X86 and ARM Kconfigs are modified to select the new variable so there is
no change in behavior. A follow on patch will select it for MIPS too.

Signed-off-by: David Daney
Cc: Russell King
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Alexander Viro
Acked-by: H. Peter Anvin
Cc: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Daney
2012-01-11 08:30:51 +0800
43d2b1132 tracepoint: add tracepoints for debugging oom_score_adj ... Browse Code »

oom_score_adj is used for guarding processes from OOM-Killer. One of
problem is that it's inherited at fork(). When a daemon set oom_score_adj
and make children, it's hard to know where the value is set.

This patch adds some tracepoints useful for debugging. This patch adds
3 trace points.
- creating new task
- renaming a task (exec)
- set oom_score_adj

To debug, users need to enable some trace pointer. Maybe filtering is useful as

# EVENT=/sys/kernel/debug/tracing/events/task/
# echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
# echo "oom_score_adj != 0" > $EVENT/task_rename/filter
# echo 1 > $EVENT/enable
# EVENT=/sys/kernel/debug/tracing/events/oom/
# echo 1 > $EVENT/enable

output will be like this.
# grep oom /sys/kernel/debug/tracing/trace
bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-11 08:30:44 +0800
e3a41a5ba btrfs: pass __GFP_WRITE for buffered write page allocations ... Browse Code »

Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Minchan Kim
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
5f8aefd44 mm: account reaped page cache on inode cache pruning ... Browse Code »

Inode cache pruning indirectly reclaims page-cache by invalidating mapping
pages. Let's account them into reclaim-state to notice this progress in
memory reclaimer.

Signed-off-by: Konstantin Khlebnikov
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-01-11 08:30:42 +0800
54c2c5761 Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Ext4 commits for 3.3 merge window

* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (32 commits)
ext4: fix undefined behavior in ext4_fill_flex_info()
ext4: make more symbols static
ext4: make local symbol ext4_initxattrs static
jbd2: fix hung processes in jbd2_journal_lock_updates()
ext4: reserve new feature flag codepoints
ext4: Report max_batch_time option correctly
ext4: add missing ext4_resize_end on error paths
ext4: let ext4_group_add() use common code
ext4: let ext4_group_extend() use common code
ext4: add new online resize interface
ext4: add a new function which adds a flex group to a fs
ext4: add a new function which allocates bitmaps and inode tables
ext4: pass verify_reserved_gdb() the number of group decriptors
ext4: add a function which updates the super block during online resizing
ext4: add a function which sets up a block group descriptors of a flex bg
ext4: add a function which sets up group blocks of a flex bg
ext4: add a structure which will be used by 64bit-resize interface
ext4: add a function which adds a new group descriptors to a fs
ext4: add a function which extends a group without checking parameters
ext4: use proper little-endian bitops
...

Linus Torvalds
2012-01-11 07:51:48 +0800
609eac1c1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
fs/9p: iattr_valid flags are kernel internal flags map them to 9p values.
fs/9p: We should not allocate a new inode when creating hardlines.
fs/9p: v9fs_stat2inode should update suid/sgid bits.
9p: Reduce object size with CONFIG_NET_9P_DEBUG
fs/9p: check schedule_timeout_interruptible return value

Fix up trivial conflicts in fs/9p/{vfs_inode.c,vfs_inode_dotl.c} due to
debug messages having changed to use p9_debug() on one hand, and the
changes for umode_t on the other.

Linus Torvalds
2012-01-11 07:09:01 +0800
57eccf1c2 Merge branch 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

* 'nfs-for-3.3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4: Change the default setting of the nfs4_disable_idmapping parameter
NFSv4: Save the owner/group name string when doing open
NFS: Remove pNFS bloat from the generic write path
pnfs-obj: Must return layout on IO error
pnfs-obj: pNFS errors are communicated on iodata->pnfs_error
NFS: Cache state owners after files are closed
NFS: Clean up nfs4_find_state_owners_locked()
NFSv4: include bitmap in nfsv4 get acl data
nfs: fix a minor do_div portability issue
NFSv4.1: cleanup comment and debug printk
NFSv4.1: change nfs4_free_slot parameters for dynamic slots
NFSv4.1: cleanup init and reset of session slot tables
NFSv4.1: fix backchannel slotid off-by-one bug
nfs: fix regression in handling of context= option in NFSv4
NFS - fix recent breakage to NFS error handling.
NFS: Retry mounting NFSROOT
SUNRPC: Clean up the RPCSEC_GSS service ticket requests

Linus Torvalds
2012-01-11 06:57:40 +0800
5c395ae70 Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 ... Browse Code »

* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBI: fix use-after-free on error path
UBI: fix missing scrub when there is a bit-flip
UBIFS: Use kmemdup rather than duplicating its implementation

Linus Torvalds
2012-01-11 06:57:19 +0800
49d41bae4 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: add recovery callbacks
dlm: add node slots and generation
dlm: move recovery barrier calls
dlm: convert rsb list to rb_tree

Linus Torvalds
2012-01-11 06:55:55 +0800
b3f2a9244 hfsplus: creation of hidden dir on mount can fail ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-11 06:48:52 +0800
7b3480f8b Merge tag 'for-linus-3.3' of git://git.infradead.org/mtd-2.6 ... Browse Code »

MTD pull for 3.3

* tag 'for-linus-3.3' of git://git.infradead.org/mtd-2.6: (113 commits)
mtd: Fix dependency for MTD_DOC200x
mtd: do not use mtd->block_markbad directly
logfs: do not use 'mtd->block_isbad' directly
mtd: introduce mtd_can_have_bb helper
mtd: do not use mtd->suspend and mtd->resume directly
mtd: do not use mtd->lock, unlock and is_locked directly
mtd: do not use mtd->sync directly
mtd: harmonize mtd_writev usage
mtd: do not use mtd->lock_user_prot_reg directly
mtd: mtd->write_user_prot_reg directly
mtd: do not use mtd->read_*_prot_reg directly
mtd: do not use mtd->get_*_prot_info directly
mtd: do not use mtd->read_oob directly
mtd: mtdoops: do not use mtd->panic_write directly
romfs: do not use mtd->get_unmapped_area directly
mtd: do not use mtd->get_unmapped_area directly
mtd: do use mtd->point directly
mtd: introduce mtd_has_oob helper
mtd: mtdcore: export symbols cleanup
mtd: clean-up the default_mtd_writev function
...

Fix up trivial edit/remove conflict in drivers/staging/spectra/lld_mtd.c

Linus Torvalds
2012-01-11 05:45:22 +0800
ace8577ae block_dev: Suppress bdev_cache_init() kmemleak warninig ... Browse Code »

Kmemleak reports the following warning in bdev_cache_init()
[ 0.003738] kmemleak: Object 0xffff880153035200 (size 256):
[ 0.003823] kmemleak: comm "swapper/0", pid 0, jiffies 4294667299
[ 0.003909] kmemleak: min_count = 1
[ 0.003988] kmemleak: count = 0
[ 0.004066] kmemleak: flags = 0x1
[ 0.004144] kmemleak: checksum = 0
[ 0.004224] kmemleak: backtrace:
[ 0.004303] [] kmemleak_alloc+0x21/0x3e
[ 0.004446] [] kmem_cache_alloc+0xca/0x1dc
[ 0.004592] [] alloc_vfsmnt+0x1f/0x198
[ 0.004736] [] vfs_kern_mount+0x36/0xd2
[ 0.004879] [] kern_mount_data+0x18/0x32
[ 0.005025] [] bdev_cache_init+0x51/0x81
[ 0.005169] [] vfs_caches_init+0x101/0x10d
[ 0.005313] [] start_kernel+0x344/0x383
[ 0.005456] [] x86_64_start_reservations+0xae/0xb2
[ 0.005602] [] x86_64_start_kernel+0x102/0x111
[ 0.005747] [] 0xffffffffffffffff
[ 0.008653] kmemleak: Trying to color unknown object at 0xffff880153035220 as Grey
[ 0.008754] Pid: 0, comm: swapper/0 Not tainted 3.3.0-rc0-dbg-04200-g8180888-dirty #888
[ 0.008856] Call Trace:
[ 0.008934] [] ? find_and_get_object+0x44/0x118
[ 0.009023] [] paint_ptr+0x57/0x8f
[ 0.009109] [] kmemleak_not_leak+0x23/0x42
[ 0.009195] [] bdev_cache_init+0x72/0x81
[ 0.009282] [] vfs_caches_init+0x101/0x10d
[ 0.009368] [] start_kernel+0x344/0x383
[ 0.009466] [] x86_64_start_reservations+0xae/0xb2
[ 0.009555] [] ? early_idt_handlers+0x140/0x140
[ 0.009643] [] x86_64_start_kernel+0x102/0x111

due to attempt to mark pointer to `struct vfsmount' as a gray object, which
is embedded into `struct mount' returned from alloc_vfsmnt().

Make `bd_mnt' static, avoiding need to tell kmemleak to mark it gray, as
suggested by Al Viro.

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Al Viro

Sergey Senozhatsky
2012-01-11 02:08:55 +0800
eaf5f9073 fix shrink_dcache_parent() livelock ... Browse Code »
1

Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
cause shrink_dcache_parent() to loop forever.

Here's what appears to happen:

1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

2 - CPU1: select_parent(P) locks P->d_lock

3 - CPU0: shrink_dentry_list() locks C->d_lock
dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

4 - CPU1: select_parent(P) locks C->d_lock,
moves C from dispose list being processed on CPU0 to the new
dispose list, returns 1

5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

6 - Goto 2 with CPU0 and CPU1 switched

Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
it found a new one, causing shrink_dentry_list() to think it's making progress
and loop over and over.

One way to trigger this is to make udev calls stat() on the sysfs file while it
is going away.

Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

Then execute the following loop:

while true; do
echo -bond0 > /sys/class/net/bonding_masters
echo +bond0 > /sys/class/net/bonding_masters
echo -bond1 > /sys/class/net/bonding_masters
echo +bond1 > /sys/class/net/bonding_masters
done

One fix would be to check all callers and prevent concurrent calls to
shrink_dcache_parent(). But I think a better solution is to stop the
stealing behavior.

This patch adds a new dentry flag that is set when the dentry is added to the
dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
new reference just before being pruned.

If the dentry has this flag, select_parent() will skip it and let
shrink_dentry_list() retry pruning it. With select_parent() skipping those
dentries there will not be the appearance of progress (new dentries found) when
there is none, hence shrink_dcache_parent() will not loop forever.

Set the flag is also set in prune_dcache_sb() for consistency as suggested by
Linus.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Al Viro

Miklos Szeredi
2012-01-11 02:06:32 +0800
ff9cb1c4e Merge branch 'for_linus' into for_linus_merged ... Browse Code »
86

Conflicts:
fs/ext4/ioctl.c

Theodore Ts'o
2012-01-11 00:54:07 +0800
d50f2ab6f ext4: fix undefined behavior in ext4_fill_flex_info() ... Browse Code »
1

Commit 503358ae01b70ce6909d19dd01287093f6b6271c ("ext4: avoid divide by
zero when trying to mount a corrupted file system") fixes CVE-2009-4307
by performing a sanity check on s_log_groups_per_flex, since it can be
set to a bogus value by an attacker.

sbi->s_log_groups_per_flex = sbi->s_es->s_log_groups_per_flex;
groups_per_flex = 1 << sbi->s_log_groups_per_flex;

if (groups_per_flex < 2) { ... }

This patch fixes two potential issues in the previous commit.

1) The sanity check might only work on architectures like PowerPC.
On x86, 5 bits are used for the shifting amount. That means, given a
large s_log_groups_per_flex value like 36, groups_per_flex = 1 << 36
is essentially 1 << 4 = 16, rather than 0. This will bypass the check,
leaving s_log_groups_per_flex and groups_per_flex inconsistent.

2) The sanity check relies on undefined behavior, i.e., oversized shift.
A standard-confirming C compiler could rewrite the check in unexpected
ways. Consider the following equivalent form, assuming groups_per_flex
is unsigned for simplicity.

groups_per_flex = 1 << sbi->s_log_groups_per_flex;
if (groups_per_flex == 0 || groups_per_flex == 1) {

We compile the code snippet using Clang 3.0 and GCC 4.6. Clang will
completely optimize away the check groups_per_flex == 0, leaving the
patched code as vulnerable as the original. GCC keeps the check, but
there is no guarantee that future versions will do the same.

Signed-off-by: Xi Wang
Signed-off-by: "Theodore Ts'o"
Cc: stable@vger.kernel.org

Xi Wang
2012-01-11 00:51:10 +0800
f4947fbce coda: switch coda_cnode_make() to sane API as well, clean coda_lookup() ... Browse Code »

Signed-off-by: Al Viro

Al Viro
2012-01-11 00:13:16 +0800
0b2c4e39c coda: deal correctly with allocation failure from coda_cnode_makectl() ... Browse Code »

lookup should fail with ENOMEM, not silently make dentry negative.
Switched to saner calling conventions, while we are at it.

Signed-off-by: Al Viro

Al Viro
2012-01-11 00:13:13 +0800

10 Jan, 2012

4 commits

e4e11180d Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: new helper - d_make_root()
dcache: use a dispose list in select_parent
ceph: d_alloc_root() may fail
ext4: fix failure exits
isofs: inode leak on mount failure

Linus Torvalds
2012-01-10 09:37:37 +0800
adc0e91ab vfs: new helper - d_make_root() ... Browse Code »

d_alloc_root() with iput() in case of allocation failure...

Signed-off-by: Al Viro

Al Viro
2012-01-10 08:23:45 +0800
b48f03b31 dcache: use a dispose list in select_parent ... Browse Code »

select_parent currently abuses the dentry cache LRU to provide
cleanup features for child dentries that need to be freed. It moves
them to the tail of the LRU, then tells shrink_dcache_parent() to
calls __shrink_dcache_sb to unconditionally move them to a dispose
list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
relock the dentries to move them off the LRU onto the dispose list,
but otherwise does not touch the dentries that select_parent() moved
to the tail of the LRU. It then passses the dispose list to
shrink_dentry_list() which tries to free the dentries.

IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
exactly the same list of dentries for disposal directly in
select_parent() and call shrink_dentry_list() instead of calling
__shrink_dcache_sb() to do that. This means that we avoid long holds
on the lru lock walking the LRU moving dentries to the dispose list
We also avoid the need to relock each dentry just to move it off the
LRU, reducing the numebr of times we lock each dentry to dispose of
them in shrink_dcache_parent() from 3 to 2 times.

Further, we remove one of the two callers of __shrink_dcache_sb().
This also means that __shrink_dcache_sb can be moved into back into
prune_dcache_sb() and we no longer have to handle referenced
dentries conditionally, simplifying the code.

Signed-off-by: Dave Chinner
Signed-off-by: Linus Torvalds
Signed-off-by: Al Viro

Dave Chinner
2012-01-10 08:22:52 +0800
3c5184ef1 ceph: d_alloc_root() may fail ... Browse Code »

... and ceph_init_dentry(NULL) will oops

Signed-off-by: Al Viro

Al Viro
2012-01-10 05:36:12 +0800