Eric Lee / smarc-fsl-linux-kernel

14 Jan, 2012

6 commits

69e4747ee Unused iocbs in a batch should not be accounted as active. ... Browse Code »

Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
allocated in a batch during processing of first iocbs. All iocbs in a
batch are automatically added to ctx->active_reqs list and accounted in
ctx->reqs_active.

If one (not the last one) of iocbs submitted by an user fails, further
iocbs are not processed, but they are still present in ctx->active_reqs
and accounted in ctx->reqs_active. This causes process to stuck in a D
state in wait_for_all_aios() on exit since ctx->reqs_active will never
go down to zero. Furthermore since kiocb_batch_free() frees iocb
without removing it from active_reqs list the list become corrupted
which may cause oops.

Fix this by removing iocb from ctx->active_reqs and updating
ctx->reqs_active in kiocb_batch_free().

Signed-off-by: Gleb Natapov
Reviewed-by: Jeff Moyer
Cc: stable@kernel.org # 3.2
Signed-off-by: Linus Torvalds

Gleb Natapov
2012-01-14 12:39:44 +0800
96e80a785 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: fix i_blocks calculation with extended regular files
Squashfs: fix mount time sanity check for corrupted superblock
Squashfs: optimise squashfs_cache_get entry search
Squashfs: Update documentation to include xattrs
Squashfs: add missing block release on error condition

Linus Torvalds
2012-01-14 02:34:57 +0800
57e6a7dde Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Fix nlink setting on inode creation
GFS2: fail mount if journal recovery fails
GFS2: let spectator mount do read only recovery
GFS2: Fix a use-after-free that coverity spotted
GFS2: dlm based recovery coordination

Linus Torvalds
2012-01-14 02:33:39 +0800
94b1984ab Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 ... Browse Code »

* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: fix key printing
UBIFS: use snprintf instead of sprintf when printing keys
UBIFS: fix debugging messages
UBIFS: make debugging messages light again
UBI: fix debugging messages
UBI: make vid_hdr non-static

Linus Torvalds
2012-01-14 02:31:33 +0800
1a52bb0b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)

Linus Torvalds
2012-01-14 02:29:21 +0800
8638094e9 autofs4 - fix deal with autofs4_write races ... Browse Code »

I don't know how I missed this obvious mistake when I
reviewed Als' patches, sorry.

[ Quoting Al:

Grr... Note to self: do git status *and* git stash show -p
before git push. Nothing like "WTF? I'd fixed that braino"
feeling ;-/

Al sent the same patch - it got broken in commit d668dc56631d:
"autofs4: deal with autofs4_write/autofs4_write races". ]

Reported-and-tested-by: Dave Airlie
Signed-off-by: Ian Kent
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Ian Kent
2012-01-14 00:30:49 +0800

13 Jan, 2012

16 commits

515315a12 UBIFS: fix key printing ... Browse Code »
43

Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking
around all printing macros and we could use static buffers for creating
key strings and printing them. However, now we do not have that locking and
we cannot use static buffers. This commit removes the old DBGKEY() macros
and introduces few new helper macros for printing debugging messages plus
a key at the end. Thankfully, all the messages are already structures in
a way that the key is printed in the end.

Signed-off-by: Artem Bityutskiy

Artem Bityutskiy
2012-01-13 18:50:42 +0800
beba00607 UBIFS: use snprintf instead of sprintf when printing keys ... Browse Code »

Switch to 'snprintf()' which is more secure and reliable. This is also a
preparation to the subsequent key printing fixes.

Signed-off-by: Artem Bityutskiy

Artem Bityutskiy
2012-01-13 18:46:21 +0800
099469502 Merge branch 'akpm' (aka "Andrew's patch-bomb, take two") ... Browse Code »

Andrew explains:

- various misc stuff

- Most of the rest of MM: memcg, threaded hugepages, others.

- cpumask

- kexec

- kdump

- some direct-io performance tweaking

- radix-tree optimisations

- new selftests code

A note on this: often people will develop a new userspace-visible
feature and will develop userspace code to exercise/test that
feature. Then they merge the patch and the selftest code dies.
Sometimes we paste it into the changelog. Sometimes the code gets
thrown into Documentation/(!).

This saddens me. So this patch creates a bare-bones framework which
will henceforth allow me to ask people to include their test apps in
the kernel tree so we can keep them alive. Then when people enhance
or fix the feature, I can ask them to update the test app too.

The infrastruture is terribly trivial at present - let's see how it
evolves.

- checkpoint/restart feature work.

A note on this: this is a project by various mad Russians to perform
c/r mainly from userspace, with various oddball helper code added
into the kernel where the need is demonstrated.

So rather than some large central lump of code, what we have is
little bits and pieces popping up in various places which either
expose something new or which permit something which is normally
kernel-private to be modified.

The overall project is an ongoing thing. I've judged that the size
and scope of the thing means that we're more likely to be successful
with it if we integrate the support into mainline piecemeal rather
than allowing it all to develop out-of-tree.

However I'm less confident than the developers that it will all
eventually work! So what I'm asking them to do is to wrap each piece
of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
eventually comes to tears and the project as a whole fails, it should
be a simple matter to go through and delete all trace of it.

This lot pretty much wraps up the -rc1 merge for me.

* akpm: (96 commits)
unlzo: fix input buffer free
ramoops: update parameters only after successful init
ramoops: fix use of rounddown_pow_of_two()
c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
c/r: introduce CHECKPOINT_RESTORE symbol
selftests: new x86 breakpoints selftest
selftests: new very basic kernel selftests directory
radix_tree: take radix_tree_path off stack
radix_tree: remove radix_tree_indirect_to_ptr()
dio: optimize cache misses in the submission path
vfs: cache request_queue in struct block_device
fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
drivers/parport/parport_pc.c: fix warnings
panic: don't print redundant backtraces on oops
sysctl: add the kernel.ns_last_pid control
kdump: add udev events for memory online/offline
include/linux/crash_dump.h needs elf.h
kdump: fix crash_kexec()/smp_send_stop() race in panic()
kdump: crashk_res init check for /sys/kernel/kexec_crash_size
...

Linus Torvalds
2012-01-13 12:42:54 +0800
b3f7f573a c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4 ... Browse Code »

The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
involved into calculation of program text/data segment sizes (which might
be seen in /proc//statm) and into brk() call final address.

For restore we need to know all these values. While
mm->start_code/end_code already present in /proc/$pid/stat, the rest
members are not, so this patch brings them in.

The restore procedure of these members is addressed in another patch using
prctl().

Signed-off-by: Cyrill Gorcunov
Acked-by: Serge Hallyn
Reviewed-by: Kees Cook
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Alexey Dobriyan
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-01-13 12:13:13 +0800
65dd2aa90 dio: optimize cache misses in the submission path ... Browse Code »

Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.

The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.

- Reorganize the code to only fetch block dev state when actually
needed.

Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.

- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.

This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)

[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
Signed-off-by: Andi Kleen
Cc: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800
87192a2a4 vfs: cache request_queue in struct block_device ... Browse Code »

This makes it possible to get from the inode to the request_queue with one
less cache miss. Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices. I think that's safe correct?

Signed-off-by: Andi Kleen
Acked-by: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800
ae55e1aaa fs/direct-io.c: calculate fs_count correctly in get_more_blocks() ... Browse Code »

In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:

dio_write foo -s 1024 -w 4096

(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.

In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.

Signed-off-by: Tao Ma
Cc: "Theodore Ts'o"
Cc: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tao Ma
2012-01-13 12:13:12 +0800
a6bc32b89 mm: compaction: introduce sync-light migration for use by compaction ... Browse Code »
43

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
b969c4ab9 mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage ... Browse Code »

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
28d82dc1c epoll: limit paths ... Browse Code »
45

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron
Cc: Nelson Elhage
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-01-13 12:13:04 +0800
2ccd4f4d4 pipe: fail cleanly when root tries F_SETPIPE_SZ with big size ... Browse Code »

When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:

------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace 432f702e6db7b5ee ]---

Instead, make kcalloc() handle the overflow case and fail quietly.

[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin
Cc: Alexander Viro
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-01-13 12:13:04 +0800
a2ef990ab proc: fix null pointer deref in proc_pid_permission() ... Browse Code »

get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [] proc_pid_permission+0x64/0xe0
PGD 112075067 PUD 112814067 PMD 0
Oops: 0002 [#1] PREEMPT SMP

This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options"). The kernel should return -ESRCH if
get_proc_task() failed.

Signed-off-by: Xiaotian Feng
Cc: Al Viro
Cc: Vasiliy Kulikov
Cc: Stephen Wilson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiaotian Feng
2012-01-13 12:13:02 +0800
6733e54b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
FUSE: Notifying the kernel of deletion.
fuse: support ioctl on directories
fuse: Use kcalloc instead of kzalloc to allocate array
fuse: llseek optimize SEEK_CUR and SEEK_SET

Linus Torvalds
2012-01-13 04:39:21 +0800
83eb26af0 ceph: ensure prealloc_blob is in place when removing xattr ... Browse Code »

In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes
are marked dirty, all attributes recorded in its rb_tree index are
formatted into a "blob" buffer. The target buffer is recorded in
ceph_inode->i_xattrs.prealloc_blob, and it is expected to exist and
be of sufficient size to hold the attributes.

The extended attributes are marked dirty in two cases: when a new
attribute is added to the inode; or when one is removed. In the
former case work is done to ensure the prealloc_blob buffer is
properly set up, but in the latter it is not.

Change the logic in ceph_removexattr() so it matches what is
done in ceph_setxattr(). Note that this is done in a way that
keeps the two blocks of code nearly identical, in anticipation
of a subsequent patch that encapsulates some of this logic into
one or more helper routines.

Signed-off-by: Alex Elder
Signed-off-by: Sage Weil

Alex Elder
2012-01-13 03:00:51 +0800
a40dc6cc2 ceph: enable/disable dentry complete flags via mount option ... Browse Code »

Enable/disable use of the dentry dir 'complete' flag via a mount option.
This lets the admin control whether ceph uses the dcache to satisfy
negative lookups or readdir when it has the entire directory contents in
its cache.

This is purely a performance optimization; correctness is guaranteed
whether it is enabled or not.

Reviewed-by: Christoph Hellwig
Signed-off-by: Sage Weil

Sage Weil
2012-01-13 03:00:40 +0800
46f72b349 vfs: export symbol d_find_any_alias() ... Browse Code »

Ceph needs this.

Reviewed-by: Christoph Hellwig
Signed-off-by: Sage Weil

Sage Weil
2012-01-13 03:00:28 +0800

12 Jan, 2012

3 commits

d46cfba53 ceph: always initialize the dentry in open_root_dentry() ... Browse Code »

When open_root_dentry() gets a dentry via d_obtain_alias() it does
not get initialized. If the dentry obtained came from the cache,
this is OK. But if not, the result is an improperly initialized
dentry.

To fix this, call ceph_init_dentry() regardless of which path
produced the dentry. That function returns immediately for a dentry
that is already initialized, it is safe to use either way.

(Credit to Sage, who suggested this fix.)

Signed-off-by: Alex Elder

Alex Elder
2012-01-12 08:28:25 +0800
d34315da9 UBIFS: fix debugging messages ... Browse Code »
1

Patch 56e46742e846e4de167dde0e1e1071ace1c882a5 broke UBIFS debugging messages:
before that commit when UBIFS debugging was enabled, users saw few useful
debugging messages after mount. However, that patch turned 'dbg_msg()' into
'pr_debug()', so to enable the debugging messages users have to enable them
first via /sys/kernel/debug/dynamic_debug/control, which is very impractical.

This commit makes 'dbg_msg()' to use 'printk()' instead of 'pr_debug()', just
as it was before the breakage.

Signed-off-by: Artem Bityutskiy
Cc: stable@kernel.org [3.0+]

Artem Bityutskiy
2012-01-12 00:44:53 +0800
1f5d78dc4 UBIFS: make debugging messages light again ... Browse Code »
1

We switch to dynamic debugging in commit
56e46742e846e4de167dde0e1e1071ace1c882a5 but did not take into account that
now we do not control anymore whether a specific message is enabled or not.
So now we lock the "dbg_lock" and release it in every debugging macro, which
make them not so light-weight.

This commit removes the "dbg_lock" protection from the debugging macros to
fix the issue.

The downside is that now our DBGKEY() stuff is broken, but this is not
critical at all and will be fixed later.

Signed-off-by: Artem Bityutskiy
Cc: stable@kernel.org [3.0+]

Artem Bityutskiy
2012-01-12 00:44:53 +0800

11 Jan, 2012

15 commits

66ad863b4 GFS2: Fix nlink setting on inode creation ... Browse Code »

Since the nlink count will be 0, we need to use set_nlink rather
than inc_nlink in order to avoid triggering the inc_nlink warning
which was added recently.

Signed-off-by: Steven Whitehouse

Steven Whitehouse
2012-01-11 20:35:05 +0800
376d37788 GFS2: fail mount if journal recovery fails ... Browse Code »

If the first mounter fails to recover one of the journals
during mount, the mount should fail.

Signed-off-by: David Teigland
Signed-off-by: Steven Whitehouse

David Teigland
2012-01-11 17:24:48 +0800
e8ca5cc57 GFS2: let spectator mount do read only recovery ... Browse Code »

Previously, a spectator mount would not even attempt to do
journal recovery for a failed node. This meant that if all
mounted nodes were spectators, everyone would be stuck after
a node failed, all waiting for recovery to be performed.
This is unnecessary since the failed node had a clean journal.

Instead, allow a spectator mount to do a partial "read only"
recovery, which means it will check if the failed journal is
clean, and if so, report a successful recovery. If the failed
journal is not clean, it reports that journal recovery failed.
This makes it work the same as a read only mount on a read only
block device.

Signed-off-by: David Teigland
Signed-off-by: Steven Whitehouse

David Teigland
2012-01-11 17:23:40 +0800
49528b4e4 GFS2: Fix a use-after-free that coverity spotted ... Browse Code »

In function gfs2_inplace_release it was trying to unlock a gfs2_holder
structure associated with a reservation, after said reservation was
freed. The problem is that the statements have the wrong order.
This patch corrects the order so that the reservation is freed after
the gfs2_holder is unlocked.

Signed-off-by: Bob Peterson
Signed-off-by: Steven Whitehouse

Bob Peterson
2012-01-11 17:23:26 +0800
e0c2a9aa1 GFS2: dlm based recovery coordination ... Browse Code »

This new method of managing recovery is an alternative to
the previous approach of using the userland gfs_controld.

- use dlm slot numbers to assign journal id's
- use dlm recovery callbacks to initiate journal recovery
- use a dlm lock to determine the first node to mount fs
- use a dlm lock to track journals that need recovery

Signed-off-by: David Teigland
Signed-off-by: Steven Whitehouse

David Teigland
2012-01-11 17:23:05 +0800
5cd9599bb Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
autofs4: deal with autofs4_write/autofs4_write races
autofs4: catatonic_mode vs. notify_daemon race
autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race
hfsplus: creation of hidden dir on mount can fail
block_dev: Suppress bdev_cache_init() kmemleak warninig
fix shrink_dcache_parent() livelock
coda: switch coda_cnode_make() to sane API as well, clean coda_lookup()
coda: deal correctly with allocation failure from coda_cnode_makectl()
securityfs: fix object creation races

Linus Torvalds
2012-01-11 13:46:36 +0800
d668dc566 autofs4: deal with autofs4_write/autofs4_write races ... Browse Code »

Just serialize the actual writing of packets into pipe on
a new mutex, independent from everything else in the locking
hierarchy. As soon as something has started feeding a piece
of packet into the pipe to daemon, we *want* everything else
about to try the same to wait until we are done.

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:20:12 +0800
875333326 autofs4: catatonic_mode vs. notify_daemon race ... Browse Code »

we need to hold ->wq_mutex while we are forming the packet to send,
lest we have autofs4_catatonic_mode() setting wq->name.name to NULL
just as autofs4_notify_daemon() decides to memcpy() from it...

We do have check for catatonic mode immediately after that (under
->wq_mutex, as it ought to be) and packet won't be actually sent,
but it'll be too late for us if we oops on that memcpy() from NULL...

Fix is obvious - just extend the area covered by ->wq_mutex over
that switch and check whether it's catatonic *before* doing anything
else.

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:19:58 +0800
4041bcdc7 autofs4: autofs4_wait() vs. autofs4_catatonic_mode() race ... Browse Code »

We need to recheck ->catatonic after autofs4_wait() got ->wq_mutex
for good, or we might end up with wq inserted into queue after
autofs4_catatonic_mode() had done its thing. It will stick there
forever, since there won't be anything to clear its ->name.name.

A bit of a complication: validate_request() drops and regains ->wq_mutex.
It actually ends up the most convenient place to stick the check into...

Acked-by: Ian Kent
Signed-off-by: Al Viro

Al Viro
2012-01-11 13:19:12 +0800
001a541ea Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback

Linus Torvalds
2012-01-11 08:59:59 +0800
40ba58792 Merge branch 'akpm' (aka "Andrew's patch-bomb") ... Browse Code »

Andrew elucidates:
- First installmeant of MM. We have a HUGE number of MM patches this
time. It's crazy.
- MAINTAINERS updates
- backlight updates
- leds
- checkpatch updates
- misc ELF stuff
- rtc updates
- reiserfs
- procfs
- some misc other bits

* akpm: (124 commits)
user namespace: make signal.c respect user namespaces
workqueue: make alloc_workqueue() take printf fmt and args for name
procfs: add hidepid= and gid= mount options
procfs: parse mount options
procfs: introduce the /proc//map_files/ directory
procfs: make proc_get_link to use dentry instead of inode
signal: add block_sigmask() for adding sigmask to current->blocked
sparc: make SA_NOMASK a synonym of SA_NODEFER
reiserfs: don't lock root inode searching
reiserfs: don't lock journal_init()
reiserfs: delay reiserfs lock until journal initialization
reiserfs: delete comments referring to the BKL
drivers/rtc/interface.c: fix alarm rollover when day or month is out-of-range
drivers/rtc/rtc-twl.c: add DT support for RTC inside twl4030/twl6030
drivers/rtc/: remove redundant spi driver bus initialization
drivers/rtc/rtc-jz4740.c: make jz4740_rtc_driver static
drivers/rtc/rtc-mc13xxx.c: make mc13xxx_rtc_idtable static
rtc: convert drivers/rtc/* to use module_platform_driver()
drivers/rtc/rtc-wm831x.c: convert to devm_kzalloc()
drivers/rtc/rtc-wm831x.c: remove unused period IRQ handler
...

Linus Torvalds
2012-01-11 08:42:48 +0800
0499680a4 procfs: add hidepid= and gid= mount options ... Browse Code »

Add support for mount options to restrict access to /proc/PID/
directories. The default backward-compatible "relaxed" behaviour is left
untouched.

The first mount option is called "hidepid" and its value defines how much
info about processes we want to be available for non-owners:

hidepid=0 (default) means the old behavior - anybody may read all
world-readable /proc/PID/* files.

hidepid=1 means users may not access any /proc// directories, but
their own. Sensitive files like cmdline, sched*, status are now protected
against other users. As permission checking done in proc_pid_permission()
and files' permissions are left untouched, programs expecting specific
files' modes are not confused.

hidepid=2 means hidepid=1 plus all /proc/PID/ will be invisible to other
users. It doesn't mean that it hides whether a process exists (it can be
learned by other means, e.g. by kill -0 $PID), but it hides process' euid
and egid. It compicates intruder's task of gathering info about running
processes, whether some daemon runs with elevated privileges, whether
another user runs some sensitive program, whether other users run any
program at all, etc.

gid=XXX defines a group that will be able to gather all processes' info
(as in hidepid=0 mode). This group should be used instead of putting
nonroot user in sudoers file or something. However, untrusted users (like
daemons, etc.) which are not supposed to monitor the tasks in the whole
system should not be added to the group.

hidepid=1 or higher is designed to restrict access to procfs files, which
might reveal some sensitive private information like precise keystrokes
timings:

http://www.openwall.com/lists/oss-security/2011/11/05/3

hidepid=1/2 doesn't break monitoring userspace tools. ps, top, pgrep, and
conky gracefully handle EPERM/ENOENT and behave as if the current user is
the only user running processes. pstree shows the process subtree which
contains "pstree" process.

Note: the patch doesn't deal with setuid/setgid issues of keeping
preopened descriptors of procfs files (like
https://lkml.org/lkml/2011/2/7/368). We rely on that the leaked
information like the scheduling counters of setuid apps doesn't threaten
anybody's privacy - only the user started the setuid program may read the
counters.

Signed-off-by: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Randy Dunlap
Cc: "H. Peter Anvin"
Cc: Greg KH
Cc: Theodore Tso
Cc: Alan Cox
Cc: James Morris
Cc: Oleg Nesterov
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2012-01-11 08:30:54 +0800
97412950b procfs: parse mount options ... Browse Code »

Add support for procfs mount options. Actual mount options are coming in
the next patches.

Signed-off-by: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Randy Dunlap
Cc: "H. Peter Anvin"
Cc: Greg KH
Cc: Theodore Tso
Cc: Alan Cox
Cc: James Morris
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasiliy Kulikov
2012-01-11 08:30:54 +0800
640708a2c procfs: introduce the /proc/<pid>/map_files/ directory ... Browse Code »

This one behaves similarly to the /proc//fd/ one - it contains
symlinks one for each mapping with file, the name of a symlink is
"vma->vm_start-vma->vm_end", the target is the file. Opening a symlink
results in a file that point exactly to the same inode as them vma's one.

For example the ls -l of some arbitrary /proc//map_files/

| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
| lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so

This *helps* checkpointing process in three ways:

1. When dumping a task mappings we do know exact file that is mapped
by particular region. We do this by opening
/proc/$pid/map_files/$address symlink the way we do with file
descriptors.

2. This also helps in determining which anonymous shared mappings are
shared with each other by comparing the inodes of them.

3. When restoring a set of processes in case two of them has a mapping
shared, we map the memory by the 1st one and then open its
/proc/$pid/map_files/$address file and map it by the 2nd task.

Using /proc/$pid/maps for this is quite inconvenient since it brings
repeatable re-reading and reparsing for this text file which slows down
restore procedure significantly. Also as being pointed in (3) it is a way
easier to use top level shared mapping in children as
/proc/$pid/map_files/$address when needed.

[akpm@linux-foundation.org: coding-style fixes]
[gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
Signed-off-by: Pavel Emelyanov
Signed-off-by: Cyrill Gorcunov
Reviewed-by: Vasiliy Kulikov
Reviewed-by: "Kirill A. Shutemov"
Cc: Tejun Heo
Cc: Alexey Dobriyan
Cc: Al Viro
Cc: Pavel Machek
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2012-01-11 08:30:54 +0800
7773fbc54 procfs: make proc_get_link to use dentry instead of inode ... Browse Code »

Prepare the ground for the next "map_files" patch which needs a name of a
link file to analyse.

Signed-off-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Tejun Heo
Cc: Vasiliy Kulikov
Cc: "Kirill A. Shutemov"
Cc: Alexey Dobriyan
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-01-11 08:30:54 +0800