Eric Lee / smarc-fsl-linux-kernel

17 Jan, 2012

1 commit

a12587b00 Merge tag 'nfs-for-3.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

NFS client bugfixes and cleanups for Linux 3.3 (pull 2)

* tag 'nfs-for-3.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
pnfsblock: alloc short extent before submit bio
pnfsblock: remove rpc_call_ops from struct parallel_io
pnfsblock: move find lock page logic out of bl_write_pagelist
pnfsblock: cleanup bl_mark_sectors_init
pnfsblock: limit bio page count
pnfsblock: don't spinlock when freeing block_dev
pnfsblock: clean up _add_entry
pnfsblock: set read/write tk_status to pnfs_error
pnfsblock: acquire im_lock in _preload_range
NFS4: fix compile warnings in nfs4proc.c
nfs: check for integer overflow in decode_devicenotify_args()
NFS: cleanup endian type in decode_ds_addr()
NFS: add an endian notation

Linus Torvalds
2012-01-17 07:08:13 +0800

16 Jan, 2012

3 commits

122804ecb Merge branch 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media ... Browse Code »

* 'v4l_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (655 commits)
[media] revert patch: HDIC HD29L2 DMB-TH USB2.0 reference design driver
mb86a20s: Add a few more register settings at the init seq
mb86a20s: Group registers into the same line
[media] [PATCH] don't reset the delivery system on DTV_CLEAR
[media] [BUG] it913x-fe fix typo error making SNR levels unstable
[media] cx23885: Query the CX25840 during enum_input for status
[media] cx25840: Add support for g_input_status
[media] rc-videomate-m1f.c Rename to match remote controler name
[media] drivers: media: au0828: Fix dependency for VIDEO_AU0828
[media] convert drivers/media/* to use module_platform_driver()
[media] drivers: video: cx231xx: Fix dependency for VIDEO_CX231XX_DVB
[media] Exynos4 JPEG codec v4l2 driver
[media] doc: v4l: selection: choose pixels as units for selection rectangles
[media] v4l: s5p-tv: mixer: fix setup of VP scaling
[media] v4l: s5p-tv: mixer: add support for selection API
[media] v4l: emulate old crop API using extended crop/compose API
[media] doc: v4l: add documentation for selection API
[media] doc: v4l: add binary images for selection API
[media] v4l: add support for selection api
[media] hd29l2: fix review findings
...

Linus Torvalds
2012-01-16 04:49:56 +0800
b3c9dd182 Merge branch 'for-3.3/core' of git://git.kernel.dk/linux-block ... Browse Code »

* 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
Revert "block: recursive merge requests"
block: Stop using macro stubs for the bio data integrity calls
blockdev: convert some macros to static inlines
fs: remove unneeded plug in mpage_readpages()
block: Add BLKROTATIONAL ioctl
block: Introduce blk_set_stacking_limits function
block: remove WARN_ON_ONCE() in exit_io_context()
block: an exiting task should be allowed to create io_context
block: ioc_cgroup_changed() needs to be exported
block: recursive merge requests
block, cfq: fix empty queue crash caused by request merge
block, cfq: move icq creation and rq->elv.icq association to block core
block, cfq: restructure io_cq creation path for io_context interface cleanup
block, cfq: move io_cq exit/release to blk-ioc.c
block, cfq: move icq cache management to block core
block, cfq: move io_cq lookup to blk-ioc.c
block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
block, cfq: reorganize cfq_io_context into generic and cfq specific parts
block: remove elevator_queue->ops
block: reorder elevator switch sequence
...

Fix up conflicts in:
- block/blk-cgroup.c
Switch from can_attach_task to can_attach
- block/cfq-iosched.c
conflict with now removed cic index changes (we now use q->id instead)

Linus Torvalds
2012-01-16 04:24:45 +0800
a520458fc Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 ... Browse Code »

* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBI: use own macros for the layout volume
UBI: fix nameless volumes handling
UBIFS: fix non-debug configuration build

Linus Torvalds
2012-01-16 03:25:41 +0800

15 Jan, 2012

5 commits

e234b5f20 UBIFS: fix non-debug configuration build ... Browse Code »

Fix a brown paperbag bug introduced by me in the previous commit. I was
in hurry and forgot about the non-debug case completely.

Artem: amend the commit message and tweak the patch to preserve alignment.
This made the patch a bit less readable, though.

Signed-off-by: Artem Bityutskiy

Dominique Martinet
2012-01-15 19:46:02 +0800
c49c41a41 Merge branch 'for-linus' of git://selinuxproject.org/~jmorris/linux-security ... Browse Code »

* 'for-linus' of git://selinuxproject.org/~jmorris/linux-security:
capabilities: remove __cap_full_set definition
security: remove the security_netlink_recv hook as it is equivalent to capable()
ptrace: do not audit capability check when outputing /proc/pid/stat
capabilities: remove task_ns_* functions
capabitlies: ns_capable can use the cap helpers rather than lsm call
capabilities: style only - move capable below ns_capable
capabilites: introduce new has_ns_capabilities_noaudit
capabilities: call has_ns_capability from has_capability
capabilities: remove all _real_ interfaces
capabilities: introduce security_capable_noaudit
capabilities: reverse arguments to security_capable
capabilities: remove the task from capable LSM hook entirely
selinux: sparse fix: fix several warnings in the security server cod
selinux: sparse fix: fix warnings in netlink code
selinux: sparse fix: eliminate warnings for selinuxfs
selinux: sparse fix: declare selinux_disable() in security.h
selinux: sparse fix: move selinux_complete_init
selinux: sparse fix: make selinux_secmark_refcount static
SELinux: Fix RCU deref check warning in sel_netport_insert()

Manually fix up a semantic mis-merge wrt security_netlink_recv():

- the interface was removed in commit fd7784615248 ("security: remove
the security_netlink_recv hook as it is equivalent to capable()")

- a new user of it appeared in commit a38f7907b926 ("crypto: Add
userspace configuration API")

causing no automatic merge conflict, but Eric Paris pointed out the
issue.

Linus Torvalds
2012-01-15 10:36:33 +0800
fed474857 fsnotify: don't BUG in fsnotify_destroy_mark() ... Browse Code »
1

Removing the parent of a watched file results in "kernel BUG at
fs/notify/mark.c:139".

To reproduce

add "-w /tmp/audit/dir/watched_file" to audit.rules
rm -rf /tmp/audit/dir

This is caused by fsnotify_destroy_mark() being called without an
extra reference taken by the caller.

Reported by Francesco Cosoleto here:

https://bugzilla.novell.com/show_bug.cgi?id=689860

Fix by removing the BUG_ON and adding a comment about not accessing mark after
the iput.

Signed-off-by: Miklos Szeredi
CC: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Miklos Szeredi
2012-01-15 10:01:42 +0800
0a80939b3 Merge tag 'for-linus' of git://github.com/rustyrussell/linux ... Browse Code »

Autogenerated GPG tag for Rusty D1ADB8F1: 15EE 8D6C AB0E 7F0C F999 BFCB D920 0E6C D1AD B8F1

* tag 'for-linus' of git://github.com/rustyrussell/linux:
module_param: check that bool parameters really are bool.
intelfbdrv.c: bailearly is an int module_param
paride/pcd: fix bool verbose module parameter.
module_param: make bool parameters really bool (drivers & misc)
module_param: make bool parameters really bool (arch)
module_param: make bool parameters really bool (core code)
kernel/async: remove redundant declaration.
printk: fix unnecessary module_param_name.
lirc_parallel: fix module parameter description.
module_param: avoid bool abuse, add bint for special cases.
module_param: check type correctness for module_param_array
modpost: use linker section to generate table.
modpost: use a table rather than a giant if/else statement.
modules: sysfs - export: taint, coresize, initsize
kernel/params: replace DEBUGP with pr_debug
module: replace DEBUGP with pr_debug
module: struct module_ref should contains long fields
module: Fix performance regression on modules with large symbol tables
module: Add comments describing how the "strmap" logic works

Fix up conflicts in scripts/mod/file2alias.c due to the new linker-
generated table approach to adding __mod_*_device_table entries. The
ARM sa11x0 mcp bus needed to be converted to that too.

Linus Torvalds
2012-01-15 04:32:16 +0800
0b48d4223 Merge branch 'for-3.3' of git://linux-nfs.org/~bfields/linux ... Browse Code »

* 'for-3.3' of git://linux-nfs.org/~bfields/linux: (31 commits)
nfsd4: nfsd4_create_clid_dir return value is unused
NFSD: Change name of extended attribute containing junction
svcrpc: don't revert to SVC_POOL_DEFAULT on nfsd shutdown
svcrpc: fix double-free on shutdown of nfsd after changing pool mode
nfsd4: be forgiving in the absence of the recovery directory
nfsd4: fix spurious 4.1 post-reboot failures
NFSD: forget_delegations should use list_for_each_entry_safe
NFSD: Only reinitilize the recall_lru list under the recall lock
nfsd4: initialize special stateid's at compile time
NFSd: use network-namespace-aware cache registering routines
SUNRPC: create svc_xprt in proper network namespace
svcrpc: update outdated BKL comment
nfsd41: allow non-reclaim open-by-fh's in 4.1
svcrpc: avoid memory-corruption on pool shutdown
svcrpc: destroy server sockets all at once
svcrpc: make svc_delete_xprt static
nfsd: Fix oops when parsing a 0 length export
nfsd4: Use kmemdup rather than duplicating its implementation
nfsd4: add a separate (lockowner, inode) lookup
nfsd4: fix CONFIG_NFSD_FAULT_INJECTION compile error
...

Linus Torvalds
2012-01-15 04:26:41 +0800

14 Jan, 2012

6 commits

69e4747ee Unused iocbs in a batch should not be accounted as active. ... Browse Code »

Since commit 080d676de095 ("aio: allocate kiocbs in batches") iocbs are
allocated in a batch during processing of first iocbs. All iocbs in a
batch are automatically added to ctx->active_reqs list and accounted in
ctx->reqs_active.

If one (not the last one) of iocbs submitted by an user fails, further
iocbs are not processed, but they are still present in ctx->active_reqs
and accounted in ctx->reqs_active. This causes process to stuck in a D
state in wait_for_all_aios() on exit since ctx->reqs_active will never
go down to zero. Furthermore since kiocb_batch_free() frees iocb
without removing it from active_reqs list the list become corrupted
which may cause oops.

Fix this by removing iocb from ctx->active_reqs and updating
ctx->reqs_active in kiocb_batch_free().

Signed-off-by: Gleb Natapov
Reviewed-by: Jeff Moyer
Cc: stable@kernel.org # 3.2
Signed-off-by: Linus Torvalds

Gleb Natapov
2012-01-14 12:39:44 +0800
96e80a785 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/pkl/squashfs-next:
Squashfs: fix i_blocks calculation with extended regular files
Squashfs: fix mount time sanity check for corrupted superblock
Squashfs: optimise squashfs_cache_get entry search
Squashfs: Update documentation to include xattrs
Squashfs: add missing block release on error condition

Linus Torvalds
2012-01-14 02:34:57 +0800
57e6a7dde Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-3.0-nmw:
GFS2: Fix nlink setting on inode creation
GFS2: fail mount if journal recovery fails
GFS2: let spectator mount do read only recovery
GFS2: Fix a use-after-free that coverity spotted
GFS2: dlm based recovery coordination

Linus Torvalds
2012-01-14 02:33:39 +0800
94b1984ab Merge branch 'linux-next' of git://git.infradead.org/ubifs-2.6 ... Browse Code »

* 'linux-next' of git://git.infradead.org/ubifs-2.6:
UBIFS: fix key printing
UBIFS: use snprintf instead of sprintf when printing keys
UBIFS: fix debugging messages
UBIFS: make debugging messages light again
UBI: fix debugging messages
UBI: make vid_hdr non-static

Linus Torvalds
2012-01-14 02:31:33 +0800
1a52bb0b6 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
ceph: ensure prealloc_blob is in place when removing xattr
rbd: initialize snap_rwsem in rbd_add()
ceph: enable/disable dentry complete flags via mount option
vfs: export symbol d_find_any_alias()
ceph: always initialize the dentry in open_root_dentry()
libceph: remove useless return value for osd_client __send_request()
ceph: avoid iput() while holding spinlock in ceph_dir_fsync
ceph: avoid useless dget/dput in encode_fh
ceph: dereference pointer after checking for NULL
crush: fix force for non-root TAKE
ceph: remove unnecessary d_fsdata conditional checks
ceph: Use kmemdup rather than duplicating its implementation

Fix up conflicts in fs/ceph/super.c (d_alloc_root() failure handling vs
always initialize the dentry in open_root_dentry)

Linus Torvalds
2012-01-14 02:29:21 +0800
8638094e9 autofs4 - fix deal with autofs4_write races ... Browse Code »

I don't know how I missed this obvious mistake when I
reviewed Als' patches, sorry.

[ Quoting Al:

Grr... Note to self: do git status *and* git stash show -p
before git push. Nothing like "WTF? I'd fixed that braino"
feeling ;-/

Al sent the same patch - it got broken in commit d668dc56631d:
"autofs4: deal with autofs4_write/autofs4_write races". ]

Reported-and-tested-by: Dave Airlie
Signed-off-by: Ian Kent
Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Ian Kent
2012-01-14 00:30:49 +0800

13 Jan, 2012

25 commits

515315a12 UBIFS: fix key printing ... Browse Code »
43

Before commit 56e46742e846e4de167dde0e1e1071ace1c882a5 we have had locking
around all printing macros and we could use static buffers for creating
key strings and printing them. However, now we do not have that locking and
we cannot use static buffers. This commit removes the old DBGKEY() macros
and introduces few new helper macros for printing debugging messages plus
a key at the end. Thankfully, all the messages are already structures in
a way that the key is printed in the end.

Signed-off-by: Artem Bityutskiy

Artem Bityutskiy
2012-01-13 18:50:42 +0800
beba00607 UBIFS: use snprintf instead of sprintf when printing keys ... Browse Code »

Switch to 'snprintf()' which is more secure and reliable. This is also a
preparation to the subsequent key printing fixes.

Signed-off-by: Artem Bityutskiy

Artem Bityutskiy
2012-01-13 18:46:21 +0800
099469502 Merge branch 'akpm' (aka "Andrew's patch-bomb, take two") ... Browse Code »

Andrew explains:

- various misc stuff

- Most of the rest of MM: memcg, threaded hugepages, others.

- cpumask

- kexec

- kdump

- some direct-io performance tweaking

- radix-tree optimisations

- new selftests code

A note on this: often people will develop a new userspace-visible
feature and will develop userspace code to exercise/test that
feature. Then they merge the patch and the selftest code dies.
Sometimes we paste it into the changelog. Sometimes the code gets
thrown into Documentation/(!).

This saddens me. So this patch creates a bare-bones framework which
will henceforth allow me to ask people to include their test apps in
the kernel tree so we can keep them alive. Then when people enhance
or fix the feature, I can ask them to update the test app too.

The infrastruture is terribly trivial at present - let's see how it
evolves.

- checkpoint/restart feature work.

A note on this: this is a project by various mad Russians to perform
c/r mainly from userspace, with various oddball helper code added
into the kernel where the need is demonstrated.

So rather than some large central lump of code, what we have is
little bits and pieces popping up in various places which either
expose something new or which permit something which is normally
kernel-private to be modified.

The overall project is an ongoing thing. I've judged that the size
and scope of the thing means that we're more likely to be successful
with it if we integrate the support into mainline piecemeal rather
than allowing it all to develop out-of-tree.

However I'm less confident than the developers that it will all
eventually work! So what I'm asking them to do is to wrap each piece
of new code inside CONFIG_CHECKPOINT_RESTORE. So if it all
eventually comes to tears and the project as a whole fails, it should
be a simple matter to go through and delete all trace of it.

This lot pretty much wraps up the -rc1 merge for me.

* akpm: (96 commits)
unlzo: fix input buffer free
ramoops: update parameters only after successful init
ramoops: fix use of rounddown_pow_of_two()
c/r: prctl: add PR_SET_MM codes to set up mm_struct entries
c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4
c/r: introduce CHECKPOINT_RESTORE symbol
selftests: new x86 breakpoints selftest
selftests: new very basic kernel selftests directory
radix_tree: take radix_tree_path off stack
radix_tree: remove radix_tree_indirect_to_ptr()
dio: optimize cache misses in the submission path
vfs: cache request_queue in struct block_device
fs/direct-io.c: calculate fs_count correctly in get_more_blocks()
drivers/parport/parport_pc.c: fix warnings
panic: don't print redundant backtraces on oops
sysctl: add the kernel.ns_last_pid control
kdump: add udev events for memory online/offline
include/linux/crash_dump.h needs elf.h
kdump: fix crash_kexec()/smp_send_stop() race in panic()
kdump: crashk_res init check for /sys/kernel/kexec_crash_size
...

Linus Torvalds
2012-01-13 12:42:54 +0800
b3f7f573a c/r: procfs: add start_data, end_data, start_brk members to /proc/$pid/stat v4 ... Browse Code »

The mm->start_code/end_code, mm->start_data/end_data, mm->start_brk are
involved into calculation of program text/data segment sizes (which might
be seen in /proc//statm) and into brk() call final address.

For restore we need to know all these values. While
mm->start_code/end_code already present in /proc/$pid/stat, the rest
members are not, so this patch brings them in.

The restore procedure of these members is addressed in another patch using
prctl().

Signed-off-by: Cyrill Gorcunov
Acked-by: Serge Hallyn
Reviewed-by: Kees Cook
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Alexey Dobriyan
Cc: Tejun Heo
Cc: Andrew Vagin
Cc: Vasiliy Kulikov
Cc: Alexey Dobriyan
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2012-01-13 12:13:13 +0800
65dd2aa90 dio: optimize cache misses in the submission path ... Browse Code »

Some investigation of a transaction processing workload showed that a
major consumer of cycles in __blockdev_direct_IO is the cache miss while
accessing the block size. This is because it has to walk the chain from
block_dev to gendisk to queue.

The block size is needed early on to check alignment and sizes. It's only
done if the check for the inode block size fails. But the costly block
device state is unconditionally fetched.

- Reorganize the code to only fetch block dev state when actually
needed.

Then do a prefetch on the block dev early on in the direct IO path. This
is worth it, because there is substantial code run before we actually
touch the block dev now.

- I also added some unlikelies to make it clear the compiler that block
device fetch code is not normally executed.

This gave a small, but measurable improvement on a large database
benchmark (about 0.3%)

[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: using prefetch requires including prefetch.h]
Signed-off-by: Andi Kleen
Cc: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800
87192a2a4 vfs: cache request_queue in struct block_device ... Browse Code »

This makes it possible to get from the inode to the request_queue with one
less cache miss. Used in followon optimization.

The livetime of the pointer is the same as the gendisk.

This assumes that the queue will always stay the same in the gendisk while
it's visible to block_devices. I think that's safe correct?

Signed-off-by: Andi Kleen
Acked-by: Jeff Moyer
Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2012-01-13 12:13:12 +0800
ae55e1aaa fs/direct-io.c: calculate fs_count correctly in get_more_blocks() ... Browse Code »

In get_more_blocks(), we use dio_count to calcuate fs_count and do some
tricky things to increase fs_count if dio_count isn't aligned. But
actually it still has some corner cases that can't be coverd. See the
following example:

dio_write foo -s 1024 -w 4096

(direct write 4096 bytes at offset 1024). The same goes if the offset
isn't aligned to fs_blocksize.

In this case, the old calculation counts fs_count to be 1, but actually we
will write into 2 different blocks (if fs_blocksize=4096). The old code
just works, since it will call get_block twice (and may have to allocate
and create extents twice for filesystems like ext4). So we'd better call
get_block just once with the proper fs_count.

Signed-off-by: Tao Ma
Cc: "Theodore Ts'o"
Cc: Christoph Hellwig
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tao Ma
2012-01-13 12:13:12 +0800
a6bc32b89 mm: compaction: introduce sync-light migration for use by compaction ... Browse Code »
43

This patch adds a lightweight sync migrate operation MIGRATE_SYNC_LIGHT
mode that avoids writing back pages to backing storage. Async compaction
maps to MIGRATE_ASYNC while sync compaction maps to MIGRATE_SYNC_LIGHT.
For other migrate_pages users such as memory hotplug, MIGRATE_SYNC is
used.

This avoids sync compaction stalling for an excessive length of time,
particularly when copying files to a USB stick where there might be a
large number of dirty pages backed by a filesystem that does not support
->writepages.

[aarcange@redhat.com: This patch is heavily based on Andrea's work]
[akpm@linux-foundation.org: fix fs/nfs/write.c build]
[akpm@linux-foundation.org: fix fs/btrfs/disk-io.c build]
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
b969c4ab9 mm: compaction: determine if dirty pages can be migrated without blocking within ->migratepage ... Browse Code »

Asynchronous compaction is used when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
there was a debate on disabling synchronous compaction but this severely
impacted allocation success rates. Part of the reason was that many dirty
pages are skipped in asynchronous compaction by the following check;

if (PageDirty(page) && !sync &&
mapping->a_ops->migratepage != migrate_page)
rc = -EBUSY;

This skips over all mapping aops using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It is
the responsibility of the callback to fail gracefully if migration would
block.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Dave Jones
Cc: Jan Kara
Cc: Andy Isaacson
Cc: Nai Xia
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-13 12:13:09 +0800
28d82dc1c epoll: limit paths ... Browse Code »
45

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron
Cc: Nelson Elhage
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason Baron
2012-01-13 12:13:04 +0800
2ccd4f4d4 pipe: fail cleanly when root tries F_SETPIPE_SZ with big size ... Browse Code »

When a user with the CAP_SYS_RESOURCE cap tries to F_SETPIPE_SZ a pipe
with size bigger than kmalloc() can alloc it spits out an ugly warning:

------------[ cut here ]------------
WARNING: at mm/page_alloc.c:2095 __alloc_pages_nodemask+0x5d3/0x7a0()
Pid: 733, comm: a.out Not tainted 3.2.0-rc1+ #4
Call Trace:
warn_slowpath_common+0x75/0xb0
warn_slowpath_null+0x15/0x20
__alloc_pages_nodemask+0x5d3/0x7a0
__get_free_pages+0x12/0x50
__kmalloc+0x12b/0x150
pipe_set_size+0x75/0x120
pipe_fcntl+0xf8/0x140
do_fcntl+0x2d4/0x410
sys_fcntl+0x66/0xa0
system_call_fastpath+0x16/0x1b
---[ end trace 432f702e6db7b5ee ]---

Instead, make kcalloc() handle the overflow case and fail quietly.

[akpm@linux-foundation.org: switch to sizeof(*bufs) for 80-column niceness]
Signed-off-by: Sasha Levin
Cc: Alexander Viro
Acked-by: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2012-01-13 12:13:04 +0800
a2ef990ab proc: fix null pointer deref in proc_pid_permission() ... Browse Code »

get_proc_task() can fail to search the task and return NULL,
put_task_struct() will then bomb the kernel with following oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
IP: [] proc_pid_permission+0x64/0xe0
PGD 112075067 PUD 112814067 PMD 0
Oops: 0002 [#1] PREEMPT SMP

This is a regression introduced by commit 0499680a ("procfs: add hidepid=
and gid= mount options"). The kernel should return -ESRCH if
get_proc_task() failed.

Signed-off-by: Xiaotian Feng
Cc: Al Viro
Cc: Vasiliy Kulikov
Cc: Stephen Wilson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiaotian Feng
2012-01-13 12:13:02 +0800
90ab5ee94 module_param: make bool parameters really bool (drivers & misc) ... Browse Code »
86

module_param(bool) used to counter-intuitively take an int. In
fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy
trick.

It's time to remove the int/unsigned int option. For this version
it'll simply give a warning, but it'll break next kernel version.

Acked-by: Mauro Carvalho Chehab
Signed-off-by: Rusty Russell

Rusty Russell
2012-01-13 07:02:20 +0800
69116f279 module_param: avoid bool abuse, add bint for special cases. ... Browse Code »

For historical reasons, we allow module_param(bool) to take an int (or
an unsigned int). That's going away.

A few drivers really want an int: they set it to -1 and a parameter
will set it to 0 or 1. This sucks: reading them from sysfs will give
'Y' for both -1 and 1, but if we change it to an int, then the users
might be broken (if they did "param" instead of "param=1").

Use a new 'bint' parser for them.

(ntfs has a different problem: it needs an int for debug_msgs because
it's also exposed via sysctl.)

Cc: Steve Glendinning
Cc: Jean Delvare
Cc: Guenter Roeck
Cc: Hoang-Nam Nguyen
Cc: Christoph Raisch
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Hal Rosenstock
Cc: linux390@de.ibm.com
Cc: Anton Altaparmakov
Cc: Jaroslav Kysela
Cc: Takashi Iwai
Cc: lm-sensors@lm-sensors.org
Cc: linux-rdma@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-ntfs-dev@lists.sourceforge.net
Cc: alsa-devel@alsa-project.org
Acked-by: Takashi Iwai (For the sound part)
Acked-by: Guenter Roeck (For the hwmon driver)
Signed-off-by: Rusty Russell

Rusty Russell
2012-01-13 07:02:17 +0800
7c5465d6c pnfsblock: alloc short extent before submit bio ... Browse Code »

As discussed earlier, it is better for block client to allocate memory for
tracking extents state before submitting bio. So the patch does it by allocating
a short_extent for every INVALID extent touched by write pagelist and for
every zeroing page we created, saving them in layout header. Then in end_io we
can just use them to create commit list items and avoid memory allocation there.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:52:10 +0800
c0411a94a pnfsblock: remove rpc_call_ops from struct parallel_io ... Browse Code »

block layout can just make use of generic read/write_done.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:52:10 +0800
72c508879 pnfsblock: move find lock page logic out of bl_write_pagelist ... Browse Code »

Also avoid unnecessary lock_page if page is handled by others.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:52:10 +0800
60c52e3a7 pnfsblock: cleanup bl_mark_sectors_init ... Browse Code »

It does not need to manipulate on partial initialized blocks.
Writeback code takes care of it.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:52:09 +0800
74a6eeb44 pnfsblock: limit bio page count ... Browse Code »

One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise
bio_alloc will fail when there are many pages in one read/write_pagelist.

Cc: #3.1+
Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:39:05 +0800
93a3844ee pnfsblock: don't spinlock when freeing block_dev ... Browse Code »

bl_free_block_dev() may sleep. We can not call it with spinlock held.
Besides, there is no need to take bm_lock as we are last user freeing bm_devlist.

Cc: #3.1+
Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:39:04 +0800
57582b372 pnfsblock: clean up _add_entry ... Browse Code »

It is wrong to kmalloc in _add_entry() as it is inside
spinlock. memory should be already allocated _add_entry() is called.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:38:55 +0800
82b906d65 pnfsblock: set read/write tk_status to pnfs_error ... Browse Code »

To pass the IO status to upper layer.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:38:51 +0800
39e567ae3 pnfsblock: acquire im_lock in _preload_range ... Browse Code »

When calling _add_entry, we should take the im_lock to protect
agains other modifiers.

Cc: #3.1+
Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:38:49 +0800
de040becc NFS4: fix compile warnings in nfs4proc.c ... Browse Code »

compile in nfs-for-3.3 branch shows following warnings. Fix it here.

fs/nfs/nfs4proc.c: In function ‘__nfs4_get_acl_uncached’:
fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 4 has type ‘size_t’
fs/nfs/nfs4proc.c:3589: warning: format ‘%ld’ expects type ‘long int’, but argument 6 has type ‘size_t’

Signed-off-by: Peng Tao
Signed-off-by: Trond Myklebust

Peng Tao
2012-01-13 05:31:51 +0800
363e0df05 nfs: check for integer overflow in decode_devicenotify_args() ... Browse Code »

On 32 bit, if n is too large then "n * sizeof(*args->devs)" could
overflow and args->devs would be smaller than expected.

Signed-off-by: Dan Carpenter
Signed-off-by: Trond Myklebust

Dan Carpenter
2012-01-13 05:30:07 +0800