Eric Lee / smarc-fsl-linux-kernel

13 Mar, 2012

4 commits

88d7d4e4a cifs: fix dentry refcount leak when opening a FIFO on lookup ... Browse Code »

commit 5bccda0ebc7c0331b81ac47d39e4b920b198b2cd upstream.

The cifs code will attempt to open files on lookup under certain
circumstances. What happens though if we find that the file we opened
was actually a FIFO or other special file?

Currently, the open filehandle just ends up being leaked leading to
a dentry refcount mismatch and oops on umount. Fix this by having the
code close the filehandle on the server if it turns out not to be a
regular file. While we're at it, change this spaghetti if statement
into a switch too.

Reported-by: CAI Qian
Tested-by: CAI Qian
Reviewed-by: Shirish Pargaonkar
Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2012-03-13 03:31:26 +0800
6deb7d23c aio: wake up waiters when freeing unused kiocbs ... Browse Code »

commit 880641bb9da2473e9ecf6c708d993b29928c1b3c upstream.

Bart Van Assche reported a hung fio process when either hot-removing
storage or when interrupting the fio process itself. The (pruned) call
trace for the latter looks like so:

fio D 0000000000000001 0 6849 6848 0x00000004
ffff880092541b88 0000000000000046 ffff880000000000 ffff88012fa11dc0
ffff88012404be70 ffff880092541fd8 ffff880092541fd8 ffff880092541fd8
ffff880128b894d0 ffff88012404be70 ffff880092541b88 000000018106f24d
Call Trace:
schedule+0x3f/0x60
io_schedule+0x8f/0xd0
wait_for_all_aios+0xc0/0x100
exit_aio+0x55/0xc0
mmput+0x2d/0x110
exit_mm+0x10d/0x130
do_exit+0x671/0x860
do_group_exit+0x44/0xb0
get_signal_to_deliver+0x218/0x5a0
do_signal+0x65/0x700
do_notify_resume+0x65/0x80
int_signal+0x12/0x17

The problem lies with the allocation batching code. It will
opportunistically allocate kiocbs, and then trim back the list of iocbs
when there is not enough room in the completion ring to hold all of the
events.

In the case above, what happens is that the pruning back of events ends
up freeing up the last active request and the context is marked as dead,
so it is thus responsible for waking up waiters. Unfortunately, the
code does not check for this condition, so we end up with a hung task.

Signed-off-by: Jeff Moyer
Reported-by: Bart Van Assche
Tested-by: Bart Van Assche
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jeff Moyer
2012-03-13 03:31:25 +0800
58458d037 regset: Prevent null pointer reference on readonly regsets ... Browse Code »

commit c8e252586f8d5de906385d8cf6385fee289a825e upstream.

The regset common infrastructure assumed that regsets would always
have .get and .set methods, but not necessarily .active methods.
Unfortunately people have since written regsets without .set methods.

Rather than putting in stub functions everywhere, handle regsets with
null .get or .set methods explicitly.

Signed-off-by: H. Peter Anvin
Reviewed-by: Oleg Nesterov
Acked-by: Roland McGrath
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

H. Peter Anvin
2012-03-13 03:31:24 +0800
cebf41113 autofs: work around unhappy compat problem on x86-64 ... Browse Code »

commit a32744d4abae24572eff7269bc17895c41bd0085 upstream.

When the autofs protocol version 5 packet type was added in commit
5c0a32fc2cd0 ("autofs4: add new packet type for v5 communications"), it
obvously tried quite hard to be word-size agnostic, and uses explicitly
sized fields that are all correctly aligned.

However, with the final "char name[NAME_MAX+1]" array at the end, the
actual size of the structure ends up being not very well defined:
because the struct isn't marked 'packed', doing a "sizeof()" on it will
align the size of the struct up to the biggest alignment of the members
it has.

And despite all the members being the same, the alignment of them is
different: a "__u64" has 4-byte alignment on x86-32, but native 8-byte
alignment on x86-64. And while 'NAME_MAX+1' ends up being a nice round
number (256), the name[] array starts out a 4-byte aligned.

End result: the "packed" size of the structure is 300 bytes: 4-byte, but
not 8-byte aligned.

As a result, despite all the fields being in the same place on all
architectures, sizeof() will round up that size to 304 bytes on
architectures that have 8-byte alignment for u64.

Note that this is *not* a problem for 32-bit compat mode on POWER, since
there __u64 is 8-byte aligned even in 32-bit mode. But on x86, 32-bit
and 64-bit alignment is different for 64-bit entities, and as a result
the structure that has exactly the same layout has different sizes.

So on x86-64, but no other architecture, we will just subtract 4 from
the size of the structure when running in a compat task. That way we
will write the properly sized packet that user mode expects.

Not pretty. Sadly, this very subtle, and unnecessary, size difference
has been encoded in user space that wants to read packets of *exactly*
the right size, and will refuse to touch anything else.

Reported-and-tested-by: Thomas Meyer
Signed-off-by: Ian Kent
Signed-off-by: Linus Torvalds
Cc: Jonathan Nieder
Signed-off-by: Greg Kroah-Hartman

Ian Kent
2012-03-13 03:31:21 +0800

01 Mar, 2012

8 commits

203aa5260 epoll: limit paths ... Browse Code »

commit 28d82dc1c4edbc352129f97f4ca22624d1fe61de upstream.

The current epoll code can be tickled to run basically indefinitely in
both loop detection path check (on ep_insert()), and in the wakeup paths.
The programs that tickle this behavior set up deeply linked networks of
epoll file descriptors that cause the epoll algorithms to traverse them
indefinitely. A couple of these sample programs have been previously
posted in this thread: https://lkml.org/lkml/2011/2/25/297.

To fix the loop detection path check algorithms, I simply keep track of
the epoll nodes that have been already visited. Thus, the loop detection
becomes proportional to the number of epoll file descriptor and links.
This dramatically decreases the run-time of the loop check algorithm. In
one diabolical case I tried it reduced the run-time from 15 mintues (all
in kernel time) to .3 seconds.

Fixing the wakeup paths could be done at wakeup time in a similar manner
by keeping track of nodes that have already been visited, but the
complexity is harder, since there can be multiple wakeups on different
cpus...Thus, I've opted to limit the number of possible wakeup paths when
the paths are created.

This is accomplished, by noting that the end file descriptor points that
are found during the loop detection pass (from the newly added link), are
actually the sources for wakeup events. I keep a list of these file
descriptors and limit the number and length of these paths that emanate
from these 'source file descriptors'. In the current implemetation I
allow 1000 paths of length 1, 500 of length 2, 100 of length 3, 50 of
length 4 and 10 of length 5. Note that it is sufficient to check the
'source file descriptors' reachable from the newly added link, since no
other 'source file descriptors' will have newly added links. This allows
us to check only the wakeup paths that may have gotten too long, and not
re-check all possible wakeup paths on the system.

In terms of the path limit selection, I think its first worth noting that
the most common case for epoll, is probably the model where you have 1
epoll file descriptor that is monitoring n number of 'source file
descriptors'. In this case, each 'source file descriptor' has a 1 path of
length 1. Thus, I believe that the limits I'm proposing are quite
reasonable and in fact may be too generous. Thus, I'm hoping that the
proposed limits will not prevent any workloads that currently work to
fail.

In terms of locking, I have extended the use of the 'epmutex' to all
epoll_ctl add and remove operations. Currently its only used in a subset
of the add paths. I need to hold the epmutex, so that we can correctly
traverse a coherent graph, to check the number of paths. I believe that
this additional locking is probably ok, since its in the setup/teardown
paths, and doesn't affect the running paths, but it certainly is going to
add some extra overhead. Also, worth noting is that the epmuex was
recently added to the ep_ctl add operations in the initial path loop
detection code using the argument that it was not on a critical path.

Another thing to note here, is the length of epoll chains that is allowed.
Currently, eventpoll.c defines:

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4

This basically means that I am limited to a graph depth of 5 (EP_MAX_NESTS
+ 1). However, this limit is currently only enforced during the loop
check detection code, and only when the epoll file descriptors are added
in a certain order. Thus, this limit is currently easily bypassed. The
newly added check for wakeup paths, stricly limits the wakeup paths to a
length of 5, regardless of the order in which ep's are linked together.
Thus, a side-effect of the new code is a more consistent enforcement of
the graph depth.

Thus far, I've tested this, using the sample programs previously
mentioned, which now either return quickly or return -EINVAL. I've also
testing using the piptest.c epoll tester, which showed no difference in
performance. I've also created a number of different epoll networks and
tested that they behave as expectded.

I believe this solves the original diabolical test cases, while still
preserving the sane epoll nesting.

Signed-off-by: Jason Baron
Cc: Nelson Elhage
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jason Baron
2012-03-01 08:31:24 +0800
e6aa5c0ba epoll: ep_unregister_pollwait() can use the freed pwq->whead ... Browse Code »

commit 971316f0503a5c50633d07b83b6db2f15a3a5b00 upstream.

signalfd_cleanup() ensures that ->signalfd_wqh is not used, but
this is not enough. eppoll_entry->whead still points to the memory
we are going to free, ep_unregister_pollwait()->remove_wait_queue()
is obviously unsafe.

Change ep_poll_callback(POLLFREE) to set eppoll_entry->whead = NULL,
change ep_unregister_pollwait() to check pwq->whead != NULL under
rcu_read_lock() before remove_wait_queue(). We add the new helper,
ep_remove_wait_queue(), for this.

This works because sighand_cachep is SLAB_DESTROY_BY_RCU and because
->signalfd_wqh is initialized in sighand_ctor(), not in copy_sighand.
ep_unregister_pollwait()->remove_wait_queue() can play with already
freed and potentially reused ->sighand, but this is fine. This memory
must have the valid ->signalfd_wqh until rcu_read_unlock().

Reported-by: Maxime Bizon
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2012-03-01 08:31:24 +0800
7741374fa epoll: introduce POLLFREE to flush ->signalfd_wqh before kfree() ... Browse Code »

commit d80e731ecab420ddcb79ee9d0ac427acbc187b4b upstream.

This patch is intentionally incomplete to simplify the review.
It ignores ep_unregister_pollwait() which plays with the same wqh.
See the next change.

epoll assumes that the EPOLL_CTL_ADD'ed file controls everything
f_op->poll() needs. In particular it assumes that the wait queue
can't go away until eventpoll_release(). This is not true in case
of signalfd, the task which does EPOLL_CTL_ADD uses its ->sighand
which is not connected to the file.

This patch adds the special event, POLLFREE, currently only for
epoll. It expects that init_poll_funcptr()'ed hook should do the
necessary cleanup. Perhaps it should be defined as EPOLLFREE in
eventpoll.

__cleanup_sighand() is changed to do wake_up_poll(POLLFREE) if
->signalfd_wqh is not empty, we add the new signalfd_cleanup()
helper.

ep_poll_callback(POLLFREE) simply does list_del_init(task_list).
This make this poll entry inconsistent, but we don't care. If you
share epoll fd which contains our sigfd with another process you
should blame yourself. signalfd is "really special". I simply do
not know how we can define the "right" semantics if it used with
epoll.

The main problem is, epoll calls signalfd_poll() once to establish
the connection with the wait queue, after that signalfd_poll(NULL)
returns the different/inconsistent results depending on who does
EPOLL_CTL_MOD/signalfd_read/etc. IOW: apart from sigmask, signalfd
has nothing to do with the file, it works with the current thread.

In short: this patch is the hack which tries to fix the symptoms.
It also assumes that nobody can take tasklist_lock under epoll
locks, this seems to be true.

Note:

- we do not have wake_up_all_poll() but wake_up_poll()
is fine, poll/epoll doesn't use WQ_FLAG_EXCLUSIVE.

- signalfd_cleanup() uses POLLHUP along with POLLFREE,
we need a couple of simple changes in eventpoll.c to
make sure it can't be "lost".

Reported-by: Maxime Bizon
Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2012-03-01 08:31:23 +0800
8ed3fe820 NFSv4: fix server_scope memory leak ... Browse Code »

commit abe9a6d57b4544ac208401f9c0a4262814db2be4 upstream.

server_scope would never be freed if nfs4_check_cl_exchange_flags() returned
non-zero

Signed-off-by: Weston Andros Adamson
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Weston Andros Adamson
2012-03-01 08:30:56 +0800
0cea513e3 NFSv4: Ensure we throw out bad delegation stateids on NFS4ERR_BAD_STATEID ... Browse Code »

commit b9f9a03150969e4bd9967c20bce67c4de769058f upstream.

To ensure that we don't just reuse the bad delegation when we attempt to
recover the nfs4_state that received the bad stateid error.

Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-03-01 08:30:55 +0800
4a818b428 NFSv4: Fix an Oops in the NFSv4 getacl code ... Browse Code »

commit 331818f1c468a24e581aedcbe52af799366a9dfe upstream.

Commit bf118a342f10dafe44b14451a1392c3254629a1f (NFSv4: include bitmap
in nfsv4 get acl data) introduces the 'acl_scratch' page for the case
where we may need to decode multi-page data. However it fails to take
into account the fact that the variable may be NULL (for the case where
we're not doing multi-page decode), and it also attaches it to the
encoding xdr_stream rather than the decoding one.

The immediate result is an Oops in nfs4_xdr_enc_getacl due to the
call to page_address() with a NULL page pointer.

Signed-off-by: Trond Myklebust
Cc: Andy Adamson
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2012-03-01 08:30:55 +0800
3c40e5e08 vfs: fix d_inode_lookup() dentry ref leak ... Browse Code »

commit e188dc02d3a9c911be56eca5aa114fe7e9822d53 upstream.

d_inode_lookup() leaks a dentry reference on IS_DEADDIR().

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2012-03-01 08:30:53 +0800
6b6cd603f eCryptfs: Copy up lower inode attrs after setting lower xattr ... Browse Code »

commit 545d680938be1e86a6c5250701ce9abaf360c495 upstream.

After passing through a ->setxattr() call, eCryptfs needs to copy the
inode attributes from the lower inode to the eCryptfs inode, as they
may have changed in the lower filesystem's ->setxattr() path.

One example is if an extended attribute containing a POSIX Access
Control List is being set. The new ACL may cause the lower filesystem to
modify the mode of the lower inode and the eCryptfs inode would need to
be updated to reflect the new mode.

https://launchpad.net/bugs/926292

Signed-off-by: Tyler Hicks
Reported-by: Sebastien Bacher
Cc: John Johansen
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2012-03-01 08:30:52 +0800

21 Feb, 2012

3 commits

23cfecf97 cifs: don't return error from standard_receive3 after marking response malformed ... Browse Code »

commit ff4fa4a25a33f92b5653bb43add0c63bea98d464 upstream.

standard_receive3 will check the validity of the response from the
server (via checkSMB). It'll pass the result of that check to handle_mid
which will dequeue it and mark it with a status of
MID_RESPONSE_MALFORMED if checkSMB returned an error. At that point,
standard_receive3 will also return an error, which will make the
demultiplex thread skip doing the callback for the mid.

This is wrong -- if we were able to identify the request and the
response is marked malformed, then we want the demultiplex thread to do
the callback. Fix this by making standard_receive3 return 0 in this
situation.

Reported-and-Tested-by: Mark Moseley
Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2012-02-21 04:46:18 +0800
77d04b76d cifs: request oplock when doing open on lookup ... Browse Code »

commit 8b0192a5f478da1c1ae906bf3ffff53f26204f56 upstream.

Currently, it's always set to 0 (no oplock requested).

Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2012-02-21 04:46:17 +0800
aec14f459 writeback: fix NULL bdi->dev in trace writeback_single_inode ... Browse Code »

commit 15eb77a07c714ac80201abd0a9568888bcee6276 upstream.

bdi_prune_sb() resets sb->s_bdi to default_backing_dev_info when the
tearing down the original bdi. Fix trace_writeback_single_inode to
use sb->s_bdi=default_backing_dev_info rather than bdi->dev=NULL for a
teared down bdi.

Reported-by: Rabin Vincent
Tested-by: Rabin Vincent
Signed-off-by: Wu Fengguang
Signed-off-by: Greg Kroah-Hartman

Wu Fengguang
2012-02-21 04:46:16 +0800

14 Feb, 2012

6 commits

85f2f3e05 cifs: Fix oops in session setup code for null user mounts ... Browse Code »

commit de47a4176c532ef5961b8a46a2d541a3517412d3 upstream.

For null user mounts, do not invoke string length function
during session setup.

Reported-and-Tested-by: Chris Clayton
Acked-by: Jeff Layton
Signed-off-by: Shirish Pargaonkar
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Shirish Pargaonkar
2012-02-14 03:16:58 +0800
3d1b7976d eCryptfs: Infinite loop due to overflow in ecryptfs_write() ... Browse Code »

commit 684a3ff7e69acc7c678d1a1394fe9e757993fd34 upstream.

ecryptfs_write() can enter an infinite loop when truncating a file to a
size larger than 4G. This only happens on architectures where size_t is
represented by 32 bits.

This was caused by a size_t overflow due to it incorrectly being used to
store the result of a calculation which uses potentially large values of
type loff_t.

[tyhicks@canonical.com: rewrite subject and commit message]
Signed-off-by: Li Wang
Signed-off-by: Yunchuan Wen
Reviewed-by: Cong Wang
Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Li Wang
2012-02-14 03:16:58 +0800
43f4a516b udf: Mark LVID buffer as uptodate before marking it dirty ... Browse Code »

commit 853a0c25baf96b028de1654bea1e0c8857eadf3d upstream.

When we hit EIO while writing LVID, the buffer uptodate bit is cleared.
This then results in an anoying warning from mark_buffer_dirty() when we
write the buffer again. So just set uptodate flag unconditionally.

Reviewed-by: Namjae Jeon
Signed-off-by: Jan Kara
Cc: Dave Jones
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-02-14 03:16:56 +0800
43904e95b proc: make sure mem_open() doesn't pin the target's memory ... Browse Code »

commit 6d08f2c7139790c268820a2e590795cb8333181a upstream.

Once /proc/pid/mem is opened, the memory can't be released until
mem_release() even if its owner exits.

Change mem_open() to do atomic_inc(mm_count) + mmput(), this only
pins mm_struct. Change mem_rw() to do atomic_inc_not_zero(mm_count)
before access_remote_vm(), this verifies that this mm is still alive.

I am not sure what should mem_rw() return if atomic_inc_not_zero()
fails. With this patch it returns zero to match the "mm == NULL" case,
may be it should return -EINVAL like it did before e268337d.

Perhaps it makes sense to add the additional fatal_signal_pending()
check into the main loop, to ensure we do not hold this memory if
the target task was oom-killed.

Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2012-02-14 03:16:53 +0800
034089b6f proc: unify mem_read() and mem_write() ... Browse Code »

commit 572d34b946bae070debd42db1143034d9687e13f upstream.

No functional changes, cleanup and preparation.

mem_read() and mem_write() are very similar. Move this code into the
new common helper, mem_rw(), which takes the additional "int write"
argument.

Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2012-02-14 03:16:53 +0800
3a196fbe2 proc: mem_release() should check mm != NULL ... Browse Code »

commit 71879d3cb3dd8f2dfdefb252775c1b3ea04a3dd4 upstream.

mem_release() can hit mm == NULL, add the necessary check.

Signed-off-by: Oleg Nesterov
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Oleg Nesterov
2012-02-14 03:16:53 +0800

04 Feb, 2012

8 commits

bfd534c72 sysfs: Complain bitterly about attempts to remove files from nonexistent directories. ... Browse Code »

commit ce597919361dcec97341151690e780eade2a9cf4 upstream.

Recently an OOPS was observed from the usb serial io_ti driver when it tried to remove
sysfs directories. Upon investigation it turns out this driver was always buggy
and that a recent sysfs change had stopped guarding itself against removing attributes
from sysfs directories that had already been removed. :(

Historically we have been silent about attempting to files from nonexistent sysfs
directories and have politely returned error codes. That has resulted in people writing
broken code that ignores the error codes.

Issue a kernel WARNING and a stack backtrace to make it clear in no uncertain
terms that abusing sysfs is not ok, and the callers need to fix their code.

This change transforms the io_ti OOPS into a more comprehensible error message
and stack backtrace.

Signed-off-by: Eric W. Biederman
Reported-by: Wolfgang Frisch
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2012-02-04 01:21:47 +0800
fde1c2621 jbd: Issue cache flush after checkpointing ... Browse Code »

commit 353b67d8ced4dc53281c88150ad295e24bc4b4c5 upstream.

When we reach cleanup_journal_tail(), there is no guarantee that
checkpointed buffers are on a stable storage - especially if buffers were
written out by log_do_checkpoint(), they are likely to be only in disk's
caches. Thus when we update journal superblock, effectively removing old
transaction from journal, this write of superblock can get to stable storage
before those checkpointed buffers which can result in filesystem corruption
after a crash.

A similar problem can happen if we replay the journal and wipe it before
flushing disk's caches.

Thus we must unconditionally issue a cache flush before we update journal
superblock in these cases. The fix is slightly complicated by the fact that we
have to get log tail before we issue cache flush but we can store it in the
journal superblock only after the cache flush. Otherwise we risk races where
new tail is written before appropriate cache flush is finished.

I managed to reproduce the corruption using somewhat tweaked Chris Mason's
barrier-test scheduler. Also this should fix occasional reports of 'Bit already
freed' filesystem errors which are totally unreproducible but inspection of
several fs images I've gathered over time points to a problem like this.

Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-02-04 01:21:32 +0800
358d01392 xfs: Fix missing xfs_iunlock() on error recovery path in xfs_readlink() ... Browse Code »

commit 9b025eb3a89e041bab6698e3858706be2385d692 upstream.

Commit b52a360b forgot to call xfs_iunlock() when it detected corrupted
symplink and bailed out. Fix it by jumping to 'out' instead of doing return.

CC: Carlos Maiolino
Signed-off-by: Jan Kara
Reviewed-by: Alex Elder
Reviewed-by: Dave Chinner
Signed-off-by: Ben Myers
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2012-02-04 01:21:27 +0800
1924fe587 eCryptfs: Fix oops when printing debug info in extent crypto functions ... Browse Code »

commit 58ded24f0fcb85bddb665baba75892f6ad0f4b8a upstream.

If pages passed to the eCryptfs extent-based crypto functions are not
mapped and the module parameter ecryptfs_verbosity=1 was specified at
loading time, a NULL pointer dereference will occur.

Note that this wouldn't happen on a production system, as you wouldn't
pass ecryptfs_verbosity=1 on a production system. It leaks private
information to the system logs and is for debugging only.

The debugging info printed in these messages is no longer very useful
and rather than doing a kmap() in these debugging paths, it will be
better to simply remove the debugging paths completely.

https://launchpad.net/bugs/913651

Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2012-02-04 01:21:24 +0800
963f50802 eCryptfs: Check inode changes in setattr ... Browse Code »

commit a261a03904849c3df50bd0300efb7fb3f865137d upstream.

Most filesystems call inode_change_ok() very early in ->setattr(), but
eCryptfs didn't call it at all. It allowed the lower filesystem to make
the call in its ->setattr() function. Then, eCryptfs would copy the
appropriate inode attributes from the lower inode to the eCryptfs inode.

This patch changes that and actually calls inode_change_ok() on the
eCryptfs inode, fairly early in ecryptfs_setattr(). Ideally, the call
would happen earlier in ecryptfs_setattr(), but there are some possible
inode initialization steps that must happen first.

Since the call was already being made on the lower inode, the change in
functionality should be minimal, except for the case of a file extending
truncate call. In that case, inode_newsize_ok() was never being
called on the eCryptfs inode. Rather than inode_newsize_ok() catching
maximum file size errors early on, eCryptfs would encrypt zeroed pages
and write them to the lower filesystem until the lower filesystem's
write path caught the error in generic_write_checks(). This patch
introduces a new function, called ecryptfs_inode_newsize_ok(), which
checks if the new lower file size is within the appropriate limits when
the truncate operation will be growing the lower file.

In summary this change prevents eCryptfs truncate operations (and the
resulting page encryptions), which would exceed the lower filesystem
limits or FSIZE rlimits, from ever starting.

Signed-off-by: Tyler Hicks
Reviewed-by: Li Wang
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2012-02-04 01:21:24 +0800
ccc10d459 eCryptfs: Make truncate path killable ... Browse Code »

commit 5e6f0d769017cc49207ef56996e42363ec26c1f0 upstream.

ecryptfs_write() handles the truncation of eCryptfs inodes. It grabs a
page, zeroes out the appropriate portions, and then encrypts the page
before writing it to the lower filesystem. It was unkillable and due to
the lack of sparse file support could result in tying up a large portion
of system resources, while encrypting pages of zeros, with no way for
the truncate operation to be stopped from userspace.

This patch adds the ability for ecryptfs_write() to detect a pending
fatal signal and return as gracefully as possible. The intent is to
leave the lower file in a useable state, while still allowing a user to
break out of the encryption loop. If a pending fatal signal is detected,
the eCryptfs inode size is updated to reflect the modified inode size
and then -EINTR is returned.

Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2012-02-04 01:21:23 +0800
75d26d309 ecryptfs: Improve metadata read failure logging ... Browse Code »

commit 30373dc0c87ffef68d5628e77d56ffb1fa22e1ee upstream.

Print inode on metadata read failure. The only real
way of dealing with metadata read failures is to delete
the underlying file system file. Having the inode
allows one to 'find . -inum INODE`.

[tyhicks@canonical.com: Removed some minor not-for-stable parts]
Signed-off-by: Tim Gardner
Reviewed-by: Kees Cook
Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Tim Gardner
2012-02-04 01:21:23 +0800
2f46e90c6 eCryptfs: Sanitize write counts of /dev/ecryptfs ... Browse Code »

commit db10e556518eb9d21ee92ff944530d84349684f4 upstream.

A malicious count value specified when writing to /dev/ecryptfs may
result in a a very large kernel memory allocation.

This patch peeks at the specified packet payload size, adds that to the
size of the packet headers and compares the result with the write count
value. The resulting maximum memory allocation size is approximately 532
bytes.

Signed-off-by: Tyler Hicks
Reported-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2012-02-04 01:21:22 +0800

26 Jan, 2012

11 commits

bd1dc8b10 proc: clear_refs: do not clear reserved pages ... Browse Code »

commit 85e72aa5384b1a614563ad63257ded0e91d1a620 upstream.

/proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
pages and corresponding page table entries of the task with PID pid, which
includes any special mappings inserted into the page tables in order to
provide things like vDSOs and user helper functions.

On ARM this causes a problem because the vectors page is mapped as a
global mapping and since ec706dab ("ARM: add a vma entry for the user
accessible vector page"), a VMA is also inserted into each task for this
page to aid unwinding through signals and syscall restarts. Since the
vectors page is required for handling faults, clearing the YOUNG bit (and
subsequently writing a faulting pte) means that we lose the vectors page
*globally* and cannot fault it back in. This results in a system deadlock
on the next exception.

To see this problem in action, just run:

$ echo 1 > /proc/self/clear_refs

on an ARM platform (as any user) and watch your system hang. I think this
has been the case since 2.6.37

This patch avoids clearing the aforementioned bits for reserved pages,
therefore leaving the vectors page intact on ARM. Since reserved pages
are not candidates for swap, this change should not have any impact on the
usefulness of clear_refs.

Signed-off-by: Will Deacon
Reported-by: Moussa Ba
Acked-by: Hugh Dickins
Cc: David Rientjes
Cc: Russell King
Acked-by: Nicolas Pitre
Cc: Matt Mackall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Will Deacon
2012-01-26 08:13:58 +0800
bdac3a105 cifs: lower default wsize when unix extensions are not used ... Browse Code »

commit ce91acb3acae26f4163c5a6f1f695d1a1e8d9009 upstream.

We've had some reports of servers (namely, the Solaris in-kernel CIFS
server) that don't deal properly with writes that are "too large" even
though they set CAP_LARGE_WRITE_ANDX. Change the default to better
mirror what windows clients do.

Cc: Pavel Shilovsky
Reported-by: Nick Davis
Signed-off-by: Jeff Layton
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2012-01-26 08:13:57 +0800
afa2f5f83 xfs: fix endian conversion issue in discard code ... Browse Code »

commit b1c770c273a4787069306fc82aab245e9ac72e9d upstream

When finding the longest extent in an AG, we read the value directly
out of the AGF buffer without endian conversion. This will give an
incorrect length, resulting in FITRIM operations potentially not
trimming everything that it should.

Signed-off-by: Dave Chinner
Reviewed-by: Christoph Hellwig
Signed-off-by: Ben Myers
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2012-01-26 08:13:55 +0800
4d96a0c10 proc: clean up and fix /proc/<pid>/mem handling ... Browse Code »

commit e268337dfe26dfc7efd422a804dbb27977a3cccc upstream.

Jüri Aedla reported that the /proc//mem handling really isn't very
robust, and it also doesn't match the permission checking of any of the
other related files.

This changes it to do the permission checks at open time, and instead of
tracking the process, it tracks the VM at the time of the open. That
simplifies the code a lot, but does mean that if you hold the file
descriptor open over an execve(), you'll continue to read from the _old_
VM.

That is different from our previous behavior, but much simpler. If
somebody actually finds a load where this matters, we'll need to revert
this commit.

I suspect that nobody will ever notice - because the process mapping
addresses will also have changed as part of the execve. So you cannot
actually usefully access the fd across a VM change simply because all
the offsets for IO would have changed too.

Reported-by: Jüri Aedla
Cc: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2012-01-26 08:13:43 +0800
ca4a55730 fix cputime overflow in uptime_proc_show ... Browse Code »

commit c3e0ef9a298e028a82ada28101ccd5cf64d209ee upstream.

For 32-bit architectures using standard jiffies the idletime calculation
in uptime_proc_show will quickly overflow. It takes (2^32 / HZ) seconds
of idle-time, or e.g. 12.45 days with no load on a quad-core with HZ=1000.
Switch to 64-bit calculations.

Cc: Michael Abbott
Signed-off-by: Martin Schwidefsky
Signed-off-by: Greg Kroah-Hartman

Martin Schwidefsky
2012-01-26 08:13:41 +0800
b4e8ff299 pnfsblock: limit bio page count ... Browse Code »

commit 74a6eeb44ca6174d9cc93b9b8b4d58211c57bc80 upstream.

One bio can have at most BIO_MAX_PAGES pages. We should limit it bec otherwise
bio_alloc will fail when there are many pages in one read/write_pagelist.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2012-01-26 08:13:37 +0800
7e0f9f47b pnfsblock: don't spinlock when freeing block_dev ... Browse Code »

commit 93a3844ee0f843b05a1df4b52e1a19ff26b98d24 upstream.

bl_free_block_dev() may sleep. We can not call it with spinlock held.
Besides, there is no need to take bm_lock as we are last user freeing bm_devlist.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2012-01-26 08:13:36 +0800
ae6481331 pnfsblock: acquire im_lock in _preload_range ... Browse Code »

commit 39e567ae36fe03c2b446e1b83ee3d39bea08f90b upstream.

When calling _add_entry, we should take the im_lock to protect
agains other modifiers.

Signed-off-by: Peng Tao
Signed-off-by: Benny Halevy
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2012-01-26 08:13:36 +0800
1f489b2d8 fix shrink_dcache_parent() livelock ... Browse Code »

commit eaf5f9073533cde21c7121c136f1c3f072d9cf59 upstream.

Two (or more) concurrent calls of shrink_dcache_parent() on the same dentry may
cause shrink_dcache_parent() to loop forever.

Here's what appears to happen:

1 - CPU0: select_parent(P) finds C and puts it on dispose list, returns 1

2 - CPU1: select_parent(P) locks P->d_lock

3 - CPU0: shrink_dentry_list() locks C->d_lock
dentry_kill(C) tries to lock P->d_lock but fails, unlocks C->d_lock

4 - CPU1: select_parent(P) locks C->d_lock,
moves C from dispose list being processed on CPU0 to the new
dispose list, returns 1

5 - CPU0: shrink_dentry_list() finds dispose list empty, returns

6 - Goto 2 with CPU0 and CPU1 switched

Basically select_parent() steals the dentry from shrink_dentry_list() and thinks
it found a new one, causing shrink_dentry_list() to think it's making progress
and loop over and over.

One way to trigger this is to make udev calls stat() on the sysfs file while it
is going away.

Having a file in /lib/udev/rules.d/ with only this one rule seems to the trick:

ATTR{vendor}=="0x8086", ATTR{device}=="0x10ca", ENV{PCI_SLOT_NAME}="%k", ENV{MATCHADDR}="$attr{address}", RUN+="/bin/true"

Then execute the following loop:

while true; do
echo -bond0 > /sys/class/net/bonding_masters
echo +bond0 > /sys/class/net/bonding_masters
echo -bond1 > /sys/class/net/bonding_masters
echo +bond1 > /sys/class/net/bonding_masters
done

One fix would be to check all callers and prevent concurrent calls to
shrink_dcache_parent(). But I think a better solution is to stop the
stealing behavior.

This patch adds a new dentry flag that is set when the dentry is added to the
dispose list. The flag is cleared in dentry_lru_del() in case the dentry gets a
new reference just before being pruned.

If the dentry has this flag, select_parent() will skip it and let
shrink_dentry_list() retry pruning it. With select_parent() skipping those
dentries there will not be the appearance of progress (new dentries found) when
there is none, hence shrink_dcache_parent() will not loop forever.

Set the flag is also set in prune_dcache_sb() for consistency as suggested by
Linus.

Signed-off-by: Miklos Szeredi
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2012-01-26 08:13:35 +0800
9ba5dc56a dcache: use a dispose list in select_parent ... Browse Code »

commit b48f03b319ba78f3abf9a7044d1f436d8d90f4f9 upstream.

select_parent currently abuses the dentry cache LRU to provide
cleanup features for child dentries that need to be freed. It moves
them to the tail of the LRU, then tells shrink_dcache_parent() to
calls __shrink_dcache_sb to unconditionally move them to a dispose
list (as DCACHE_REFERENCED is ignored). __shrink_dcache_sb() has to
relock the dentries to move them off the LRU onto the dispose list,
but otherwise does not touch the dentries that select_parent() moved
to the tail of the LRU. It then passses the dispose list to
shrink_dentry_list() which tries to free the dentries.

IOWs, the use of __shrink_dcache_sb() is superfluous - we can build
exactly the same list of dentries for disposal directly in
select_parent() and call shrink_dentry_list() instead of calling
__shrink_dcache_sb() to do that. This means that we avoid long holds
on the lru lock walking the LRU moving dentries to the dispose list
We also avoid the need to relock each dentry just to move it off the
LRU, reducing the numebr of times we lock each dentry to dispose of
them in shrink_dcache_parent() from 3 to 2 times.

Further, we remove one of the two callers of __shrink_dcache_sb().
This also means that __shrink_dcache_sb can be moved into back into
prune_dcache_sb() and we no longer have to handle referenced
dentries conditionally, simplifying the code.

Signed-off-by: Dave Chinner
Signed-off-by: Linus Torvalds
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Dave Chinner
2012-01-26 08:13:35 +0800
a8b1c0add fsnotify: don't BUG in fsnotify_destroy_mark() ... Browse Code »

commit fed474857efbed79cd390d0aee224231ca718f63 upstream.

Removing the parent of a watched file results in "kernel BUG at
fs/notify/mark.c:139".

To reproduce

add "-w /tmp/audit/dir/watched_file" to audit.rules
rm -rf /tmp/audit/dir

This is caused by fsnotify_destroy_mark() being called without an
extra reference taken by the caller.

Reported by Francesco Cosoleto here:

https://bugzilla.novell.com/show_bug.cgi?id=689860

Fix by removing the BUG_ON and adding a comment about not accessing mark after
the iput.

Signed-off-by: Miklos Szeredi
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Miklos Szeredi
2012-01-26 08:13:33 +0800