Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

06 Feb, 2015

3 commits

0c217359a NFSv4.1: Fix an Oops in nfs41_walk_client_list ... Browse Code »

commit 3175e1dcec40fab1a444c010087f2068b6b04732 upstream.

If we start state recovery on a client that failed to initialise correctly,
then we are very likely to Oops.

Reported-by: "Mkrtchyan, Tigran"
Link: http://lkml.kernel.org/r/130621862.279655.1421851650684.JavaMail.zimbra@desy.de
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-02-06 14:36:05 +0800
19387f4d8 nfs: fix dio deadlock when O_DIRECT flag is flipped ... Browse Code »

commit ee8a1a8b160a87dc3a9c81a86796aa4db85ea815 upstream.

We only support swap file calling nfs_direct_IO. However, application
might be able to get to nfs_direct_IO if it toggles O_DIRECT flag
during IO and it can deadlock because we grab inode->i_mutex in
nfs_file_direct_write(). So return 0 for such case. Then the generic
layer will fall back to buffer IO.

Signed-off-by: Peng Tao
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2015-02-06 14:36:05 +0800
f9525833c udf: Release preallocation on last writeable close ... Browse Code »

commit b07ef35244424cbeda9844198607c7077099c82c upstream.

Commit 6fb1ca92a640 "udf: Fix race between write(2) and close(2)"
changed the condition when preallocation is released. The idea was that
we don't want to release the preallocation for an inode on close when
there are other writeable file descriptors for the inode. However the
condition was written in the opposite way so we released preallocation
only if there were other writeable file descriptors. Fix the problem by
changing the condition properly.

Fixes: 6fb1ca92a6409a9d5b0696447cd4997bc9aaf5a2
Reported-by: Fabian Frederick
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-02-06 14:36:03 +0800

30 Jan, 2015

1 commit

24f1a316c fix deadlock in cifs_ioctl_clone() ... Browse Code »

commit 378ff1a53b5724f3ac97b0aba3c9ecac072f6fcd upstream.

It really needs to check that src is non-directory *and* use
{un,}lock_two_nodirectories(). As it is, it's trivial to cause
double-lock (ioctl(fd, CIFS_IOC_COPYCHUNK_FILE, fd)) and if the
last argument is an fd of directory, we are asking for trouble
by violating the locking order - all directories go before all
non-directories. If the last argument is an fd of parent
directory, it has 50% odds of locking child before parent,
which will cause AB-BA deadlock if we race with unlink().

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2015-01-30 09:40:45 +0800

28 Jan, 2015

4 commits

e7b203315 LOCKD: Fix a race when initialising nlmsvc_timeout ... Browse Code »

commit 06bed7d18c2c07b3e3eeadf4bd357f6e806618cc upstream.

This commit fixes a race whereby nlmclnt_init() first starts the lockd
daemon, and then calls nlm_bind_host() with the expectation that
nlmsvc_timeout has already been initialised. Unfortunately, there is no
no synchronisation between lockd() and lockd_up() to guarantee that this
is the case.

Fix is to move the initialisation of nlmsvc_timeout into lockd_create_svc

Fixes: 9a1b6bf818e74 ("LOCKD: Don't call utsname()->nodename...")
Cc: Bruce Fields
Cc: stable@vger.kernel.org # 3.10.x
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-01-28 00:29:45 +0800
60ff54016 NFSv4.1: Fix client id trunking on Linux ... Browse Code »

commit 1fc0703af3143914a389bfa081c7acb09502ed5d upstream.

Currently, our trunking code will check for session trunking, but will
fail to detect client id trunking. This is a problem, because it means
that the client will fail to recognise that the two connections represent
shared state, even if they do not permit a shared session.
By removing the check for the server minor id, and only checking the
major id, we will end up doing the right thing in both cases: we close
down the new nfs_client and fall back to using the existing one.

Fixes: 05f4c350ee02e ("NFS: Discover NFSv4 server trunking when mounting")
Cc: Chuck Lever
Tested-by: Chuck Lever
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-01-28 00:29:41 +0800
d5471f896 locks: fix NULL-deref in generic_delete_lease ... Browse Code »

commit 52d304eb4eaced9ad04b64ba7cd6ceb5153bbf18 upstream.

commit 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e
locks: generic_delete_lease doesn't need a file_lock at all

moves the call to fl->fl_lmops->lm_change() to a place in the
code where fl might be a non-lease lock.
When that happens, fl_lmops is NULL and an Oops ensures.

So add an extra test to restore correct functioning.

Reported-by: Linda Walsh
Link: https://bugzilla.suse.com/show_bug.cgi?id=912569
Fixes: 0efaa7e82f02fe69c05ad28e905f31fc86e6f08e
Signed-off-by: NeilBrown
Signed-off-by: Jeff Layton
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2015-01-28 00:29:40 +0800
a27d8a231 genirq: Prevent proc race against freeing of irq descriptors ... Browse Code »

commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream.

Since the rework of the sparse interrupt code to actually free the
unused interrupt descriptors there exists a race between the /proc
interfaces to the irq subsystem and the code which frees the interrupt
descriptor.

CPU0 CPU1
show_interrupts()
desc = irq_to_desc(X);
free_desc(desc)
remove_from_radix_tree();
kfree(desc);
raw_spinlock_irq(&desc->lock);

/proc/interrupts is the only interface which can actively corrupt
kernel memory via the lock access. /proc/stat can only read from freed
memory. Extremly hard to trigger, but possible.

The interfaces in /proc/irq/N/ are not affected by this because the
removal of the proc file is serialized in procfs against concurrent
readers/writers. The removal happens before the descriptor is freed.

For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
as the descriptor is never freed. It's merely cleared out with the irq
descriptor lock held. So any concurrent proc access will either see
the old correct value or the cleared out ones.

Protect the lookup and access to the irq descriptor in
show_interrupts() with the sparse_irq_lock.

Provide kstat_irqs_usr() which is protecting the lookup and access
with sparse_irq_lock and switch /proc/stat to use it.

Document the existing kstat_irqs interfaces so it's clear that the
caller needs to take care about protection. The users of these
interfaces are either not affected due to SPARSE_IRQ=n or already
protected against removal.

Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator"
Signed-off-by: Thomas Gleixner
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2015-01-28 00:29:37 +0800

16 Jan, 2015

13 commits

30e4fb85e Btrfs: don't delay inode ref updates during log replay ... Browse Code »

commit 6f8960541b1eb6054a642da48daae2320fddba93 upstream.

Commit 1d52c78afbb (Btrfs: try not to ENOSPC on log replay) added a
check to skip delayed inode updates during log replay because it
confuses the enospc code. But the delayed processing will end up
ignoring delayed refs from log replay because the inode itself wasn't
put through the delayed code.

This can end up triggering a warning at commit time:

WARNING: CPU: 2 PID: 778 at fs/btrfs/delayed-inode.c:1410 btrfs_assert_delayed_root_empty+0x32/0x34()

Which is repeated for each commit because we never process the delayed
inode ref update.

The fix used here is to change btrfs_delayed_delete_inode_ref to return
an error if we're currently in log replay. The caller will do the ref
deletion immediately and everything will work properly.

Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Chris Mason
2015-01-16 22:59:56 +0800
83d17827a nilfs2: fix the nilfs_iget() vs. nilfs_new_inode() races ... Browse Code »

commit 705304a863cc41585508c0f476f6d3ec28cf7e00 upstream.

Same story as in commit 41080b5a2401 ("nfsd race fixes: ext2") (similar
ext2 fix) except that nilfs2 needs to use insert_inode_locked4() instead
of insert_inode_locked() and a bug of a check for dead inodes needs to
be fixed.

If nilfs_iget() is called from nfsd after nilfs_new_inode() calls
insert_inode_locked4(), nilfs_iget() will wait for unlock_new_inode() at
the end of nilfs_mkdir()/nilfs_create()/etc to unlock the inode.

If nilfs_iget() is called before nilfs_new_inode() calls
insert_inode_locked4(), it will create an in-core inode and read its
data from the on-disk inode. But, nilfs_iget() will find i_nlink equals
zero and fail at nilfs_read_inode_common(), which will lead it to call
iget_failed() and cleanly fail.

However, this sanity check doesn't work as expected for reused on-disk
inodes because they leave a non-zero value in i_mode field and it
hinders the test of i_nlink. This patch also fixes the issue by
removing the test on i_mode that nilfs2 doesn't need.

Signed-off-by: Ryusuke Konishi
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Ryusuke Konishi
2015-01-16 22:59:54 +0800
d5f902afa ceph: do_sync is never initialized ... Browse Code »

commit 021b77bee210843bed1ea91b5cad58235ff9c8e5 upstream.

Probably this code was syncing a lot more often then intended because
the do_sync variable wasn't set to zero.

Fixes: c62988ec0910 ('ceph: avoid meaningless calling ceph_caps_revoking if sync_mode == WB_SYNC_ALL.')
Signed-off-by: Dan Carpenter
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Dan Carpenter
2015-01-16 22:59:53 +0800
bff467fab nfsd: fix fi_delegees leak when fi_had_conflict returns true ... Browse Code »

commit 94ae1db226a5bcbb48372d81161f084c9e283fd8 upstream.

Currently, nfs4_set_delegation takes a reference to an existing
delegation and then checks to see if there is a conflict. If there is
one, then it doesn't release that reference.

Change the code to take the reference after the check and only if there
is no conflict.

Signed-off-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-01-16 22:59:53 +0800
7599f17f0 nfsd4: fix xdr4 count of server in fs_location4 ... Browse Code »

commit bf7491f1be5e125eece2ec67e0f79d513caa6c7e upstream.

Fix a bug where nfsd4_encode_components_esc() incorrectly calculates the
length of server array in fs_location4--note that it is a count of the
number of array elements, not a length in bytes.

Signed-off-by: Benjamin Coddington
Fixes: 082d4bd72a45 (nfsd4: "backfill" using write_bytes_to_xdr_buf)
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Benjamin Coddington
2015-01-16 22:59:53 +0800
6a4edf54d nfsd4: fix xdr4 inclusion of escaped char ... Browse Code »

commit 5a64e56976f1ba98743e1678c0029a98e9034c81 upstream.

Fix a bug where nfsd4_encode_components_esc() includes the esc_end char as
an additional string encoding.

Signed-off-by: Benjamin Coddington
Fixes: e7a0444aef4a "nfsd: add IPv6 addr escaping to fs_location hosts"
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Benjamin Coddington
2015-01-16 22:59:53 +0800
fb0a85574 fs: nfsd: Fix signedness bug in compare_blob ... Browse Code »

commit ef17af2a817db97d42dd2ec0a425231748e23dbc upstream.

Bugs similar to the one in acbbe6fbb240 (kcmp: fix standard comparison
bug) are in rich supply.

In this variant, the problem is that struct xdr_netobj::len has type
unsigned int, so the expression o1->len - o2->len _also_ has type
unsigned int; it has completely well-defined semantics, and the result
is some non-negative integer, which is always representable in a long
long. But this means that if the conditional triggers, we are
guaranteed to return a positive value from compare_blob.

In this case it could be fixed by

- res = o1->len - o2->len;
+ res = (long long)o1->len - (long long)o2->len;

but I'd rather eliminate the usually broken 'return a - b;' idiom.

Reviewed-by: Jeff Layton
Signed-off-by: Rasmus Villemoes
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Rasmus Villemoes
2015-01-16 22:59:53 +0800
66a2cbe36 reiserfs: destroy allocated commit workqueue ... Browse Code »

commit fa0c5540739320258c3e3a45aaae9dae467b2504 upstream.

When resirefs is trying to mount a partition, it creates a commit
workqueue (sbi->commit_wq). But when mount fails later, the workqueue
is not freed.

Signed-off-by: Jiri Slaby
Reported-by: auxsvr@gmail.com
Reported-by: Benoît Monin
Cc: Jan Kara
Cc: reiserfs-devel@vger.kernel.org
Fixes: 797d9016ceca69879bb273218810fa0beef46aac
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jiri Slaby
2015-01-16 22:59:53 +0800
02a59e291 writeback: fix a subtle race condition in I_DIRTY clearing ... Browse Code »

commit 9c6ac78eb3521c5937b2dd8a7d1b300f41092f45 upstream.

After invoking ->dirty_inode(), __mark_inode_dirty() does smp_mb() and
tests inode->i_state locklessly to see whether it already has all the
necessary I_DIRTY bits set. The comment above the barrier doesn't
contain any useful information - memory barriers can't ensure "changes
are seen by all cpus" by itself.

And it sure enough was broken. Please consider the following
scenario.

CPU 0 CPU 1
-------------------------------------------------------------------------------

enters __writeback_single_inode()
grabs inode->i_lock
tests PAGECACHE_TAG_DIRTY which is clear
enters __set_page_dirty()
grabs mapping->tree_lock
sets PAGECACHE_TAG_DIRTY
releases mapping->tree_lock
leaves __set_page_dirty()

enters __mark_inode_dirty()
smp_mb()
sees I_DIRTY_PAGES set
leaves __mark_inode_dirty()
clears I_DIRTY_PAGES
releases inode->i_lock

Now @inode has dirty pages w/ I_DIRTY_PAGES clear. This doesn't seem
to lead to an immediately critical problem because requeue_inode()
later checks PAGECACHE_TAG_DIRTY instead of I_DIRTY_PAGES when
deciding whether the inode needs to be requeued for IO and there are
enough unintentional memory barriers inbetween, so while the inode
ends up with inconsistent I_DIRTY_PAGES flag, it doesn't fall off the
IO list.

The lack of explicit barrier may also theoretically affect the other
I_DIRTY bits which deal with metadata dirtiness. There is no
guarantee that a strong enough barrier exists between
I_DIRTY_[DATA]SYNC clearing and write_inode() writing out the dirtied
inode. Filesystem inode writeout path likely has enough stuff which
can behave as full barrier but it's theoretically possible that the
writeout may not see all the updates from ->dirty_inode().

Fix it by adding an explicit smp_mb() after I_DIRTY clearing. Note
that I_DIRTY_PAGES needs a special treatment as it always needs to be
cleared to be interlocked with the lockless test on
__mark_inode_dirty() side. It's cleared unconditionally and
reinstated after smp_mb() if the mapping still has dirty pages.

Also add comments explaining how and why the barriers are paired.

Lightly tested.

Signed-off-by: Tejun Heo
Cc: Jan Kara
Cc: Mikulas Patocka
Cc: Jens Axboe
Cc: Al Viro
Reviewed-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2015-01-16 22:59:52 +0800
c7ba2d794 pstore-ram: Allow optional mapping with pgprot_noncached ... Browse Code »

commit 027bc8b08242c59e19356b4b2c189f2d849ab660 upstream.

On some ARMs the memory can be mapped pgprot_noncached() and still
be working for atomic operations. As pointed out by Colin Cross
, in some cases you do want to use
pgprot_noncached() if the SoC supports it to see a debug printk
just before a write hanging the system.

On ARMs, the atomic operations on strongly ordered memory are
implementation defined. So let's provide an optional kernel parameter
for configuring pgprot_noncached(), and use pgprot_writecombine() by
default.

Cc: Arnd Bergmann
Cc: Rob Herring
Cc: Randy Dunlap
Cc: Anton Vorontsov
Cc: Colin Cross
Cc: Olof Johansson
Cc: Russell King
Acked-by: Kees Cook
Signed-off-by: Tony Lindgren
Signed-off-by: Tony Luck
Signed-off-by: Greg Kroah-Hartman

Tony Lindgren
2015-01-16 22:59:47 +0800
dedfc0ec1 pstore-ram: Fix hangs by using write-combine mappings ... Browse Code »

commit 7ae9cb81933515dc7db1aa3c47ef7653717e3090 upstream.

Currently trying to use pstore on at least ARMs can hang as we're
mapping the peristent RAM with pgprot_noncached().

On ARMs, pgprot_noncached() will actually make the memory strongly
ordered, and as the atomic operations pstore uses are implementation
defined for strongly ordered memory, they may not work. So basically
atomic operations have undefined behavior on ARM for device or strongly
ordered memory types.

Let's fix the issue by using write-combine variants for mappings. This
corresponds to normal, non-cacheable memory on ARM. For many other
architectures, this change does not change the mapping type as by
default we have:

#define pgprot_writecombine pgprot_noncached

The reason why pgprot_noncached() was originaly used for pstore
is because Colin Cross had observed lost
debug prints right before a device hanging write operation on some
systems. For the platforms supporting pgprot_noncached(), we can
add a an optional configuration option to support that. But let's
get pstore working first before adding new features.

Cc: Arnd Bergmann
Cc: Anton Vorontsov
Cc: Colin Cross
Cc: Olof Johansson
Cc: linux-kernel@vger.kernel.org
Acked-by: Kees Cook
Signed-off-by: Rob Herring
[tony@atomide.com: updated description]
Signed-off-by: Tony Lindgren
Signed-off-by: Tony Luck
Signed-off-by: Greg Kroah-Hartman

Rob Herring
2015-01-16 22:59:47 +0800
234b25a3a ocfs2: fix the wrong directory passed to ocfs2_lookup_ino_from_name() when link file ... Browse Code »

commit 53dc20b9a3d928b0744dad5aee65b610de1cc85d upstream.

In ocfs2_link(), the parent directory inode passed to function
ocfs2_lookup_ino_from_name() is wrong. Parameter dir is the parent of
new_dentry not old_dentry. We should get old_dir from old_dentry and
lookup old_dentry in old_dir in case another node remove the old dentry.

With this change, hard linking works again, when paths are relative with
at least one subdirectory. This is how the problem was reproducable:

# mkdir a
# mkdir b
# touch a/test
# ln a/test b/test
ln: failed to create hard link `b/test' => `a/test': No such file or directory

However when creating links in the same dir, it worked well.

Now the link gets created.

Fixes: 0e048316ff57 ("ocfs2: check existence of old dentry in ocfs2_link()")
Signed-off-by: joyce.xue
Reported-by: Szabo Aron - UBIT
Cc: Mark Fasheh
Cc: Joel Becker
Tested-by: Aron Szabo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Xue jiufei
2015-01-16 22:59:44 +0800
8f0613ea5 ocfs2: fix journal commit deadlock ... Browse Code »

commit 136f49b9171074872f2a14ad0ab10486d1ba13ca upstream.

For buffer write, page lock will be got in write_begin and released in
write_end, in ocfs2_write_end_nolock(), before it unlock the page in
ocfs2_free_write_ctxt(), it calls ocfs2_run_deallocs(), this will ask
for the read lock of journal->j_trans_barrier. Holding page lock and
ask for journal->j_trans_barrier breaks the locking order.

This will cause a deadlock with journal commit threads, ocfs2cmt will
get write lock of journal->j_trans_barrier first, then it wakes up
kjournald2 to do the commit work, at last it waits until done. To
commit journal, kjournald2 needs flushing data first, it needs get the
cache page lock.

Since some ocfs2 cluster locks are holding by write process, this
deadlock may hung the whole cluster.

unlock pages before ocfs2_run_deallocs() can fix the locking order, also
put unlock before ocfs2_commit_trans() to make page lock is unlocked
before j_trans_barrier to preserve unlocking order.

Signed-off-by: Junxiao Bi
Reviewed-by: Wengang Wang
Reviewed-by: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Junxiao Bi
2015-01-16 22:59:44 +0800

09 Jan, 2015

19 commits

3c6babf55 Btrfs: fix fs corruption on transaction abort if device supports discard ... Browse Code »

commit 678886bdc6378c1cbd5072da2c5a3035000214e3 upstream.

When we abort a transaction we iterate over all the ranges marked as dirty
in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
from those trees, add them back (unpin) to the free space caches and, if
the fs was mounted with "-o discard", perform a discard on those regions.
Also, after adding the regions to the free space caches, a fitrim ioctl call
can see those ranges in a block group's free space cache and perform a discard
on the ranges, so the same issue can happen without "-o discard" as well.

This causes corruption, affecting one or multiple btree nodes (in the worst
case leaving the fs unmountable) because some of those ranges (the ones in
the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
referred by the last committed super block - breaking the rule that anything
that was committed by a transaction is untouched until the next transaction
commits successfully.

I ran into this while running in a loop (for several hours) the fstest that
I recently submitted:

[PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim

The corruption always happened when a transaction aborted and then fsck complained
like this:

_check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
*** fsck.btrfs output ***
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
Check tree block failed, want=94945280, have=0
read block failed check_tree_block
Couldn't open file system

In this case 94945280 corresponded to the root of a tree.
Using frace what I observed was the following sequence of steps happened:

1) transaction N started, fs_info->pinned_extents pointed to
fs_info->freed_extents[0];

2) node/eb 94945280 is created;

3) eb is persisted to disk;

4) transaction N commit starts, fs_info->pinned_extents now points to
fs_info->freed_extents[1], and transaction N completes;

5) transaction N + 1 starts;

6) eb is COWed, and btrfs_free_tree_block() called for this eb;

7) eb range (94945280 to 94945280 + 16Kb) is added to
fs_info->pinned_extents (fs_info->freed_extents[1]);

8) Something goes wrong in transaction N + 1, like hitting ENOSPC
for example, and the transaction is aborted, turning the fs into
readonly mode. The stack trace I got for example:

[112065.253935] [] dump_stack+0x4d/0x66
[112065.254271] [] warn_slowpath_common+0x7f/0x98
[112065.254567] [] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.261674] [] warn_slowpath_fmt+0x48/0x50
[112065.261922] [] ? btrfs_free_path+0x26/0x29 [btrfs]
[112065.262211] [] __btrfs_abort_transaction+0x50/0x10b [btrfs]
[112065.262545] [] btrfs_remove_chunk+0x537/0x58b [btrfs]
[112065.262771] [] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
[112065.263105] [] cleaner_kthread+0x100/0x12f [btrfs]
(...)
[112065.264493] ---[ end trace dd7903a975a31a08 ]---
[112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
[112065.264997] BTRFS info (device sdc): forced readonly

9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
turn calls btrfs_destroy_pinned_extent();

10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
marked as dirty in fs_info->freed_extents[], and for each one
it calls discard, if the fs was mounted with "-o discard", and
adds the range to the free space cache of the respective block
group;

11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
sees the free space entries and performs a discard;

12) After an umount and mount (or fsck), our eb's location on disk was full
of zeroes, and it should have been untouched, because it was marked as
dirty in the fs_info->pinned_extents tree, and therefore used by the
trees that the last committed superblock points to.

Fix this by not performing a discard and not adding the ranges to the free space
caches - it's useless from this point since the fs is now in readonly mode and
we won't write free space caches to disk anymore (otherwise we would leak space)
nor any new superblock. By not adding the ranges to the free space caches, it
prevents other code paths from allocating that space and write to it as well,
therefore being safer and simpler.

This isn't a new problem, as it's been present since 2011 (git commit
acce952b0263825da32cf10489413dec78053347).

Signed-off-by: Filipe Manana
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-01-09 02:30:30 +0800
9d15399d7 Btrfs: make sure logged extents complete in the current transaction V3 ... Browse Code »

commit 50d9aa99bd35c77200e0e3dd7a72274f8304701f upstream.

Liu Bo pointed out that my previous fix would lose the generation update in the
scenario I described. It is actually much worse than that, we could lose the
entire extent if we lose power right after the transaction commits. Consider
the following

write extent 0-4k
log extent in log tree
commit transaction
< power fail happens here
ordered extent completes

We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
the transaction commit will reset the log so it isn't replayed. If we lose
power before the transaction commit we are save, otherwise we are not.

Fix this by keeping track of all extents we logged in this transaction. Then
when we go to commit the transaction make sure we wait for all of those ordered
extents to complete before proceeding. This will make sure that if we lose
power after the transaction commit we still have our data. This also fixes the
problem of the improperly updated extent generation. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2015-01-09 02:30:30 +0800
115f9146f Btrfs: do not move em to modified list when unpinning ... Browse Code »

commit a28046956c71985046474283fa3bcd256915fb72 upstream.

We use the modified list to keep track of which extents have been modified so we
know which ones are candidates for logging at fsync() time. Newly modified
extents are added to the list at modification time, around the same time the
ordered extent is created. We do this so that we don't have to wait for ordered
extents to complete before we know what we need to log. The problem is when
something like this happens

log extent 0-4k on inode 1
copy csum for 0-4k from ordered extent into log
sync log
commit transaction
log some other extent on inode 1
ordered extent for 0-4k completes and adds itself onto modified list again
log changed extents
see ordered extent for 0-4k has already been logged
at this point we assume the csum has been copied
sync log
crash

On replay we will see the extent 0-4k in the log, drop the original 0-4k extent
which is the same one that we are replaying which also drops the csum, and then
we won't find the csum in the log for that bytenr. This of course causes us to
have errors about not having csums for certain ranges of our inode. So remove
the modified list manipulation in unpin_extent_cache, any modified extents
should have been added well before now, and we don't want them re-logged. This
fixes my test that I could reliably reproduce this problem with. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2015-01-09 02:30:30 +0800
37ea7a1f6 btrfs: fix wrong accounting of raid1 data profile in statfs ... Browse Code »

commit 0d95c1bec906dd1ad951c9c001e798ca52baeb0f upstream.

The sizes that are obtained from space infos are in raw units and have
to be adjusted according to the raid factor. This was missing for
f_bavail and df reported doubled size for raid1.

Reported-by: Martin Steigerwald
Fixes: ba7b6e62f420 ("btrfs: adjust statfs calculations according to raid profiles")
Signed-off-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

David Sterba
2015-01-09 02:30:30 +0800
d77fe802a Btrfs: make sure we wait on logged extents when fsycning two subvols ... Browse Code »

commit 9dba8cf128ef98257ca719722280c9634e7e9dc7 upstream.

If we have two fsync()'s race on different subvols one will do all of its work
to get into the log_tree, wait on it's outstanding IO, and then allow the
log_tree to finish it's commit. The problem is we were just free'ing that
subvols logged extents instead of waiting on them, so whoever lost the race
wouldn't really have their data on disk. Fix this by waiting properly instead
of freeing the logged extents. Thanks,

Signed-off-by: Josef Bacik
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Josef Bacik
2015-01-09 02:30:29 +0800
d7fad547c eCryptfs: Remove buggy and unnecessary write in file name decode routine ... Browse Code »

commit 942080643bce061c3dd9d5718d3b745dcb39a8bc upstream.

Dmitry Chernenkov used KASAN to discover that eCryptfs writes past the
end of the allocated buffer during encrypted filename decoding. This
fix corrects the issue by getting rid of the unnecessary 0 write when
the current bit offset is 2.

Signed-off-by: Michael Halcrow
Reported-by: Dmitry Chernenkov
Suggested-by: Kees Cook
Signed-off-by: Tyler Hicks
Signed-off-by: Greg Kroah-Hartman

Michael Halcrow
2015-01-09 02:30:29 +0800
bbeb37ea1 eCryptfs: Force RO mount when encrypted view is enabled ... Browse Code »

commit 332b122d39c9cbff8b799007a825d94b2e7c12f2 upstream.

The ecryptfs_encrypted_view mount option greatly changes the
functionality of an eCryptfs mount. Instead of encrypting and decrypting
lower files, it provides a unified view of the encrypted files in the
lower filesystem. The presence of the ecryptfs_encrypted_view mount
option is intended to force a read-only mount and modifying files is not
supported when the feature is in use. See the following commit for more
information:

e77a56d [PATCH] eCryptfs: Encrypted passthrough

This patch forces the mount to be read-only when the
ecryptfs_encrypted_view mount option is specified by setting the
MS_RDONLY flag on the superblock. Additionally, this patch removes some
broken logic in ecryptfs_open() that attempted to prevent modifications
of files when the encrypted view feature was in use. The check in
ecryptfs_open() was not sufficient to prevent file modifications using
system calls that do not operate on a file descriptor.

Signed-off-by: Tyler Hicks
Reported-by: Priya Bansal
Signed-off-by: Greg Kroah-Hartman

Tyler Hicks
2015-01-09 02:30:29 +0800
41ba2abbb udf: Check component length before reading it ... Browse Code »

commit e237ec37ec154564f8690c5bd1795339955eeef9 upstream.

Check that length specified in a component of a symlink fits in the
input buffer we are reading. Also properly ignore component length for
component types that do not use it. Otherwise we read memory after end
of buffer for corrupted udf image.

Reported-by: Carl Henrik Lunde
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:29 +0800
53fbe4cb7 udf: Verify symlink size before loading it ... Browse Code »

commit a1d47b262952a45aae62bd49cfaf33dd76c11a2c upstream.

UDF specification allows arbitrarily large symlinks. However we support
only symlinks at most one block large. Check the length of the symlink
so that we don't access memory beyond end of the symlink block.

Reported-by: Carl Henrik Lunde
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:29 +0800
a6a4afa5c udf: Verify i_size when loading inode ... Browse Code »

commit e159332b9af4b04d882dbcfe1bb0117f0a6d4b58 upstream.

Verify that inode size is sane when loading inode with data stored in
ICB. Otherwise we may get confused later when working with the inode and
inode size is too big.

Reported-by: Carl Henrik Lunde
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:28 +0800
1a927faa5 udf: Check path length when reading symlink ... Browse Code »

commit 0e5cc9a40ada6046e6bc3bdfcd0c0d7e4b706b14 upstream.

Symlink reading code does not check whether the resulting path fits into
the page provided by the generic code. This isn't as easy as just
checking the symlink size because of various encoding conversions we
perform on path. So we have to check whether there is still enough space
in the buffer on the fly.

Reported-by: Carl Henrik Lunde
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:28 +0800
522a8162a ncpfs: return proper error from NCP_IOC_SETROOT ioctl ... Browse Code »

commit a682e9c28cac152e6e54c39efcf046e0c8cfcf63 upstream.

If some error happens in NCP_IOC_SETROOT ioctl, the appropriate error
return value is then (in most cases) just overwritten before we return.
This can result in reporting success to userspace although error happened.

This bug was introduced by commit 2e54eb96e2c8 ("BKL: Remove BKL from
ncpfs"). Propagate the errors correctly.

Coverity id: 1226925.

Fixes: 2e54eb96e2c80 ("BKL: Remove BKL from ncpfs")
Signed-off-by: Jan Kara
Cc: Petr Vandrovec
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:28 +0800
4a7215f13 userns: Add a knob to disable setgroups on a per user namespace basis ... Browse Code »

commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream.

- Expose the knob to user space through a proc file /proc//setgroups

A value of "deny" means the setgroups system call is disabled in the
current processes user namespace and can not be enabled in the
future in this user namespace.

A value of "allow" means the segtoups system call is enabled.

- Descendant user namespaces inherit the value of setgroups from
their parents.

- A proc file is used (instead of a sysctl) as sysctls currently do
not allow checking the permissions at open time.

- Writing to the proc file is restricted to before the gid_map
for the user namespace is set.

This ensures that disabling setgroups at a user namespace
level will never remove the ability to call setgroups
from a process that already has that ability.

A process may opt in to the setgroups disable for itself by
creating, entering and configuring a user namespace or by calling
setns on an existing user namespace with setgroups disabled.
Processes without privileges already can not call setgroups so this
is a noop. Prodcess with privilege become processes without
privilege when entering a user namespace and as with any other path
to dropping privilege they would not have the ability to call
setgroups. So this remains within the bounds of what is possible
without a knob to disable setgroups permanently in a user namespace.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-01-09 02:30:26 +0800
462c8c0b3 umount: Disallow unprivileged mount force ... Browse Code »

commit b2f5d4dc38e034eecb7987e513255265ff9aa1cf upstream.

Forced unmount affects not just the mount namespace but the underlying
superblock as well. Restrict forced unmount to the global root user
for now. Otherwise it becomes possible a user in a less privileged
mount namespace to force the shutdown of a superblock of a filesystem
in a more privileged mount namespace, allowing a DOS attack on root.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-01-09 02:30:24 +0800
80d4d8397 mnt: Implicitly add MNT_NODEV on remount when it was implicitly added by mount ... Browse Code »

commit 3e1866410f11356a9fd869beb3e95983dc79c067 upstream.

Now that remount is properly enforcing the rule that you can't remove
nodev at least sandstorm.io is breaking when performing a remount.

It turns out that there is an easy intuitive solution implicitly
add nodev on remount when nodev was implicitly added on mount.

Tested-by: Cedric Bosdonnat
Tested-by: Richard Weinberger
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-01-09 02:30:24 +0800
877c27dba mnt: Fix a memory stomp in umount ... Browse Code »

commit c297abfdf15b4480704d6b566ca5ca9438b12456 upstream.

While reviewing the code of umount_tree I realized that when we append
to a preexisting unmounted list we do not change pprev of the former
first item in the list.

Which means later in namespace_unlock hlist_del_init(&mnt->mnt_hash) on
the former first item of the list will stomp unmounted.first leaving
it set to some random mount point which we are likely to free soon.

This isn't likely to hit, but if it does I don't know how anyone could
track it down.

[ This happened because we don't have all the same operations for
hlist's as we do for normal doubly-linked lists. In particular,
list_splice() is easy on our standard doubly-linked lists, while
hlist_splice() doesn't exist and needs both start/end entries of the
hlist. And commit 38129a13e6e7 incorrectly open-coded that missing
hlist_splice().

We should think about making these kinds of "mindless" conversions
easier to get right by adding the missing hlist helpers - Linus ]

Fixes: 38129a13e6e71f666e0468e99fdd932a687b4d7e switch mnt_hash to hlist
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-01-09 02:30:24 +0800
9c0f8266e isofs: Fix unchecked printing of ER records ... Browse Code »

commit 4e2024624e678f0ebb916e6192bd23c1f9fdf696 upstream.

We didn't check length of rock ridge ER records before printing them.
Thus corrupted isofs image can cause us to access and print some memory
behind the buffer with obvious consequences.

Reported-and-tested-by: Carl Henrik Lunde
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2015-01-09 02:30:21 +0800
6fac18d0a dcache: fix kmemcheck warning in switch_names ... Browse Code »

commit 08d4f7722268755ee34ed1c9e8afee7dfff022bb upstream.

This patch fixes kmemcheck warning in switch_names. The function
switch_names swaps inline names of two dentries. It swaps full arrays
d_iname, no matter how many bytes are really used by the strings. Reading
data beyond string ends results in kmemcheck warning.

We fix the bug by marking both arrays as fully initialized.

Signed-off-by: Mikulas Patocka
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2015-01-09 02:30:18 +0800
a8897f267 nfs41: fix nfs4_proc_layoutget error handling ... Browse Code »

commit 4bd5a980de87d2b5af417485bde97b8eb3d6cf6a upstream.

nfs4_layoutget_release() drops layout hdr refcnt. Grab the refcnt
early so that it is safe to call .release in case nfs4_alloc_pages
fails.

Signed-off-by: Peng Tao
Fixes: a47970ff78147 ("NFSv4.1: Hold reference to layout hdr in layoutget")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Peng Tao
2015-01-09 02:30:18 +0800