Eric Lee / smarc-fsl-linux-kernel

15 Jan, 2016

1 commit

4eeb44518 MLK-12091 fs: ext3: add kludge to avoid an oops after the disk disappears ... Browse Code »

The idea of this patch is borrowed from the commit of ext4:
7c2e70879fc0949b4220ee61b7c4553f6976a94d, which is kept in upstream
as a kludge for ext4 driver, since ext3 driver has the same problem
but obsolete in upstream 4.3, we port it for ext3 for internal 4.1
tree.

Signed-off-by: Li Jun

Li Jun
2016-01-15 01:03:04 +0800

15 Dec, 2015

15 commits

3b16e6415 ceph: fix message length computation ... Browse Code »

commit 777d738a5e58ba3b6f3932ab1543ce93703f4873 upstream.

create_request_message() computes the maximum length of a message,
but uses the wrong type for the time stamp: sizeof(struct timespec)
may be 8 or 16 depending on the architecture, while sizeof(struct
ceph_timespec) is always 8, and that is what gets put into the
message.

Found while auditing the uses of timespec for y2038 problems.

Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request")
Signed-off-by: Arnd Bergmann
Signed-off-by: Yan, Zheng
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2015-12-15 13:24:37 +0800
273ef1334 ocfs2: fix umask ignored issue ... Browse Code »

commit 8f1eb48758aacf6c1ffce18179295adbf3bd7640 upstream.

New created file's mode is not masked with umask, and this makes umask not
work for ocfs2 volume.

Fixes: 702e5bc ("ocfs2: use generic posix ACL infrastructure")
Signed-off-by: Junxiao Bi
Cc: Gang He
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Junxiao Bi
2015-12-15 13:24:37 +0800
2cc11ce90 nfs: if we have no valid attrs, then don't declare the attribute cache valid ... Browse Code »

commit c812012f9ca7cf89c9e1a1cd512e6c3b5be04b85 upstream.

If we pass in an empty nfs_fattr struct to nfs_update_inode, it will
(correctly) not update any of the attributes, but it then clears the
NFS_INO_INVALID_ATTR flag, which indicates that the attributes are
up to date. Don't clear the flag if the fattr struct has no valid
attrs to apply.

Reviewed-by: Steve French
Signed-off-by: Jeff Layton
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-12-15 13:24:37 +0800
254cbeb13 nfs4: start callback_ident at idr 1 ... Browse Code »

commit c68a027c05709330fe5b2f50c50d5fa02124b5d8 upstream.

If clp->cl_cb_ident is zero, then nfs_cb_idr_remove_locked() skips removing
it when the nfs_client is freed. A decoding or server bug can then find
and try to put that first nfs_client which would lead to a crash.

Signed-off-by: Benjamin Coddington
Fixes: d6870312659d ("nfs4client: convert to idr_alloc()")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Benjamin Coddington
2015-12-15 13:24:37 +0800
03ecddec2 debugfs: fix refcount imbalance in start_creating ... Browse Code »

commit 0ee9608c89e81a1ccee52ecb58a7ff040e2522d9 upstream.

In debugfs' start_creating(), we pin the file system to safely access
its root. When we failed to create a file, we unpin the file system via
failed_creating() to release the mount count and eventually the reference
of the vfsmount.

However, when we run into an error during lookup_one_len() when still
in start_creating(), we only release the parent's mutex but not so the
reference on the mount. Looks like it was done in the past, but after
splitting portions of __create_file() into start_creating() and
end_creating() via 190afd81e4a5 ("debugfs: split the beginning and the
end of __create_file() off"), this seemed missed. Noticed during code
review.

Fixes: 190afd81e4a5 ("debugfs: split the beginning and the end of __create_file() off")
Signed-off-by: Daniel Borkmann
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2015-12-15 13:24:36 +0800
3c10560e5 nfsd: eliminate sending duplicate and repeated delegations ... Browse Code »

commit 34ed9872e745fa56f10e9bef2cf3d2336c6c8816 upstream.

We've observed the nfsd server in a state where there are
multiple delegations on the same nfs4_file for the same client.
The nfs client does attempt to DELEGRETURN these when they are presented to
it - but apparently under some (unknown) circumstances the client does not
manage to return all of them. This leads to the eventual
attempt to CB_RECALL more than one delegation with the same nfs
filehandle to the same client. The first recall will succeed, but the
next recall will fail with NFS4ERR_BADHANDLE. This leads to the server
having delegations on cl_revoked that the client has no way to FREE
or DELEGRETURN, with resulting inability to recover. The state manager
on the server will continually assert SEQ4_STATUS_RECALLABLE_STATE_REVOKED,
and the state manager on the client will be looping unable to satisfy
the server.

List discussion also reports a race between OPEN and DELEGRETURN that
will be avoided by only sending the delegation once to the
client. This is also logically in accordance with RFC5561 9.1.1 and 10.2.

So, let's:

1.) Not hand out duplicate delegations.
2.) Only send them to the client once.

RFC 5561:

9.1.1:
"Delegations and layouts, on the other hand, are not associated with a
specific owner but are associated with the client as a whole
(identified by a client ID)."

10.2:
"...the stateid for a delegation is associated with a client ID and may be
used on behalf of all the open-owners for the given client. A
delegation is made to the client as a whole and not to any specific
process or thread of control within it."

Reported-by: Eric Meddaugh
Cc: Trond Myklebust
Cc: Olga Kornievskaia
Signed-off-by: Andrew Elble
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Andrew Elble
2015-12-15 13:24:36 +0800
35b2e295e nfsd: serialize state seqid morphing operations ... Browse Code »

commit 35a92fe8770ce54c5eb275cd76128645bea2d200 upstream.

Andrew was seeing a race occur when an OPEN and OPEN_DOWNGRADE were
running in parallel. The server would receive the OPEN_DOWNGRADE first
and check its seqid, but then an OPEN would race in and bump it. The
OPEN_DOWNGRADE would then complete and bump the seqid again. The result
was that the OPEN_DOWNGRADE would be applied after the OPEN, even though
it should have been rejected since the seqid changed.

The only recourse we have here I think is to serialize operations that
bump the seqid in a stateid, particularly when we're given a seqid in
the call. To address this, we add a new rw_semaphore to the
nfs4_ol_stateid struct. We do a down_write prior to checking the seqid
after looking up the stateid to ensure that nothing else is going to
bump it while we're operating on it.

In the case of OPEN, we do a down_read, as the call doesn't contain a
seqid. Those can run in parallel -- we just need to serialize them when
there is a concurrent OPEN_DOWNGRADE or CLOSE.

LOCK and LOCKU however always take the write lock as there is no
opportunity for parallelizing those.

Reported-and-Tested-by: Andrew W Elble
Signed-off-by: Jeff Layton
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-12-15 13:24:36 +0800
3e447b618 ext4, jbd2: ensure entering into panic after recording an error in superblock ... Browse Code »

commit 4327ba52afd03fc4b5afa0ee1d774c9c5b0e85c5 upstream.

If a EXT4 filesystem utilizes JBD2 journaling and an error occurs, the
journaling will be aborted first and the error number will be recorded
into JBD2 superblock and, finally, the system will enter into the
panic state in "errors=panic" option. But, in the rare case, this
sequence is little twisted like the below figure and it will happen
that the system enters into panic state, which means the system reset
in mobile environment, before completion of recording an error in the
journal superblock. In this case, e2fsck cannot recognize that the
filesystem failure occurred in the previous run and the corruption
wouldn't be fixed.

Task A Task B
ext4_handle_error()
-> jbd2_journal_abort()
-> __journal_abort_soft()
-> __jbd2_journal_abort_hard()
| -> journal->j_flags |= JBD2_ABORT;
|
| __ext4_abort()
| -> jbd2_journal_abort()
| | -> __journal_abort_soft()
| | -> if (journal->j_flags & JBD2_ABORT)
| | return;
| -> panic()
|
-> jbd2_journal_update_sb_errno()

Tested-by: Hobin Woo
Signed-off-by: Daeho Jeong
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Daeho Jeong
2015-12-15 13:24:35 +0800
456dd91e0 ext4: fix potential use after free in __ext4_journal_stop ... Browse Code »

commit 6934da9238da947628be83635e365df41064b09b upstream.

There is a use-after-free possibility in __ext4_journal_stop() in the
case that we free the handle in the first jbd2_journal_stop() because
we're referencing handle->h_err afterwards. This was introduced in
9705acd63b125dee8b15c705216d7186daea4625 and it is wrong. Fix it by
storing the handle->h_err value beforehand and avoid referencing
potentially freed handle.

Fixes: 9705acd63b125dee8b15c705216d7186daea4625
Signed-off-by: Lukas Czerner
Reviewed-by: Andreas Dilger
Signed-off-by: Greg Kroah-Hartman

Lukas Czerner
2015-12-15 13:24:35 +0800
b8a7a3010 ext4 crypto: fix memory leak in ext4_bio_write_page() ... Browse Code »

commit 937d7b84dca58f2565715f2c8e52f14c3d65fb22 upstream.

There are times when ext4_bio_write_page() is called even though we
don't actually need to do any I/O. This happens when ext4_writepage()
gets called by the jbd2 commit path when an inode needs to force its
pages written out in order to provide data=ordered guarantees --- and
a page is backed by an unwritten (e.g., uninitialized) block on disk,
or if delayed allocation means the page's backing store hasn't been
allocated yet. In that case, we need to skip the call to
ext4_encrypt_page(), since in addition to wasting CPU, it leads to a
bounce page and an ext4 crypto context getting leaked.

Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Theodore Ts'o
2015-12-15 13:24:35 +0800
c39df1cf3 Btrfs: fix race when listing an inode's xattrs ... Browse Code »

commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d upstream.

When listing a inode's xattrs we have a time window where we race against
a concurrent operation for adding a new hard link for our inode that makes
us not return any xattr to user space. In order for this to happen, the
first xattr of our inode needs to be at slot 0 of a leaf and the previous
leaf must still have room for an inode ref (or extref) item, and this can
happen because an inode's listxattrs callback does not lock the inode's
i_mutex (nor does the VFS does it for us), but adding a hard link to an
inode makes the VFS lock the inode's i_mutex before calling the inode's
link callback.

If we have the following leafs:

Leaf X (has N items) Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 XATTR_ITEM 12345), ... ]
slot N - 2 slot N - 1 slot 0

The race illustrated by the following sequence diagram is possible:

CPU 1 CPU 2

btrfs_listxattr()

searches for key (257 XATTR_ITEM 0)

gets path with path->nodes[0] == leaf X
and path->slots[0] == N

because path->slots[0] is >=
btrfs_header_nritems(leaf X), it calls
btrfs_next_leaf()

btrfs_next_leaf()
releases the path

adds key (257 INODE_REF 666)
to the end of leaf X (slot N),
and leaf X now has N + 1 items

searches for the key (257 INODE_REF 256),
with path->keep_locks == 1, because that
is the last key it saw in leaf X before
releasing the path

ends up at leaf X again and it verifies
that the key (257 INODE_REF 256) is no
longer the last key in leaf X, so it
returns with path->nodes[0] == leaf X
and path->slots[0] == N, pointing to
the new item with key (257 INODE_REF 666)

btrfs_listxattr's loop iteration sees that
the type of the key pointed by the path is
different from the type BTRFS_XATTR_ITEM_KEY
and so it breaks the loop and stops looking
for more xattr items
--> the application doesn't get any xattr
listed for our inode

So fix this by breaking the loop only if the key's type is greater than
BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.

Signed-off-by: Filipe Manana
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-12-15 13:24:34 +0800
d22b28439 Btrfs: fix race leading to BUG_ON when running delalloc for nodatacow ... Browse Code »

commit 1d512cb77bdbda80f0dd0620a3b260d697fd581d upstream.

If we are using the NO_HOLES feature, we have a tiny time window when
running delalloc for a nodatacow inode where we can race with a concurrent
link or xattr add operation leading to a BUG_ON.

This happens because at run_delalloc_nocow() we end up casting a leaf item
of type BTRFS_INODE_[REF|EXTREF]_KEY or of type BTRFS_XATTR_ITEM_KEY to a
file extent item (struct btrfs_file_extent_item) and then analyse its
extent type field, which won't match any of the expected extent types
(values BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]) and therefore trigger an
explicit BUG_ON(1).

The following sequence diagram shows how the race happens when running a
no-cow dellaloc range [4K, 8K[ for inode 257 and we have the following
neighbour leafs:

Leaf X (has N items) Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0

(Note the implicit hole for inode 257 regarding the [0, 8K[ range)

CPU 1 CPU 2

run_dealloc_nocow()
btrfs_lookup_file_extent()
--> searches for a key with value
(257 EXTENT_DATA 4096) in the
fs/subvol tree
--> returns us a path with
path->nodes[0] == leaf X and
path->slots[0] == N

because path->slots[0] is >=
btrfs_header_nritems(leaf X), it
calls btrfs_next_leaf()

btrfs_next_leaf()
--> releases the path

hard link added to our inode,
with key (257 INODE_REF 500)
added to the end of leaf X,
so leaf X now has N + 1 keys

--> searches for the key
(257 INODE_REF 256), because
it was the last key in leaf X
before it released the path,
with path->keep_locks set to 1

--> ends up at leaf X again and
it verifies that the key
(257 INODE_REF 256) is no longer
the last key in the leaf, so it
returns with path->nodes[0] ==
leaf X and path->slots[0] == N,
pointing to the new item with
key (257 INODE_REF 500)

the loop iteration of run_dealloc_nocow()
does not break out the loop and continues
because the key referenced in the path
at path->nodes[0] and path->slots[0] is
for inode 257, its type is < BTRFS_EXTENT_DATA_KEY
and its offset (500) is less then our delalloc
range's end (8192)

the item pointed by the path, an inode reference item,
is (incorrectly) interpreted as a file extent item and
we get an invalid extent type, leading to the BUG_ON(1):

if (extent_type == BTRFS_FILE_EXTENT_REG ||
extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
} else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
} else {
BUG_ON(1)
}

The same can happen if a xattr is added concurrently and ends up having
a key with an offset smaller then the delalloc's range end.

So fix this by skipping keys with a type smaller than
BTRFS_EXTENT_DATA_KEY.

Signed-off-by: Filipe Manana
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-12-15 13:24:34 +0800
51ac5edb6 Btrfs: fix race leading to incorrect item deletion when dropping extents ... Browse Code »

commit aeafbf8486c9e2bd53f5cc3c10c0b7fd7149d69c upstream.

While running a stress test I got the following warning triggered:

[191627.672810] ------------[ cut here ]------------
[191627.673949] WARNING: CPU: 8 PID: 8447 at fs/btrfs/file.c:779 __btrfs_drop_extents+0x391/0xa50 [btrfs]()
(...)
[191627.701485] Call Trace:
[191627.702037] [] dump_stack+0x4f/0x7b
[191627.702992] [] ? console_unlock+0x356/0x3a2
[191627.704091] [] warn_slowpath_common+0xa1/0xbb
[191627.705380] [] ? __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.706637] [] warn_slowpath_null+0x1a/0x1c
[191627.707789] [] __btrfs_drop_extents+0x391/0xa50 [btrfs]
[191627.709155] [] ? cache_alloc_debugcheck_after.isra.32+0x171/0x1d0
[191627.712444] [] ? kmemleak_alloc_recursive.constprop.40+0x16/0x18
[191627.714162] [] insert_reserved_file_extent.constprop.40+0x83/0x24e [btrfs]
[191627.715887] [] ? start_transaction+0x3bb/0x610 [btrfs]
[191627.717287] [] btrfs_finish_ordered_io+0x273/0x4e2 [btrfs]
[191627.728865] [] finish_ordered_fn+0x15/0x17 [btrfs]
[191627.730045] [] normal_work_helper+0x14c/0x32c [btrfs]
[191627.731256] [] btrfs_endio_write_helper+0x12/0x14 [btrfs]
[191627.732661] [] process_one_work+0x24c/0x4ae
[191627.733822] [] worker_thread+0x206/0x2c2
[191627.734857] [] ? process_scheduled_works+0x2f/0x2f
[191627.736052] [] ? process_scheduled_works+0x2f/0x2f
[191627.737349] [] kthread+0xef/0xf7
[191627.738267] [] ? time_hardirqs_on+0x15/0x28
[191627.739330] [] ? __kthread_parkme+0xad/0xad
[191627.741976] [] ret_from_fork+0x42/0x70
[191627.743080] [] ? __kthread_parkme+0xad/0xad
[191627.744206] ---[ end trace bbfddacb7aaada8d ]---

$ cat -n fs/btrfs/file.c
691 int __btrfs_drop_extents(struct btrfs_trans_handle *trans,
(...)
758 btrfs_item_key_to_cpu(leaf, &key, path->slots[0]);
759 if (key.objectid > ino ||
760 key.type > BTRFS_EXTENT_DATA_KEY || key.offset >= end)
761 break;
762
763 fi = btrfs_item_ptr(leaf, path->slots[0],
764 struct btrfs_file_extent_item);
765 extent_type = btrfs_file_extent_type(leaf, fi);
766
767 if (extent_type == BTRFS_FILE_EXTENT_REG ||
768 extent_type == BTRFS_FILE_EXTENT_PREALLOC) {
(...)
774 } else if (extent_type == BTRFS_FILE_EXTENT_INLINE) {
(...)
778 } else {
779 WARN_ON(1);
780 extent_end = search_start;
781 }
(...)

This happened because the item we were processing did not match a file
extent item (its key type != BTRFS_EXTENT_DATA_KEY), and even on this
case we cast the item to a struct btrfs_file_extent_item pointer and
then find a type field value that does not match any of the expected
values (BTRFS_FILE_EXTENT_[REG|PREALLOC|INLINE]). This scenario happens
due to a tiny time window where a race can happen as exemplified below.
For example, consider the following scenario where we're using the
NO_HOLES feature and we have the following two neighbour leafs:

Leaf X (has N items) Leaf Y

[ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ] [ (257 EXTENT_DATA 8192), ... ]
slot N - 2 slot N - 1 slot 0

Our inode 257 has an implicit hole in the range [0, 8K[ (implicit rather
than explicit because NO_HOLES is enabled). Now if our inode has an
ordered extent for the range [4K, 8K[ that is finishing, the following
can happen:

CPU 1 CPU 2

btrfs_finish_ordered_io()
insert_reserved_file_extent()
__btrfs_drop_extents()
Searches for the key
(257 EXTENT_DATA 4096) through
btrfs_lookup_file_extent()

Key not found and we get a path where
path->nodes[0] == leaf X and
path->slots[0] == N

Because path->slots[0] is >=
btrfs_header_nritems(leaf X), we call
btrfs_next_leaf()

btrfs_next_leaf() releases the path

inserts key
(257 INODE_REF 4096)
at the end of leaf X,
leaf X now has N + 1 keys,
and the new key is at
slot N

btrfs_next_leaf() searches for
key (257 INODE_REF 256), with
path->keep_locks set to 1,
because it was the last key it
saw in leaf X

finds it in leaf X again and
notices it's no longer the last
key of the leaf, so it returns 0
with path->nodes[0] == leaf X and
path->slots[0] == N (which is now
< btrfs_header_nritems(leaf X)),
pointing to the new key
(257 INODE_REF 4096)

__btrfs_drop_extents() casts the
item at path->nodes[0], slot
path->slots[0], to a struct
btrfs_file_extent_item - it does
not skip keys for the target
inode with a type less than
BTRFS_EXTENT_DATA_KEY
(BTRFS_INODE_REF_KEY < BTRFS_EXTENT_DATA_KEY)

sees a bogus value for the type
field triggering the WARN_ON in
the trace shown above, and sets
extent_end = search_start (4096)

does the if-then-else logic to
fixup 0 length extent items created
by a past bug from hole punching:

if (extent_end == key.offset &&
extent_end >= search_start)
goto delete_extent_item;

that evaluates to true and it ends
up deleting the key pointed to by
path->slots[0], (257 INODE_REF 4096),
from leaf X

The same could happen for example for a xattr that ends up having a key
with an offset value that matches search_start (very unlikely but not
impossible).

So fix this by ensuring that keys smaller than BTRFS_EXTENT_DATA_KEY are
skipped, never casted to struct btrfs_file_extent_item and never deleted
by accident. Also protect against the unexpected case of getting a key
for a lower inode number by skipping that key and issuing a warning.

Signed-off-by: Filipe Manana
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-12-15 13:24:34 +0800
f1008f6d2 Btrfs: fix truncation of compressed and inlined extents ... Browse Code »

commit 0305cd5f7fca85dae392b9ba85b116896eb7c1c7 upstream.

When truncating a file to a smaller size which consists of an inline
extent that is compressed, we did not discard (or made unusable) the
data between the new file size and the old file size, wasting metadata
space and allowing for the truncated data to be leaked and the data
corruption/loss mentioned below.
We were also not correctly decrementing the number of bytes used by the
inode, we were setting it to zero, giving a wrong report for callers of
the stat(2) syscall. The fsck tool also reported an error about a mismatch
between the nbytes of the file versus the real space used by the file.

Now because we weren't discarding the truncated region of the file, it
was possible for a caller of the clone ioctl to actually read the data
that was truncated, allowing for a security breach without requiring root
access to the system, using only standard filesystem operations. The
scenario is the following:

1) User A creates a file which consists of an inline and compressed
extent with a size of 2000 bytes - the file is not accessible to
any other users (no read, write or execution permission for anyone
else);

2) The user truncates the file to a size of 1000 bytes;

3) User A makes the file world readable;

4) User B creates a file consisting of an inline extent of 2000 bytes;

5) User B issues a clone operation from user A's file into its own
file (using a length argument of 0, clone the whole range);

6) User B now gets to see the 1000 bytes that user A truncated from
its file before it made its file world readbale. User B also lost
the bytes in the range [1000, 2000[ bytes from its own file, but
that might be ok if his/her intention was reading stale data from
user A that was never supposed to be public.

Note that this contrasts with the case where we truncate a file from 2000
bytes to 1000 bytes and then truncate it back from 1000 to 2000 bytes. In
this case reading any byte from the range [1000, 2000[ will return a value
of 0x00, instead of the original data.

This problem exists since the clone ioctl was added and happens both with
and without my recent data loss and file corruption fixes for the clone
ioctl (patch "Btrfs: fix file corruption and data loss after cloning
inline extents").

So fix this by truncating the compressed inline extents as we do for the
non-compressed case, which involves decompressing, if the data isn't already
in the page cache, compressing the truncated version of the extent, writing
the compressed content into the inline extent and then truncate it.

The following test case for fstests reproduces the problem. In order for
the test to pass both this fix and my previous fix for the clone ioctl
that forbids cloning a smaller inline extent into a larger one,
which is titled "Btrfs: fix file corruption and data loss after cloning
inline extents", are needed. Without that other fix the test fails in a
different way that does not leak the truncated data, instead part of
destination file gets replaced with zeroes (because the destination file
has a larger inline extent than the source).

seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15

_cleanup()
{
rm -f $tmp.*
}

# get standard environment, filters and checks
. ./common/rc
. ./common/filter

# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner

rm -f $seqres.full

_scratch_mkfs >>$seqres.full 2>&1
_scratch_mount "-o compress"

# Create our test files. File foo is going to be the source of a clone operation
# and consists of a single inline extent with an uncompressed size of 512 bytes,
# while file bar consists of a single inline extent with an uncompressed size of
# 256 bytes. For our test's purpose, it's important that file bar has an inline
# extent with a size smaller than foo's inline extent.
$XFS_IO_PROG -f -c "pwrite -S 0xa1 0 128" \
-c "pwrite -S 0x2a 128 384" \
$SCRATCH_MNT/foo | _filter_xfs_io
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 256" $SCRATCH_MNT/bar | _filter_xfs_io

# Now durably persist all metadata and data. We do this to make sure that we get
# on disk an inline extent with a size of 512 bytes for file foo.
sync

# Now truncate our file foo to a smaller size. Because it consists of a
# compressed and inline extent, btrfs did not shrink the inline extent to the
# new size (if the extent was not compressed, btrfs would shrink it to 128
# bytes), it only updates the inode's i_size to 128 bytes.
$XFS_IO_PROG -c "truncate 128" $SCRATCH_MNT/foo

# Now clone foo's inline extent into bar.
# This clone operation should fail with errno EOPNOTSUPP because the source
# file consists only of an inline extent and the file's size is smaller than
# the inline extent of the destination (128 bytes < 256 bytes). However the
# clone ioctl was not prepared to deal with a file that has a size smaller
# than the size of its inline extent (something that happens only for compressed
# inline extents), resulting in copying the full inline extent from the source
# file into the destination file.
#
# Note that btrfs' clone operation for inline extents consists of removing the
# inline extent from the destination inode and copy the inline extent from the
# source inode into the destination inode, meaning that if the destination
# inode's inline extent is larger (N bytes) than the source inode's inline
# extent (M bytes), some bytes (N - M bytes) will be lost from the destination
# file. Btrfs could copy the source inline extent's data into the destination's
# inline extent so that we would not lose any data, but that's currently not
# done due to the complexity that would be needed to deal with such cases
# (specially when one or both extents are compressed), returning EOPNOTSUPP, as
# it's normally not a very common case to clone very small files (only case
# where we get inline extents) and copying inline extents does not save any
# space (unlike for normal, non-inlined extents).
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/foo $SCRATCH_MNT/bar

# Now because the above clone operation used to succeed, and due to foo's inline
# extent not being shinked by the truncate operation, our file bar got the whole
# inline extent copied from foo, making us lose the last 128 bytes from bar
# which got replaced by the bytes in range [128, 256[ from foo before foo was
# truncated - in other words, data loss from bar and being able to read old and
# stale data from foo that should not be possible to read anymore through normal
# filesystem operations. Contrast with the case where we truncate a file from a
# size N to a smaller size M, truncate it back to size N and then read the range
# [M, N[, we should always get the value 0x00 for all the bytes in that range.

# We expected the clone operation to fail with errno EOPNOTSUPP and therefore
# not modify our file's bar data/metadata. So its content should be 256 bytes
# long with all bytes having the value 0xbb.
#
# Without the btrfs bug fix, the clone operation succeeded and resulted in
# leaking truncated data from foo, the bytes that belonged to its range
# [128, 256[, and losing data from bar in that same range. So reading the
# file gave us the following content:
#
# 0000000 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1 a1
# *
# 0000200 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a 2a
# *
# 0000400
echo "File bar's content after the clone operation:"
od -t x1 $SCRATCH_MNT/bar

# Also because the foo's inline extent was not shrunk by the truncate
# operation, btrfs' fsck, which is run by the fstests framework everytime a
# test completes, failed reporting the following error:
#
# root 5 inode 257 errors 400, nbytes wrong

status=0
exit

Signed-off-by: Filipe Manana
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-12-15 13:24:33 +0800
83881c15e Btrfs: fix file corruption and data loss after cloning inline extents ... Browse Code »

commit 8039d87d9e473aeb740d4fdbd59b9d2f89b2ced9 upstream.

Currently the clone ioctl allows to clone an inline extent from one file
to another that already has other (non-inlined) extents. This is a problem
because btrfs is not designed to deal with files having inline and regular
extents, if a file has an inline extent then it must be the only extent
in the file and must start at file offset 0. Having a file with an inline
extent followed by regular extents results in EIO errors when doing reads
or writes against the first 4K of the file.

Also, the clone ioctl allows one to lose data if the source file consists
of a single inline extent, with a size of N bytes, and the destination
file consists of a single inline extent with a size of M bytes, where we
have M > N. In this case the clone operation removes the inline extent
from the destination file and then copies the inline extent from the
source file into the destination file - we lose the M - N bytes from the
destination file, a read operation will get the value 0x00 for any bytes
in the the range [N, M] (the destination inode's i_size remained as M,
that's why we can read past N bytes).

So fix this by not allowing such destructive operations to happen and
return errno EOPNOTSUPP to user space.

Currently the fstest btrfs/035 tests the data loss case but it totally
ignores this - i.e. expects the operation to succeed and does not check
the we got data loss.

The following test case for fstests exercises all these cases that result
in file corruption and data loss:

seq=`basename $0`
seqres=$RESULT_DIR/$seq
echo "QA output created by $seq"
tmp=/tmp/$$
status=1 # failure is the default!
trap "_cleanup; exit \$status" 0 1 2 3 15

_cleanup()
{
rm -f $tmp.*
}

# get standard environment, filters and checks
. ./common/rc
. ./common/filter

# real QA test starts here
_need_to_be_root
_supported_fs btrfs
_supported_os Linux
_require_scratch
_require_cloner
_require_btrfs_fs_feature "no_holes"
_require_btrfs_mkfs_feature "no-holes"

rm -f $seqres.full

test_cloning_inline_extents()
{
local mkfs_opts=$1
local mount_opts=$2

_scratch_mkfs $mkfs_opts >>$seqres.full 2>&1
_scratch_mount $mount_opts

# File bar, the source for all the following clone operations, consists
# of a single inline extent (50 bytes).
$XFS_IO_PROG -f -c "pwrite -S 0xbb 0 50" $SCRATCH_MNT/bar \
| _filter_xfs_io

# Test cloning into a file with an extent (non-inlined) where the
# destination offset overlaps that extent. It should not be possible to
# clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xaa 0K 16K" $SCRATCH_MNT/foo \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo

# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo data after clone operation:"
# All bytes should have the value 0xaa (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo
$XFS_IO_PROG -c "pwrite -S 0xcc 0 100" $SCRATCH_MNT/foo | _filter_xfs_io

# Test cloning the inline extent against a file which has a hole in its
# first 4K followed by a non-inlined extent. It should not be possible
# as well to clone the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "pwrite -S 0xdd 4K 12K" $SCRATCH_MNT/foo2 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo2

# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "File foo2 data after clone operation:"
# All bytes should have the value 0x00 (clone operation failed and did
# not modify our file).
od -t x1 $SCRATCH_MNT/foo2
$XFS_IO_PROG -c "pwrite -S 0xee 0 90" $SCRATCH_MNT/foo2 | _filter_xfs_io

# Test cloning the inline extent against a file which has a size of zero
# but has a prealloc extent. It should not be possible as well to clone
# the inline extent from file bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" $SCRATCH_MNT/foo3 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo3

# Doing IO against any range in the first 4K of the file should work.
# Due to a past clone ioctl bug which allowed cloning the inline extent,
# these operations resulted in EIO errors.
echo "First 50 bytes of foo3 after clone operation:"
# Should not be able to read any bytes, file has 0 bytes i_size (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo3
$XFS_IO_PROG -c "pwrite -S 0xff 0 90" $SCRATCH_MNT/foo3 | _filter_xfs_io

# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size not greater than the size of
# bar's inline extent (40 < 50).
# It should be possible to do the extent cloning from bar to this file.
$XFS_IO_PROG -f -c "pwrite -S 0x01 0 40" $SCRATCH_MNT/foo4 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo4

# Doing IO against any range in the first 4K of the file should work.
echo "File foo4 data after clone operation:"
# Must match file bar's content.
od -t x1 $SCRATCH_MNT/foo4
$XFS_IO_PROG -c "pwrite -S 0x02 0 90" $SCRATCH_MNT/foo4 | _filter_xfs_io

# Test cloning the inline extent against a file which consists of a
# single inline extent that has a size greater than the size of bar's
# inline extent (60 > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "pwrite -S 0x03 0 60" $SCRATCH_MNT/foo5 \
| _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo5

# Reading the file should not fail.
echo "File foo5 data after clone operation:"
# Must have a size of 60 bytes, with all bytes having a value of 0x03
# (the clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo5

# Test cloning the inline extent against a file which has no extents but
# has a size greater than bar's inline extent (16K > 50).
# It should not be possible to clone the inline extent from file bar
# into this file.
$XFS_IO_PROG -f -c "truncate 16K" $SCRATCH_MNT/foo6 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo6

# Reading the file should not fail.
echo "File foo6 data after clone operation:"
# Must have a size of 16K, with all bytes having a value of 0x00 (the
# clone operation failed and did not modify our file).
od -t x1 $SCRATCH_MNT/foo6

# Test cloning the inline extent against a file which has no extents but
# has a size not greater than bar's inline extent (30 < 50).
# It should be possible to clone the inline extent from file bar into
# this file.
$XFS_IO_PROG -f -c "truncate 30" $SCRATCH_MNT/foo7 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo7

# Reading the file should not fail.
echo "File foo7 data after clone operation:"
# Must have a size of 50 bytes, with all bytes having a value of 0xbb.
od -t x1 $SCRATCH_MNT/foo7

# Test cloning the inline extent against a file which has a size not
# greater than the size of bar's inline extent (20 < 50) but has
# a prealloc extent that goes beyond the file's size. It should not be
# possible to clone the inline extent from bar into this file.
$XFS_IO_PROG -f -c "falloc -k 0 1M" \
-c "pwrite -S 0x88 0 20" \
$SCRATCH_MNT/foo8 | _filter_xfs_io
$CLONER_PROG -s 0 -d 0 -l 0 $SCRATCH_MNT/bar $SCRATCH_MNT/foo8

echo "File foo8 data after clone operation:"
# Must have a size of 20 bytes, with all bytes having a value of 0x88
# (the clone operation did not modify our file).
od -t x1 $SCRATCH_MNT/foo8

_scratch_unmount
}

echo -e "\nTesting without compression and without the no-holes feature...\n"
test_cloning_inline_extents

echo -e "\nTesting with compression and without the no-holes feature...\n"
test_cloning_inline_extents "" "-o compress"

echo -e "\nTesting without compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" ""

echo -e "\nTesting with compression and with the no-holes feature...\n"
test_cloning_inline_extents "-O no-holes" "-o compress"

status=0
exit

Signed-off-by: Filipe Manana
Signed-off-by: Greg Kroah-Hartman

Filipe Manana
2015-12-15 13:24:33 +0800

10 Dec, 2015

1 commit

669b3319d fs/proc, core/debug: Don't expose absolute kernel addresses via wchan ... Browse Code »

commit b2f73922d119686323f14fbbe46587f863852328 upstream.

So the /proc/PID/stat 'wchan' field (the 30th field, which contains
the absolute kernel address of the kernel function a task is blocked in)
leaks absolute kernel addresses to unprivileged user-space:

seq_put_decimal_ull(m, ' ', wchan);

The absolute address might also leak via /proc/PID/wchan as well, if
KALLSYMS is turned off or if the symbol lookup fails for some reason:

static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
struct pid *pid, struct task_struct *task)
{
unsigned long wchan;
char symname[KSYM_NAME_LEN];

wchan = get_wchan(task);

if (lookup_symbol_name(wchan, symname) < 0) {
if (!ptrace_may_access(task, PTRACE_MODE_READ))
return 0;
seq_printf(m, "%lu", wchan);
} else {
seq_printf(m, "%s", symname);
}

return 0;
}

This isn't ideal, because for example it trivially leaks the KASLR offset
to any local attacker:

fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
ffffffff8123b380

Most real-life uses of wchan are symbolic:

ps -eo pid:10,tid:10,wchan:30,comm

and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
open("/proc/30833/wchan", O_RDONLY) = 6

There's one compatibility quirk here: procps relies on whether the
absolute value is non-zero - and we can provide that functionality
by outputing "0" or "1" depending on whether the task is blocked
(whether there's a wchan address).

These days there appears to be very little legitimate reason
user-space would be interested in the absolute address. The
absolute address is mostly historic: from the days when we
didn't have kallsyms and user-space procps had to do the
decoding itself via the System.map.

So this patch sets all numeric output to "0" or "1" and keeps only
symbolic output, in /proc/PID/wchan.

( The absolute sleep address can generally still be profiled via
perf, by tasks with sufficient privileges. )

Reviewed-by: Thomas Gleixner
Acked-by: Kees Cook
Acked-by: Linus Torvalds
Cc: Al Viro
Cc: Alexander Potapenko
Cc: Andrey Konovalov
Cc: Andrey Ryabinin
Cc: Andy Lutomirski
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Denys Vlasenko
Cc: Dmitry Vyukov
Cc: Kostya Serebryany
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: kasan-dev
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Ingo Molnar
2015-12-10 03:03:20 +0800

10 Nov, 2015

5 commits

ee03d02eb btrfs: fix possible leak in btrfs_ioctl_balance() ... Browse Code »

commit 0f89abf56abbd0e1c6e3cef9813e6d9f05383c1e upstream.

Commit 8eb934591f8b ("btrfs: check unsupported filters in balance
arguments") adds a jump to exit label out_bargs in case the argument
check fails. At this point in addition to the bargs memory, the
memory for struct btrfs_balance_control has already been allocated.
Ownership of bctl is passed to btrfs_balance() in the good case,
thus the memory is not freed due to the introduced jump. Make sure
that the memory gets freed in any case as necessary. Detected by
Coverity CID 1328378.

Signed-off-by: Christian Engelmayer
Reviewed-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Christian Engelmayer
2015-11-10 06:33:39 +0800
7fd58acc9 ovl: fix dentry reference leak ... Browse Code »

commit ab79efab0a0ba01a74df782eb7fa44b044dae8b5 upstream.

In ovl_copy_up_locked(), newdentry is leaked if the function exits through
out_cleanup as this just to out after calling ovl_cleanup() - which doesn't
actually release the ref on newdentry.

The out_cleanup segment should instead exit through out2 as certainly
newdentry leaks - and possibly upper does also, though this isn't caught
given the catch of newdentry.

Without this fix, something like the following is seen:

BUG: Dentry ffff880023e9eb20{i=f861,n=#ffff880023e82d90} still in use (1) [unmount of tmpfs tmpfs]
BUG: Dentry ffff880023ece640{i=0,n=bigfile} still in use (1) [unmount of tmpfs tmpfs]

when unmounting the upper layer after an error occurred in copyup.

An error can be induced by creating a big file in a lower layer with
something like:

dd if=/dev/zero of=/lower/a/bigfile bs=65536 count=1 seek=$((0xf000))

to create a large file (4.1G). Overlay an upper layer that is too small
(on tmpfs might do) and then induce a copy up by opening it writably.

Reported-by: Ulrich Obergfell
Signed-off-by: David Howells
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

David Howells
2015-11-10 06:33:38 +0800
aa637cda1 ovl: use O_LARGEFILE in ovl_copy_up() ... Browse Code »

commit 0480334fa60488d12ae101a02d7d9e1a3d03d7dd upstream.

Open the lower file with O_LARGEFILE in ovl_copy_up().

Pass O_LARGEFILE unconditionally in ovl_copy_up_data() as it's purely for
catching 32-bit userspace dealing with a file large enough that it'll be
mishandled if the application isn't aware that there might be an integer
overflow. Inside the kernel, there shouldn't be any problems.

Reported-by: Ulrich Obergfell
Signed-off-by: David Howells
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

David Howells
2015-11-10 06:33:38 +0800
5c418f1bd ovl: free lower_mnt array in ovl_put_super ... Browse Code »

commit 5ffdbe8bf1e485026e1c7e4714d2841553cf0b40 upstream.

This fixes memory leak after umount.

Kmemleak report:

unreferenced object 0xffff8800ba791010 (size 8):
comm "mount", pid 2394, jiffies 4294996294 (age 53.920s)
hex dump (first 8 bytes):
20 1c 13 02 00 88 ff ff .......
backtrace:
[] create_object+0x124/0x2c0
[] kmemleak_alloc+0x7b/0xc0
[] __kmalloc+0x106/0x340
[] ovl_fill_super+0x55c/0x9b0 [overlay]
[] mount_nodev+0x54/0xa0
[] ovl_mount+0x18/0x20 [overlay]
[] mount_fs+0x43/0x170
[] vfs_kern_mount+0x74/0x170
[] do_mount+0x22d/0xdf0
[] SyS_mount+0x7b/0xc0
[] entry_SYSCALL_64_fastpath+0x12/0x76
[] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Miklos Szeredi
Fixes: dd662667e6d3 ("ovl: add mutli-layer infrastructure")
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2015-11-10 06:33:38 +0800
a03bd0e03 ovl: free stack of paths in ovl_fill_super ... Browse Code »

commit 0f95502ad84874b3c05fc7cdd9d4d9d5cddf7859 upstream.

This fixes small memory leak after mount.

Kmemleak report:

unreferenced object 0xffff88003683fe00 (size 16):
comm "mount", pid 2029, jiffies 4294909563 (age 33.380s)
hex dump (first 16 bytes):
20 27 1f bb 00 88 ff ff 40 4b 0f 36 02 88 ff ff '......@K.6....
backtrace:
[] create_object+0x124/0x2c0
[] kmemleak_alloc+0x7b/0xc0
[] __kmalloc+0x106/0x340
[] ovl_fill_super+0x389/0x9a0 [overlay]
[] mount_nodev+0x54/0xa0
[] ovl_mount+0x18/0x20 [overlay]
[] mount_fs+0x43/0x170
[] vfs_kern_mount+0x74/0x170
[] do_mount+0x22d/0xdf0
[] SyS_mount+0x7b/0xc0
[] entry_SYSCALL_64_fastpath+0x12/0x76
[] 0xffffffffffffffff

Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Miklos Szeredi
Fixes: a78d9f0d5d5c ("ovl: support multiple lower layers")
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2015-11-10 06:33:37 +0800

27 Oct, 2015

7 commits

41c4e0825 nfs4: have do_vfs_lock take an inode pointer ... Browse Code »

commit 83bfff23e9ed19f37c4ef0bba84e75bd88e5cf21 upstream.

Now that we have file locking helpers that can deal with an inode
instead of a filp, we can change the NFSv4 locking code to use that
instead.

This should fix the case where we have a filp that is closed while flock
or OFD locks are set on it, and the task is signaled so that it doesn't
wait for the LOCKU reply to come in before the filp is freed. At that
point we can end up with a use-after-free with the current code, which
relies on dereferencing the fl_file in the lock request.

Signed-off-by: Jeff Layton
Reviewed-by: "J. Bruce Fields"
Tested-by: "J. Bruce Fields"
Cc: William Dauchy
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-10-27 08:52:00 +0800
c7fc0d838 locks: inline posix_lock_file_wait and flock_lock_file_wait ... Browse Code »

commit ee296d7c5709440f8abd36b5b65c6b3e388538d9 upstream.

They just call file_inode and then the corresponding *_inode_file_wait
function. Just make them static inlines instead.

Signed-off-by: Jeff Layton
Cc: William Dauchy
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-10-27 08:52:00 +0800
b2540f146 locks: new helpers - flock_lock_inode_wait and posix_lock_inode_wait ... Browse Code »

commit 29d01b22eaa18d8b46091d3c98c6001c49f78e4a upstream.

Allow callers to pass in an inode instead of a filp.

Signed-off-by: Jeff Layton
Reviewed-by: "J. Bruce Fields"
Tested-by: "J. Bruce Fields"
Cc: William Dauchy
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-10-27 08:52:00 +0800
0bdb53e1b locks: have flock_lock_file take an inode pointer instead of a filp ... Browse Code »

commit bcd7f78d078ff6197715c1ed070c92aca57ec12c upstream.

...and rename it to better describe how it works.

In order to fix a use-after-free in NFS, we need to be able to remove
locks from an inode after the filp associated with them may have already
been freed. flock_lock_file already only dereferences the filp to get to
the inode, so just change it so the callers do that.

All of the callers already pass in a lock request that has the fl_file
set properly, so we don't need to pass it in individually. With that
change it now only dereferences the filp to get to the inode, so just
push that out to the callers.

Signed-off-by: Jeff Layton
Reviewed-by: "J. Bruce Fields"
Tested-by: "J. Bruce Fields"
Cc: William Dauchy
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2015-10-27 08:51:59 +0800
271759afb nfsd/blocklayout: accept any minlength ... Browse Code »

commit 8c3ad9cb7343dc5f61b8cf3cdbe1016c5e7c2c8b upstream.

Recent Linux clients have started to send GETLAYOUT requests with
minlength less than blocksize.

Servers aren't really allowed to impose this kind of restriction on
layouts; see RFC 5661 section 18.43.3 for details.

This has been observed to cause indefinite hangs on fsx runs on some
clients.

Signed-off-by: Christoph Hellwig
Signed-off-by: J. Bruce Fields
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2015-10-27 08:51:54 +0800
6780e0d1b btrfs: fix use after free iterating extrefs ... Browse Code »

commit dc6c5fb3b514221f2e9d21ee626a9d95d3418dff upstream.

The code for btrfs inode-resolve has never worked properly for
files with enough hard links to trigger extrefs. It was trying to
get the leaf out of a path after freeing the path:

btrfs_release_path(path);
leaf = path->nodes[0];
item_size = btrfs_item_size_nr(leaf, slot);

The fix here is to use the extent buffer we cloned just a little higher
up to avoid deadlocks caused by using the leaf in the path.

Signed-off-by: Chris Mason
cc: Mark Fasheh
Reviewed-by: Filipe Manana
Reviewed-by: Mark Fasheh
Signed-off-by: Greg Kroah-Hartman

Chris Mason
2015-10-27 08:51:54 +0800
2382a147e btrfs: check unsupported filters in balance arguments ... Browse Code »

commit 8eb934591f8bf584969454a658f629cd06e59f3a upstream.

We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Signed-off-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

David Sterba
2015-10-27 08:51:54 +0800

23 Oct, 2015

11 commits

2058efbcb namei: results of d_is_negative() should be checked after dentry revalidation ... Browse Code »

commit daf3761c9fcde0f4ca64321cbed6c1c86d304193 upstream.

Leandro Awa writes:
"After switching to version 4.1.6, our parallelized and distributed
workflows now fail consistently with errors of the form:

T34: ./regex.c:39:22: error: config.h: No such file or directory

From our 'git bisect' testing, the following commit appears to be the
possible cause of the behavior we've been seeing: commit 766c4cbfacd8"

Al Viro says:
"What happens is that 766c4cbfacd8 got the things subtly wrong.

We used to treat d_is_negative() after lookup_fast() as "fall with
ENOENT". That was wrong - checking ->d_flags outside of ->d_seq
protection is unreliable and failing with hard error on what should've
fallen back to non-RCU pathname resolution is a bug.

Unfortunately, we'd pulled the test too far up and ran afoul of
another kind of staleness. The dentry might have been absolutely
stable from the RCU point of view (and we might be on UP, etc), but
stale from the remote fs point of view. If ->d_revalidate() returns
"it's actually stale", dentry gets thrown away and the original code
wouldn't even have looked at its ->d_flags.

What we need is to check ->d_flags where 766c4cbfacd8 does (prior to
->d_seq validation) but only use the result in cases where we do not
discard this dentry outright"

Reported-by: Leandro Awa
Link: https://bugzilla.kernel.org/show_bug.cgi?id=104911
Fixes: 766c4cbfacd8 ("namei: d_is_negative() should be checked...")
Tested-by: Leandro Awa
Signed-off-by: Trond Myklebust
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2015-10-23 05:43:26 +0800
863e9b4f5 nfs/filelayout: Fix NULL reference caused by double freeing of fh_array ... Browse Code »

commit 3ec0c97959abff33a42db9081c22132bcff5b4f2 upstream.

If filelayout_decode_layout fail, _filelayout_free_lseg will causes
a double freeing of fh_array.

[ 1179.279800] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 1179.280198] IP: [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
[ 1179.281010] PGD 0
[ 1179.281443] Oops: 0000 [#1]
[ 1179.281831] Modules linked in: nfs_layout_nfsv41_files(OE) nfsv4(OE) nfs(OE) fscache(E) xfs libcrc32c coretemp nfsd crct10dif_pclmul ppdev crc32_pclmul crc32c_intel auth_rpcgss ghash_clmulni_intel nfs_acl lockd vmw_balloon grace sunrpc parport_pc vmw_vmci parport shpchp i2c_piix4 vmwgfx drm_kms_helper ttm drm serio_raw mptspi scsi_transport_spi mptscsih e1000 mptbase ata_generic pata_acpi [last unloaded: fscache]
[ 1179.283891] CPU: 0 PID: 13336 Comm: cat Tainted: G OE 4.3.0-rc1-pnfs+ #244
[ 1179.284323] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 05/20/2014
[ 1179.285206] task: ffff8800501d48c0 ti: ffff88003e3c4000 task.ti: ffff88003e3c4000
[ 1179.285668] RIP: 0010:[] [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
[ 1179.286612] RSP: 0018:ffff88003e3c77f8 EFLAGS: 00010202
[ 1179.287092] RAX: 0000000000000000 RBX: ffff88001fe78900 RCX: 0000000000000000
[ 1179.287731] RDX: ffffea0000f40760 RSI: ffff88001fe789c8 RDI: ffff88001fe789c0
[ 1179.288383] RBP: ffff88003e3c7810 R08: ffffea0000f40760 R09: 0000000000000000
[ 1179.289170] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88001fe789c8
[ 1179.289959] R13: ffff88001fe789c0 R14: ffff88004ec05a80 R15: ffff88004f935b88
[ 1179.290791] FS: 00007f4e66bb5700(0000) GS:ffffffff81c29000(0000) knlGS:0000000000000000
[ 1179.291580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1179.292209] CR2: 0000000000000000 CR3: 00000000203f8000 CR4: 00000000001406f0
[ 1179.292731] Stack:
[ 1179.293195] ffff88001fe78900 00000000000000d0 ffff88001fe78178 ffff88003e3c7868
[ 1179.293676] ffffffffa0272737 0000000000000001 0000000000000001 ffff88001fe78800
[ 1179.294151] 00000000614fffce ffffffff81727671 ffff88001fe78100 ffff88001fe78100
[ 1179.294623] Call Trace:
[ 1179.295092] [] filelayout_alloc_lseg+0xa7/0x2d0 [nfs_layout_nfsv41_files]
[ 1179.295625] [] ? out_of_line_wait_on_bit+0x81/0xb0
[ 1179.296133] [] pnfs_layout_process+0xae/0x320 [nfsv4]
[ 1179.296632] [] nfs4_proc_layoutget+0x2b1/0x360 [nfsv4]
[ 1179.297134] [] pnfs_update_layout+0x853/0xb30 [nfsv4]
[ 1179.297632] [] ? nfs_get_lock_context+0x74/0x170 [nfs]
[ 1179.298158] [] filelayout_pg_init_read+0x37/0x50 [nfs_layout_nfsv41_files]
[ 1179.298834] [] __nfs_pageio_add_request+0x119/0x460 [nfs]
[ 1179.299385] [] ? nfs_create_request.part.9+0x37/0x2e0 [nfs]
[ 1179.299872] [] nfs_pageio_add_request+0xa3/0x1b0 [nfs]
[ 1179.300362] [] readpage_async_filler+0x85/0x260 [nfs]
[ 1179.300907] [] read_cache_pages+0x91/0xd0
[ 1179.301391] [] ? nfs_read_completion+0x220/0x220 [nfs]
[ 1179.301867] [] nfs_readpages+0x128/0x200 [nfs]
[ 1179.302330] [] __do_page_cache_readahead+0x203/0x280
[ 1179.302784] [] ? __do_page_cache_readahead+0xd8/0x280
[ 1179.303413] [] ondemand_readahead+0x1a6/0x2f0
[ 1179.303855] [] page_cache_sync_readahead+0x31/0x50
[ 1179.304286] [] generic_file_read_iter+0x4a6/0x5c0
[ 1179.304711] [] ? __nfs_revalidate_mapping+0x1f6/0x240 [nfs]
[ 1179.305132] [] nfs_file_read+0x52/0xa0 [nfs]
[ 1179.305540] [] __vfs_read+0xcc/0x100
[ 1179.305936] [] vfs_read+0x85/0x130
[ 1179.306326] [] SyS_read+0x58/0xd0
[ 1179.306708] [] entry_SYSCALL_64_fastpath+0x12/0x76
[ 1179.307094] Code: c4 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 8b 07 49 89 f4 85 c0 74 47 48 8b 06 49 89 fd 8b 38 48 85 ff 74 22 31 db eb 0c 48 63 d3 48 8b 3c d0 48 85
[ 1179.308357] RIP [] filelayout_free_fh_array.isra.11+0x1d/0x70 [nfs_layout_nfsv41_files]
[ 1179.309177] RSP
[ 1179.309582] CR2: 0000000000000000

Signed-off-by: Kinglong Mee
Signed-off-by: Trond Myklebust
Cc: William Dauchy
Signed-off-by: Greg Kroah-Hartman

Kinglong Mee
2015-10-23 05:43:26 +0800
aaf19f122 fix a braino in ovl_d_select_inode() ... Browse Code »

commit 9391dd00d13c853ab4f2a85435288ae2202e0e43 upstream.

when opening a directory we want the overlayfs inode, not one from
the topmost layer.

Reported-By: Andrey Jr. Melnikov
Tested-By: Andrey Jr. Melnikov
Signed-off-by: Al Viro
Cc: "Kamata, Munehisa"
Signed-off-by: Greg Kroah-Hartman

Al Viro
2015-10-23 05:43:26 +0800
9abb3b810 overlayfs: Make f_path always point to the overlay and f_inode to the underlay ... Browse Code »

commit 4bacc9c9234c7c8eec44f5ed4e960d9f96fa0f01 upstream.

Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).

Using my union testsuite to set things up, before the patch I see:

[root@andromeda union-testsuite]# bash 5 /a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 13381 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 13381 Links: 1
...

After the patch:

[root@andromeda union-testsuite]# bash 5 /mnt/a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 40346 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 40346 Links: 1
...

Note the change in where /proc/$$/fd/5 points to in the ls command. It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).

The inode accessed, however, is the lower layer. The union layer is on device
25h/37d and the upper layer on 24h/36d.

Signed-off-by: David Howells
Signed-off-by: Al Viro
Cc: "Kamata, Munehisa"
Signed-off-by: Greg Kroah-Hartman

David Howells
2015-10-23 05:43:26 +0800
0d2ea357d overlay: Call ovl_drop_write() earlier in ovl_dentry_open() ... Browse Code »

commit f25801ee4680ef1db21e15c112e6e5fe3ffe8da5 upstream.

Call ovl_drop_write() earlier in ovl_dentry_open() before we call vfs_open()
as we've done the copy up for which we needed the freeze-write lock by that
point.

Signed-off-by: David Howells
Signed-off-by: Al Viro
Cc: "Kamata, Munehisa"
Signed-off-by: Greg Kroah-Hartman

David Howells
2015-10-23 05:43:26 +0800
eed13ce27 vfs: Test for and handle paths that are unreachable from their mnt_root ... Browse Code »

commit 397d425dc26da728396e66d392d5dcb8dac30c37 upstream.

In rare cases a directory can be renamed out from under a bind mount.
In those cases without special handling it becomes possible to walk up
the directory tree to the root dentry of the filesystem and down
from the root dentry to every other file or directory on the filesystem.

Like division by zero .. from an unconnected path can not be given
a useful semantic as there is no predicting at which path component
the code will realize it is unconnected. We certainly can not match
the current behavior as the current behavior is a security hole.

Therefore when encounting .. when following an unconnected path
return -ENOENT.

- Add a function path_connected to verify path->dentry is reachable
from path->mnt.mnt_root. AKA to validate that rename did not do
something nasty to the bind mount.

To avoid races path_connected must be called after following a path
component to it's next path component.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-10-23 05:43:25 +0800
6f4e45e35 dcache: Handle escaped paths in prepend_path ... Browse Code »

commit cde93be45a8a90d8c264c776fab63487b5038a65 upstream.

A rename can result in a dentry that by walking up d_parent
will never reach it's mnt_root. For lack of a better term
I call this an escaped path.

prepend_path is called by four different functions __d_path,
d_absolute_path, d_path, and getcwd.

__d_path only wants to see paths are connected to the root it passes
in. So __d_path needs prepend_path to return an error.

d_absolute_path similarly wants to see paths that are connected to
some root. Escaped paths are not connected to any mnt_root so
d_absolute_path needs prepend_path to return an error greater
than 1. So escaped paths will be treated like paths on lazily
unmounted mounts.

getcwd needs to prepend "(unreachable)" so getcwd also needs
prepend_path to return an error.

d_path is the interesting hold out. d_path just wants to print
something, and does not care about the weird cases. Which raises
the question what should be printed?

Given that / should result in -ENOENT I
believe it is desirable for escaped paths to be printed as empty
paths. As there are not really any meaninful path components when
considered from the perspective of a mount tree.

So tweak prepend_path to return an empty path with an new error
code of 3 when it encounters an escaped path.

Signed-off-by: "Eric W. Biederman"
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2015-10-23 05:43:25 +0800
207663ca0 UBIFS: Kill unneeded locking in ubifs_init_security ... Browse Code »

commit cf6f54e3f133229f02a90c04fe0ff9dd9d3264b4 upstream.

Fixes the following lockdep splat:
[ 1.244527] =============================================
[ 1.245193] [ INFO: possible recursive locking detected ]
[ 1.245193] 4.2.0-rc1+ #37 Not tainted
[ 1.245193] ---------------------------------------------
[ 1.245193] cp/742 is trying to acquire lock:
[ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] ubifs_init_security+0x29/0xb0
[ 1.245193]
[ 1.245193] but task is already holding lock:
[ 1.245193] (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
[ 1.245193]
[ 1.245193] other info that might help us debug this:
[ 1.245193] Possible unsafe locking scenario:
[ 1.245193]
[ 1.245193] CPU0
[ 1.245193] ----
[ 1.245193] lock(&sb->s_type->i_mutex_key#9);
[ 1.245193] lock(&sb->s_type->i_mutex_key#9);
[ 1.245193]
[ 1.245193] *** DEADLOCK ***
[ 1.245193]
[ 1.245193] May be due to missing lock nesting notation
[ 1.245193]
[ 1.245193] 2 locks held by cp/742:
[ 1.245193] #0: (sb_writers#5){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
[ 1.245193] #1: (&sb->s_type->i_mutex_key#9){+.+.+.}, at: [] path_openat+0x3af/0x1280
[ 1.245193]
[ 1.245193] stack backtrace:
[ 1.245193] CPU: 2 PID: 742 Comm: cp Not tainted 4.2.0-rc1+ #37
[ 1.245193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140816_022509-build35 04/01/2014
[ 1.245193] ffffffff8252d530 ffff88007b023a38 ffffffff814f6f49 ffffffff810b56c5
[ 1.245193] ffff88007c30cc80 ffff88007b023af8 ffffffff810a150d ffff88007b023a68
[ 1.245193] 000000008101302a ffff880000000000 00000008f447e23f ffffffff8252d500
[ 1.245193] Call Trace:
[ 1.245193] [] dump_stack+0x4c/0x65
[ 1.245193] [] ? console_unlock+0x1c5/0x510
[ 1.245193] [] __lock_acquire+0x1a6d/0x1ea0
[ 1.245193] [] ? __lock_is_held+0x58/0x80
[ 1.245193] [] lock_acquire+0xd3/0x270
[ 1.245193] [] ? ubifs_init_security+0x29/0xb0
[ 1.245193] [] mutex_lock_nested+0x6b/0x3a0
[ 1.245193] [] ? ubifs_init_security+0x29/0xb0
[ 1.245193] [] ? ubifs_init_security+0x29/0xb0
[ 1.245193] [] ubifs_init_security+0x29/0xb0
[ 1.245193] [] ubifs_create+0xa6/0x1f0
[ 1.245193] [] ? path_openat+0x3af/0x1280
[ 1.245193] [] vfs_create+0x95/0xc0
[ 1.245193] [] path_openat+0x7cc/0x1280
[ 1.245193] [] ? __lock_acquire+0x543/0x1ea0
[ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
[ 1.245193] [] ? calc_global_load_tick+0x60/0x90
[ 1.245193] [] ? sched_clock_cpu+0x90/0xc0
[ 1.245193] [] ? __alloc_fd+0xaf/0x180
[ 1.245193] [] do_filp_open+0x75/0xd0
[ 1.245193] [] ? _raw_spin_unlock+0x26/0x40
[ 1.245193] [] ? __alloc_fd+0xaf/0x180
[ 1.245193] [] do_sys_open+0x129/0x200
[ 1.245193] [] SyS_open+0x19/0x20
[ 1.245193] [] entry_SYSCALL_64_fastpath+0x12/0x6f

While the lockdep splat is a false positive, becuase path_openat holds i_mutex
of the parent directory and ubifs_init_security() tries to acquire i_mutex
of a new inode, it reveals that taking i_mutex in ubifs_init_security() is
in vain because it is only being called in the inode allocation path
and therefore nobody else can see the inode yet.

Reported-and-tested-by: Boris Brezillon
Reviewed-and-tested-by: Dongsheng Yang
Signed-off-by: Richard Weinberger
Signed-off-by: dedekind1@gmail.com
Signed-off-by: Greg Kroah-Hartman

Richard Weinberger
2015-10-23 05:43:24 +0800
22ee53c3b cifs: use server timestamp for ntlmv2 authentication ... Browse Code »

commit 98ce94c8df762d413b3ecb849e2b966b21606d04 upstream.

Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
10.10.5) share fails in case the clocks differ more than +/-2h:

digest-service: digest-request: od failed with 2 proto=ntlmv2
digest-service: digest-request: kdc failed with -1561745592 proto=ntlmv2

Fix this by (re-)using the given server timestamp for the
ntlmv2 authentication (as Windows 7 does).

A related problem was also reported earlier by Namjae Jaen (see below):

Windows machine has extended security feature which refuse to allow
authentication when there is time difference between server time and
client time when ntlmv2 negotiation is used. This problem is prevalent
in embedded enviornment where system time is set to default 1970.

Modern servers send the server timestamp in the TargetInfo Av_Pair
structure in the challenge message [see MS-NLMP 2.2.2.1]
In [MS-NLMP 3.1.5.1.2] it is explicitly mentioned that the client must
use the server provided timestamp if present OR current time if it is
not

Reported-by: Namjae Jeon
Signed-off-by: Peter Seiderer
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Peter Seiderer
2015-10-23 05:43:21 +0800
5ceb4f5dd Do not fall back to SMBWriteX in set_file_size error cases ... Browse Code »

commit 646200a041203f440fb6fcf9cacd9efeda9de74c upstream.

The error paths in set_file_size for cifs and smb3 are incorrect.

In the unlikely event that a server did not support set file info
of the file size, the code incorrectly falls back to trying SMBWriteX
(note that only the original core SMB Write, used for example by DOS,
can set the file size this way - this actually does not work for the more
recent SMBWriteX). The idea was since the old DOS SMB Write could set
the file size if you write zero bytes at that offset then use that if
server rejects the normal set file info call.

Fortunately the SMBWriteX will never be sent on the wire (except when
file size is zero) since the length and offset fields were reversed
in the two places in this function that call SMBWriteX causing
the fall back path to return an error. It is also important to never call
an SMB request from an SMB2/sMB3 session (which theoretically would
be possible, and can cause a brief session drop, although the client
recovers) so this should be fixed. In practice this path does not happen
with modern servers but the error fall back to SMBWriteX is clearly wrong.

Removing the calls to SMBWriteX in the error paths in cifs_set_file_size

Pointed out by PaX/grsecurity team

Signed-off-by: Steve French
Reported-by: PaX Team
CC: Emese Revfy
CC: Brad Spengler
Signed-off-by: Greg Kroah-Hartman

Steve French
2015-10-23 05:43:18 +0800
52213f9e4 disabling oplocks/leases via module parm enable_oplocks broken for SMB3 ... Browse Code »

commit e0ddde9d44e37fbc21ce893553094ecf1a633ab5 upstream.

leases (oplocks) were always requested for SMB2/SMB3 even when oplocks
disabled in the cifs.ko module.

Signed-off-by: Steve French
Reviewed-by: Chandrika Srinivasan
Signed-off-by: Greg Kroah-Hartman

Steve French
2015-10-23 05:43:18 +0800