Eric Lee / smarc-fsl-linux-kernel

20 Mar, 2018

1 commit

2e2af2902 block: better op and flags encoding ... Browse Code »

Now that we don't need the common flags to overflow outside the range
of a 32-bit type we can encode them the same way for both the bio and
request fields. This in addition allows us to place the operation
first (and make some room for more ops while we're at it) and to
stop having to shift around the operation values.

In addition this allows passing around only one value in the block layer
instead of two (and eventuall also in the file systems, but we can do
that later) and thus clean up a lot of code.

Last but not least this allows decreasing the size of the cmd_flags
field in struct request to 32-bits. Various functions passing this
value could also be updated, but I'd like to avoid the churn for now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe
(cherry picked from commit ef295ecf090d3e86e5b742fc6ab34f1122a43773)

Conflicts:
block/blk-mq.c
include/linux/blk_types.h
include/linux/blkdev.h

Christoph Hellwig
2018-03-20 04:37:01 +0800

18 Mar, 2018

3 commits

2a28923bb NFS: Fix unstable write completion ... Browse Code »

commit c4f24df942a181699c5bab01b8e5e82b925f77f3 upstream.

We do want to respect the FLUSH_SYNC argument to nfs_commit_inode() to
ensure that all outstanding COMMIT requests to the inode in question are
complete. Currently we may exit early from both nfs_commit_inode() and
nfs_write_inode() even if there are COMMIT requests in flight, or unstable
writes on the commit list.

In order to get the right semantics w.r.t. sync_inode(), we don't need
to have nfs_commit_inode() reset the inode dirty flags when called from
nfs_wb_page() and/or nfs_wb_all(). We just need to ensure that
nfs_write_inode() leaves them in the right state if there are outstanding
commits, or stable pages.

Reported-by: Scott Mayhew
Fixes: dc4fd9ab01ab ("nfs: don't wait on commit in nfs_commit_inode()...")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2018-03-18 18:18:54 +0800
fb1f410da NFS: Fix an incorrect type in struct nfs_direct_req ... Browse Code »

commit d9ee65539d3eabd9ade46cca1780e3309ad0f907 upstream.

The start offset needs to be of type loff_t.

Fixed: 5fadeb47dcc5c ("nfs: count DIO good bytes correctly with mirroring")
Cc: stable@vger.kernel.org # v4.0+
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2018-03-18 18:18:54 +0800
b496b24aa ext4: inplace xattr block update fails to deduplicate blocks ... Browse Code »

commit ec00022030da5761518476096626338bd67df57a upstream.

When an xattr block has a single reference, block is updated inplace
and it is reinserted to the cache. Later, a cache lookup is performed
to see whether an existing block has the same contents. This cache
lookup will most of the time return the just inserted entry so
deduplication is not achieved.

Running the following test script will produce two xattr blocks which
can be observed in "File ACL: " line of debugfs output:

mke2fs -b 1024 -I 128 -F -O extent /dev/sdb 1G
mount /dev/sdb /mnt/sdb

touch /mnt/sdb/{x,y}

setfattr -n user.1 -v aaa /mnt/sdb/x
setfattr -n user.2 -v bbb /mnt/sdb/x

setfattr -n user.1 -v aaa /mnt/sdb/y
setfattr -n user.2 -v bbb /mnt/sdb/y

debugfs -R 'stat x' /dev/sdb | cat
debugfs -R 'stat y' /dev/sdb | cat

This patch defers the reinsertion to the cache so that we can locate
other blocks with the same contents.

Signed-off-by: Tahsin Erdogan
Signed-off-by: Theodore Ts'o
Reviewed-by: Andreas Dilger
Signed-off-by: Tommi Rantala
Signed-off-by: Greg Kroah-Hartman

Tahsin Erdogan
2018-03-18 18:18:54 +0800

11 Mar, 2018

1 commit

931dde83a btrfs: preserve i_mode if __btrfs_set_acl() fails ... Browse Code »

commit d7d824966530acfe32b94d1ed672e6fe1638cd68 upstream.

When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by restoring the original mode bits if __btrfs_set_acl
fails.

Signed-off-by: Ernesto A. Fernández
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Nikolay Borisov
Signed-off-by: Greg Kroah-Hartman

Ernesto A. Fernández
2018-03-11 23:21:35 +0800

03 Mar, 2018

4 commits

c33d49420 xfs: quota: check result of register_shrinker() ... Browse Code »

[ Upstream commit 3a3882ff26fbdbaf5f7e13f6a0bccfbf7121041d ]

xfs_qm_init_quotainfo() does not check result of register_shrinker()
which was tagged as __must_check recently, reported by sparse.

Signed-off-by: Aliaksei Karaliou
[darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong

Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Aliaksei Karaliou
2018-03-03 17:23:25 +0800
799948750 xfs: quota: fix missed destroy of qi_tree_lock ... Browse Code »

[ Upstream commit 2196881566225f3c3428d1a5f847a992944daa5b ]

xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock
while destroys quotainfo->qi_quotaofflock.

Signed-off-by: Aliaksei Karaliou
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Aliaksei Karaliou
2018-03-03 17:23:25 +0800
fd7cbb5ad sget(): handle failures of register_shrinker() ... Browse Code »

[ Upstream commit 9ee332d99e4d5a97548943b81c54668450ce641b ]

Signed-off-by: Al Viro
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Al Viro
2018-03-03 17:23:21 +0800
4a97b2d09 f2fs: fix a bug caused by NULL extent tree ... Browse Code »

commit dad48e73127ba10279ea33e6dbc8d3905c4d31c0 upstream.

Thread A: Thread B:

-f2fs_remount
-sbi->mount_opt.opt = 0;
extent_tree is NULL
-default_options && parse_options
-remount return

Signed-off-by: Jaegeuk Kim
Signed-off-by: Nikolay Borisov
Signed-off-by: Greg Kroah-Hartman

Yunlei He
2018-03-03 17:23:20 +0800

28 Feb, 2018

1 commit

f06c2c659 fs/dax.c: fix inefficiency in dax_writeback_mapping_range() ... Browse Code »

commit 1eb643d02b21412e603b42cdd96010a2ac31c05f upstream.

dax_writeback_mapping_range() fails to update iteration index when
searching radix tree for entries needing cache flushing. Thus each
pagevec worth of entries is searched starting from the start which is
inefficient and prone to livelocks. Update index properly.

Link: http://lkml.kernel.org/r/20170619124531.21491-1-jack@suse.cz
Fixes: 9973c98ecfda3 ("dax: add support for fsync/sync")
Signed-off-by: Jan Kara
Reviewed-by: Ross Zwisler
Cc: Dan Williams
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2018-02-28 17:18:33 +0800

25 Feb, 2018

3 commits

890c52ab3 binfmt_elf: compat: avoid unused function warning ... Browse Code »

When CONFIG_ELF_CORE is disabled, we get a harmless warning in the compat
version of binfmt_elf:

fs/compat_binfmt_elf.c:58:13: error: 'cputime_to_compat_timeval' defined but not used [-Werror=unused-function]

This was addressed in mainline Linux as part of a larger rework with commit
cd19c364b313 ("fs/binfmt: Convert obsolete cputime type to nsecs").

For 4.9 and earlier, this just shuts up the warning by adding an #ifdef
around the function definition.

Signed-off-by: Arnd Bergmann
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2018-02-25 18:05:55 +0800
445e8f85d reiserfs: avoid a -Wmaybe-uninitialized warning ... Browse Code »

commit ab4949640d6674b617b314ad3c2c00353304bab9 upstream.

The latest gcc-7.0.1 snapshot warns about an unintialized variable use:

In file included from fs/reiserfs/lbalance.c:8:0:
fs/reiserfs/lbalance.c: In function 'leaf_item_bottle.isra.3':
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);
~~^~~
fs/reiserfs/reiserfs.h:1279:13: error: '*((void *)&n_ih+8).v' may be used uninitialized in this function [-Werror=maybe-uninitialized]
v2->v = (v2->v & cpu_to_le64(15ULL << 60)) | cpu_to_le64(offset);

This happens because the offset/type pair that is stored in
ih.key.u.k_offset_v2 is actually uninitialized when we call
set_le_ih_k_offset() and set_le_ih_k_type(). After we have called both,
all data is correct, but the first of the two reads uninitialized data
for the type field and writes it back before it gets overwritten.

This works around the warning by initializing the k_offset_v2 through
the slightly larger memcpy().

[JK: Remove now unused define and make it obvious we initialize the key]

Signed-off-by: Arnd Bergmann
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2018-02-25 18:05:53 +0800
1c3aae50c btrfs: Fix possible off-by-one in btrfs_search_path_in_tree ... Browse Code »

[ Upstream commit c8bcbfbd239ed60a6562964b58034ac8a25f4c31 ]

The name char array passed to btrfs_search_path_in_tree is of size
BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
are in the range of [0, 4079]. Currently the code uses the define but this
represents an off-by-one.

Implications:

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

Kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Before this patch the buffer was one byte larger, but then the -1 was
not added.

Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")
Signed-off-by: Nikolay Borisov
Reviewed-by: David Sterba
[ added implications ]
Signed-off-by: David Sterba

Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Nikolay Borisov
2018-02-25 18:05:48 +0800

22 Feb, 2018

11 commits

012e79b98 vfs: don't do RCU lookup of empty pathnames ... Browse Code »

commit c0eb027e5aef70b71e5a38ee3e264dc0b497f343 upstream.

Normal pathname lookup doesn't allow empty pathnames, but using
AT_EMPTY_PATH (with name_to_handle_at() or fstatat(), for example) you
can trigger an empty pathname lookup.

And not only is the RCU lookup in that case entirely unnecessary
(because we'll obviously immediately finalize the end result), it is
actively wrong.

Why? An empth path is a special case that will return the original
'dirfd' dentry - and that dentry may not actually be RCU-free'd,
resulting in a potential use-after-free if we were to initialize the
path lazily under the RCU read lock and depend on complete_walk()
finalizing the dentry.

Found by syzkaller and KASAN.

Reported-by: Dmitry Vyukov
Reported-by: Vegard Nossum
Acked-by: Al Viro
Signed-off-by: Linus Torvalds
Cc: Eric Biggers
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2018-02-22 22:43:55 +0800
9cb167400 ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE ... Browse Code »

commit ff26cc10aec128c3f86b5611fd5f59c71d49c0e3 upstream.

If we can't get inode lock immediately in the function
ocfs2_inode_lock_with_page() when reading a page, we should not return
directly here, since this will lead to a softlockup problem when the
kernel is configured with CONFIG_PREEMPT is not set. The method is to
get a blocking lock and immediately unlock before returning, this can
avoid CPU resource waste due to lots of retries, and benefits fairness
in getting lock among multiple nodes, increase efficiency in case
modifying the same file frequently from multiple nodes.

The softlockup crash (when set /proc/sys/kernel/softlockup_panic to 1)
looks like:

Kernel panic - not syncing: softlockup: hung tasks
CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default #1
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:

dump_stack+0x5c/0x82
panic+0xd5/0x21e
watchdog_timer_fn+0x208/0x210
__hrtimer_run_queues+0xcc/0x200
hrtimer_interrupt+0xa6/0x1f0
smp_apic_timer_interrupt+0x34/0x50
apic_timer_interrupt+0x96/0xa0

RIP: 0010:unlock_page+0x17/0x30
RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
ocfs2_readpage+0x41/0x2d0 [ocfs2]
filemap_fault+0x12b/0x5c0
ocfs2_fault+0x29/0xb0 [ocfs2]
__do_fault+0x1a/0xa0
__handle_mm_fault+0xbe8/0x1090
handle_mm_fault+0xaa/0x1f0
__do_page_fault+0x235/0x4b0
trace_do_page_fault+0x3c/0x110
async_page_fault+0x28/0x30
RIP: 0033:0x7fa75ded638e
RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000

About performance improvement, we can see the testing time is reduced,
and CPU utilization decreases, the detailed data is as follows. I ran
multi_mmap test case in ocfs2-test package in a three nodes cluster.

Before applying this patch:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2754 ocfs2te+ 20 0 170248 6980 4856 D 80.73 0.341 0:18.71 multi_mmap
1505 root rt 0 222236 123060 97224 S 2.658 6.015 0:01.44 corosync
5 root 20 0 0 0 0 S 1.329 0.000 0:00.19 kworker/u8:0
95 root 20 0 0 0 0 S 1.329 0.000 0:00.25 kworker/u8:1
2728 root 20 0 0 0 0 S 0.997 0.000 0:00.24 jbd2/sda1-33
2721 root 20 0 0 0 0 S 0.664 0.000 0:00.07 ocfs2dc-3C8CFD4
2750 ocfs2te+ 20 0 142976 4652 3532 S 0.664 0.227 0:00.28 mpirun

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 14:44:52 CST 2017
multi_mmap..................................................Passed.
Runtime 783 seconds.

After apply this patch:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2508 ocfs2te+ 20 0 170248 6804 4680 R 54.00 0.333 0:55.37 multi_mmap
155 root 20 0 0 0 0 S 2.667 0.000 0:01.20 kworker/u8:3
95 root 20 0 0 0 0 S 2.000 0.000 0:01.58 kworker/u8:1
2504 ocfs2te+ 20 0 142976 4604 3480 R 1.667 0.225 0:01.65 mpirun
5 root 20 0 0 0 0 S 1.000 0.000 0:01.36 kworker/u8:0
2482 root 20 0 0 0 0 S 1.000 0.000 0:00.86 jbd2/sda1-33
299 root 0 -20 0 0 0 S 0.333 0.000 0:00.13 kworker/2:1H
335 root 0 -20 0 0 0 S 0.333 0.000 0:00.17 kworker/1:1H
535 root 20 0 12140 7268 1456 S 0.333 0.355 0:00.34 haveged
1282 root rt 0 222284 123108 97224 S 0.333 6.017 0:01.33 corosync

ocfs2test@tb-node2:~>multiple_run.sh -i ens3 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C hacluster -s pcmk -n tb-node2,tb-node1,tb-node3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap /mnt/shared
Tests with "-b 4096 -C 32768"
Thu Dec 28 15:04:12 CST 2017
multi_mmap..................................................Passed.
Runtime 487 seconds.

Link: http://lkml.kernel.org/r/1514447305-30814-1-git-send-email-ghe@suse.com
Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock")
Signed-off-by: Gang He
Reviewed-by: Eric Ren
Acked-by: alex chen
Acked-by: piaojun
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Joseph Qi
Cc: Changwei Ge
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Gang He
2018-02-22 22:43:52 +0800
efb1cbc22 Btrfs: fix unexpected -EEXIST when creating new inode ... Browse Code »

commit 900c9981680067573671ecc5cbfa7c5770be3a40 upstream.

The highest objectid, which is assigned to new inode, is decided at
the time of initializing fs roots. However, in cases where log replay
gets processed, the btree which fs root owns might be changed, so we
have to search it again for the highest objectid, otherwise creating
new inode would end up with -EEXIST.

cc: v4.4-rc6+
Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when loading tree root and subvolume roots")
Signed-off-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-02-22 22:43:50 +0800
b48edd6d7 Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly ... Browse Code »

commit e8f1bc1493855e32b7a2a019decc3c353d94daf6 upstream.

This regression is introduced in
commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction").

There are two problems,

a) it is ->destroy_inode() that does the final free on inode, not
->evict_inode(),
b) clear_inode() must be called before ->evict_inode() returns.

This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in evict() because I_CLEAR is set in clear_inode().

Fixes: commit 3d48d9810de4 ("btrfs: Handle uninitialised inode eviction")
Cc: # v4.7-rc6+
Signed-off-by: Liu Bo
Reviewed-by: Nikolay Borisov
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-02-22 22:43:50 +0800
bc0d431e7 Btrfs: fix extent state leak from tree log ... Browse Code »

commit 55237a5f2431a72435e3ed39e4306e973c0446b7 upstream.

It's possible that btrfs_sync_log() bails out after one of the two
btrfs_write_marked_extents() which convert extent state's state bit into
EXTENT_NEED_WAIT from EXTENT_DIRTY/EXTENT_NEW, however only EXTENT_DIRTY
and EXTENT_NEW are searched by free_log_tree() so that those extent states
with EXTENT_NEED_WAIT lead to memory leak.

cc:
Signed-off-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-02-22 22:43:49 +0800
0f4adc146 Btrfs: fix crash due to not cleaning up tree log block's dirty bits ... Browse Code »

commit 1846430c24d66e85cc58286b3319c82cd54debb2 upstream.

In cases that the whole fs flips into readonly status due to failures in
critical sections, then log tree's blocks are still dirty, and this leads
to a crash during umount time, the crash is about use-after-free,

umount
-> close_ctree
-> stop workers
-> iput(btree_inode)
-> iput_final
-> write_inode_now
-> ...
-> queue job on stop'd workers

cc: v3.12+
Fixes: 681ae50917df ("Btrfs: cleanup reserved space when freeing tree log on error")
Signed-off-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-02-22 22:43:49 +0800
ecd72fd60 Btrfs: fix deadlock in run_delalloc_nocow ... Browse Code »

commit e89166990f11c3f21e1649d760dd35f9e410321c upstream.

@cur_offset is not set back to what it should be (@cow_start) if
btrfs_next_leaf() returns something wrong, and the range [cow_start,
cur_offset) remains locked forever.

cc:
Signed-off-by: Liu Bo
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2018-02-22 22:43:49 +0800
4a36f437b ext4: save error to disk in __ext4_grp_locked_error() ... Browse Code »

commit 06f29cc81f0350261f59643a505010531130eea0 upstream.

In the function __ext4_grp_locked_error(), __save_error_info()
is called to save error info in super block block, but does not sync
that information to disk to info the subsequence fsck after reboot.

This patch writes the error information to disk. After this patch,
I think there is no obvious EXT4 error handle branches which leads to
"Remounting filesystem read-only" will leave the disk partition miss
the subsequence fsck.

Signed-off-by: Zhouyi Zhou
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Zhouyi Zhou
2018-02-22 22:43:48 +0800
539deabfc ext4: fix a race in the ext4 shutdown path ... Browse Code »

commit abbc3f9395c76d554a9ed27d4b1ebfb5d9b0e4ca upstream.

This patch fixes a race between the shutdown path and bio completion
handling. In the ext4 direct io path with async io, after submitting a
bio to the block layer, if journal starting fails,
ext4_direct_IO_write() would bail out pretending that the IO
failed. The caller would have had no way of knowing whether or not the
IO was successfully submitted. So instead, we return -EIOCBQUEUED in
this case. Now, the caller knows that the IO was submitted. The bio
completion handler takes care of the error.

Tested: Ran the shutdown xfstest test 461 in loop for over 2 hours across
4 machines resulting in over 400 runs. Verified that the race didn't
occur. Usually the race was seen in about 20-30 iterations.

Signed-off-by: Harshad Shirwadkar
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Harshad Shirwadkar
2018-02-22 22:43:48 +0800
99a89d8fb jbd2: fix sphinx kernel-doc build warnings ... Browse Code »

commit f69120ce6c024aa634a8fc25787205e42f0ccbe6 upstream.

Sphinx emits various (26) warnings when building make target 'htmldocs'.
Currently struct definitions contain duplicate documentation, some as
kernel-docs and some as standard c89 comments. We can reduce
duplication while cleaning up the kernel docs.

Move all kernel-docs to right above each struct member. Use the set of
all existing comments (kernel-doc and c89). Add documentation for
missing struct members and function arguments.

Signed-off-by: Tobin C. Harding
Signed-off-by: Theodore Ts'o
Cc: stable@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman

Tobin C. Harding
2018-02-22 22:43:48 +0800
9cb2d0bc2 mbcache: initialize entry->e_referenced in mb_cache_entry_create() ... Browse Code »

commit 3876bbe27d04b848750d5310a37d6b76b593f648 upstream.

KMSAN reported use of uninitialized |entry->e_referenced| in a condition
in mb_cache_shrink():

==================================================================
BUG: KMSAN: use of uninitialized memory in mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287
CPU: 2 PID: 816 Comm: kswapd1 Not tainted 4.11.0-rc5+ #2877
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs
01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:16 [inline]
dump_stack+0x172/0x1c0 lib/dump_stack.c:52
kmsan_report+0x12a/0x180 mm/kmsan/kmsan.c:927
__msan_warning_32+0x61/0xb0 mm/kmsan/kmsan_instr.c:469
mb_cache_shrink+0x3b4/0xc50 fs/mbcache.c:287
mb_cache_scan+0x67/0x80 fs/mbcache.c:321
do_shrink_slab mm/vmscan.c:397 [inline]
shrink_slab+0xc3d/0x12d0 mm/vmscan.c:500
shrink_node+0x208f/0x2fd0 mm/vmscan.c:2603
kswapd_shrink_node mm/vmscan.c:3172 [inline]
balance_pgdat mm/vmscan.c:3289 [inline]
kswapd+0x160f/0x2850 mm/vmscan.c:3478
kthread+0x46c/0x5f0 kernel/kthread.c:230
ret_from_fork+0x29/0x40 arch/x86/entry/entry_64.S:430
chained origin:
save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline]
kmsan_save_stack mm/kmsan/kmsan.c:317 [inline]
kmsan_internal_chain_origin+0x12a/0x1f0 mm/kmsan/kmsan.c:547
__msan_store_shadow_origin_1+0xac/0x110 mm/kmsan/kmsan_instr.c:257
mb_cache_entry_create+0x3b3/0xc60 fs/mbcache.c:95
ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline]
ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022
ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252
ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306
ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36
__vfs_setxattr+0x703/0x790 fs/xattr.c:149
__vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180
vfs_setxattr fs/xattr.c:223 [inline]
setxattr+0x6ae/0x790 fs/xattr.c:449
path_setxattr+0x1eb/0x380 fs/xattr.c:468
SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490
SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486
entry_SYSCALL_64_fastpath+0x13/0x94
origin:
save_stack_trace+0x37/0x40 arch/x86/kernel/stacktrace.c:59
kmsan_save_stack_with_flags mm/kmsan/kmsan.c:302 [inline]
kmsan_internal_poison_shadow+0xb1/0x1a0 mm/kmsan/kmsan.c:198
kmsan_kmalloc+0x7f/0xe0 mm/kmsan/kmsan.c:337
kmem_cache_alloc+0x1c2/0x1e0 mm/slub.c:2766
mb_cache_entry_create+0x283/0xc60 fs/mbcache.c:86
ext4_xattr_cache_insert fs/ext4/xattr.c:1647 [inline]
ext4_xattr_block_set+0x4c82/0x5530 fs/ext4/xattr.c:1022
ext4_xattr_set_handle+0x1332/0x20a0 fs/ext4/xattr.c:1252
ext4_xattr_set+0x4d2/0x680 fs/ext4/xattr.c:1306
ext4_xattr_trusted_set+0x8d/0xa0 fs/ext4/xattr_trusted.c:36
__vfs_setxattr+0x703/0x790 fs/xattr.c:149
__vfs_setxattr_noperm+0x27a/0x6f0 fs/xattr.c:180
vfs_setxattr fs/xattr.c:223 [inline]
setxattr+0x6ae/0x790 fs/xattr.c:449
path_setxattr+0x1eb/0x380 fs/xattr.c:468
SYSC_lsetxattr+0x8d/0xb0 fs/xattr.c:490
SyS_lsetxattr+0x77/0xa0 fs/xattr.c:486
entry_SYSCALL_64_fastpath+0x13/0x94
==================================================================

Signed-off-by: Alexander Potapenko
Signed-off-by: Eric Biggers
Cc: stable@vger.kernel.org # v4.6
Signed-off-by: Greg Kroah-Hartman

Alexander Potapenko
2018-02-22 22:43:48 +0800

17 Feb, 2018

16 commits

38e3bc59e ovl: fix failure to fsync lower dir ... Browse Code »

commit d796e77f1dd541fe34481af2eee6454688d13982 upstream.

As a writable mount, it is not expected for overlayfs to return
EINVAL/EROFS for fsync, even if dir/file is not changed.

This commit fixes the case of fsync of directory, which is easier to
address, because overlayfs already implements fsync file operation for
directories.

The problem reported by Raphael is that new PostgreSQL 10.0 with a
database in overlayfs where lower layer in squashfs fails to start.
The failure is due to fsync error, when PostgreSQL does fsync on all
existing db directories on startup and a specific directory exists
lower layer with no changes.

Reported-by: Raphael Hertzog
Signed-off-by: Amir Goldstein
Tested-by: Raphaël Hertzog
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Amir Goldstein
2018-02-17 20:21:20 +0800
8fe7ceaf8 btrfs: Handle btrfs_set_extent_delalloc failure in fixup worker ... Browse Code »

commit f3038ee3a3f1017a1cbe9907e31fa12d366c5dcb upstream.

This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers
to deal with pages that have been improperly dirtied") and it didn't do
any error handling then. This function might very well fail in ENOMEM
situation, yet it's not handled, this could lead to inconsistent state.
So let's handle the failure by setting the mapping error bit.

Signed-off-by: Nikolay Borisov
Reviewed-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Nikolay Borisov
2018-02-17 20:21:20 +0800
71baf27d8 pipe: fix off-by-one error when checking buffer limits ... Browse Code »

commit 9903a91c763ecdae333a04a9d89d79d2b8966503 upstream.

With pipe-user-pages-hard set to 'N', users were actually only allowed up
to 'N - 1' buffers; and likewise for pipe-user-pages-soft.

Fix this to allow up to 'N' buffers, as would be expected.

Link: http://lkml.kernel.org/r/20180111052902.14409-5-ebiggers3@gmail.com
Fixes: b0b91d18e2e9 ("pipe: fix limit checking in pipe_set_size()")
Signed-off-by: Eric Biggers
Acked-by: Willy Tarreau
Acked-by: Kees Cook
Acked-by: Joe Lawrence
Cc: Alexander Viro
Cc: "Luis R . Rodriguez"
Cc: Michael Kerrisk
Cc: Mikulas Patocka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2018-02-17 20:21:18 +0800
a705c24b5 pipe: actually allow root to exceed the pipe buffer limits ... Browse Code »

commit 85c2dd5473b2718b4b63e74bfeb1ca876868e11f upstream.

pipe-user-pages-hard and pipe-user-pages-soft are only supposed to apply
to unprivileged users, as documented in both Documentation/sysctl/fs.txt
and the pipe(7) man page.

However, the capabilities are actually only checked when increasing a
pipe's size using F_SETPIPE_SZ, not when creating a new pipe. Therefore,
if pipe-user-pages-hard has been set, the root user can run into it and be
unable to create pipes. Similarly, if pipe-user-pages-soft has been set,
the root user can run into it and have their pipes limited to 1 page each.

Fix this by allowing the privileged override in both cases.

Link: http://lkml.kernel.org/r/20180111052902.14409-4-ebiggers3@gmail.com
Fixes: 759c01142a5d ("pipe: limit the per-user amount of pages allocated in pipes")
Signed-off-by: Eric Biggers
Acked-by: Kees Cook
Acked-by: Joe Lawrence
Cc: Alexander Viro
Cc: "Luis R . Rodriguez"
Cc: Michael Kerrisk
Cc: Mikulas Patocka
Cc: Willy Tarreau
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2018-02-17 20:21:18 +0800
da3b22465 fs/proc/kcore.c: use probe_kernel_read() instead of memcpy() ... Browse Code »

commit d0290bc20d4739b7a900ae37eb5d4cc3be2b393f upstream.

Commit df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext
data") added a bounce buffer to avoid hardened usercopy checks. Copying
to the bounce buffer was implemented with a simple memcpy() assuming
that it is always valid to read from kernel memory iff the
kern_addr_valid() check passed.

A simple, but pointless, test case like "dd if=/proc/kcore of=/dev/null"
now can easily crash the kernel, since the former execption handling on
invalid kernel addresses now doesn't work anymore.

Also adding a kern_addr_valid() implementation wouldn't help here. Most
architectures simply return 1 here, while a couple implemented a page
table walk to figure out if something is mapped at the address in
question.

With DEBUG_PAGEALLOC active mappings are established and removed all the
time, so that relying on the result of kern_addr_valid() before
executing the memcpy() also doesn't work.

Therefore simply use probe_kernel_read() to copy to the bounce buffer.
This also allows to simplify read_kcore().

At least on s390 this fixes the observed crashes and doesn't introduce
warnings that were removed with df04abfd181a ("fs/proc/kcore.c: Add
bounce buffer for ktext data"), even though the generic
probe_kernel_read() implementation uses uaccess functions.

While looking into this I'm also wondering if kern_addr_valid() could be
completely removed...(?)

Link: http://lkml.kernel.org/r/20171202132739.99971-1-heiko.carstens@de.ibm.com
Fixes: df04abfd181a ("fs/proc/kcore.c: Add bounce buffer for ktext data")
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Signed-off-by: Heiko Carstens
Acked-by: Kees Cook
Cc: Jiri Olsa
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Heiko Carstens
2018-02-17 20:21:18 +0800
daaa81c48 nsfs: mark dentry with DCACHE_RCUACCESS ... Browse Code »

commit 073c516ff73557a8f7315066856c04b50383ac34 upstream.

Andrey reported a use-after-free in __ns_get_path():

spin_lock include/linux/spinlock.h:299 [inline]
lockref_get_not_dead+0x19/0x80 lib/lockref.c:179
__ns_get_path+0x197/0x860 fs/nsfs.c:66
open_related_ns+0xda/0x200 fs/nsfs.c:143
sock_ioctl+0x39d/0x440 net/socket.c:1001
vfs_ioctl fs/ioctl.c:45 [inline]
do_vfs_ioctl+0x1bf/0x1780 fs/ioctl.c:685
SYSC_ioctl fs/ioctl.c:700 [inline]
SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691

We are under rcu read lock protection at that point:

rcu_read_lock();
d = atomic_long_read(&ns->stashed);
if (!d)
goto slow;
dentry = (struct dentry *)d;
if (!lockref_get_not_dead(&dentry->d_lockref))
goto slow;
rcu_read_unlock();

but don't use a proper RCU API on the free path, therefore a parallel
__d_free() could free it at the same time. We need to mark the stashed
dentry with DCACHE_RCUACCESS so that __d_free() will be called after all
readers leave RCU.

Fixes: e149ed2b805f ("take the targets of /proc/*/ns/* symlinks to separate fs")
Cc: Alexander Viro
Cc: Andrew Morton
Reported-by: Andrey Konovalov
Signed-off-by: Cong Wang
Signed-off-by: Linus Torvalds
Cc: Eric Biggers
Signed-off-by: Greg Kroah-Hartman

Cong Wang
2018-02-17 20:21:15 +0800
058d13f85 kernfs: fix regression in kernfs_fop_write caused by wrong type ... Browse Code »

commit ba87977a49913129962af8ac35b0e13e0fa4382d upstream.

Commit b7ce40cff0b9 ("kernfs: cache atomic_write_len in
kernfs_open_file") changes type of local variable 'len' from ssize_t
to size_t. This change caused that the *ppos value is updated also
when the previous write callback failed.

Mentioned snippet:
...
len = ops->write(...); 0)
Signed-off-by: Ivan Vecera
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Ivan Vecera
2018-02-17 20:21:15 +0800
b79d8854e NFS: Fix a race between mmap() and O_DIRECT ... Browse Code »

commit e231c6879cfd44e4fffd384bb6dd7d313249a523 upstream.

When locking the file in order to do O_DIRECT on it, we must unmap
any mmapped ranges on the pagecache so that we can flush out the
dirty data.

Fixes: a5864c999de67 ("NFS: Do not serialise O_DIRECT reads and writes")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2018-02-17 20:21:14 +0800
967f650f8 NFS: reject request for id_legacy key without auxdata ... Browse Code »

commit 49686cbbb3ebafe42e63868222f269d8053ead00 upstream.

nfs_idmap_legacy_upcall() is supposed to be called with 'aux' pointing
to a 'struct idmap', via the call to request_key_with_auxdata() in
nfs_idmap_request_key().

However it can also be reached via the request_key() system call in
which case 'aux' will be NULL, causing a NULL pointer dereference in
nfs_idmap_prepare_pipe_upcall(), assuming that the key description is
valid enough to get that far.

Fix this by making nfs_idmap_legacy_upcall() negate the key if no
auxdata is provided.

As usual, this bug was found by syzkaller. A simple reproducer using
the command-line keyctl program is:

keyctl request2 id_legacy uid:0 '' @s

Fixes: 57e62324e469 ("NFS: Store the legacy idmapper result in the keyring")
Reported-by: syzbot+5dfdbcf7b3eb5912abbb@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Eric Biggers
2018-02-17 20:21:14 +0800
ca2c316f7 NFS: commit direct writes even if they fail partially ... Browse Code »

commit 1b8d97b0a837beaf48a8449955b52c650a7114b4 upstream.

If some of the WRITE calls making up an O_DIRECT write syscall fail,
we neglect to commit, even if some of the WRITEs succeed.

We also depend on the commit code to free the reference count on the
nfs_page taken in the "if (request_commit)" case at the end of
nfs_direct_write_completion(). The problem was originally noticed
because ENOSPC's encountered partway through a write would result in a
closed file being sillyrenamed when it should have been unlinked.

Signed-off-by: J. Bruce Fields
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

J. Bruce Fields
2018-02-17 20:21:14 +0800
d1840343f NFS: Add a cond_resched() to nfs_commit_release_pages() ... Browse Code »

commit 7f1bda447c9bd48b415acedba6b830f61591601f upstream.

The commit list can get very large, and so we need a cond_resched()
in nfs_commit_release_pages() in order to ensure we don't hog the CPU
for excessive periods of time.

Reported-by: Mike Galbraith
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2018-02-17 20:21:14 +0800
e1df8c682 nfs/pnfs: fix nfs_direct_req ref leak when i/o falls back to the mds ... Browse Code »

commit ba4a76f703ab7eb72941fdaac848502073d6e9ee upstream.

Currently when falling back to doing I/O through the MDS (via
pnfs_{read|write}_through_mds), the client frees the nfs_pgio_header
without releasing the reference taken on the dreq
via pnfs_generic_pg_{read|write}pages -> nfs_pgheader_init ->
nfs_direct_pgio_init. It then takes another reference on the dreq via
nfs_generic_pg_pgios -> nfs_pgheader_init -> nfs_direct_pgio_init and
as a result the requester will become stuck in inode_dio_wait. Once
that happens, other processes accessing the inode will become stuck as
well.

Ensure that pnfs_read_through_mds() and pnfs_write_through_mds() clean
up correctly by calling hdr->completion_ops->completion() instead of
calling hdr->release() directly.

This can be reproduced (sometimes) by performing "storage failover
takeover" commands on NetApp filer while doing direct I/O from a client.

This can also be reproduced using SystemTap to simulate a failure while
doing direct I/O from a client (from Dave Wysochanski
):

stap -v -g -e 'probe module("nfs_layout_nfsv41_files").function("nfs4_fl_prepare_ds").return { $return=NULL; exit(); }'

Suggested-by: Trond Myklebust
Signed-off-by: Scott Mayhew
Fixes: 1ca018d28d ("pNFS: Fix a memory leak when attempted pnfs fails")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Scott Mayhew
2018-02-17 20:21:14 +0800
298dc6c66 ubifs: Massage assert in ubifs_xattr_set() wrt. init_xattrs ... Browse Code »

commit d8db5b1ca9d4c57e49893d0f78e6d5ce81450cc8 upstream.

The inode is not locked in init_xattrs when creating a new inode.

Without this patch, there will occurs assert when booting or creating
a new file, if the kernel config CONFIG_SECURITY_SMACK is enabled.

Log likes:

UBIFS assert failed in ubifs_xattr_set at 298 (pid 1156)
CPU: 1 PID: 1156 Comm: ldconfig Tainted: G S 4.12.0-rc1-207440-g1e70b02 #2
Hardware name: MediaTek MT2712 evaluation board (DT)
Call trace:
[] dump_backtrace+0x0/0x238
[] show_stack+0x14/0x20
[] dump_stack+0x9c/0xc0
[] ubifs_xattr_set+0x374/0x5e0
[] init_xattrs+0x5c/0xb8
[] security_inode_init_security+0x110/0x190
[] ubifs_init_security+0x30/0x68
[] ubifs_mkdir+0x100/0x200
[] vfs_mkdir+0x11c/0x1b8
[] SyS_mkdirat+0x74/0xd0
[] __sys_trace_return+0x0/0x4

Signed-off-by: Xiaolei Li
Signed-off-by: Richard Weinberger
Cc: stable@vger.kernel.org
(julia: massaged to apply to 4.9.y, which doesn't contain fscrypto support)
Signed-off-by: Julia Cartwright
Signed-off-by: Greg Kroah-Hartman

Xiaolei Li
2018-02-17 20:21:14 +0800
7e68916c3 CIFS: zero sensitive data when freeing ... Browse Code »

commit 97f4b7276b829a8927ac903a119bef2f963ccc58 upstream.

also replaces memset()+kfree() by kzfree().

Signed-off-by: Aurelien Aptel
Signed-off-by: Steve French
Reviewed-by: Pavel Shilovsky
Signed-off-by: Greg Kroah-Hartman

Aurelien Aptel
2018-02-17 20:21:12 +0800
f59eda166 cifs: Fix autonegotiate security settings mismatch ... Browse Code »

commit 9aca7e454415f7878b28524e76bebe1170911a88 upstream.

Autonegotiation gives a security settings mismatch error if the SMB
server selects an SMBv3 dialect that isn't SMB3.02. The exact error is
"protocol revalidation - security settings mismatch".
This can be tested using Samba v4.2 or by setting the global Samba
setting max protocol = SMB3_00.

The check that fails in smb3_validate_negotiate is the dialect
verification of the negotiate info response. This is because it tries
to verify against the protocol_id in the global smbdefault_values. The
protocol_id in smbdefault_values is SMB3.02.
In SMB2_negotiate the protocol_id in smbdefault_values isn't updated,
it is global so it probably shouldn't be, but server->dialect is.

This patch changes the check in smb3_validate_negotiate to use
server->dialect instead of server->vals->protocol_id. The patch works
with autonegotiate and when using a specific version in the vers mount
option.

Signed-off-by: Daniel N Pettersson
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Daniel N Pettersson
2018-02-17 20:21:12 +0800
ee6858f72 cifs: Fix missing put_xid in cifs_file_strict_mmap ... Browse Code »

commit f04a703c3d613845ae3141bfaf223489de8ab3eb upstream.

If cifs_zap_mapping() returned an error, we would return without putting
the xid that we got earlier. Restructure cifs_file_strict_mmap() and
cifs_file_mmap() to be more similar to each other and have a single
point of return that always puts the xid.

Signed-off-by: Matthew Wilcox
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Matthew Wilcox
2018-02-17 20:21:12 +0800