Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

07 Aug, 2017

11 commits

f76ddff6c Btrfs: adjust outstanding_extents counter properly when dio write is split ... Browse Code »

[ Upstream commit c2931667c83ded6504b3857e99cc45b21fa496fb ]

Currently how btrfs dio deals with split dio write is not good
enough if dio write is split into several segments due to the
lack of contiguous space, a large dio write like 'dd bs=1G count=1'
can end up with incorrect outstanding_extents counter and endio
would complain loudly with an assertion.

This fixes the problem by compensating the outstanding_extents
counter in inode if a large dio write gets split.

Reported-by: Anand Jain
Tested-by: Anand Jain
Signed-off-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2017-08-07 09:59:47 +0800
673121283 Btrfs: fix lockdep warning about log_mutex ... Browse Code »

[ Upstream commit 781feef7e6befafd4d9787d1f7ada1f9ccd504e4 ]

While checking INODE_REF/INODE_EXTREF for a corner case, we may acquire a
different inode's log_mutex with holding the current inode's log_mutex, and
lockdep has complained this with a possilble deadlock warning.

Fix this by using mutex_lock_nested() when processing the other inode's
log_mutex.

Reviewed-by: Filipe Manana
Signed-off-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2017-08-07 09:59:47 +0800
78418b867 Btrfs: use down_read_nested to make lockdep silent ... Browse Code »

[ Upstream commit e321f8a801d7b4c40da8005257b05b9c2b51b072 ]

If @block_group is not @used_bg, it'll try to get @used_bg's lock without
droping @block_group 's lock and lockdep has throwed a scary deadlock warning
about it.
Fix it by using down_read_nested.

Signed-off-by: Liu Bo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2017-08-07 09:59:47 +0800
e10f7bd6a pstore: Use dynamic spinlock initializer ... Browse Code »

commit e9a330c4289f2ba1ca4bf98c2b430ab165a8931b upstream.

The per-prz spinlock should be using the dynamic initializer so that
lockdep can correctly track it. Without this, under lockdep, we get a
warning at boot that the lock is in non-static memory.

Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Fixes: 76d5692a5803 ("pstore: Correctly initialize spinlock and flags")
Signed-off-by: Kees Cook
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2017-08-07 09:59:43 +0800
a0840275e pstore: Correctly initialize spinlock and flags ... Browse Code »

commit 76d5692a58031696e282384cbd893832bc92bd76 upstream.

The ram backend wasn't always initializing its spinlock correctly. Since
it was coming from kzalloc memory, though, it was harmless on
architectures that initialize unlocked spinlocks to 0 (at least x86 and
ARM). This also fixes a possibly ignored flag setting too.

When running under CONFIG_DEBUG_SPINLOCK, the following Oops was visible:

[ 0.760836] persistent_ram: found existing buffer, size 29988, start 29988
[ 0.765112] persistent_ram: found existing buffer, size 30105, start 30105
[ 0.769435] persistent_ram: found existing buffer, size 118542, start 118542
[ 0.785960] persistent_ram: found existing buffer, size 0, start 0
[ 0.786098] persistent_ram: found existing buffer, size 0, start 0
[ 0.786131] pstore: using zlib compression
[ 0.790716] BUG: spinlock bad magic on CPU#0, swapper/0/1
[ 0.790729] lock: 0xffffffc0d1ca9bb0, .magic: 00000000, .owner: /-1, .owner_cpu: 0
[ 0.790742] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.10.0-rc2+ #913
[ 0.790747] Hardware name: Google Kevin (DT)
[ 0.790750] Call trace:
[ 0.790768] [] dump_backtrace+0x0/0x2bc
[ 0.790780] [] show_stack+0x20/0x28
[ 0.790794] [] dump_stack+0xa4/0xcc
[ 0.790809] [] spin_dump+0xe0/0xf0
[ 0.790821] [] spin_bug+0x30/0x3c
[ 0.790834] [] do_raw_spin_lock+0x50/0x1b8
[ 0.790846] [] _raw_spin_lock_irqsave+0x54/0x6c
[ 0.790862] [] buffer_size_add+0x48/0xcc
[ 0.790875] [] persistent_ram_write+0x60/0x11c
[ 0.790888] [] ramoops_pstore_write_buf+0xd4/0x2a4
[ 0.790900] [] pstore_console_write+0xf0/0x134
[ 0.790912] [] console_unlock+0x48c/0x5e8
[ 0.790923] [] register_console+0x3b0/0x4d4
[ 0.790935] [] pstore_register+0x1a8/0x234
[ 0.790947] [] ramoops_probe+0x6b8/0x7d4
[ 0.790961] [] platform_drv_probe+0x7c/0xd0
[ 0.790972] [] driver_probe_device+0x1b4/0x3bc
[ 0.790982] [] __device_attach_driver+0xc8/0xf4
[ 0.790996] [] bus_for_each_drv+0xb4/0xe4
[ 0.791006] [] __device_attach+0xd0/0x158
[ 0.791016] [] device_initial_probe+0x24/0x30
[ 0.791026] [] bus_probe_device+0x50/0xe4
[ 0.791038] [] device_add+0x3a4/0x76c
[ 0.791051] [] of_device_add+0x74/0x84
[ 0.791062] [] of_platform_device_create_pdata+0xc0/0x100
[ 0.791073] [] of_platform_device_create+0x34/0x40
[ 0.791086] [] of_platform_default_populate_init+0x58/0x78
[ 0.791097] [] do_one_initcall+0x88/0x160
[ 0.791109] [] kernel_init_freeable+0x264/0x31c
[ 0.791123] [] kernel_init+0x18/0x11c
[ 0.791133] [] ret_from_fork+0x10/0x50
[ 0.793717] console [pstore-1] enabled
[ 0.797845] pstore: Registered ramoops as persistent store backend
[ 0.804647] ramoops: attached 0x100000@0xf7edc000, ecc: 0/0

Fixes: 663deb47880f ("pstore: Allow prz to control need for locking")
Fixes: 109704492ef6 ("pstore: Make spinlock per zone instead of global")
Reported-by: Brian Norris
Signed-off-by: Kees Cook
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2017-08-07 09:59:43 +0800
469308031 pstore: Allow prz to control need for locking ... Browse Code »

commit 663deb47880f2283809669563c5a52ac7c6aef1a upstream.

In preparation of not locking at all for certain buffers depending on if
there's contention, make locking optional depending on the initialization
of the prz.

Signed-off-by: Joel Fernandes
[kees: moved locking flag into prz instead of via caller arguments]
Signed-off-by: Kees Cook
Signed-off-by: Greg Kroah-Hartman

Joel Fernandes
2017-08-07 09:59:43 +0800
ad25f11ed dentry name snapshots ... Browse Code »

commit 49d31c2f389acfe83417083e1208422b4091cd9e upstream.

take_dentry_name_snapshot() takes a safe snapshot of dentry name;
if the name is a short one, it gets copied into caller-supplied
structure, otherwise an extra reference to external name is grabbed
(those are never modified). In either case the pointer to stable
string is stored into the same structure.

dentry must be held by the caller of take_dentry_name_snapshot(),
but may be freely dropped afterwards - the snapshot will stay
until destroyed by release_dentry_name_snapshot().

Intended use:
struct name_snapshot s;

take_dentry_name_snapshot(&s, dentry);
...
access s.name
...
release_dentry_name_snapshot(&s);

Replaces fsnotify_oldname_...(), gets used in fsnotify to obtain the name
to pass down with event.

Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2017-08-07 09:59:43 +0800
7d2a35486 NFSv4.1: Fix a race where CB_NOTIFY_LOCK fails to wake a waiter ... Browse Code »

commit b7dbcc0e433f0f61acb89ed9861ec996be4f2b38 upstream.

nfs4_retry_setlk() sets the task's state to TASK_INTERRUPTIBLE within the
same region protected by the wait_queue's lock after checking for a
notification from CB_NOTIFY_LOCK callback. However, after releasing that
lock, a wakeup for that task may race in before the call to
freezable_schedule_timeout_interruptible() and set TASK_WAKING, then
freezable_schedule_timeout_interruptible() will set the state back to
TASK_INTERRUPTIBLE before the task will sleep. The result is that the task
will sleep for the entire duration of the timeout.

Since we've already set TASK_INTERRUPTIBLE in the locked section, just use
freezable_schedule_timout() instead.

Fixes: a1d617d8f134 ("nfs: allow blocking locks to be awoken by lock callbacks")
Signed-off-by: Benjamin Coddington
Reviewed-by: Jeff Layton
Signed-off-by: Anna Schumaker
Signed-off-by: Greg Kroah-Hartman

Benjamin Coddington
2017-08-07 09:59:40 +0800
b087b8b11 NFS: invalidate file size when taking a lock. ... Browse Code »

commit 442ce0499c0535f8972b68fa1fda357357a5c953 upstream.

Prior to commit ca0daa277aca ("NFS: Cache aggressively when file is open
for writing"), NFS would revalidate, or invalidate, the file size when
taking a lock. Since that commit it only invalidates the file content.

If the file size is changed on the server while wait for the lock, the
client will have an incorrect understanding of the file size and could
corrupt data. This particularly happens when writing beyond the
(supposed) end of file and can be easily be demonstrated with
posix_fallocate().

If an application opens an empty file, waits for a write lock, and then
calls posix_fallocate(), glibc will determine that the underlying
filesystem doesn't support fallocate (assuming version 4.1 or earlier)
and will write out a '0' byte at the end of each 4K page in the region
being fallocated that is after the end of the file.
NFS will (usually) detect that these writes are beyond EOF and will
expand them to cover the whole page, and then will merge the pages.
Consequently, NFS will write out large blocks of zeroes beyond where it
thought EOF was. If EOF had moved, the pre-existing part of the file
will be over-written. Locking should have protected against this,
but it doesn't.

This patch restores the use of nfs_zap_caches() which invalidated the
cached attributes. When posix_fallocate() asks for the file size, the
request will go to the server and get a correct answer.

Fixes: ca0daa277aca ("NFS: Cache aggressively when file is open for writing")
Signed-off-by: NeilBrown
Signed-off-by: Anna Schumaker
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-08-07 09:59:40 +0800
3a79e1c8e jfs: Don't clear SGID when inheriting ACLs ... Browse Code »

commit 9bcf66c72d726322441ec82962994e69157613e4 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__jfs_set_acl() into jfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: jfs-discussion@lists.sourceforge.net
Signed-off-by: Jan Kara
Signed-off-by: Dave Kleikamp
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-08-07 09:59:39 +0800
d97aff4f9 pstore: Make spinlock per zone instead of global ... Browse Code »

commit 109704492ef637956265ec2eb72ae7b3b39eb6f4 upstream.

Currently pstore has a global spinlock for all zones. Since the zones
are independent and modify different areas of memory, there's no need
to have a global lock, so we should use a per-zone lock as introduced
here. Also, when ramoops's ftrace use-case has a FTRACE_PER_CPU flag
introduced later, which splits the ftrace memory area into a single zone
per CPU, it will eliminate the need for locking. In preparation for this,
make the locking optional.

Signed-off-by: Joel Fernandes
[kees: updated commit message]
Signed-off-by: Kees Cook
Cc: Leo Yan
Signed-off-by: Greg Kroah-Hartman

Joel Fernandes
2017-08-07 09:59:39 +0800

28 Jul, 2017

13 commits

69fbb4421 reiserfs: Don't clear SGID when inheriting ACLs ... Browse Code »

commit 6883cd7f68245e43e91e5ee583b7550abf14523f upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__reiserfs_set_acl() into reiserfs_set_acl(). That way the function will
not be called when inheriting ACLs which is what we want as it prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: reiserfs-devel@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:08:08 +0800
97de6f34b ovl: fix random return value on mount ... Browse Code »

commit 8fc646b44385ff0a9853f6590497e43049eeb311 upstream.

On failure to prepare_creds(), mount fails with a random
return value, as err was last set to an integer cast of
a valid lower mnt pointer or set to 0 if inodes index feature
is enabled.

Reported-by: Dan Carpenter
Fixes: 3fe6e52f0626 ("ovl: override creds with the ones from ...")
Signed-off-by: Amir Goldstein
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Amir Goldstein
2017-07-28 06:08:07 +0800
5cf84432b hfsplus: Don't clear SGID when inheriting ACLs ... Browse Code »

commit 84969465ddc4f8aeb3b993123b571aa01c5f2683 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by creating __hfsplus_set_posix_acl() function that does
not call posix_acl_update_mode() and use it when inheriting ACLs. That
prevents SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:08:07 +0800
acccf01a8 ceph: fix race in concurrent readdir ... Browse Code »

commit 84583cfb973c4313955c6231cc9cb3772d280b15 upstream.

For a large directory, program needs to issue multiple readdir
syscalls to get all dentries. When there are multiple programs
read the directory concurrently. Following sequence of events
can happen.

- program calls readdir with pos = 2. ceph sends readdir request
to mds. The reply contains N1 entries. ceph adds these N1 entries
to readdir cache.
- program calls readdir with pos = N1+2. The readdir is satisfied
by the readdir cache, N2 entries are returned. (Other program
calls readdir in the middle, which fills the cache)
- program calls readdir with pos = N1+N2+2. ceph sends readdir
request to mds. The reply contains N3 entries and it reaches
directory end. ceph adds these N3 entries to the readdir cache
and marks directory complete.

The second readdir call does not update fi->readdir_cache_idx.
ceph add the last N3 entries to wrong places.

Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Yan, Zheng
2017-07-28 06:08:07 +0800
fa67ac18e udf: Fix deadlock between writeback and udf_setsize() ... Browse Code »

commit f2e95355891153f66d4156bf3a142c6489cd78c6 upstream.

udf_setsize() called truncate_setsize() with i_data_sem held. Thus
truncate_pagecache() called from truncate_setsize() could lock a page
under i_data_sem which can deadlock as page lock ranks below
i_data_sem - e. g. writeback can hold page lock and try to acquire
i_data_sem to map a block.

Fix the problem by moving truncate_setsize() calls from under
i_data_sem. It is safe for us to change i_size without holding
i_data_sem as all the places that depend on i_size being stable already
hold inode_lock.

Fixes: 7e49b6f2480cb9a9e7322a91592e56a5c85361f5
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:08:06 +0800
9ebfb4fa3 NFS: only invalidate dentrys that are clearly invalid. ... Browse Code »

commit cc89684c9a265828ce061037f1f79f4a68ccd3f7 upstream.

Since commit bafc9b754f75 ("vfs: More precise tests in d_invalidate")
in v3.18, a return of '0' from ->d_revalidate() will cause the dentry
to be invalidated even if it has filesystems mounted on or it or on a
descendant. The mounted filesystem is unmounted.

This means we need to be careful not to return 0 unless the directory
referred to truly is invalid. So -ESTALE or -ENOENT should invalidate
the directory. Other errors such a -EPERM or -ERESTARTSYS should be
returned from ->d_revalidate() so they are propagated to the caller.

A particular problem can be demonstrated by:

1/ mount an NFS filesystem using NFSv3 on /mnt
2/ mount any other filesystem on /mnt/foo
3/ ls /mnt/foo
4/ turn off network, or otherwise make the server unable to respond
5/ ls /mnt/foo &
6/ cat /proc/$!/stack # note that nfs_lookup_revalidate is in the call stack
7/ kill -9 $! # this results in -ERESTARTSYS being returned
8/ observe that /mnt/foo has been unmounted.

This patch changes nfs_lookup_revalidate() to only treat
-ESTALE from nfs_lookup_verify_inode() and
-ESTALE or -ENOENT from ->lookup()
as indicating an invalid inode. Other errors are returned.

Also nfs_check_inode_attributes() is changed to return -ESTALE rather
than -EIO. This is consistent with the error returned in similar
circumstances from nfs_update_inode().

As this bug allows any user to unmount a filesystem mounted on an NFS
filesystem, this fix is suitable for stable kernels.

Fixes: bafc9b754f75 ("vfs: More precise tests in d_invalidate")
Signed-off-by: NeilBrown
Signed-off-by: Anna Schumaker
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-07-28 06:08:06 +0800
ec469b5e2 ubifs: Don't leak kernel memory to the MTD ... Browse Code »

commit 4acadda74ff8b949c448c0282765ae747e088c87 upstream.

When UBIFS prepares data structures which will be written to the MTD it
ensues that their lengths are multiple of 8. Since it uses kmalloc() the
padded bytes are left uninitialized and we leak a few bytes of kernel
memory to the MTD.
To make sure that all bytes are initialized, let's switch to kzalloc().
Kzalloc() is fine in this case because the buffers are not huge and in
the IO path the performance bottleneck is anyway the MTD.

Fixes: 1e51764a3c2a ("UBIFS: add new flash file system")
Signed-off-by: Richard Weinberger
Reviewed-by: Boris Brezillon
Signed-off-by: Richard Weinberger
Signed-off-by: Greg Kroah-Hartman

Richard Weinberger
2017-07-28 06:08:04 +0800
fee760fc6 ovl: drop CAP_SYS_RESOURCE from saved mounter's credentials ... Browse Code »

commit 51f8f3c4e22535933ef9aecc00e9a6069e051b57 upstream.

If overlay was mounted by root then quota set for upper layer does not work
because overlay now always use mounter's credentials for operations.
Also overlay might deplete reserved space and inodes in ext4.

This patch drops capability SYS_RESOURCE from saved credentials.
This affects creation new files, whiteouts, and copy-up operations.

Signed-off-by: Konstantin Khlebnikov
Fixes: 1175b6b8d963 ("ovl: do operations on underlying file system in mounter's context")
Cc: Vivek Goyal
Signed-off-by: Miklos Szeredi
Cc: Amir Goldstein
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2017-07-28 06:08:03 +0800
f97f9e94f f2fs: Don't clear SGID when inheriting ACLs ... Browse Code »

commit c925dc162f770578ff4a65ec9b08270382dba9e6 upstream.

This patch copies commit b7f8a09f80:
"btrfs: Don't clear SGID when inheriting ACLs" written by Jan.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
Signed-off-by: Jan Kara
Reviewed-by: Chao Yu
Reviewed-by: Jan Kara
Signed-off-by: Jaegeuk Kim
Signed-off-by: Greg Kroah-Hartman

Jaegeuk Kim
2017-07-28 06:08:03 +0800
19e117a50 f2fs: sanity check size of nat and sit cache ... Browse Code »

commit 21d3f8e1c3b7996ce239ab6fa82e9f7a8c47d84d upstream.

Make sure number of entires doesn't exceed max journal size.

Signed-off-by: Jin Qian
Reviewed-by: Chao Yu
Signed-off-by: Jaegeuk Kim
Signed-off-by: Greg Kroah-Hartman

Jin Qian
2017-07-28 06:08:03 +0800
58d2eacd3 xfs: Don't clear SGID when inheriting ACLs ... Browse Code »

commit 8ba358756aa08414fa9e65a1a41d28304ed6fd7f upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by calling __xfs_set_acl() instead of xfs_set_acl() when
setting up inode in xfs_generic_create(). That prevents SGID bit
clearing and mode is properly set by posix_acl_create() anyway. We also
reorder arguments of __xfs_set_acl() to match the ordering of
xfs_set_acl() to make things consistent.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: Darrick J. Wong
CC: linux-xfs@vger.kernel.org
Signed-off-by: Jan Kara
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:08:03 +0800
4d1f97eb5 ext2: Don't clear SGID when inheriting ACLs ... Browse Code »

commit a992f2d38e4ce17b8c7d1f7f67b2de0eebdea069 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by creating __ext2_set_acl() function that does not call
posix_acl_update_mode() and use it when inheriting ACLs. That prevents
SGID bit clearing and the mode has been properly set by
posix_acl_create() anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: linux-ext4@vger.kernel.org
Signed-off-by: Jan Kara
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:08:02 +0800
157302f97 btrfs: Don't clear SGID when inheriting ACLs ... Browse Code »

commit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream.

When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
set, DIR1 is expected to have SGID bit set (and owning group equal to
the owning group of 'DIR0'). However when 'DIR0' also has some default
ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
'DIR1' to get cleared if user is not member of the owning group.

Fix the problem by moving posix_acl_update_mode() out of
__btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
called when inheriting ACLs which is what we want as it prevents SGID
bit clearing and the mode has been properly set by posix_acl_create()
anyway.

Fixes: 073931017b49d9458aa351605b43a7e34598caef
CC: linux-btrfs@vger.kernel.org
CC: David Sterba
Signed-off-by: Jan Kara
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-07-28 06:07:58 +0800

21 Jul, 2017

6 commits

54fcb2303 mnt: Make propagate_umount less slow for overlapping mount propagation trees ... Browse Code »

commit 296990deb389c7da21c78030376ba244dc1badf5 upstream.

Andrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Make the walk of the mount propagation trees nearly linear by
remembering which mounts have already been visited, allowing
subsequent walks to detect when walking a mount propgation tree or a
subtree of a mount propgation tree would be duplicate work and to skip
them entirely.

Walk the list of mounts whose propgatation trees need to be traversed
from the mount highest in the mount tree to mounts lower in the mount
tree so that odds are higher that the code will walk the largest trees
first, allowing later tree walks to be skipped entirely.

Add cleanup_umount_visitation to remover the code's memory of which
mounts have been visited.

Add the functions last_slave and skip_propagation_subtree to allow
skipping appropriate parts of the mount propagation tree without
needing to change the logic of the rest of the code.

A script to generate overlapping mount propagation trees:

$ cat runs.h
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
mkdir /mnt/1/1

iteration=10
if [ -n "$1" ] ; then
iteration=$1
fi

for i in $(seq $iteration); do
mount --bind /mnt/1/1 /mnt/1/1
done

mount --rbind /mnt/1 /mnt/2

TIMEFORMAT='%Rs'
nr=$(( ( 2 ** ( $iteration + 1 ) ) + 1 ))
echo -n "umount -l /mnt/1 -> $nr "
time umount -l /mnt/1

nr=$(cat /proc/self/mountinfo | grep zdtm | wc -l )
time umount -l /mnt/2

$ for i in $(seq 9 19); do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

mhash | 8192 | 8192 | 1048576 | 1048576
mounts | before | after | before | after
------------------------------------------------
1025 | 0.040s | 0.016s | 0.038s | 0.019s
2049 | 0.094s | 0.017s | 0.080s | 0.018s
4097 | 0.243s | 0.019s | 0.206s | 0.023s
8193 | 1.202s | 0.028s | 1.562s | 0.032s
16385 | 9.635s | 0.036s | 9.952s | 0.041s
32769 | 60.928s | 0.063s | 44.321s | 0.064s
65537 | | 0.097s | | 0.097s
131073 | | 0.233s | | 0.176s
262145 | | 0.653s | | 0.344s
524289 | | 2.305s | | 0.735s
1048577 | | 7.107s | | 2.603s

Andrei Vagin reports fixing the performance problem is part of the
work to fix CVE-2016-6213.

Fixes: a05964f3917c ("[PATCH] shared mounts handling: umount")
Reported-by: Andrei Vagin
Reviewed-by: Andrei Vagin
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2017-07-21 13:42:22 +0800
bb4fbf094 mnt: In propgate_umount handle visiting mounts in any order ... Browse Code »

commit 99b19d16471e9c3faa85cad38abc9cbbe04c6d55 upstream.

While investigating some poor umount performance I realized that in
the case of overlapping mount trees where some of the mounts are locked
the code has been failing to unmount all of the mounts it should
have been unmounting.

This failure to unmount all of the necessary
mounts can be reproduced with:

$ cat locked_mounts_test.sh

mount -t tmpfs test-base /mnt
mount --make-shared /mnt
mkdir -p /mnt/b

mount -t tmpfs test1 /mnt/b
mount --make-shared /mnt/b
mkdir -p /mnt/b/10

mount -t tmpfs test2 /mnt/b/10
mount --make-shared /mnt/b/10
mkdir -p /mnt/b/10/20

mount --rbind /mnt/b /mnt/b/10/20

unshare -Urm --propagation unchaged /bin/sh -c 'sleep 5; if [ $(grep test /proc/self/mountinfo | wc -l) -eq 1 ] ; then echo SUCCESS ; else echo FAILURE ; fi'
sleep 1
umount -l /mnt/b
wait %%

$ unshare -Urm ./locked_mounts_test.sh

This failure is corrected by removing the prepass that marks mounts
that may be umounted.

A first pass is added that umounts mounts if possible and if not sets
mount mark if they could be unmounted if they weren't locked and adds
them to a list to umount possibilities. This first pass reconsiders
the mounts parent if it is on the list of umount possibilities, ensuring
that information of umoutability will pass from child to mount parent.

A second pass then walks through all mounts that are umounted and processes
their children unmounting them or marking them for reparenting.

A last pass cleans up the state on the mounts that could not be umounted
and if applicable reparents them to their first parent that remained
mounted.

While a bit longer than the old code this code is much more robust
as it allows information to flow up from the leaves and down
from the trunk making the order in which mounts are encountered
in the umount propgation tree irrelevant.

Fixes: 0c56fe31420c ("mnt: Don't propagate unmounts to locked mounts")
Reviewed-by: Andrei Vagin
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2017-07-21 13:42:22 +0800
e260db757 mnt: In umount propagation reparent in a separate pass ... Browse Code »

commit 570487d3faf2a1d8a220e6ee10f472163123d7da upstream.

It was observed that in some pathlogical cases that the current code
does not unmount everything it should. After investigation it
was determined that the issue is that mnt_change_mntpoint can
can change which mounts are available to be unmounted during mount
propagation which is wrong.

The trivial reproducer is:
$ cat ./pathological.sh

mount -t tmpfs test-base /mnt
cd /mnt
mkdir 1 2 1/1
mount --bind 1 1
mount --make-shared 1
mount --bind 1 2
mount --bind 1/1 1/1
mount --bind 1/1 1/1
echo
grep test-base /proc/self/mountinfo
umount 1/1
echo
grep test-base /proc/self/mountinfo

$ unshare -Urm ./pathological.sh

The expected output looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

The output without the fix looks like:
46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
49 54 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
50 53 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
51 49 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
54 47 0:25 /1/1 /mnt/1/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
53 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 50 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

46 31 0:25 / /mnt rw,relatime - tmpfs test-base rw,uid=1000,gid=1000
47 46 0:25 /1 /mnt/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
48 46 0:25 /1 /mnt/2 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000
52 48 0:25 /1/1 /mnt/2/1 rw,relatime shared:1 - tmpfs test-base rw,uid=1000,gid=1000

That last mount in the output was in the propgation tree to be unmounted but
was missed because the mnt_change_mountpoint changed it's parent before the walk
through the mount propagation tree observed it.

Fixes: 1064f874abc0 ("mnt: Tuck mounts under others instead of creating shadow/side mounts.")
Acked-by: Andrei Vagin
Reviewed-by: Ram Pai
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2017-07-21 13:42:22 +0800
f31c4f65d exec: Limit arg stack to at most 75% of _STK_LIM ... Browse Code »

commit da029c11e6b12f321f36dac8771e833b65cec962 upstream.

To avoid pathological stack usage or the need to special-case setuid
execs, just limit all arg stack usage to at most 75% of _STK_LIM (6MB).

Signed-off-by: Kees Cook
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2017-07-21 13:42:22 +0800
63c2f8f8c binfmt_elf: use ELF_ET_DYN_BASE only for PIE ... Browse Code »

commit eab09532d40090698b05a07c1c87f39fdbc5fab5 upstream.

The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.)

With the advent of PIE (ET_DYN binaries with an INTERP Program Header),
ELF_ET_DYN_BASE continued to be used since the kernel was only looking
at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the
top 1/3rd of the TASK_SIZE, a substantial portion of the address space
is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are
loaded above the mmap region. This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with
pathological stack regions.

Lowering ELF_ET_DYN_BASE solves both by moving programs below the mmap
region in all cases, and will now additionally avoid programs falling
back to the mmap region by enforcing MAP_FIXED for program loads (i.e.
if it would have collided with the stack, now it will fail to load
instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by
the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE,
which means architectures can now safely lower their values without risk
of loaders colliding with their subsequently loaded programs.

For 64-bit, ELF_ET_DYN_BASE is best set to 4GB to allow runtimes to use
the entire 32-bit address space for 32-bit pointers.

Thanks to PaX Team, Daniel Micay, and Rik van Riel for inspiration and
suggestions on how to implement this solution.

Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
Link: http://lkml.kernel.org/r/20170621173201.GA114489@beast
Signed-off-by: Kees Cook
Acked-by: Rik van Riel
Cc: Daniel Micay
Cc: Qualys Security Advisory
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Alexander Viro
Cc: Dmitry Safonov
Cc: Andy Lutomirski
Cc: Grzegorz Andrejczuk
Cc: Masahiro Yamada
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Heiko Carstens
Cc: James Hogan
Cc: Martin Schwidefsky
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Pratyush Anand
Cc: Russell King
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Kees Cook
2017-07-21 13:42:21 +0800
a9aa6522a fs/dcache.c: fix spin lockup issue on nlru->lock ... Browse Code »

commit b17c070fb624cf10162cf92ea5e1ec25cd8ac176 upstream.

__list_lru_walk_one() acquires nlru spin lock (nlru->lock) for longer
duration if there are more number of items in the lru list. As per the
current code, it can hold the spin lock for upto maximum UINT_MAX
entries at a time. So if there are more number of items in the lru
list, then "BUG: spinlock lockup suspected" is observed in the below
path:

spin_bug+0x90
do_raw_spin_lock+0xfc
_raw_spin_lock+0x28
list_lru_add+0x28
dput+0x1c8
path_put+0x20
terminate_walk+0x3c
path_lookupat+0x100
filename_lookup+0x6c
user_path_at_empty+0x54
SyS_faccessat+0xd0
el0_svc_naked+0x24

This nlru->lock is acquired by another CPU in this path -

d_lru_shrink_move+0x34
dentry_lru_isolate_shrink+0x48
__list_lru_walk_one.isra.10+0x94
list_lru_walk_node+0x40
shrink_dcache_sb+0x60
do_remount_sb+0xbc
do_emergency_remount+0xb0
process_one_work+0x228
worker_thread+0x2e0
kthread+0xf4
ret_from_fork+0x10

Fix this lockup by reducing the number of entries to be shrinked from
the lru list to 1024 at once. Also, add cond_resched() before
processing the lru list again.

Link: http://marc.info/?t=149722864900001&r=1&w=2
Link: http://lkml.kernel.org/r/1498707575-2472-1-git-send-email-stummala@codeaurora.org
Signed-off-by: Sahitya Tummala
Suggested-by: Jan Kara
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Alexander Polakov
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Sahitya Tummala
2017-07-21 13:42:21 +0800

15 Jul, 2017

1 commit

c0d3a7bdc ext4: check return value of kstrtoull correctly in reserved_clusters_store ... Browse Code »

commit 1ea1516fbbab2b30bf98c534ecaacba579a35208 upstream.

kstrtoull returns 0 on success, however, in reserved_clusters_store we
will return -EINVAL if kstrtoull returns 0, it makes us fail to update
reserved_clusters value through sysfs.

Fixes: 76d33bca5581b1dd5c3157fa168db849a784ada4
Signed-off-by: Chao Yu
Signed-off-by: Miao Xie
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Chao Yu
2017-07-15 18:16:16 +0800

12 Jul, 2017

4 commits

25b2ee6f9 gfs2: Fix glock rhashtable rcu bug ... Browse Code »

commit 961ae1d83d055a4b9ebbfb4cc8ca62ec1a7a3b74 upstream.

Before commit 88ffbf3e03 "GFS2: Use resizable hash table for glocks",
glocks were freed via call_rcu to allow reading the glock hashtable
locklessly using rcu. This was then changed to free glocks immediately,
which made reading the glock hashtable unsafe. Bring back the original
code for freeing glocks via call_rcu.

Signed-off-by: Andreas Gruenbacher
Signed-off-by: Bob Peterson
Signed-off-by: Greg Kroah-Hartman

Andreas Gruenbacher
2017-07-12 21:01:06 +0800
9403514ba ceph: choose readdir frag based on previous readdir reply ... Browse Code »

commit b50c2de51e611da90cf3cf04c058f7e9bbe79e93 upstream.

The dirfragtree is lazily updated, it's not always accurate. Infinite
loops happens in following circumstance.

- client send request to read frag A
- frag A has been fragmented into frag B and C. So mds fills the reply
with contents of frag B
- client wants to read next frag C. ceph_choose_frag(frag value of C)
return frag A.

The fix is using previous readdir reply to calculate next readdir frag
when possible.

Signed-off-by: "Yan, Zheng"
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Yan, Zheng
2017-07-12 21:01:02 +0800
26ff065b8 fs: completely ignore unknown open flags ... Browse Code »

commit 629e014bb8349fcf7c1e4df19a842652ece1c945 upstream.

Currently we just stash anything we got into file->f_flags, and the
report it in fcntl(F_GETFD). This patch just clears out all unknown
flags so that we don't pass them to the fs or report them.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-07-12 21:01:02 +0800
6efb1b0b6 fs: add a VALID_OPEN_FLAGS ... Browse Code »

commit 80f18379a7c350c011d30332658aa15fe49a8fa5 upstream.

Add a central define for all valid open flags, and use it in the uniqueness
check.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-07-12 21:01:02 +0800

05 Jul, 2017

5 commits

d5c5e8ba5 ocfs2: o2hb: revert hb threshold to keep compatible ... Browse Code »

commit 33496c3c3d7b88dcbe5e55aa01288b05646c6aca upstream.

Configfs is the interface for ocfs2-tools to set configure to kernel and
$configfs_dir/cluster/$clustername/heartbeat/dead_threshold is the one
used to configure heartbeat dead threshold. Kernel has a default value
of it but user can set O2CB_HEARTBEAT_THRESHOLD in /etc/sysconfig/o2cb
to override it.

Commit 45b997737a80 ("ocfs2/cluster: use per-attribute show and store
methods") changed heartbeat dead threshold name while ocfs2-tools did
not, so ocfs2-tools won't set this configurable and the default value is
always used. So revert it.

Fixes: 45b997737a80 ("ocfs2/cluster: use per-attribute show and store methods")
Link: http://lkml.kernel.org/r/1490665245-15374-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi
Acked-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Junxiao Bi
2017-07-05 20:40:29 +0800
68a5dc385 coredump: Ensure proper size of sparse core files ... Browse Code »

[ Upstream commit 4d22c75d4c7b5c5f4bd31054f09103ee490878fd ]

If the last section of a core file ends with an unmapped or zero page,
the size of the file does not correspond with the last dump_skip() call.
gdb complains that the file is truncated and can be confusing to users.

After all of the vma sections are written, make sure that the file size
is no smaller than the current file position.

This problem can be demonstrated with gdb's bigcore testcase on the
sparc architecture.

Signed-off-by: Dave Kleikamp
Cc: Alexander Viro
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Al Viro
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Dave Kleikamp
2017-07-05 20:40:26 +0800
d21816c24 aio: fix lock dep warning ... Browse Code »

[ Upstream commit a12f1ae61c489076a9aeb90bddca7722bf330df3 ]

lockdep reports a warnning. file_start_write/file_end_write only
acquire/release the lock for regular files. So checking the files in aio
side too.

[ 453.532141] ------------[ cut here ]------------
[ 453.533011] WARNING: CPU: 1 PID: 1298 at ../kernel/locking/lockdep.c:3514 lock_release+0x434/0x670
[ 453.533011] DEBUG_LOCKS_WARN_ON(depth ] dump_stack+0x67/0x9c
[ 453.533011] [] __warn+0x111/0x130
[ 453.533011] [] warn_slowpath_fmt+0x97/0xb0
[ 453.533011] [] ? __warn+0x130/0x130
[ 453.533011] [] ? blk_finish_plug+0x29/0x60
[ 453.533011] [] lock_release+0x434/0x670
[ 453.533011] [] ? import_single_range+0xd4/0x110
[ 453.533011] [] ? rw_verify_area+0x65/0x140
[ 453.533011] [] ? aio_write+0x1f6/0x280
[ 453.533011] [] aio_write+0x229/0x280
[ 453.533011] [] ? aio_complete+0x640/0x640
[ 453.533011] [] ? debug_check_no_locks_freed+0x1a0/0x1a0
[ 453.533011] [] ? debug_lockdep_rcu_enabled.part.2+0x1a/0x30
[ 453.533011] [] ? debug_lockdep_rcu_enabled+0x35/0x40
[ 453.533011] [] ? __might_fault+0x7e/0xf0
[ 453.533011] [] do_io_submit+0x94c/0xb10
[ 453.533011] [] ? do_io_submit+0x23e/0xb10
[ 453.533011] [] ? SyS_io_destroy+0x270/0x270
[ 453.533011] [] ? mark_held_locks+0x23/0xc0
[ 453.533011] [] ? trace_hardirqs_on_thunk+0x1a/0x1c
[ 453.533011] [] SyS_io_submit+0x10/0x20
[ 453.533011] [] entry_SYSCALL_64_fastpath+0x18/0xad
[ 453.533011] [] ? trace_hardirqs_off_caller+0xc0/0x110
[ 453.533011] ---[ end trace b2fbe664d1cc0082 ]---

Cc: Dmitry Monakhov
Cc: Jan Kara
Cc: Christoph Hellwig
Cc: Al Viro
Reviewed-by: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Al Viro
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Shaohua Li
2017-07-05 20:40:26 +0800
c3eab85ff Btrfs: fix truncate down when no_holes feature is enabled ... Browse Code »

[ Upstream commit 91298eec05cd8d4e828cf7ee5d4a6334f70cf69a ]

For such a file mapping,

[0-4k][hole][8k-12k]

In NO_HOLES mode, we don't have the [hole] extent any more.
Commit c1aa45759e90 ("Btrfs: fix shrinking truncate when the no_holes feature is enabled")
fixed disk isize not being updated in NO_HOLES mode when data is not flushed.

However, even if data has been flushed, we can still have trouble
in updating disk isize since we updated disk isize to 'start' of
the last evicted extent.

Reviewed-by: Chris Mason
Signed-off-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Liu Bo
2017-07-05 20:40:22 +0800
e8b5068b6 Btrfs: Fix deadlock between direct IO and fast fsync ... Browse Code »

[ Upstream commit 97dcdea076ecef41ea4aaa23d4397c2f622e4265 ]

The following deadlock is seen when executing generic/113 test,

---------------------------------------------------------+----------------------------------------------------
Direct I/O task Fast fsync task
---------------------------------------------------------+----------------------------------------------------
btrfs_direct_IO
__blockdev_direct_IO
do_blockdev_direct_IO
do_direct_IO
btrfs_get_blocks_direct
while (blocks needs to written)
get_more_blocks (first iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
Create and add extent map and ordered extent
up_read(&BTRFS_I(inode) >dio_sem)
btrfs_sync_file
btrfs_log_dentry_safe
btrfs_log_inode_parent
btrfs_log_inode
btrfs_log_changed_extents
down_write(&BTRFS_I(inode) >dio_sem)
Collect new extent maps and ordered extents
wait for ordered extent completion
get_more_blocks (second iteration)
btrfs_get_blocks_direct
btrfs_create_dio_extent
down_read(&BTRFS_I(inode) >dio_sem)
--------------------------------------------------------------------------------------------------------------

In the above description, Btrfs direct I/O code path has not yet started
submitting bios for file range covered by the initial ordered
extent. Meanwhile, The fast fsync task obtains the write semaphore and
waits for I/O on the ordered extent to get completed. However, the
Direct I/O task is now blocked on obtaining the read semaphore.

To resolve the deadlock, this commit modifies the Direct I/O code path
to obtain the read semaphore before invoking
__blockdev_direct_IO(). The semaphore is then given up after
__blockdev_direct_IO() returns. This allows the Direct I/O code to
complete I/O on all the ordered extents it creates.

Signed-off-by: Chandan Rajendra
Reviewed-by: Filipe Manana
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Chandan Rajendra
2017-07-05 20:40:22 +0800