Eric Lee / smarc-fsl-linux-kernel

01 Feb, 2017

6 commits

5637949ed NFSv4.0: always send mode in SETATTR after EXCLUSIVE4 ... Browse Code »

commit a430607b2ef7c3be090f88c71cfcb1b3988aa7c0 upstream.

Some nfsv4.0 servers may return a mode for the verifier following an open
with EXCLUSIVE4 createmode, but this does not mean the client should skip
setting the mode in the following SETATTR. It should only do that for
EXCLUSIVE4_1 or UNGAURDED createmode.

Fixes: 5334c5bdac92 ("NFS: Send attributes in OPEN request for NFS4_CREATE_EXCLUSIVE4_1")
Signed-off-by: Benjamin Coddington
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Benjamin Coddington
2017-02-01 15:33:09 +0800
0a7023506 NFSv4.1: Fix a deadlock in layoutget ... Browse Code »

commit 8ac092519ad91931c96d306c4bfae2c6587c325f upstream.

We cannot call nfs4_handle_exception() without first ensuring that the
slot has been freed. If not, we end up deadlocking with the process
waiting for recovery to complete, and recovery waiting for the slot
table to drain.

Fixes: 2e80dbe7ac51 ("NFSv4.1: Close callback races for OPEN, LAYOUTGET...")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2017-02-01 15:33:09 +0800
ffb97c11d Btrfs: remove ->{get, set}_acl() from btrfs_dir_ro_inode_operations ... Browse Code »

commit 57b59ed2e5b91e958843609c7884794e29e6c4cb upstream.

Subvolume directory inodes can't have ACLs.

Signed-off-by: Omar Sandoval
Reviewed-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2017-02-01 15:33:06 +0800
ad80fada9 Btrfs: disable xattr operations on subvolume directories ... Browse Code »

commit 1fdf41941b8010691679638f8d0c8d08cfee7726 upstream.

When you snapshot a subvolume containing a subvolume, you get a
placeholder directory where the subvolume would be. These directory
inodes have ->i_ops set to btrfs_dir_ro_inode_operations. Previously,
these i_ops didn't include the xattr operation callbacks. The conversion
to xattr_handlers missed this case, leading to bogus attempts to set
xattrs on these inodes. This manifested itself as failures when running
delayed inodes.

To fix this, clear IOP_XATTR in ->i_opflags on these inodes.

Fixes: 6c6ef9f26e59 ("xattr: Stop calling {get,set,remove}xattr inode operations")
Cc: Andreas Gruenbacher
Reported-by: Chris Murphy
Tested-by: Chris Murphy
Signed-off-by: Omar Sandoval
Reviewed-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2017-02-01 15:33:06 +0800
79babd4a6 Btrfs: remove old tree_root case in btrfs_read_locked_inode() ... Browse Code »

commit 67ade058ef2c65a3e56878af9c293ec76722a2e5 upstream.

As Jeff explained in c2951f32d36c ("btrfs: remove old tree_root dirent
processing in btrfs_real_readdir()"), supporting this old format is no
longer necessary since the Btrfs magic number has been updated since we
changed to the current format. There are other places where we still
handle this old format, but since this is part of a fix that is going to
stable, I'm only removing this one for now.

Signed-off-by: Omar Sandoval
Reviewed-by: David Sterba
Signed-off-by: Chris Mason
Signed-off-by: Greg Kroah-Hartman

Omar Sandoval
2017-02-01 15:33:06 +0800
485952414 xfs: prevent quotacheck from overloading inode lru ... Browse Code »

commit e0d76fa4475ef2cf4b52d18588b8ce95153d021b upstream.

Quotacheck runs at mount time in situations where quota accounting must
be recalculated. In doing so, it uses bulkstat to visit every inode in
the filesystem. Historically, every inode processed during quotacheck
was released and immediately tagged for reclaim because quotacheck runs
before the superblock is marked active by the VFS. In other words,
the final iput() lead to an immediate ->destroy_inode() call, which
allowed the XFS background reclaim worker to start reclaiming inodes.

Commit 17c12bcd3 ("xfs: when replaying bmap operations, don't let
unlinked inodes get reaped") marks the XFS superblock active sooner as
part of the mount process to support caching inodes processed during log
recovery. This occurs before quotacheck and thus means all inodes
processed by quotacheck are inserted to the LRU on release. The
s_umount lock is held until the mount has completed and thus prevents
the shrinkers from operating on the sb. This means that quotacheck can
excessively populate the inode LRU and lead to OOM conditions on systems
without sufficient RAM.

Update the quotacheck bulkstat handler to set XFS_IGET_DONTCACHE on
inodes processed by quotacheck. This causes ->drop_inode() to return 1
and in turn causes iput_final() to evict the inode. This preserves the
original quotacheck behavior and prevents it from overloading the LRU
and running out of memory.

Reported-by: Martin Svec
Signed-off-by: Brian Foster
Reviewed-by: Eric Sandeen
Reviewed-by: Darrick J. Wong
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Brian Foster
2017-02-01 15:33:05 +0800

26 Jan, 2017

9 commits

6e9fa67c5 ceph: fix endianness bug in frag_tree_split_cmp ... Browse Code »

commit fe2ed42517533068ac03eed5630fffafff27eacf upstream.

sparse says:

fs/ceph/inode.c:308:36: warning: incorrect type in argument 1 (different base types)
fs/ceph/inode.c:308:36: expected unsigned int [unsigned] [usertype] a
fs/ceph/inode.c:308:36: got restricted __le32 [usertype] frag
fs/ceph/inode.c:308:46: warning: incorrect type in argument 2 (different base types)
fs/ceph/inode.c:308:46: expected unsigned int [unsigned] [usertype] b
fs/ceph/inode.c:308:46: got restricted __le32 [usertype] frag

We need to convert these values to host-endian before calling the
comparator.

Fixes: a407846ef7c6 ("ceph: don't assume frag tree splits in mds reply are sorted")
Signed-off-by: Jeff Layton
Reviewed-by: Sage Weil
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2017-01-26 15:24:43 +0800
2e4f2131b ceph: fix endianness of getattr mask in ceph_d_revalidate ... Browse Code »

commit 1097680d759918ce4a8705381c0ab2ed7bd60cf1 upstream.

sparse says:

fs/ceph/dir.c:1248:50: warning: incorrect type in assignment (different base types)
fs/ceph/dir.c:1248:50: expected restricted __le32 [usertype] mask
fs/ceph/dir.c:1248:50: got int [signed] [assigned] mask

Fixes: 200fd27c8fa2 ("ceph: use lookup request to revalidate dentry")
Signed-off-by: Jeff Layton
Reviewed-by: Sage Weil
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2017-01-26 15:24:43 +0800
8934e0696 ceph: fix ceph_get_caps() interruption ... Browse Code »

commit 6e09d0fb64402cec579f029ca4c7f39f5c48fc60 upstream.

Commit 5c341ee32881 ("ceph: fix scheduler warning due to nested
blocking") causes infinite loop when process is interrupted. Fix it.

Signed-off-by: Yan, Zheng
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Yan, Zheng
2017-01-26 15:24:43 +0800
48baa9241 ceph: fix scheduler warning due to nested blocking ... Browse Code »

commit 5c341ee32881c554727ec14b71ec3e8832f01989 upstream.

try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x5d/0x110
ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G O 4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0
ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c
0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0
Call Trace:
[] dump_stack+0x67/0x9e
[] warn_slowpath_common+0x86/0xc0
[] warn_slowpath_fmt+0x4c/0x50
[] ? prepare_to_wait_event+0x5d/0x110
[] ? prepare_to_wait_event+0x5d/0x110
[] __might_sleep+0x9f/0xb0
[] mutex_lock+0x20/0x40
[] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
[] try_get_cap_refs+0xa2/0x320 [ceph]
[] ceph_get_caps+0x255/0x2b0 [ceph]
[] ? wait_woken+0xb0/0xb0
[] ceph_write_iter+0x2b1/0xde0 [ceph]
[] ? schedule_timeout+0x202/0x260
[] ? kmem_cache_free+0x1ea/0x200
[] ? iput+0x9e/0x230
[] ? __might_sleep+0x52/0xb0
[] ? __might_fault+0x37/0x40
[] ? cp_new_stat+0x153/0x170
[] __vfs_write+0xaa/0xe0
[] vfs_write+0xa9/0x190
[] ? set_close_on_exec+0x31/0x70
[] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov
Signed-off-by: Yan, Zheng
Signed-off-by: Greg Kroah-Hartman

Nikolay Borisov
2017-01-26 15:24:43 +0800
1f75575ac ceph: fix bad endianness handling in parse_reply_info_extra ... Browse Code »

commit 6df8c9d80a27cb587f61b4f06b57e248d8bc3f86 upstream.

sparse says:

fs/ceph/mds_client.c:291:23: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:293:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:294:28: warning: restricted __le32 degrades to integer
fs/ceph/mds_client.c:296:28: warning: restricted __le32 degrades to integer

The op value is __le32, so we need to convert it before comparing it.

Signed-off-by: Jeff Layton
Reviewed-by: Sage Weil
Signed-off-by: Ilya Dryomov
Signed-off-by: Greg Kroah-Hartman

Jeff Layton
2017-01-26 15:24:40 +0800
ce5c52f03 ubifs: Fix journal replay wrt. xattr nodes ... Browse Code »

commit 1cb51a15b576ee325d527726afff40947218fd5e upstream.

When replaying the journal it can happen that a journal entry points to
a garbage collected node.
This is the case when a power-cut occurred between a garbage collect run
and a commit. In such a case nodes have to be read using the failable
read functions to detect whether the found node matches what we expect.

One corner case was forgotten, when the journal contains an entry to
remove an inode all xattrs have to be removed too. UBIFS models xattr
like directory entries, so the TNC code iterates over
all xattrs of the inode and removes them too. This code re-uses the
functions for walking directories and calls ubifs_tnc_next_ent().
ubifs_tnc_next_ent() expects to be used only after the journal and
aborts when a node does not match the expected result. This behavior can
render an UBIFS volume unmountable after a power-cut when xattrs are
used.

Fix this issue by using failable read functions in ubifs_tnc_next_ent()
too when replaying the journal.
Fixes: 1e51764a3c2ac05a ("UBIFS: add new flash file system")
Reported-by: Rock Lee
Reviewed-by: David Gstir
Signed-off-by: Richard Weinberger
Signed-off-by: Greg Kroah-Hartman

Richard Weinberger
2017-01-26 15:24:40 +0800
07f026756 fuse: fix time_to_jiffies nsec sanity check ... Browse Code »

commit 210675270caa33253e4c33f3c5e657e7d6060812 upstream.

Commit bcb6f6d2b9c2 ("fuse: use timespec64") introduced clamped nsec values
in time_to_jiffies but used the max of nsec and NSEC_PER_SEC - 1 instead of
the min. Because of this, dentries would stay in the cache longer than
requested and go stale in scenarios that relied on their timely eviction.

Fixes: bcb6f6d2b9c2 ("fuse: use timespec64")
Signed-off-by: David Sheets
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

David Sheets
2017-01-26 15:24:38 +0800
0181b3603 fuse: clear FR_PENDING flag when moving requests out of pending queue ... Browse Code »

commit a8a86d78d673b1c99fe9b0064739fde9e9774184 upstream.

fuse_abort_conn() moves requests from pending list to a temporary list
before canceling them. This operation races with request_wait_answer()
which also tries to remove the request after it gets a fatal signal. It
checks FR_PENDING flag to determine whether the request is still in the
pending list.

Make fuse_abort_conn() clear FR_PENDING flag so that request_wait_answer()
does not remove the request from temporary list.

This bug causes an Oops when trying to delete an already deleted list entry
in end_requests().

Fixes: ee314a870e40 ("fuse: abort: no fc->lock needed for request ending")
Signed-off-by: Tahsin Erdogan
Signed-off-by: Miklos Szeredi
Signed-off-by: Greg Kroah-Hartman

Tahsin Erdogan
2017-01-26 15:24:38 +0800
782b361c9 tmpfs: clear S_ISGID when setting posix ACLs ... Browse Code »

commit 497de07d89c1410d76a15bec2bb41f24a2a89f31 upstream.

This change was missed the tmpfs modification in In CVE-2016-7097
commit 073931017b49 ("posix_acl: Clear SGID bit when setting
file permissions")
It can test by xfstest generic/375, which failed to clear
setgid bit in the following test case on tmpfs:

touch $testfile
chown 100:100 $testfile
chmod 2755 $testfile
_runas -u 100 -g 101 -- setfacl -m u::rwx,g::rwx,o::rwx $testfile

Signed-off-by: Gu Zheng
Signed-off-by: Al Viro
Cc: Brad Spengler
Signed-off-by: Greg Kroah-Hartman

Gu Zheng
2017-01-26 15:24:37 +0800

20 Jan, 2017

11 commits

396b25173 NFSv4.1: nfs4_fl_prepare_ds must be careful about reporting success. ... Browse Code »

commit cfd278c280f997cf2fe4662e0acab0fe465f637b upstream.

Various places assume that if nfs4_fl_prepare_ds() turns a non-NULL 'ds',
then ds->ds_clp will also be non-NULL.

This is not necessasrily true in the case when the process received a fatal signal
while nfs4_pnfs_ds_connect is waiting in nfs4_wait_ds_connect().
In that case ->ds_clp may not be set, and the devid may not recently have been marked
unavailable.

So add a test for ds_clp == NULL and return NULL in that case.

Fixes: c23266d532b4 ("NFS4.1 Fix data server connection race")
Signed-off-by: NeilBrown
Acked-by: Olga Kornievskaia
Acked-by: Adamson, Andy
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

NeilBrown
2017-01-20 03:18:05 +0800
e331f2f2b NFS: Fix a performance regression in readdir ... Browse Code »

commit 79f687a3de9e3ba2518b4ea33f38ca6cbe9133eb upstream.

Ben Coddington reports that commit 311324ad1713, by adding the function
nfs_dir_mapping_need_revalidate() that checks page cache validity on
each call to nfs_readdir() causes a performance regression when
the directory is being modified.

If the directory is changing while we're iterating through the directory,
POSIX does not require us to invalidate the page cache unless the user
calls rewinddir(). However, we still do want to ensure that we use
readdirplus in order to avoid a load of stat() calls when the user
is doing an 'ls -l' workload.

The fix should be to invalidate the page cache immediately when we're
setting the NFS_INO_ADVISE_RDPLUS bit.

Reported-by: Benjamin Coddington
Fixes: 311324ad1713 ("NFS: Be more aggressive in using readdirplus...")
Reviewed-by: Benjamin Coddington
Tested-by: Benjamin Coddington
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2017-01-20 03:18:05 +0800
4c4d4bec6 pNFS: Fix race in pnfs_wait_on_layoutreturn ... Browse Code »

commit ee284e35d8c71bf5d4d807eaff6f67a17134b359 upstream.

We must put the task to sleep while holding the inode->i_lock in order
to ensure atomicity with the test for NFS_LAYOUT_RETURN.

Fixes: 500d701f336b ("NFS41: make close wait for layoutreturn")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2017-01-20 03:18:05 +0800
633b57037 NFS: fix typo in parameter description ... Browse Code »

commit f36ab161bebe464d33b998294eff29b17a9c8918 upstream.

Fix typo in parameter description.

Fixes: 5405fc44c337 ("NFSv4.x: Add kernel parameter to control the
callback server")
Signed-off-by: Wei Yongjun
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Wei Yongjun
2017-01-20 03:18:05 +0800
7a1dcd92f btrfs: fix error handling when run_delayed_extent_op fails ... Browse Code »

commit aa7c8da35d1905d80e840d075f07d26ec90144b5 upstream.

In __btrfs_run_delayed_refs, the error path when run_delayed_extent_op
fails sets locked_ref->processing = 0 but doesn't re-increment
delayed_refs->num_heads_ready. As a result, we end up triggering
the WARN_ON in btrfs_select_ref_head.

Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
Reported-by: Jon Nelson
Signed-off-by: Jeff Mahoney
Reviewed-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Jeff Mahoney
2017-01-20 03:18:05 +0800
003e3163f btrfs: fix locking when we put back a delayed ref that's too new ... Browse Code »

commit d0280996437081dd12ed1e982ac8aeaa62835ec4 upstream.

In __btrfs_run_delayed_refs, when we put back a delayed ref that's too
new, we have already dropped the lock on locked_ref when we set
->processing = 0.

This patch keeps the lock to cover that assignment.

Fixes: d7df2c796d7 (Btrfs: attach delayed ref updates to delayed ref heads)
Signed-off-by: Jeff Mahoney
Reviewed-by: Liu Bo
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Jeff Mahoney
2017-01-20 03:18:05 +0800
00cf64fba sysctl: Drop reference added by grab_header in proc_sys_readdir ... Browse Code »

commit 93362fa47fe98b62e4a34ab408c4a418432e7939 upstream.

Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.

The calltrace of CVE-2016-9191:

[ 5535.960522] Call Trace:
[ 5535.963265] [] schedule+0x3f/0xa0
[ 5535.968817] [] schedule_timeout+0x3db/0x6f0
[ 5535.975346] [] ? wait_for_completion+0x45/0x130
[ 5535.982256] [] wait_for_completion+0xc3/0x130
[ 5535.988972] [] ? wake_up_q+0x80/0x80
[ 5535.994804] [] drop_sysctl_table+0xc4/0xe0
[ 5536.001227] [] drop_sysctl_table+0x77/0xe0
[ 5536.007648] [] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654] [] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657] [] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344] [] partition_sched_domains+0x44/0x450
[ 5536.036447] [] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844] [] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336] [] update_flag+0x11d/0x210
[ 5536.057373] [] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186] [] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899] [] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420] [] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234] [] ? css_killed_work_fn+0x25/0x220
[ 5536.091049] [] cpuset_css_offline+0x35/0x60
[ 5536.097571] [] css_killed_work_fn+0x5c/0x220
[ 5536.104207] [] process_one_work+0x1df/0x710
[ 5536.110736] [] ? process_one_work+0x160/0x710
[ 5536.117461] [] worker_thread+0x12b/0x4a0
[ 5536.123697] [] ? process_one_work+0x710/0x710
[ 5536.130426] [] kthread+0xfe/0x120
[ 5536.135991] [] ret_from_fork+0x1f/0x40
[ 5536.142041] [] ? kthread_create_on_node+0x230/0x230

One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex. The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.

See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

Fixes: f0c3b5093add ("[readdir] convert procfs")
Reported-by: CAI Qian
Tested-by: Yang Shukui
Signed-off-by: Zhou Chengming
Acked-by: Al Viro
Signed-off-by: Eric W. Biederman
Signed-off-by: Greg Kroah-Hartman

Zhou Chengming
2017-01-20 03:18:04 +0800
1a62a0f76 mnt: Protect the mountpoint hashtable with mount_lock ... Browse Code »

commit 3895dbf8985f656675b5bde610723a29cbce3fa7 upstream.

Protecting the mountpoint hashtable with namespace_sem was sufficient
until a call to umount_mnt was added to mntput_no_expire. At which
point it became possible for multiple calls of put_mountpoint on
the same hash chain to happen on the same time.

Kristen Johansen reported:
> This can cause a panic when simultaneous callers of put_mountpoint
> attempt to free the same mountpoint. This occurs because some callers
> hold the mount_hash_lock, while others hold the namespace lock. Some
> even hold both.
>
> In this submitter's case, the panic manifested itself as a GP fault in
> put_mountpoint() when it called hlist_del() and attempted to dereference
> a m_hash.pprev that had been poisioned by another thread.

Al Viro observed that the simple fix is to switch from using the namespace_sem
to the mount_lock to protect the mountpoint hash table.

I have taken Al's suggested patch moved put_mountpoint in pivot_root
(instead of taking mount_lock an additional time), and have replaced
new_mountpoint with get_mountpoint a function that does the hash table
lookup and addition under the mount_lock. The introduction of get_mounptoint
ensures that only the mount_lock is needed to manipulate the mountpoint
hashtable.

d_set_mounted is modified to only set DCACHE_MOUNTED if it is not
already set. This allows get_mountpoint to use the setting of
DCACHE_MOUNTED to ensure adding a struct mountpoint for a dentry
happens exactly once.

Fixes: ce07d891a089 ("mnt: Honor MNT_LOCKED when detaching mounts")
Reported-by: Krister Johansen
Suggested-by: Al Viro
Acked-by: Al Viro
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2017-01-20 03:18:03 +0800
28dad9aa9 btrfs: fix crash when tracepoint arguments are freed by wq callbacks ... Browse Code »

commit ac0c7cf8be00f269f82964cf7b144ca3edc5dbc4 upstream.

Enabling btrfs tracepoints leads to instant crash, as reported. The wq
callbacks could free the memory and the tracepoints started to
dereference the members to get to fs_info.

The proposed fix https://marc.info/?l=linux-btrfs&m=148172436722606&w=2
removed the tracepoints but we could preserve them by passing only the
required data in a safe way.

Fixes: bc074524e123 ("btrfs: prefix fsid to all trace events")
Reported-by: Sebastian Andrzej Siewior
Reviewed-by: Qu Wenruo
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

David Sterba
2017-01-20 03:18:02 +0800
6ba35da69 xfs: Timely free truncated dirty pages ... Browse Code »

commit 0a417b8dc1f10b03e8f558b8a831f07ec4c23795 upstream.

Commit 99579ccec4e2 "xfs: skip dirty pages in ->releasepage()" started
to skip dirty pages in xfs_vm_releasepage() which also has the effect
that if a dirty page is truncated, it does not get freed by
block_invalidatepage() and is lingering in LRU list waiting for reclaim.
So a simple loop like:

while true; do
dd if=/dev/zero of=file bs=1M count=100
rm file
done

will keep using more and more memory until we hit low watermarks and
start pagecache reclaim which will eventually reclaim also the truncate
pages. Keeping these truncated (and thus never usable) pages in memory
is just a waste of memory, is unnecessarily stressing page cache
reclaim, and reportedly also leads to anonymous mmap(2) returning ENOMEM
prematurely.

So instead of just skipping dirty pages in xfs_vm_releasepage(), return
to old behavior of skipping them only if they have delalloc or unwritten
buffers and fix the spurious warnings by warning only if the page is
clean.

CC: Brian Foster
CC: Vlastimil Babka
Reported-by: Petr Tůma
Fixes: 99579ccec4e271c3d4d4e7c946058766812afdab
Signed-off-by: Jan Kara
Reviewed-by: Brian Foster
Signed-off-by: Darrick J. Wong
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2017-01-20 03:18:00 +0800
6c9bd81cb ocfs2: fix crash caused by stale lvb with fsdlm plugin ... Browse Code »

commit e7ee2c089e94067d68475990bdeed211c8852917 upstream.

The crash happens rather often when we reset some cluster nodes while
nodes contend fiercely to do truncate and append.

The crash backtrace is below:

dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover_grant 1 locks on 971 resources
dlm: C21CBDA5E0774F4BA5A9D4F317717495: dlm_recover 9 generation 5 done: 4 ms
ocfs2: Begin replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: End replay journal (node 318952601, slot 2) on device (253,18)
ocfs2: Beginning quota recovery on device (253,18) for slot 2
ocfs2: Finishing quota recovery on device (253,18) for slot 2
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode)
(truncate,30154,1):ocfs2_truncate_file:470 ERROR: Inode 290321, inode i_size = 732 != di i_size = 937, i_flags = 0x1
------------[ cut here ]------------
kernel BUG at /usr/src/linux/fs/ocfs2/file.c:470!
invalid opcode: 0000 [#1] SMP
Modules linked in: ocfs2_stack_user(OEN) ocfs2(OEN) ocfs2_nodemanager ocfs2_stackglue(OEN) quota_tree dlm(OEN) configfs fuse sd_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi af_packet iscsi_ibft iscsi_boot_sysfs softdog xfs libcrc32c ppdev parport_pc pcspkr parport joydev virtio_balloon virtio_net i2c_piix4 acpi_cpufreq button processor ext4 crc16 jbd2 mbcache ata_generic cirrus virtio_blk ata_piix drm_kms_helper ahci syscopyarea libahci sysfillrect sysimgblt fb_sys_fops ttm floppy libata drm virtio_pci virtio_ring uhci_hcd virtio ehci_hcd usbcore serio_raw usb_common sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4
Supported: No, Unsupported modules are loaded
CPU: 1 PID: 30154 Comm: truncate Tainted: G OE N 4.4.21-69-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.1-0-g4adadbd-20151112_172657-sheep25 04/01/2014
task: ffff88004ff6d240 ti: ffff880074e68000 task.ti: ffff880074e68000
RIP: 0010:[] [] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]
RSP: 0018:ffff880074e6bd50 EFLAGS: 00010282
RAX: 0000000000000074 RBX: 000000000000029e RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
RBP: ffff880074e6bda8 R08: 000000003675dc7a R09: ffffffff82013414
R10: 0000000000034c50 R11: 0000000000000000 R12: ffff88003aab3448
R13: 00000000000002dc R14: 0000000000046e11 R15: 0000000000000020
FS: 00007f839f965700(0000) GS:ffff88007fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007f839f97e000 CR3: 0000000036723000 CR4: 00000000000006e0
Call Trace:
ocfs2_setattr+0x698/0xa90 [ocfs2]
notify_change+0x1ae/0x380
do_truncate+0x5e/0x90
do_sys_ftruncate.constprop.11+0x108/0x160
entry_SYSCALL_64_fastpath+0x12/0x6d
Code: 24 28 ba d6 01 00 00 48 c7 c6 30 43 62 a0 8b 41 2c 89 44 24 08 48 8b 41 20 48 c7 c1 78 a3 62 a0 48 89 04 24 31 c0 e8 a0 97 f9 ff 0b 3d 00 fe ff ff 0f 84 ab fd ff ff 83 f8 fc 0f 84 a2 fd ff
RIP [] ocfs2_truncate_file+0x640/0x6c0 [ocfs2]

It's because ocfs2_inode_lock() get us stale LVB in which the i_size is
not equal to the disk i_size. We mistakenly trust the LVB because the
underlaying fsdlm dlm_lock() doesn't set lkb_sbflags with
DLM_SBF_VALNOTVALID properly for us. But, why?

The current code tries to downconvert lock without DLM_LKF_VALBLK flag
to tell o2cb don't update RSB's LVB if it's a PR->NULL conversion, even
if the lock resource type needs LVB. This is not the right way for
fsdlm.

The fsdlm plugin behaves different on DLM_LKF_VALBLK, it depends on
DLM_LKF_VALBLK to decide if we care about the LVB in the LKB. If
DLM_LKF_VALBLK is not set, fsdlm will skip recovering RSB's LVB from
this lkb and set the right DLM_SBF_VALNOTVALID appropriately when node
failure happens.

The following diagram briefly illustrates how this crash happens:

RSB1 is inode metadata lock resource with LOCK_TYPE_USES_LVB;

The 1st round:

Node1 Node2
RSB1: PR
RSB1(master): NULL->EX
ocfs2_downconvert_lock(PR->NULL, set_lvb==0)
ocfs2_dlm_lock(no DLM_LKF_VALBLK)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

dlm_lock(no DLM_LKF_VALBLK)
convert_lock(overwrite lkb->lkb_exflags
with no DLM_LKF_VALBLK)

RSB1: NULL RSB1: EX
reset Node2
dlm_recover_rsbs()
recover_lvb()

/* The LVB is not trustable if the node with EX fails and
* no lock >= PR is left. We should set RSB_VALNOTVALID for RSB1.
*/

if(!(kb_exflags & DLM_LKF_VALBLK)) /* This means we miss the chance to
return; * to invalid the LVB here.
*/

The 2nd round:

Node 1 Node2
RSB1(become master from recovery)

ocfs2_setattr()
ocfs2_inode_lock(NULL->EX)
/* dlm_lock() return the stale lvb without setting DLM_SBF_VALNOTVALID */
ocfs2_meta_lvb_is_trustable() return 1 /* so we don't refresh inode from disk */
ocfs2_truncate_file()
mlog_bug_on_msg(disk isize != i_size_read(inode)) /* crash! */

The fix is quite straightforward. We keep to set DLM_LKF_VALBLK flag
for dlm_lock() if the lock resource type needs LVB and the fsdlm plugin
is uesed.

Link: http://lkml.kernel.org/r/1481275846-6604-1-git-send-email-zren@suse.com
Signed-off-by: Eric Ren
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Eric Ren
2017-01-20 03:17:59 +0800

12 Jan, 2017

14 commits

1b9c25568 xfs: fix max_retries _show and _store functions ... Browse Code »

commit ff97f2399edac1e0fb3fa7851d5fbcbdf04717cf upstream.

max_retries _show and _store functions should test against cfg->max_retries,
not cfg->retry_timeout

Signed-off-by: Carlos Maiolino
Reviewed-by: Eric Sandeen
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Carlos Maiolino
2017-01-12 18:39:45 +0800
91192ae41 xfs: fix crash and data corruption due to removal of busy COW extents ... Browse Code »

commit a1b7a4dea6166cf46be895bce4aac67ea5160fe8 upstream.

There is a race window between write_cache_pages calling
clear_page_dirty_for_io and XFS calling set_page_writeback, in which
the mapping for an inode is tagged neither as dirty, nor as writeback.

If the COW shrinker hits in exactly that window we'll remove the delayed
COW extents and writepages trying to write it back, which in release
kernels will manifest as corruption of the bmap btree, and in debug
kernels will trip the ASSERT about now calling xfs_bmapi_write with the
COWFORK flag for holes. A complex customer load manages to hit this
window fairly reliably, probably by always having COW writeback in flight
while the cow shrinker runs.

This patch adds another check for having the I_DIRTY_PAGES flag set,
which is still set during this race window. While this fixes the problem
I'm still not overly happy about the way the COW shrinker works as it
still seems a bit fragile.

Signed-off-by: Christoph Hellwig
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Christoph Hellwig
2017-01-12 18:39:45 +0800
b96e4e87d xfs: use the actual AG length when reserving blocks ... Browse Code »

commit 20e73b000bcded44a91b79429d8fa743247602ad upstream.

We need to use the actual AG length when making per-AG reservations,
since we could otherwise end up reserving more blocks out of the last
AG than there are actual blocks.

Complained-about-by: Brian Foster
Signed-off-by: Darrick J. Wong
Reviewed-by: Christoph Hellwig
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
d9c7c9fa6 xfs: fix double-cleanup when CUI recovery fails ... Browse Code »

commit 7a21272b088894070391a94fdd1c67014020fa1d upstream.

Dan Carpenter reported a double-free of rcur if _defer_finish fails
while we're recovering CUI items. Fix the error recovery to prevent
this.

Reported-by: Dan Carpenter
Signed-off-by: Darrick J. Wong
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
aa38f370b xfs: use GPF_NOFS when allocating btree cursors ... Browse Code »

commit b24a978c377be5f14e798cb41238e66fe51aab2f upstream.

Use NOFS for allocating btree cursors, since they can be called
under the ilock.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
3c382dda4 xfs: ignore leaf attr ichdr.count in verifier during log replay ... Browse Code »

commit 2e1d23370e75d7d89350d41b4ab58c7f6a0e26b2 upstream.

When we create a new attribute, we first create a shortform
attribute, and try to fit the new attribute into it.
If that fails, we copy the (empty) attribute into a leaf attribute,
and do the copy again. Thus there can be a transient state where
we have an empty leaf attribute.

If we encounter this during log replay, the verifier will fail.
So add a test to ignore this part of the leaf attr verification
during log replay.

Thanks as usual to dchinner for spotting the problem.

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:44 +0800
c00203386 xfs: don't cap maximum dedupe request length ... Browse Code »

commit 1bb33a98702d8360947f18a44349df75ba555d5d upstream.

After various discussions on linux-fsdevel, it has been decided that it
is not necessary to cap the length of a dedupe request, and that
correctly-written userspace client programs will be able to absorb the
change. Therefore, remove the length clamping behavior.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:44 +0800
f8b20705a xfs: don't allow di_size with high bit set ... Browse Code »

commit ef388e2054feedaeb05399ed654bdb06f385d294 upstream.

The on-disk field di_size is used to set i_size, which is a signed
integer of loff_t. If the high bit of di_size is set, we'll end up with
a negative i_size, which will cause all sorts of problems. Since the
VFS won't let us create a file with such length, we should catch them
here in the verifier too.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
12815dd15 xfs: error out if trying to add attrs and anextents > 0 ... Browse Code »

commit 0f352f8ee8412bd9d34fb2a6411241da61175c0e upstream.

We shouldn't assert if somehow we end up trying to add an attr fork to
an inode that apparently already has attr extents because this is an
indication of on-disk corruption. Instead, return an error code to
userspace.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
cd4bf1d41 xfs: don't crash if reading a directory results in an unexpected hole ... Browse Code »

commit 96a3aefb8ffde23180130460b0b2407b328eb727 upstream.

In xfs_dir3_data_read, we can encounter the situation where err == 0 and
*bpp == NULL if the given bno offset happens to be a hole; this leads to
a crash if we try to set the buffer type after the _da_read_buf call.
Holes can happen due to corrupt or malicious entries in the bmbt data,
so be a little more careful when we're handling buffers.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
b88398de1 xfs: complain if we don't get nextents bmap records ... Browse Code »

commit 356a3225222e5bc4df88aef3419fb6424f18ab69 upstream.

When reading into memory all extents of a btree-format inode fork,
complain if the number of extents we find is not the same as the number
of extents reported in the inode core. This is needed to stop an IO
action from accessing the garbage areas of the in-core fork.

[dchinner: removed redundant assert]

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
4bb31bcce xfs: check for bogus values in btree block headers ... Browse Code »

commit bb3be7e7c1c18e1b141d4cadeb98cc89ecf78099 upstream.

When we're reading a btree block, make sure that what we retrieved
matches the owner and level; and has a plausible number of records.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:43 +0800
b85f32481 xfs: forbid AG btrees with level == 0 ... Browse Code »

commit d2a047f31e86941fa896e0e3271536d50aba415e upstream.

There is no such thing as a zero-level AG btree since even a single-node
zero-records btree has one level. Btree cursor constructors read
cur_nlevels straight from disk and then access things like
cur_bufs[cur_nlevels - 1] which is /really/ bad if cur_nlevels is zero!
Therefore, strengthen the verifiers to prevent this possibility.

Signed-off-by: Darrick J. Wong
Reviewed-by: Dave Chinner
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Darrick J. Wong
2017-01-12 18:39:42 +0800
4081d4a79 xfs: handle cow fork in xfs_bmap_trace_exlist ... Browse Code »

commit c44a1f22626c153976289e1cd67bdcdfefc16e1f upstream.

By inspection, xfs_bmap_trace_exlist isn't handling cow forks,
and will trace the data fork instead.

Fix this by setting state appropriately if whichfork
== XFS_COW_FORK.

()___()
< @ @ >
| |
{o_o}
(|)

Signed-off-by: Eric Sandeen
Reviewed-by: Christoph Hellwig
Signed-off-by: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Greg Kroah-Hartman

Eric Sandeen
2017-01-12 18:39:42 +0800