Eric Lee / smarc-fsl-linux-kernel

22 Sep, 2016

3 commits

87709e28d fs/locks: Use percpu_down_read_preempt_disable() ... Browse Code »

Avoid spurious preemption.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-22 21:25:54 +0800
7c3f654d8 fs/locks: Replace lg_local with a per-cpu spinlock ... Browse Code »

As Oleg suggested, replace file_lock_list with a structure containing
the hlist head and a spinlock.

This completely removes the lglock from fs/locks.

Suggested-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-22 21:25:53 +0800
aba376607 fs/locks: Replace lg_global with a percpu-rwsem ... Browse Code »

Replace the global part of the lglock with a percpu-rwsem.

Since fcl_lock is a spinlock and itself nests under i_lock, which too
is a spinlock we cannot acquire sleeping locks at
locks_{insert,remove}_global_locks().

We can however wrap all fcl_lock acquisitions with percpu_down_read
such that all invocations of locks_{insert,remove}_global_locks() have
that read lock held.

This allows us to replace the lg_global part of the lglock with the
write side of the rwsem.

In the absense of writers, percpu_{down,up}_read() are free of atomic
instructions. This further avoids the very long preempt-disable
regions caused by lglock on larger machines.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Al Viro
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dave@stgolabs.net
Cc: der.herr@hofr.at
Cc: paulmck@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: tj@kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2016-09-22 21:25:53 +0800

21 Sep, 2016

2 commits

df04abfd1 fs/proc/kcore.c: Add bounce buffer for ktext data ... Browse Code »

We hit hardened usercopy feature check for kernel text access by reading
kcore file:

usercopy: kernel memory exposure attempt detected from ffffffff8179a01f () (4065 bytes)
kernel BUG at mm/usercopy.c:75!

Bypassing this check for kcore by adding bounce buffer for ktext data.

Reported-by: Steve Best
Fixes: f5509cc18daa ("mm: Hardened usercopy")
Suggested-by: Kees Cook
Signed-off-by: Jiri Olsa
Acked-by: Kees Cook
Signed-off-by: Linus Torvalds

Jiri Olsa
2016-09-21 04:32:49 +0800
f5beeb185 fs/proc/kcore.c: Make bounce buffer global for read ... Browse Code »

Next patch adds bounce buffer for ktext area, so it's
convenient to have single bounce buffer for both
vmalloc/module and ktext cases.

Suggested-by: Linus Torvalds
Signed-off-by: Jiri Olsa
Acked-by: Kees Cook
Signed-off-by: Linus Torvalds

Jiri Olsa
2016-09-21 04:32:49 +0800

20 Sep, 2016

10 commits

63b52c493 Revert "ocfs2: bump up o2cb network protocol version" ... Browse Code »

This reverts commit 38b52efd218b ("ocfs2: bump up o2cb network protocol
version").

This commit made rolling upgrade fail. When one node is upgraded to new
version with this commit, the remaining nodes will fail to establish
connections to it, then the application like VMs on the remaining nodes
can't be live migrated to the upgraded one. This will cause an outage.
Since negotiate hb timeout behavior didn't change without this commit,
so revert it.

Fixes: 38b52efd218bf ("ocfs2: bump up o2cb network protocol version")
Link: http://lkml.kernel.org/r/1471396924-10375-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Joseph Qi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-09-20 06:36:17 +0800
d21c353d5 ocfs2: fix start offset to ocfs2_zero_range_for_truncate() ... Browse Code »

If we punch a hole on a reflink such that following conditions are met:

1. start offset is on a cluster boundary
2. end offset is not on a cluster boundary
3. (end offset is somewhere in another extent) or
(hole range > MAX_CONTIG_BYTES(1MB)),

we dont COW the first cluster starting at the start offset. But in this
case, we were wrongly passing this cluster to
ocfs2_zero_range_for_truncate() to zero out. This will modify the
cluster in place and zero it in the source too.

Fix this by skipping this cluster in such a scenario.

To reproduce:

1. Create a random file of say 10 MB
xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile
2. Reflink it
reflink -f 10MBfile reflnktest
3. Punch a hole at starting at cluster boundary with range greater that
1MB. You can also use a range that will put the end offset in another
extent.
fallocate -p -o 0 -l 1048615 reflnktest
4. sync
5. Check the first cluster in the source file. (It will be zeroed out).
dd if=10MBfile iflag=direct bs= count=1 | hexdump -C

Link: http://lkml.kernel.org/r/1470957147-14185-1-git-send-email-ashish.samant@oracle.com
Signed-off-by: Ashish Samant
Reported-by: Saar Maoz
Reviewed-by: Srinivas Eeda
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc: Joseph Qi
Cc: Eric Ren
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ashish Samant
2016-09-20 06:36:17 +0800
3bb8b653c ocfs2: fix double unlock in case retry after free truncate log ... Browse Code »

If ocfs2_reserve_cluster_bitmap_bits() fails with ENOSPC, it will try to
free truncate log and then retry. Since ocfs2_try_to_free_truncate_log
will lock/unlock global bitmap inode, we have to unlock it before
calling this function. But when retry reserve and it fails with no
global bitmap inode lock taken, it will unlock again in error handling
branch and BUG.

This issue also exists if no need retry and then ocfs2_inode_lock fails.
So fix it.

Fixes: 2070ad1aebff ("ocfs2: retry on ENOSPC if sufficient space in truncate log")
Link: http://lkml.kernel.org/r/57D91939.6030809@huawei.com
Signed-off-by: Joseph Qi
Signed-off-by: Jiufei Xue
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-09-20 06:36:17 +0800
96d41019e fanotify: fix list corruption in fanotify_get_response() ... Browse Code »

fanotify_get_response() calls fsnotify_remove_event() when it finds that
group is being released from fanotify_release() (bypass_perm is set).

However the event it removes need not be only in the group's notification
queue but it can have already moved to access_list (userspace read the
event before closing the fanotify instance fd) which is protected by a
different lock. Thus when fsnotify_remove_event() races with
fanotify_release() operating on access_list, the list can get corrupted.

Fix the problem by moving all the logic removing permission events from
the lists to one place - fanotify_release().

Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-3-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara
Reported-by: Miklos Szeredi
Tested-by: Miklos Szeredi
Reviewed-by: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2016-09-20 06:36:17 +0800
12703dbfe fsnotify: add a way to stop queueing events on group shutdown ... Browse Code »

Implement a function that can be called when a group is being shutdown
to stop queueing new events to the group. Fanotify will use this.

Fixes: 5838d4442bd5 ("fanotify: fix double free of pending permission events")
Link: http://lkml.kernel.org/r/1473797711-14111-2-git-send-email-jack@suse.cz
Signed-off-by: Jan Kara
Reviewed-by: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2016-09-20 06:36:17 +0800
d5bf14189 ocfs2: fix trans extend while free cached blocks ... Browse Code »

The root cause of this issue is the same with the one fixed by the last
patch, but this time credits for allocator inode and group descriptor
may not be consumed before trans extend.

The following error was caught:

WARNING: CPU: 0 PID: 2037 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront fb_sys_fops sysimgblt sysfillrect syscopyarea xen_netfront parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 2037 Comm: rm Tainted: G W 4.1.12-37.6.3.el6uek.bug24573128v2.x86_64 #2
Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
Call Trace:
dump_stack+0x48/0x5c
warn_slowpath_common+0x95/0xe0
warn_slowpath_null+0x1a/0x20
start_this_handle+0x4c3/0x510 [jbd2]
jbd2__journal_restart+0x161/0x1b0 [jbd2]
jbd2_journal_restart+0x13/0x20 [jbd2]
ocfs2_extend_trans+0x74/0x220 [ocfs2]
ocfs2_free_cached_blocks+0x16b/0x4e0 [ocfs2]
ocfs2_run_deallocs+0x70/0x270 [ocfs2]
ocfs2_commit_truncate+0x474/0x6f0 [ocfs2]
ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
ocfs2_evict_inode+0x28/0x60 [ocfs2]
evict+0xab/0x1a0
iput_final+0xf6/0x190
iput+0xc8/0xe0
do_unlinkat+0x1b7/0x310
SyS_unlinkat+0x22/0x40
system_call_fastpath+0x12/0x71
---[ end trace a62437cb060baa71 ]---
JBD2: rm wants too many credits (149 > 128)

Link: http://lkml.kernel.org/r/1473674623-11810-2-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-09-20 06:36:17 +0800
2b0ad0085 ocfs2: fix trans extend while flush truncate log ... Browse Code »

Every time, ocfs2_extend_trans() included a credit for truncate log
inode, but as that inode had been managed by jbd2 running transaction
first time, it will not consume that credit until
jbd2_journal_restart().

Since total credits to extend always included the un-consumed ones,
there will be more and more un-consumed credit, at last
jbd2_journal_restart() will fail due to credit number over the half of
max transction credit.

The following error was caught when unlinking a large file with many
extents:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 13626 at fs/jbd2/transaction.c:269 start_this_handle+0x4c3/0x510 [jbd2]()
Modules linked in: ocfs2 nfsd lockd grace nfs_acl auth_rpcgss sunrpc autofs4 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue configfs sd_mod sg ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ppdev xen_kbdfront xen_netfront fb_sys_fops sysimgblt sysfillrect syscopyarea parport_pc parport pcspkr i2c_piix4 i2c_core acpi_cpufreq ext4 jbd2 mbcache xen_blkfront floppy pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 13626 Comm: unlink Tainted: G W 4.1.12-37.6.3.el6uek.x86_64 #2
Hardware name: Xen HVM domU, BIOS 4.4.4OVM 02/11/2016
Call Trace:
dump_stack+0x48/0x5c
warn_slowpath_common+0x95/0xe0
warn_slowpath_null+0x1a/0x20
start_this_handle+0x4c3/0x510 [jbd2]
jbd2__journal_restart+0x161/0x1b0 [jbd2]
jbd2_journal_restart+0x13/0x20 [jbd2]
ocfs2_extend_trans+0x74/0x220 [ocfs2]
ocfs2_replay_truncate_records+0x93/0x360 [ocfs2]
__ocfs2_flush_truncate_log+0x13e/0x3a0 [ocfs2]
ocfs2_remove_btree_range+0x458/0x7f0 [ocfs2]
ocfs2_commit_truncate+0x1b3/0x6f0 [ocfs2]
ocfs2_truncate_for_delete+0xbd/0x380 [ocfs2]
ocfs2_wipe_inode+0x136/0x6a0 [ocfs2]
ocfs2_delete_inode+0x2a2/0x3e0 [ocfs2]
ocfs2_evict_inode+0x28/0x60 [ocfs2]
evict+0xab/0x1a0
iput_final+0xf6/0x190
iput+0xc8/0xe0
do_unlinkat+0x1b7/0x310
SyS_unlink+0x16/0x20
system_call_fastpath+0x12/0x71
---[ end trace 28aa7410e69369cf ]---
JBD2: unlink wants too many credits (251 > 128)

Link: http://lkml.kernel.org/r/1473674623-11810-1-git-send-email-junxiao.bi@oracle.com
Signed-off-by: Junxiao Bi
Reviewed-by: Joseph Qi
Cc: Mark Fasheh
Cc: Joel Becker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Junxiao Bi
2016-09-20 06:36:17 +0800
31b4beb47 ipc/shm: fix crash if CONFIG_SHMEM is not set ... Browse Code »

Commit c01d5b300774 ("shmem: get_unmapped_area align huge page") makes
use of shm_get_unmapped_area() in shm_file_operations() unconditional to
CONFIG_MMU.

As Tony Battersby pointed this can lead NULL-pointer dereference on
machine with CONFIG_MMU=y and CONFIG_SHMEM=n. In this case ipc/shm is
backed by ramfs which doesn't provide f_op->get_unmapped_area for
configurations with MMU.

The solution is to provide dummy f_op->get_unmapped_area for ramfs when
CONFIG_MMU=y, which just call current->mm->get_unmapped_area().

Fixes: c01d5b300774 ("shmem: get_unmapped_area align huge page")
Link: http://lkml.kernel.org/r/20160912102704.140442-1-kirill.shutemov@linux.intel.com
Signed-off-by: Kirill A. Shutemov
Reported-by: Tony Battersby
Tested-by: Tony Battersby
Cc: Hugh Dickins
Cc: [4.7.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-09-20 06:36:17 +0800
7cbdb4a28 autofs: use dentry flags to block walks during expire ... Browse Code »

Somewhere along the way the autofs expire operation has changed to hold
a spin lock over expired dentry selection. The autofs indirect mount
expired dentry selection is complicated and quite lengthy so it isn't
appropriate to hold a spin lock over the operation.

Commit 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()") added a
might_sleep() to dput() causing a WARN_ONCE() about this usage to be
issued.

But the spin lock doesn't need to be held over this check, the autofs
dentry info. flags are enough to block walks into dentrys during the
expire.

I've left the direct mount expire as it is (for now) because it is much
simpler and quicker than the indirect mount expire and adding spin lock
release and re-aquires would do nothing more than add overhead.

Fixes: 47be61845c77 ("fs/dcache.c: avoid soft-lockup in dput()")
Link: http://lkml.kernel.org/r/20160912014017.1773.73060.stgit@pluto.themaw.net
Signed-off-by: Ian Kent
Reported-by: Takashi Iwai
Tested-by: Takashi Iwai
Cc: Takashi Iwai
Cc: NeilBrown
Cc: Al Viro
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-09-20 06:36:17 +0800
e6f0c6e61 ocfs2/dlm: fix race between convert and migration ... Browse Code »

Commit ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
checks if lockres master has changed to identify whether new master has
finished recovery or not. This will introduce a race that right after
old master does umount ( means master will change), a new convert
request comes.

In this case, it will reset lockres state to DLM_RECOVERING and then
retry convert, and then fail with lockres->l_action being set to
OCFS2_AST_INVALID, which will cause inconsistent lock level between
ocfs2 and dlm, and then finally BUG.

Since dlm recovery will clear lock->convert_pending in
dlm_move_lockres_to_recovery_list, we can use it to correctly identify
the race case between convert and recovery. So fix it.

Fixes: ac7cf246dfdb ("ocfs2/dlm: fix race between convert and recovery")
Link: http://lkml.kernel.org/r/57CE1569.8010704@huawei.com
Signed-off-by: Joseph Qi
Signed-off-by: Jun Piao
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Junxiao Bi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joseph Qi
2016-09-20 06:36:16 +0800

17 Sep, 2016

1 commit

4d2899d73 Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6 ... Browse Code »

Pull cifs fixes from Steve French:
"Small set of cifs fixes"

* 'for-next' of git://git.samba.org/sfrench/cifs-2.6:
Move check for prefix path to within cifs_get_root()
Compare prepaths when comparing superblocks
Fix memory leaks in cifs_do_mount()

Linus Torvalds
2016-09-17 08:09:48 +0800

16 Sep, 2016

3 commits

22f6b4d34 aio: mark AIO pseudo-fs noexec ... Browse Code »

This ensures that do_mmap() won't implicitly make AIO memory mappings
executable if the READ_IMPLIES_EXEC personality flag is set. Such
behavior is problematic because the security_mmap_file LSM hook doesn't
catch this case, potentially permitting an attacker to bypass a W^X
policy enforced by SELinux.

I have tested the patch on my machine.

To test the behavior, compile and run this:

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include

int main(void) {
personality(READ_IMPLIES_EXEC);
aio_context_t ctx = 0;
if (syscall(__NR_io_setup, 1, &ctx))
err(1, "io_setup");

char cmd[1000];
sprintf(cmd, "cat /proc/%d/maps | grep -F '/[aio]'",
(int)getpid());
system(cmd);
return 0;
}

In the output, "rw-s" is good, "rwxs" is bad.

Signed-off-by: Jann Horn
Signed-off-by: Linus Torvalds

Jann Horn
2016-09-16 06:49:28 +0800
b71dbf103 vfs: cap dedupe request structure size at PAGE_SIZE ... Browse Code »

Kirill A Shutemov reports that the kernel doesn't try to cap dest_count
in any way, and uses the number to allocate kernel memory. This causes
high order allocation warnings in the kernel log if someone passes in a
big enough value. We should clamp the allocation at PAGE_SIZE to avoid
stressing the VM.

The two existing users of the dedupe ioctl never send more than 120
requests, so we can safely clamp dest_range at PAGE_SIZE, because with
4k pages we can handle up to 127 dedupe candidates. Given the max
extent length of 16MB, we can end up doing 2GB of IO which is plenty.

[ Note: the "offsetof()" can't overflow, because 'count' is just a
16-bit integer. That's not obvious in the limited context of the
patch, so I'm noting it here because it made me go look. - Linus ]

Reported-by: "Kirill A. Shutemov"
Signed-off-by: Darrick J. Wong
Signed-off-by: Linus Torvalds

Darrick J. Wong
2016-09-16 04:29:52 +0800
5297e0f0f vfs: fix return type of ioctl_file_dedupe_range ... Browse Code »

All the VFS functions in the dedupe ioctl path return int status, so
the ioctl handler ought to as well.

Found by Coverity, CID 1350952.

Signed-off-by: Darrick J. Wong
Signed-off-by: Linus Torvalds

Darrick J. Wong
2016-09-16 04:29:52 +0800

13 Sep, 2016

1 commit

2c937eb4d Merge tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs ... Browse Code »

Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:

Stable patches:
- We must serialise LAYOUTGET and LAYOUTRETURN to ensure correct
state accounting
- Fix the CREATE_SESSION slot number

Bugfixes:
- sunrpc: fix a UDP memory accounting regression
- NFS: Fix an error reporting regression in nfs_file_write()
- pNFS: Fix further layout stateid issues
- RPC/rdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer
exhaustion...")
- RPC/rdma: Fix receive buffer accounting"

* tag 'nfs-for-4.8-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
NFSv4.1: Fix the CREATE_SESSION slot number accounting
xprtrdma: Fix receive buffer accounting
xprtrdma: Revert 3d4cf35bd4fa ("xprtrdma: Reply buffer exhaustion...")
pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs
pNFS: Clear out all layout segments if the server unsets lrp->res.lrs_present
pNFS: Fix pnfs_set_layout_stateid() to clear NFS_LAYOUT_INVALID_STID
pNFS: Ensure LAYOUTGET and LAYOUTRETURN are properly serialised
NFS: Fix error reporting in nfs_file_write()
sunrpc: fix UDP memory accounting

Linus Torvalds
2016-09-13 05:13:45 +0800

12 Sep, 2016

1 commit

b519d408e NFSv4.1: Fix the CREATE_SESSION slot number accounting ... Browse Code »

Ensure that we conform to the algorithm described in RFC5661, section
18.36.4 for when to bump the sequence id. In essence we do it for all
cases except when the RPC call timed out, or in case of the server returning
NFS4ERR_DELAY or NFS4ERR_STALE_CLIENTID.

Signed-off-by: Trond Myklebust
Cc: stable@vger.kernel.org

Trond Myklebust
2016-09-12 02:56:44 +0800

11 Sep, 2016

2 commits

98ac9a608 Merge branch 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm fixes from Dan Williams:
"nvdimm fixes for v4.8, two of them are tagged for -stable:

- Fix devm_memremap_pages() to use track_pfn_insert(). Otherwise,
DAX pmd mappings end up with an uncached pgprot, and unusable
performance for the device-dax interface. The device-dax interface
appeared in 4.7 so this is tagged for -stable.

- Fix a couple VM_BUG_ON() checks in the show_smaps() path to
understand DAX pmd entries. This fix is tagged for -stable.

- Fix a mis-merge of the nfit machine-check handler to flip the
polarity of an if() to match the final version of the patch that
Vishal sent for 4.8-rc1. Without this the nfit machine check
handler never detects / inserts new 'badblocks' entries which
applications use to identify lost portions of files.

- For test purposes, fix the nvdimm_clear_poison() path to operate on
legacy / simulated nvdimm memory ranges. Without this fix a test
can set badblocks, but never clear them on these ranges.

- Fix the range checking done by dax_dev_pmd_fault(). This is not
tagged for -stable since this problem is mitigated by specifying
aligned resources at device-dax setup time.

These patches have appeared in a next release over the past week. The
recent rebase you can see in the timestamps was to drop an invalid fix
as identified by the updated device-dax unit tests [1]. The -mm
touches have an ack from Andrew"

[1]: "[ndctl PATCH 0/3] device-dax test for recent kernel bugs"
https://lists.01.org/pipermail/linux-nvdimm/2016-September/006855.html

* 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm: allow legacy (e820) pmem region to clear bad blocks
nfit, mce: Fix SPA matching logic in MCE handler
mm: fix cache mode of dax pmd mappings
mm: fix show_smap() for zone_device-pmd ranges
dax: fix mapping size check

Linus Torvalds
2016-09-11 00:58:52 +0800
6905732c8 Merge tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 ... Browse Code »

Pull fscrypto fixes fromTed Ts'o:
"Fix some brown-paper-bag bugs for fscrypto, including one one which
allows a malicious user to set an encryption policy on an empty
directory which they do not own"

* tag 'for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
fscrypto: require write access to mount to set encryption policy
fscrypto: only allow setting encryption policy on directories
fscrypto: add authorization check for setting encryption policy

Linus Torvalds
2016-09-11 00:18:33 +0800

10 Sep, 2016

10 commits

ba63f23d6 fscrypto: require write access to mount to set encryption policy ... Browse Code »

Since setting an encryption policy requires writing metadata to the
filesystem, it should be guarded by mnt_want_write/mnt_drop_write.
Otherwise, a user could cause a write to a frozen or readonly
filesystem. This was handled correctly by f2fs but not by ext4. Make
fscrypt_process_policy() handle it rather than relying on the filesystem
to get it right.

Signed-off-by: Eric Biggers
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o
Acked-by: Jaegeuk Kim

Eric Biggers
2016-09-10 13:18:57 +0800
348c1bfa8 Move check for prefix path to within cifs_get_root() ... Browse Code »

Signed-off-by: Sachin Prabhu
Tested-by: Aurelien Aptel
Signed-off-by: Steve French

Sachin Prabhu
2016-09-10 12:58:07 +0800
c1d8b24d1 Compare prepaths when comparing superblocks ... Browse Code »

The patch
fs/cifs: make share unaccessible at root level mountable
makes use of prepaths when any component of the underlying path is
inaccessible.

When mounting 2 separate shares having different prepaths but are other
wise similar in other respects, we end up sharing superblocks when we
shouldn't be doing so.

Signed-off-by: Sachin Prabhu
Tested-by: Aurelien Aptel
Signed-off-by: Steve French

Sachin Prabhu
2016-09-10 12:58:06 +0800
4214ebf46 Fix memory leaks in cifs_do_mount() ... Browse Code »

Fix memory leaks introduced by the patch
fs/cifs: make share unaccessible at root level mountable

Also move allocation of cifs_sb->prepath to cifs_setup_cifs_sb().

Signed-off-by: Sachin Prabhu
Tested-by: Aurelien Aptel
Signed-off-by: Steve French

Sachin Prabhu
2016-09-10 12:58:06 +0800
002ced4be fscrypto: only allow setting encryption policy on directories ... Browse Code »

The FS_IOC_SET_ENCRYPTION_POLICY ioctl allowed setting an encryption
policy on nondirectory files. This was unintentional, and in the case
of nonempty regular files did not behave as expected because existing
data was not actually encrypted by the ioctl.

In the case of ext4, the user could also trigger filesystem errors in
->empty_dir(), e.g. due to mismatched "directory" checksums when the
kernel incorrectly tried to interpret a regular file as a directory.

This bug affected ext4 with kernels v4.8-rc1 or later and f2fs with
kernels v4.6 and later. It appears that older kernels only permitted
directories and that the check was accidentally lost during the
refactoring to share the file encryption code between ext4 and f2fs.

This patch restores the !S_ISDIR() check that was present in older
kernels.

Signed-off-by: Eric Biggers
Cc: stable@vger.kernel.org
Signed-off-by: Theodore Ts'o

Eric Biggers
2016-09-10 11:38:12 +0800
163ae1c6a fscrypto: add authorization check for setting encryption policy ... Browse Code »

On an ext4 or f2fs filesystem with file encryption supported, a user
could set an encryption policy on any empty directory(*) to which they
had readonly access. This is obviously problematic, since such a
directory might be owned by another user and the new encryption policy
would prevent that other user from creating files in their own directory
(for example).

Fix this by requiring inode_owner_or_capable() permission to set an
encryption policy. This means that either the caller must own the file,
or the caller must have the capability CAP_FOWNER.

(*) Or also on any regular file, for f2fs v4.6 and later and ext4
v4.8-rc1 and later; a separate bug fix is coming for that.

Signed-off-by: Eric Biggers
Cc: stable@vger.kernel.org # 4.1+; check fs/{ext4,f2fs}
Signed-off-by: Theodore Ts'o

Eric Biggers
2016-09-10 11:37:14 +0800
ca120cf68 mm: fix show_smap() for zone_device-pmd ranges ... Browse Code »

Attempting to dump /proc//smaps for a process with pmd dax mappings
currently results in the following VM_BUG_ONs:

kernel BUG at mm/huge_memory.c:1105!
task: ffff88045f16b140 task.stack: ffff88045be14000
RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
[..]
Call Trace:
[] smaps_pte_range+0xa0/0x4b0
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

kernel BUG at fs/proc/task_mmu.c:585!
RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
Call Trace:
[] ? vsnprintf+0x255/0x4c0
[] __walk_page_range+0x1fe/0x4d0
[] walk_page_vma+0x62/0x80
[] show_smap+0xa6/0x2b0

These locations are sanity checking page flags that must be set for an
anonymous transparent huge page, but are not set for the zone_device
pages associated with dax mappings.

Cc: Ross Zwisler
Cc: Kirill A. Shutemov
Acked-by: Andrew Morton
Signed-off-by: Dan Williams

Dan Williams
2016-09-10 08:34:45 +0800
6dc728ccd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse ... Browse Code »

Pull fuse fix from Miklos Szeredi:
"This fixes a deadlock when fuse, direct I/O and loop device are
combined"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: direct-io: don't dirty ITER_BVEC pages

Linus Torvalds
2016-09-10 04:00:41 +0800
5c44ad6a3 Merge branch 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs ... Browse Code »

Pull overlayfs fix from Miklos Szeredi:
"This fixes a regression caused by the last pull request"

* 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
ovl: fix workdir creation

Linus Torvalds
2016-09-10 03:56:28 +0800
f4a9c169c Merge branch 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs ... Browse Code »

Pull btrfs fixes from Chris Mason:
"I'm not proud of how long it took me to track down that one liner in
btrfs_sync_log(), but the good news is the patches I was trying to
blame for these problems were actually fine (sorry Filipe)"

* 'for-linus-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs:
btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress
btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns
btrfs: do not decrease bytes_may_use when replaying extents

Linus Torvalds
2016-09-10 03:52:31 +0800

08 Sep, 2016

1 commit

b7f3c7d34 Merge branch 'for-chris' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/… ... Browse Code »

…linux into for-linus-4.8

Chris Mason
2016-09-08 03:55:36 +0800

06 Sep, 2016

2 commits

ce129655c btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress ... Browse Code »

In btrfs_async_reclaim_metadata_space(), we use ticket's address to
determine whether asynchronous metadata reclaim work is making progress.

ticket = list_first_entry(&space_info->tickets,
struct reserve_ticket, list);
if (last_ticket == ticket) {
flush_state++;
} else {
last_ticket = ticket;
flush_state = FLUSH_DELAYED_ITEMS_NR;
if (commit_cycles)
commit_cycles--;
}

But indeed it's wrong, we should not rely on local variable's address to
do this check, because addresses may be same. In my test environment, I
dd one 168MB file in a 256MB fs, found that for this file, every time
wait_reserve_ticket() called, local variable ticket's address is same,

For above codes, assume a previous ticket's address is addrA, last_ticket
is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
wake up it, then another ticket is added, but with the same address addrA,
now last_ticket will be same to current ticket, then current ticket's flush
work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
which may result in some enospc issues(I have seen this in my test machine).

Signed-off-by: Wang Xiaoguang
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba

Wang Xiaoguang
2016-09-06 22:31:43 +0800
cbd60aa7c Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns ... Browse Code »

We use a btrfs_log_ctx structure to pass information into the
tree log commit, and get error values out. It gets added to a per
log-transaction list which we walk when things go bad.

Commit d1433debe added an optimization to skip waiting for the log
commit, but didn't take root_log_ctx out of the list. This
patch makes sure we remove things before exiting.

Signed-off-by: Chris Mason
Fixes: d1433debe7f4346cf9fc0dafc71c3137d2a97bc4
cc: stable@vger.kernel.org # 3.15+

Chris Mason
2016-09-06 20:57:25 +0800

05 Sep, 2016

4 commits

ed7a69483 btrfs: do not decrease bytes_may_use when replaying extents ... Browse Code »

When replaying extents, there is no need to update bytes_may_use
in btrfs_alloc_logged_file_extent(), otherwise it'll trigger a
WARN_ON about bytes_may_use.

Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
Signed-off-by: Wang Xiaoguang
Reviewed-by: Josef Bacik
Signed-off-by: David Sterba

Wang Xiaoguang
2016-09-05 23:40:41 +0800
0f5aa88a7 ceph: do not modify fi->frag in need_reset_readdir() ... Browse Code »

Commit f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
modified "if (fpos_frag(new_pos) != fi->frag)" to "if (fi->frag |=
fpos_frag(new_pos))" in need_reset_readdir(), thus replacing a
comparison operator with an assignment one.

This looks like a typo which is reported by clang when building the
kernel with some warning flags:

fs/ceph/dir.c:600:22: error: using the result of an assignment as a
condition without parentheses [-Werror,-Wparentheses]
} else if (fi->frag |= fpos_frag(new_pos)) {
~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~
fs/ceph/dir.c:600:22: note: place parentheses around the assignment
to silence this warning
} else if (fi->frag |= fpos_frag(new_pos)) {
^
( )
fs/ceph/dir.c:600:22: note: use '!=' to turn this compound
assignment into an inequality comparison
} else if (fi->frag |= fpos_frag(new_pos)) {
^~
!=

Fixes: f3c4ebe65ea1 ("ceph: using hash value to compose dentry offset")
Signed-off-by: Nicolas Iooss
Signed-off-by: Ilya Dryomov

Nicolas Iooss
2016-09-05 20:30:35 +0800
e1ff3dd1a ovl: fix workdir creation ... Browse Code »

Workdir creation fails in latest kernel.

Fix by allowing EOPNOTSUPP as a valid return value from
vfs_removexattr(XATTR_NAME_POSIX_ACL_*). Upper filesystem may not support
ACL and still be perfectly able to support overlayfs.

Reported-by: Martin Ziegler
Signed-off-by: Miklos Szeredi
Fixes: c11b9fdd6a61 ("ovl: remove posix_acl_default from workdir")
Cc:

Miklos Szeredi
2016-09-05 19:55:20 +0800
334a8f371 pNFS: Don't forget the layout stateid if there are outstanding LAYOUTGETs ... Browse Code »

If there are outstanding LAYOUTGET rpc calls, then we want to ensure that
we keep the layout stateid around so we that don't inadvertently pick up
an old/misordered sequence id.
The race is as follows:

Client Server
====== ======
LAYOUTGET(seqid)
LAYOUTGET(seqid)
return LAYOUTGET(seqid+1)
return LAYOUTGET(seqid+2)
process LAYOUTGET(seqid+2)
forget layout
process LAYOUTGET(seqid+1)

If it forgets the layout stateid before processing seqid+1, then
the client will not check the layout->plh_barrier, and so will set
the stateid with seqid+1.

Signed-off-by: Trond Myklebust

Trond Myklebust
2016-09-05 00:59:00 +0800