26 Jan, 2018

2 commits


25 Jan, 2018

1 commit

  • In fixing the readdir+pagefault deadlock I accidentally introduced a
    stale entry regression in readdir. If we get close to full for the
    temporary buffer, and then skip a few delayed deletions, and then try to
    add another entry that won't fit, we will emit the entries we found and
    retry. Unfortunately we delete entries from our del_list as we find
    them, assuming we won't need them. However our pos will be with
    whatever our last entry was, which could be before the delayed deletions
    we skipped, so the next search will add the deleted entries back into
    our readdir buffer. So instead don't delete entries we find in our
    del_list so we can make sure we always find our delayed deletions. This
    is a slight perf hit for readdir with lots of pending deletions, but
    hopefully this isn't a common occurrence. If it is we can revist this
    and optimize it.

    cc: stable@vger.kernel.org
    Fixes: 23b5ec74943f ("btrfs: fix readdir deadlock with pagefault")
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba

    Josef Bacik
     

23 Jan, 2018

3 commits

  • Commit bdcf0a423ea1 ("kernel: make groups_sort calling a responsibility
    group_info allocators") appears to break nfsd rootsquash in a pretty
    major way.

    It adds a call to groups_sort() inside the loop that copies/squashes
    gids, which means the valid gids are sorted along with the following
    garbage. The net result is that the highest numbered valid gids are
    replaced with any lower-valued garbage gids, possibly including 0.

    We should sort only once, after filling in all the gids.

    Fixes: bdcf0a423ea1 ("kernel: make groups_sort calling a responsibility ...")
    Signed-off-by: Ben Hutchings
    Acked-by: J. Bruce Fields
    Signed-off-by: Linus Torvalds

    Ben Hutchings
     
  • In orangefs_devreq_read, there is a loop which picks an op off the list
    of pending ops. If the loop fails to find an op, there is nothing to
    read, and it returns EAGAIN. If the op has been given up on, the loop
    is restarted via a goto. The bug is that the variable which the found
    op is written to is not reinitialized, so if there are no more eligible
    ops on the list, the code runs again on the already handled op.

    This is triggered by interrupting a process while the op is being copied
    to the client-core. It's a fairly small window, but it's there.

    Signed-off-by: Martin Brandenburg
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Martin Brandenburg
     
  • set_op_state_purged can delete the op.

    Signed-off-by: Martin Brandenburg
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Martin Brandenburg
     

20 Jan, 2018

1 commit

  • do_task_stat() accesses IP and SP of a task without bumping reference
    count of a stack (which became an entity with independent lifetime at
    some point).

    Steps to reproduce:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    setrlimit(RLIMIT_CORE, &(struct rlimit){});

    while (1) {
    char buf[64];
    char buf2[4096];
    pid_t pid;
    int fd;

    pid = fork();
    if (pid == 0) {
    *(volatile int *)0 = 0;
    }

    snprintf(buf, sizeof(buf), "/proc/%u/stat", pid);
    fd = open(buf, O_RDONLY);
    read(fd, buf2, sizeof(buf2));
    close(fd);

    waitpid(pid, NULL, 0);
    }
    return 0;
    }

    BUG: unable to handle kernel paging request at 0000000000003fd8
    IP: do_task_stat+0x8b4/0xaf0
    PGD 800000003d73e067 P4D 800000003d73e067 PUD 3d558067 PMD 0
    Oops: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1417 Comm: a.out Not tainted 4.15.0-rc8-dirty #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc27 04/01/2014
    RIP: 0010:do_task_stat+0x8b4/0xaf0
    Call Trace:
    proc_single_show+0x43/0x70
    seq_read+0xe6/0x3b0
    __vfs_read+0x1e/0x120
    vfs_read+0x84/0x110
    SyS_read+0x3d/0xa0
    entry_SYSCALL_64_fastpath+0x13/0x6c
    RIP: 0033:0x7f4d7928cba0
    RSP: 002b:00007ffddb245158 EFLAGS: 00000246
    Code: 03 b7 a0 01 00 00 4c 8b 4c 24 70 4c 8b 44 24 78 4c 89 74 24 18 e9 91 f9 ff ff f6 45 4d 02 0f 84 fd f7 ff ff 48 8b 45 40 48 89 ef 8b 80 d8 3f 00 00 48 89 44 24 20 e8 9b 97 eb ff 48 89 44 24
    RIP: do_task_stat+0x8b4/0xaf0 RSP: ffffc90000607cc8
    CR2: 0000000000003fd8

    John Ogness said: for my tests I added an else case to verify that the
    race is hit and correctly mitigated.

    Link: http://lkml.kernel.org/r/20180116175054.GA11513@avx2
    Signed-off-by: Alexey Dobriyan
    Reported-by: "Kohli, Gaurav"
    Tested-by: John Ogness
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Oleg Nesterov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

07 Jan, 2018

1 commit


06 Jan, 2018

2 commits

  • Pull btrfs fixes from David Sterba:
    "We have two more fixes for 4.15, both aimed for stable.

    The leak fix is obvious, the second patch fixes a bug revealed by the
    refcount API, when it behaves differently than previous atomic_t and
    reports refs going from 0 to 1 in one case"

    * tag 'for-4.15-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
    btrfs: Fix flush bio leak

    Linus Torvalds
     
  • Pull XFS fixes from Darrick Wong:
    "I have just a few fixes for bugs and resource cleanup problems this
    week:

    - Fix resource cleanup of failed quota initialization

    - Fix integer overflow problems wrt s_maxbytes"

    * tag 'xfs-4.15-fixes-10' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: fix s_maxbytes overflow problems
    xfs: quota: check result of register_shrinker()
    xfs: quota: fix missed destroy of qi_tree_lock

    Linus Torvalds
     

05 Jan, 2018

1 commit

  • The previous fix in commit 384632e67e08 ("userfaultfd: non-cooperative:
    fix fork use after free") corrected the refcounting in case of
    UFFD_EVENT_FORK failure for the fork userfault paths.

    That still didn't clear the vma->vm_userfaultfd_ctx of the vmas that
    were set to point to the aborted new uffd ctx earlier in
    dup_userfaultfd.

    Link: http://lkml.kernel.org/r/20171223002505.593-2-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Reported-by: syzbot
    Reviewed-by: Mike Rapoport
    Cc: Eric Biggers
    Cc: Dmitry Vyukov
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

04 Jan, 2018

2 commits

  • Pull afs/fscache fixes from David Howells:

    - Fix the default return of fscache_maybe_release_page() when a cache
    isn't in use - it prevents a filesystem from releasing pages. This
    can cause a system to OOM.

    - Fix a potential uninitialised variable in AFS.

    - Fix AFS unlink's handling of the nlink count. It needs to use the
    nlink manipulation functions so that inode structs of deleted inodes
    actually get scheduled for destruction.

    - Fix error handling in afs_write_end() so that the page gets unlocked
    and put if we can't fill the unwritten portion.

    * 'afs-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Fix missing error handling in afs_write_end()
    afs: Fix unlink
    afs: Potential uninitialized variable in afs_extract_data()
    fscache: Fix the default for fscache_maybe_release_page()

    Linus Torvalds
     
  • This is a logical revert of commit e37fdb785a5f ("exec: Use secureexec
    for setting dumpability")

    This weakens dumpability back to checking only for uid/gid changes in
    current (which is useless), but userspace depends on dumpability not
    being tied to secureexec.

    https://bugzilla.redhat.com/show_bug.cgi?id=1528633

    Reported-by: Tom Horsley
    Fixes: e37fdb785a5f ("exec: Use secureexec for setting dumpability")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     

03 Jan, 2018

5 commits

  • Fix some integer overflow problems if offset + count happen to be large
    enough to cause an integer overflow.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • xfs_qm_init_quotainfo() does not check result of register_shrinker()
    which was tagged as __must_check recently, reported by sparse.

    Signed-off-by: Aliaksei Karaliou
    [darrick: move xfs_qm_destroy_quotainos nearer xfs_qm_init_quotainos]
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Aliaksei Karaliou
     
  • xfs_qm_destroy_quotainfo() does not destroy quotainfo->qi_tree_lock
    while destroys quotainfo->qi_quotaofflock.

    Signed-off-by: Aliaksei Karaliou
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Aliaksei Karaliou
     
  • refcounts have a generic implementation and an asm optimized one. The
    generic version has extra debugging to make sure that once a refcount
    goes to zero, refcount_inc won't increase it.

    The btrfs delayed inode code wasn't expecting this, and we're tripping
    over the warnings when the generic refcounts are used. We ended up with
    this race:

    Process A Process B
    btrfs_get_delayed_node()
    spin_lock(root->inode_lock)
    radix_tree_lookup()
    __btrfs_release_delayed_node()
    refcount_dec_and_test(&delayed_node->refs)
    our refcount is now zero
    refcount_add(2) inode_lock)
    radix_tree_delete()

    With the generic refcounts, we actually warn again when process B above
    tries to release his refcount because refcount_add() turned into a
    no-op.

    We saw this in production on older kernels without the asm optimized
    refcounts.

    The fix used here is to use refcount_inc_not_zero() to detect when the
    object is in the middle of being freed and return NULL. This is almost
    always the right answer anyway, since we usually end up pitching the
    delayed_node if it didn't have fresh data in it.

    This also changes __btrfs_release_delayed_node() to remove the extra
    check for zero refcounts before radix tree deletion.
    btrfs_get_delayed_node() was the only path that was allowing refcounts
    to go from zero to one.

    Fixes: 6de5f18e7b0da ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
    CC: # 4.12+
    Signed-off-by: Chris Mason
    Reviewed-by: Liu Bo
    Signed-off-by: David Sterba

    Chris Mason
     
  • Commit e0ae99941423 ("btrfs: preallocate device flush bio") reworked
    the way the flush bio is allocated and used. Concretely it allocates
    the bio in __alloc_device and then re-uses it multiple times with a
    very simple endio routine that just calls complete() without consuming
    a reference. Allocated bios by default come with a ref count of 1,
    which is then consumed by the endio routine (or not, in which case they
    should be bio_put by the caller). The way the impleementation works now
    is that the flush bio has a refcount of 2 and we only ever bio_put it
    once, leaving it to hang indefinitely. Fix this by removing the extra
    bio_get in __alloc_device.

    Fixes: e0ae99941423 ("btrfs: preallocate device flush bio")
    Signed-off-by: Nikolay Borisov
    Reviewed-by: Liu Bo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Nikolay Borisov
     

02 Jan, 2018

3 commits

  • afs_write_end() is missing page unlock and put if afs_fill_page() fails.

    Reported-by: Al Viro
    Signed-off-by: David Howells

    David Howells
     
  • Repeating creation and deletion of a file on an afs mount will run the box
    out of memory, e.g.:

    dd if=/dev/zero of=/afs/scratch/m0 bs=$((1024*1024)) count=512
    rm /afs/scratch/m0

    The problem seems to be that it's not properly decrementing the nlink count
    so that the inode can be scrapped.

    Note that this doesn't fix local creation followed by remote deletion.
    That's harder to handle and will require a separate patch as we're not told
    that the file has been deleted - only that the directory has changed.

    Reported-by: Marc Dionne
    Signed-off-by: David Howells

    David Howells
     
  • Smatch warns that:

    fs/afs/rxrpc.c:922 afs_extract_data()
    error: uninitialized symbol 'remote_abort'.

    Smatch is right that "remote_abort" might be uninitialized when we pass
    it to afs_set_call_complete(). I don't know if that function uses the
    uninitialized variable. Anyway, the comment for rxrpc_kernel_recv_data(),
    says that "*_abort should also be initialised to 0." and this patch does
    that.

    Signed-off-by: Dan Carpenter
    Signed-off-by: David Howells

    Dan Carpenter
     

23 Dec, 2017

1 commit

  • Pull xfs fixes from Darrick Wong:
    "Here are some XFS fixes for 4.15-rc5. Apologies for the unusually
    large number of patches this late, but I wanted to make sure the
    corruption fixes were really ready to go.

    Changes since last update:

    - Fix a locking problem during xattr block conversion that could lead
    to the log checkpointing thread to try to write an incomplete
    buffer to disk, which leads to a corruption shutdown

    - Fix a null pointer dereference when removing delayed allocation
    extents

    - Remove post-eof speculative allocations when reflinking a block
    past current inode size so that we don't just leave them there and
    assert on inode reclaim

    - Relax an assert which didn't accurately reflect the way locking
    works and would trigger under heavy io load

    - Avoid infinite loop when cancelling copy on write extents after a
    writeback failure

    - Try to avoid copy on write transaction reservation overflows when
    remapping after a successful write

    - Fix various problems with the copy-on-write reservation automatic
    garbage collection not being cleaned up properly during a ro
    remount

    - Fix problems with rmap log items being processed in the wrong
    order, leading to corruption shutdowns

    - Fix problems with EFI recovery wherein the "remove any rmapping if
    present" mechanism wasn't actually doing anything, which would lead
    to corruption problems later when the extent is reallocated,
    leading to multiple rmaps for the same extent"

    * tag 'xfs-4.15-fixes-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: only skip rmap owner checks for unknown-owner rmap removal
    xfs: always honor OWN_UNKNOWN rmap removal requests
    xfs: queue deferred rmap ops for cow staging extent alloc/free in the right order
    xfs: set cowblocks tag for direct cow writes too
    xfs: remove leftover CoW reservations when remounting ro
    xfs: don't be so eager to clear the cowblocks tag on truncate
    xfs: track cowblocks separately in i_flags
    xfs: allow CoW remap transactions to use reserve blocks
    xfs: avoid infinite loop when cancelling CoW blocks after writeback failure
    xfs: relax is_reflink_inode assert in xfs_reflink_find_cow_mapping
    xfs: remove dest file's post-eof preallocations before reflinking
    xfs: move xfs_iext_insert tracepoint to report useful information
    xfs: account for null transactions in bunmapi
    xfs: hold xfs_buf locked between shortform->leaf conversion and the addition of an attribute
    xfs: add the ability to join a held buffer to a defer_ops

    Linus Torvalds
     

22 Dec, 2017

6 commits

  • For rmap removal, refactor the rmap owner checks into a separate
    function, then skip the checks if we are performing an unknown-owner
    removal.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Calling xfs_rmap_free with an unknown owner is supposed to remove any
    rmaps covering that range regardless of owner. This is used by the EFI
    recovery code to say "we're freeing this, it mustn't be owned by
    anything anymore", but for whatever reason xfs_free_ag_extent filters
    them out.

    Therefore, remove the filter and make xfs_rmap_unmap actually treat it
    as a wildcard owner -- free anything that's already there, and if
    there's no owner at all then that's fine too.

    There are two existing callers of bmap_add_free that take care the rmap
    deferred ops themselves and use OWN_UNKNOWN to skip the EFI-based rmap
    cleanup; convert these to use OWN_NULL (via helpers), and now we really
    require that an RUI (if any) gets added to the defer ops before any EFI.

    Lastly, now that xfs_free_extent filters out OWN_NULL rmap free requests,
    growfs will have to consult directly with the rmap to ensure that there
    aren't any rmaps in the grown region.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Under the deferred rmap operation scheme, there's a certain order in
    which the rmap deferred ops have to be queued to maintain integrity
    during log replay. For alloc/map operations that order is cui -> rui;
    for free/unmap operations that order is cui -> rui -> efi. However, the
    initial refcount code got the ordering wrong in the free side of things
    because it queued refcount free op and an EFI and the refcount free op
    queued a rmap free op, resulting in the order cui -> efi -> rui.

    If we fail before the efd finishes, the efi recovery will try to do a
    wildcard rmap removal and the subsequent rui will fail to find the rmap
    and blow up. This didn't ever happen due to other screws up in handling
    unknown owner rmap removals, but those other screw ups broke recovery in
    other ways, so fix the ordering to follow the intended rules.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • If a user performs a direct CoW write, we end up loading the CoW fork
    with preallocated extents. Therefore, we must set the cowblocks tag so
    that they can be cleared out if we run low on space.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • When we're remounting the filesystem readonly, remove all CoW
    preallocations prior to going ro. If the fs goes down after the ro
    remount, we never clean up the staging extents, which means xfs_check
    will trip over them on a subsequent run. Practically speaking, the next
    mount will clean them up too, so this is unlikely to be seen. Since we
    shut down the cowblocks cleaner on remount-ro, we also have to make sure
    we start it back up if/when we remount-rw.

    Found by adding clonerange to fsstress and running xfs/017.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     
  • Currently, xfs_itruncate_extents clears the cowblocks tag if i_cnextents
    is zero. This is wrong, since i_cnextents only tracks real extents in
    the CoW fork, which means that we could have some delayed CoW
    reservations still in there that will now never get cleaned.

    Fix a further bug where we /don't/ clear the reflink iflag if there are
    any attribute blocks -- really, it's only safe to clear the reflink flag
    if there are no data fork extents and no cow fork extents.

    Found by adding clonerange to fsstress in xfs/017.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

21 Dec, 2017

1 commit

  • The EOFBLOCKS/COWBLOCKS tags are totally separate things, so track them
    with separate i_flags. Right now we're abusing IEOFBLOCKS for both,
    which is totally bogus because we won't tag the inode with COWBLOCKS if
    IEOFBLOCKS was set by a previous tagging of the inode with EOFBLOCKS.
    Found by wiring up clonerange to fsstress in xfs/017.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Reviewed-by: Christoph Hellwig

    Darrick J. Wong
     

19 Dec, 2017

1 commit


18 Dec, 2017

4 commits

  • This reverts commit 04e35f4495dd560db30c25efca4eecae8ec8c375.

    SELinux runs with secureexec for all non-"noatsecure" domain transitions,
    which means lots of processes end up hitting the stack hard-limit change
    that was introduced in order to fix a race with prlimit(). That race fix
    will need to be redesigned.

    Reported-by: Laura Abbott
    Reported-by: Tomáš Trnka
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • With CONFIG_MTD=m and CONFIG_CRAMFS=y, we now get a link failure:

    fs/cramfs/inode.o: In function `cramfs_mount': inode.c:(.text+0x220): undefined reference to `mount_mtd'
    fs/cramfs/inode.o: In function `cramfs_mtd_fill_super':
    inode.c:(.text+0x6d8): undefined reference to `mtd_point'
    inode.c:(.text+0xae4): undefined reference to `mtd_unpoint'

    This adds a more specific Kconfig dependency to avoid the broken
    configuration.

    Alternatively we could make CRAMFS itself depend on "MTD || !MTD" with a
    similar result.

    Fixes: 99c18ce580c6 ("cramfs: direct memory access support")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Nicolas Pitre
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • Pull vfs fixes from Al Viro:
    "The alloc_super() one is a regression in this merge window, lazytime
    thing is older..."

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    VFS: Handle lazytime in do_mount()
    alloc_super(): do ->s_umount initialization earlier

    Linus Torvalds
     
  • Pull ext4 fixes from Ted Ts'o:
    "Fix a regression which caused us to fail to interpret symlinks in very
    ancient ext3 file system images.

    Also fix two xfstests failures, one of which could cause an OOPS, plus
    an additional bug fix caught by fuzz testing"

    * tag 'ext4_for_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: fix crash when a directory's i_size is too small
    ext4: add missing error check in __ext4_new_inode()
    ext4: fix fdatasync(2) after fallocate(2) operation
    ext4: support fast symlinks from ext3 file systems

    Linus Torvalds
     

17 Dec, 2017

1 commit

  • Pull NFS client fixes from Anna Schumaker:
    "This has two stable bugfixes, one to fix a BUG_ON() when
    nfs_commit_inode() is called with no outstanding commit requests and
    another to fix a race in the SUNRPC receive codepath.

    Additionally, there are also fixes for an NFS client deadlock and an
    xprtrdma performance regression.

    Summary:

    Stable bugfixes:
    - NFS: Avoid a BUG_ON() in nfs_commit_inode() by not waiting for a
    commit in the case that there were no commit requests.
    - SUNRPC: Fix a race in the receive code path

    Other fixes:
    - NFS: Fix a deadlock in nfs client initialization
    - xprtrdma: Fix a performance regression for small IOs"

    * tag 'nfs-for-4.15-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
    SUNRPC: Fix a race in the receive code path
    nfs: don't wait on commit in nfs_commit_inode() if there were no commit requests
    xprtrdma: Spread reply processing over more CPUs
    nfs: fix a deadlock in nfs client initialization

    Linus Torvalds
     

16 Dec, 2017

5 commits

  • This reverts commits 5c9d2d5c269c, c7da82b894e9, and e7fe7b5cae90.

    We'll probably need to revisit this, but basically we should not
    complicate the get_user_pages_fast() case, and checking the actual page
    table protection key bits will require more care anyway, since the
    protection keys depend on the exact state of the VM in question.

    Particularly when doing a "remote" page lookup (ie in somebody elses VM,
    not your own), you need to be much more careful than this was. Dave
    Hansen says:

    "So, the underlying bug here is that we now a get_user_pages_remote()
    and then go ahead and do the p*_access_permitted() checks against the
    current PKRU. This was introduced recently with the addition of the
    new p??_access_permitted() calls.

    We have checks in the VMA path for the "remote" gups and we avoid
    consulting PKRU for them. This got missed in the pkeys selftests
    because I did a ptrace read, but not a *write*. I also didn't
    explicitly test it against something where a COW needed to be done"

    It's also not entirely clear that it makes sense to check the protection
    key bits at this level at all. But one possible eventual solution is to
    make the get_user_pages_fast() case just abort if it sees protection key
    bits set, which makes us fall back to the regular get_user_pages() case,
    which then has a vma and can do the check there if we want to.

    We'll see.

    Somewhat related to this all: what we _do_ want to do some day is to
    check the PAGE_USER bit - it should obviously always be set for user
    pages, but it would be a good check to have back. Because we have no
    generic way to test for it, we lost it as part of moving over from the
    architecture-specific x86 GUP implementation to the generic one in
    commit e585513b76f7 ("x86/mm/gup: Switch GUP to the generic
    get_user_page_fast() implementation").

    Cc: Peter Zijlstra
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Andrew Morton
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Pull ceph fix from Ilya Dryomov:
    "CephFS inode trimming fix from Zheng, marked for stable"

    * tag 'ceph-for-4.15-rc4' of git://github.com/ceph/ceph-client:
    ceph: drop negative child dentries before try pruning inode's alias

    Linus Torvalds
     
  • Pull overlayfs fixes from Miklos Szeredi:

    - fix incomplete syncing of filesystem

    - fix regression in readdir on ovl over 9p

    - only follow redirects when needed

    - misc fixes and cleanups

    * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs:
    ovl: fix overlay: warning prefix
    ovl: Use PTR_ERR_OR_ZERO()
    ovl: Sync upper dirty data when syncing overlayfs
    ovl: update ctx->pos on impure dir iteration
    ovl: Pass ovl_get_nlink() parameters in right order
    ovl: don't follow redirects if redirect_dir=off

    Linus Torvalds
     
  • If there were no commit requests, then nfs_commit_inode() should not
    wait on the commit or mark the inode dirty, otherwise the following
    BUG_ON can be triggered:

    [ 1917.130762] kernel BUG at fs/inode.c:578!
    [ 1917.130766] Oops: Exception in kernel mode, sig: 5 [#1]
    [ 1917.130768] SMP NR_CPUS=2048 NUMA pSeries
    [ 1917.130772] Modules linked in: iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi blocklayoutdriver rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache sunrpc sg nx_crypto pseries_rng ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic crct10dif_common ibmvscsi scsi_transport_srp ibmveth scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
    [ 1917.130805] CPU: 2 PID: 14923 Comm: umount.nfs4 Tainted: G ------------ T 3.10.0-768.el7.ppc64 #1
    [ 1917.130810] task: c0000005ecd88040 ti: c00000004cea0000 task.ti: c00000004cea0000
    [ 1917.130813] NIP: c000000000354178 LR: c000000000354160 CTR: c00000000012db80
    [ 1917.130816] REGS: c00000004cea3720 TRAP: 0700 Tainted: G ------------ T (3.10.0-768.el7.ppc64)
    [ 1917.130820] MSR: 8000000100029032 CR: 22002822 XER: 20000000
    [ 1917.130828] CFAR: c00000000011f594 SOFTE: 1
    GPR00: c000000000354160 c00000004cea39a0 c0000000014c4700 c0000000018cc750
    GPR04: 000000000000c750 80c0000000000000 0600000000000000 04eeb76bea749a03
    GPR08: 0000000000000034 c0000000018cc758 0000000000000001 d000000005e619e8
    GPR12: c00000000012db80 c000000007b31200 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR24: 0000000000000000 c000000000dfc3ec 0000000000000000 c0000005eefc02c0
    GPR28: d0000000079dbd50 c0000005b94a02c0 c0000005b94a0250 c0000005b94a01c8
    [ 1917.130867] NIP [c000000000354178] .evict+0x1c8/0x350
    [ 1917.130871] LR [c000000000354160] .evict+0x1b0/0x350
    [ 1917.130873] Call Trace:
    [ 1917.130876] [c00000004cea39a0] [c000000000354160] .evict+0x1b0/0x350 (unreliable)
    [ 1917.130880] [c00000004cea3a30] [c0000000003558cc] .evict_inodes+0x13c/0x270
    [ 1917.130884] [c00000004cea3af0] [c000000000327d20] .kill_anon_super+0x70/0x1e0
    [ 1917.130896] [c00000004cea3b80] [d000000005e43e30] .nfs_kill_super+0x20/0x60 [nfs]
    [ 1917.130900] [c00000004cea3c00] [c000000000328a20] .deactivate_locked_super+0xa0/0x1b0
    [ 1917.130903] [c00000004cea3c80] [c00000000035ba54] .cleanup_mnt+0xd4/0x180
    [ 1917.130907] [c00000004cea3d10] [c000000000119034] .task_work_run+0x114/0x150
    [ 1917.130912] [c00000004cea3db0] [c00000000001ba6c] .do_notify_resume+0xcc/0x100
    [ 1917.130916] [c00000004cea3e30] [c00000000000a7b0] .ret_from_except_lite+0x5c/0x60
    [ 1917.130919] Instruction dump:
    [ 1917.130921] 7fc3f378 486734b5 60000000 387f00a0 38800003 4bdcb365 60000000 e95f00a0
    [ 1917.130927] 694a0060 7d4a0074 794ad182 694a0001 892d02a4 2f890000 40de0134

    Signed-off-by: Scott Mayhew
    Cc: stable@vger.kernel.org # 4.5+
    Signed-off-by: Anna Schumaker

    Scott Mayhew
     
  • The following deadlock can occur between a process waiting for a client
    to initialize in while walking the client list during nfsv4 server trunking
    detection and another process waiting for the nfs_clid_init_mutex so it
    can initialize that client:

    Process 1 Process 2
    --------- ---------
    spin_lock(&nn->nfs_client_lock);
    list_add_tail(&CLIENTA->cl_share_link,
    &nn->nfs_client_list);
    spin_unlock(&nn->nfs_client_lock);
    spin_lock(&nn->nfs_client_lock);
    list_add_tail(&CLIENTB->cl_share_link,
    &nn->nfs_client_list);
    spin_unlock(&nn->nfs_client_lock);
    mutex_lock(&nfs_clid_init_mutex);
    nfs41_walk_client_list(clp, result, cred);
    nfs_wait_client_init_complete(CLIENTA);
    (waiting for nfs_clid_init_mutex)

    Make sure nfs_match_client() only evaluates clients that have completed
    initialization in order to prevent that deadlock.

    This patch also fixes v4.0 trunking behavior by not marking the client
    NFS_CS_READY until the clientid has been confirmed.

    Signed-off-by: Scott Mayhew
    Signed-off-by: Anna Schumaker

    Scott Mayhew