16 Aug, 2018

4 commits

  • commit 119e1ef80ecfe0d1deb6378d4ab41f5b71519de1 upstream.

    __legitimize_mnt() has two problems - one is that in case of success
    the check of mount_lock is not ordered wrt preceding increment of
    refcount, making it possible to have successful __legitimize_mnt()
    on one CPU just before the otherwise final mntpu() on another,
    with __legitimize_mnt() not seeing mntput() taking the lock and
    mntput() not seeing the increment done by __legitimize_mnt().
    Solved by a pair of barriers.

    Another is that failure of __legitimize_mnt() on the second
    read_seqretry() leaves us with reference that'll need to be
    dropped by caller; however, if that races with final mntput()
    we can end up with caller dropping rcu_read_lock() and doing
    mntput() to release that reference - with the first mntput()
    having freed the damn thing just as rcu_read_lock() had been
    dropped. Solution: in "do mntput() yourself" failure case
    grab mount_lock, check if MNT_DOOMED has been set by racing
    final mntput() that has missed our increment and if it has -
    undo the increment and treat that as "failure, caller doesn't
    need to drop anything" case.

    It's not easy to hit - the final mntput() has to come right
    after the first read_seqretry() in __legitimize_mnt() *and*
    manage to miss the increment done by __legitimize_mnt() before
    the second read_seqretry() in there. The things that are almost
    impossible to hit on bare hardware are not impossible on SMP
    KVM, though...

    Reported-by: Oleg Nesterov
    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 9ea0a46ca2c318fcc449c1e6b62a7230a17888f1 upstream.

    mntput_no_expire() does the calculation of total refcount under mount_lock;
    unfortunately, the decrement (as well as all increments) are done outside
    of it, leading to false positives in the "are we dropping the last reference"
    test. Consider the following situation:
    * mnt is a lazy-umounted mount, kept alive by two opened files. One
    of those files gets closed. Total refcount of mnt is 2. On CPU 42
    mntput(mnt) (called from __fput()) drops one reference, decrementing component
    * After it has looked at component #0, the process on CPU 0 does
    mntget(), incrementing component #0, gets preempted and gets to run again -
    on CPU 69. There it does mntput(), which drops the reference (component #69)
    and proceeds to spin on mount_lock.
    * On CPU 42 our first mntput() finishes counting. It observes the
    decrement of component #69, but not the increment of component #0. As the
    result, the total it gets is not 1 as it should've been - it's 0. At which
    point we decide that vfsmount needs to be killed and proceed to free it and
    shut the filesystem down. However, there's still another opened file
    on that filesystem, with reference to (now freed) vfsmount, etc. and we are
    screwed.

    It's not a wide race, but it can be reproduced with artificial slowdown of
    the mnt_get_count() loop, and it should be easier to hit on SMP KVM setups.

    Fix consists of moving the refcount decrement under mount_lock; the tricky
    part is that we want (and can) keep the fast case (i.e. mount that still
    has non-NULL ->mnt_ns) entirely out of mount_lock. All places that zero
    mnt->mnt_ns are dropping some reference to mnt and they call synchronize_rcu()
    before that mntput(). IOW, if mntput() observes (under rcu_read_lock())
    a non-NULL ->mnt_ns, it is guaranteed that there is another reference yet to
    be dropped.

    Reported-by: Jann Horn
    Tested-by: Jann Horn
    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 4c0d7cd5c8416b1ef41534d19163cb07ffaa03ab upstream.

    RCU pathwalk relies upon the assumption that anything that changes
    ->d_inode of a dentry will invalidate its ->d_seq. That's almost
    true - the one exception is that the final dput() of already unhashed
    dentry does *not* touch ->d_seq at all. Unhashing does, though,
    so for anything we'd found by RCU dcache lookup we are fine.
    Unfortunately, we can *start* with an unhashed dentry or jump into
    it.

    We could try and be careful in the (few) places where that could
    happen. Or we could just make the final dput() invalidate the damn
    thing, unhashed or not. The latter is much simpler and easier to
    backport, so let's do it that way.

    Reported-by: "Dae R. Jeong"
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 90bad5e05bcdb0308cfa3d3a60f5c0b9c8e2efb3 upstream.

    Since mountpoint crossing can happen without leaving lazy mode,
    root dentries do need the same protection against having their
    memory freed without RCU delay as everything else in the tree.

    It's partially hidden by RCU delay between detaching from the
    mount tree and dropping the vfsmount reference, but the starting
    point of pathwalk can be on an already detached mount, in which
    case umount-caused RCU delay has already passed by the time the
    lazy pathwalk grabs rcu_read_lock(). If the starting point
    happens to be at the root of that vfsmount *and* that vfsmount
    covers the entire filesystem, we get trouble.

    Fixes: 48a066e72d97 ("RCU'd vsfmounts")
    Cc: stable@vger.kernel.org
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

09 Aug, 2018

6 commits

  • commit 92d34134193e5b129dc24f8d79cb9196626e8d7a upstream.

    The code is assuming the buffer is max_size length, but we weren't
    allocating enough space for it.

    Signed-off-by: Shankara Pailoor
    Signed-off-by: Dave Kleikamp
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Shankara Pailoor
     
  • commit bb3d48dcf86a97dc25fe9fc2c11938e19cb4399a upstream.

    xfs_attr3_leaf_create may have errored out before instantiating a buffer,
    for example if the blkno is out of range. In that case there is no work
    to do to remove it, and in fact xfs_da_shrink_inode will lead to an oops
    if we try.

    This also seems to fix a flaw where the original error from
    xfs_attr3_leaf_create gets overwritten in the cleanup case, and it
    removes a pointless assignment to bp which isn't used after this.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=199969
    Reported-by: Xu, Wen
    Tested-by: Xu, Wen
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Cc: Eduardo Valentin
    Signed-off-by: Greg Kroah-Hartman

    Eric Sandeen
     
  • commit afca6c5b2595fc44383919fba740c194b0b76aff upstream.

    A recent fuzzed filesystem image cached random dcache corruption
    when the reproducer was run. This often showed up as panics in
    lookup_slow() on a null inode->i_ops pointer when doing pathwalks.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
    ....
    Call Trace:
    lookup_slow+0x44/0x60
    walk_component+0x3dd/0x9f0
    link_path_walk+0x4a7/0x830
    path_lookupat+0xc1/0x470
    filename_lookup+0x129/0x270
    user_path_at_empty+0x36/0x40
    path_listxattr+0x98/0x110
    SyS_listxattr+0x13/0x20
    do_syscall_64+0xf5/0x280
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    but had many different failure modes including deadlocks trying to
    lock the inode that was just allocated or KASAN reports of
    use-after-free violations.

    The cause of the problem was a corrupt INOBT on a v4 fs where the
    root inode was marked as free in the inobt record. Hence when we
    allocated an inode, it chose the root inode to allocate, found it in
    the cache and re-initialised it.

    We recently fixed a similar inode allocation issue caused by inobt
    record corruption problem in xfs_iget_cache_miss() in commit
    ee457001ed6c ("xfs: catch inode allocation state mismatch
    corruption"). This change adds similar checks to the cache-hit path
    to catch it, and turns the reproducer into a corruption shutdown
    situation.

    Reported-by: Wen Xu
    Signed-Off-By: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    [darrick: fix typos in comment]
    Signed-off-by: Darrick J. Wong
    Cc: Eduardo Valentin
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit ee457001ed6c6f31ddad69c24c1da8f377d8472d upstream.

    We recently came across a V4 filesystem causing memory corruption
    due to a newly allocated inode being setup twice and being added to
    the superblock inode list twice. From code inspection, the only way
    this could happen is if a newly allocated inode was not marked as
    free on disk (i.e. di_mode wasn't zero).

    Running the metadump on an upstream debug kernel fails during inode
    allocation like so:

    XFS: Assertion failed: ip->i_d.di_nblocks == 0, file: fs/xfs/xfs_inod=
    e.c, line: 838
    ------------[ cut here ]------------
    kernel BUG at fs/xfs/xfs_message.c:114!
    invalid opcode: 0000 [#1] PREEMPT SMP
    CPU: 11 PID: 3496 Comm: mkdir Not tainted 4.16.0-rc5-dgc #442
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/0=
    1/2014
    RIP: 0010:assfail+0x28/0x30
    RSP: 0018:ffffc9000236fc80 EFLAGS: 00010202
    RAX: 00000000ffffffea RBX: 0000000000004000 RCX: 0000000000000000
    RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffff8227211b
    RBP: ffffc9000236fce8 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000bec R11: f000000000000000 R12: ffffc9000236fd30
    R13: ffff8805c76bab80 R14: ffff8805c77ac800 R15: ffff88083fb12e10
    FS: 00007fac8cbff040(0000) GS:ffff88083fd00000(0000) knlGS:0000000000000=
    000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fffa6783ff8 CR3: 00000005c6e2b003 CR4: 00000000000606e0
    Call Trace:
    xfs_ialloc+0x383/0x570
    xfs_dir_ialloc+0x6a/0x2a0
    xfs_create+0x412/0x670
    xfs_generic_create+0x1f7/0x2c0
    ? capable_wrt_inode_uidgid+0x3f/0x50
    vfs_mkdir+0xfb/0x1b0
    SyS_mkdir+0xcf/0xf0
    do_syscall_64+0x73/0x1a0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Extracting the inode number we crashed on from an event trace and
    looking at it with xfs_db:

    xfs_db> inode 184452204
    xfs_db> p
    core.magic = 0x494e
    core.mode = 0100644
    core.version = 2
    core.format = 2 (extents)
    core.nlinkv2 = 1
    core.onlink = 0
    .....

    Confirms that it is not a free inode on disk. xfs_repair
    also trips over this inode:

    .....
    zero length extent (off = 0, fsbno = 0) in ino 184452204
    correcting nextents for inode 184452204
    bad attribute fork in inode 184452204, would clear attr fork
    bad nblocks 1 for inode 184452204, would reset to 0
    bad anextents 1 for inode 184452204, would reset to 0
    imap claims in-use inode 184452204 is free, would correct imap
    would have cleared inode 184452204
    .....
    disconnected inode 184452204, would move to lost+found

    And so we have a situation where the directory structure and the
    inobt thinks the inode is free, but the inode on disk thinks it is
    still in use. Where this corruption came from is not possible to
    diagnose, but we can detect it and prevent the kernel from oopsing
    on lookup. The reproducer now results in:

    $ sudo mkdir /mnt/scratch/{0,1,2,3,4,5}{0,1,2,3,4,5}
    mkdir: cannot create directory =E2=80=98/mnt/scratch/00=E2=80=99: File ex=
    ists
    mkdir: cannot create directory =E2=80=98/mnt/scratch/01=E2=80=99: File ex=
    ists
    mkdir: cannot create directory =E2=80=98/mnt/scratch/03=E2=80=99: Structu=
    re needs cleaning
    mkdir: cannot create directory =E2=80=98/mnt/scratch/04=E2=80=99: Input/o=
    utput error
    mkdir: cannot create directory =E2=80=98/mnt/scratch/05=E2=80=99: Input/o=
    utput error
    ....

    And this corruption shutdown:

    [ 54.843517] XFS (loop0): Corruption detected! Free inode 0xafe846c not=
    marked free on disk
    [ 54.845885] XFS (loop0): Internal error xfs_trans_cancel at line 1023 =
    of file fs/xfs/xfs_trans.c. Caller xfs_create+0x425/0x670
    [ 54.848994] CPU: 10 PID: 3541 Comm: mkdir Not tainted 4.16.0-rc5-dgc #=
    443
    [ 54.850753] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIO=
    S 1.10.2-1 04/01/2014
    [ 54.852859] Call Trace:
    [ 54.853531] dump_stack+0x85/0xc5
    [ 54.854385] xfs_trans_cancel+0x197/0x1c0
    [ 54.855421] xfs_create+0x425/0x670
    [ 54.856314] xfs_generic_create+0x1f7/0x2c0
    [ 54.857390] ? capable_wrt_inode_uidgid+0x3f/0x50
    [ 54.858586] vfs_mkdir+0xfb/0x1b0
    [ 54.859458] SyS_mkdir+0xcf/0xf0
    [ 54.860254] do_syscall_64+0x73/0x1a0
    [ 54.861193] entry_SYSCALL_64_after_hwframe+0x42/0xb7
    [ 54.862492] RIP: 0033:0x7fb73bddf547
    [ 54.863358] RSP: 002b:00007ffdaa553338 EFLAGS: 00000246 ORIG_RAX: 0000=
    000000000053
    [ 54.865133] RAX: ffffffffffffffda RBX: 00007ffdaa55449a RCX: 00007fb73=
    bddf547
    [ 54.866766] RDX: 0000000000000001 RSI: 00000000000001ff RDI: 00007ffda=
    a55449a
    [ 54.868432] RBP: 00007ffdaa55449a R08: 00000000000001ff R09: 00005623a=
    8670dd0
    [ 54.870110] R10: 00007fb73be72d5b R11: 0000000000000246 R12: 000000000=
    00001ff
    [ 54.871752] R13: 00007ffdaa5534b0 R14: 0000000000000000 R15: 00007ffda=
    a553500
    [ 54.873429] XFS (loop0): xfs_do_force_shutdown(0x8) called from line 1=
    024 of file fs/xfs/xfs_trans.c. Return address = ffffffff814cd050
    [ 54.882790] XFS (loop0): Corruption of in-memory data detected. Shutt=
    ing down filesystem
    [ 54.884597] XFS (loop0): Please umount the filesystem and rectify the =
    problem(s)

    Note that this crash is only possible on v4 filesystemsi or v5
    filesystems mounted with the ikeep mount option. For all other V5
    filesystems, this problem cannot occur because we don't read inodes
    we are allocating from disk - we simply overwrite them with the new
    inode information.

    Signed-Off-By: Dave Chinner
    Reviewed-by: Carlos Maiolino
    Tested-by: Carlos Maiolino
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Cc: Eduardo Valentin
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.

    When we clone a range into a file we can end up dropping existing
    extent maps (or trimming them) and replacing them with new ones if the
    range to be cloned overlaps with a range in the destination inode.
    When that happens we add the new extent maps to the list of modified
    extents in the inode's extent map tree, so that a "fast" fsync (the flag
    BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
    and log corresponding extent items. However, at the end of range cloning
    operation we do truncate all the pages in the affected range (in order to
    ensure future reads will not get stale data). Sometimes this truncation
    will release the corresponding extent maps besides the pages from the page
    cache. If this happens, then a "fast" fsync operation will miss logging
    some extent items, because it relies exclusively on the extent maps being
    present in the inode's extent tree, leading to data loss/corruption if
    the fsync ends up using the same transaction used by the clone operation
    (that transaction was not committed in the meanwhile). An extent map is
    released through the callback btrfs_invalidatepage(), which gets called by
    truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
    later ends up calling try_release_extent_mapping() which will release the
    extent map if some conditions are met, like the file size being greater
    than 16Mb, gfp flags allow blocking and the range not being locked (which
    is the case during the clone operation) nor being the extent map flagged
    as pinned (also the case for cloning).

    The following example, turned into a test for fstests, reproduces the
    issue:

    $ mkfs.btrfs -f /dev/sdb
    $ mount /dev/sdb /mnt

    $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
    $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

    $ xfs_io -c "fsync" /mnt/bar
    # reflink destination offset corresponds to the size of file bar,
    # 2728Kb minus 4Kb.
    $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
    $ xfs_io -c "fsync" /mnt/bar

    $ md5sum /mnt/bar
    95a95813a8c2abc9aa75a6c2914a077e /mnt/bar

    $ mount /dev/sdb /mnt
    $ md5sum /mnt/bar
    207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
    # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
    # power failure

    In the above example, the destination offset of the clone operation
    corresponds to the size of the "bar" file minus 4Kb. So during the clone
    operation, the extent map covering the range from 2572Kb to 2728Kb gets
    trimmed so that it ends at offset 2724Kb, and a new extent map covering
    the range from 2724Kb to 11724Kb is created. So at the end of the clone
    operation when we ask to truncate the pages in the range from 2724Kb to
    2724Kb + 15908Kb, the page invalidation callback ends up removing the new
    extent map (through try_release_extent_mapping()) when the page at offset
    2724Kb is passed to that callback.

    Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
    map is removed at try_release_extent_mapping(), forcing the next fsync to
    search for modified extents in the fs/subvolume tree instead of relying on
    the presence of extent maps in memory. This way we can continue doing a
    "fast" fsync if the destination range of a clone operation does not
    overlap with an existing range or if any of the criteria necessary to
    remove an extent map at try_release_extent_mapping() is not met (file
    size not bigger then 16Mb or gfp flags do not allow blocking).

    CC: stable@vger.kernel.org # 3.16+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 44de022c4382541cebdd6de4465d1f4f465ff1dd upstream.

    Ext4_check_descriptors() was getting called before s_gdb_count was
    initialized. So for file systems w/o the meta_bg feature, allocation
    bitmaps could overlap the block group descriptors and ext4 wouldn't
    notice.

    For file systems with the meta_bg feature enabled, there was a
    fencepost error which would cause the ext4_check_descriptors() to
    incorrectly believe that the block allocation bitmap overlaps with the
    block group descriptor blocks, and it would reject the mount.

    Fix both of these problems.

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Benjamin Gilbert
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     

06 Aug, 2018

3 commits

  • commit 31e810aa1033a7db50a2746cd34a2432237f6420 upstream.

    The fix in commit 0cbb4b4f4c44 ("userfaultfd: clear the
    vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails") cleared the
    vma->vm_userfaultfd_ctx but kept userfaultfd flags in vma->vm_flags
    that were copied from the parent process VMA.

    As the result, there is an inconsistency between the values of
    vma->vm_userfaultfd_ctx.ctx and vma->vm_flags which triggers BUG_ON
    in userfaultfd_release().

    Clearing the uffd flags from vma->vm_flags in case of UFFD_EVENT_FORK
    failure resolves the issue.

    Link: http://lkml.kernel.org/r/1532931975-25473-1-git-send-email-rppt@linux.vnet.ibm.com
    Fixes: 0cbb4b4f4c44 ("userfaultfd: clear the vma->vm_userfaultfd_ctx if UFFD_EVENT_FORK fails")
    Signed-off-by: Mike Rapoport
    Reported-by: syzbot+121be635a7a35ddb7dcb@syzkaller.appspotmail.com
    Cc: Andrea Arcangeli
    Cc: Eric Biggers
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Rapoport
     
  • commit 71755ee5350b63fb1f283de8561cdb61b47f4d1d upstream.

    The squashfs fragment reading code doesn't actually verify that the
    fragment is inside the fragment table. The end result _is_ verified to
    be inside the image when actually reading the fragment data, but before
    that is done, we may end up taking a page fault because the fragment
    table itself might not even exist.

    Another report from Anatoly and his endless squashfs image fuzzing.

    Reported-by: Анатолий Тросиненко
    Acked-by:: Phillip Lougher ,
    Cc: Willy Tarreau
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit d512584780d3e6a7cacb2f482834849453d444a1 upstream.

    Anatoly reports another squashfs fuzzing issue, where the decompression
    parameters themselves are in a compressed block.

    This causes squashfs_read_data() to be called in order to read the
    decompression options before the decompression stream having been set
    up, making squashfs go sideways.

    Reported-by: Anatoly Trosinenko
    Acked-by: Phillip Lougher
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

03 Aug, 2018

26 commits

  • commit e8d4bfe3a71537284a90561f77c85dea6c154369 upstream.

    When executing filesystem sync or umount on overlayfs,
    dirty data does not get synced as expected on upper filesystem.
    This patch fixes sync filesystem method to keep data consistency
    for overlayfs.

    Signed-off-by: Chengguang Xu
    Fixes: e593b2bf513d ("ovl: properly implement sync_filesystem()")
    Cc: #4.11
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • commit 5012284700775a4e6e3fbe7eac4c543c4874b559 upstream.

    Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is
    valid" will complain if block group zero does not have the
    EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct,
    since a freshly created file system has this flag cleared. It gets
    almost immediately after the file system is mounted read-write --- but
    the following somewhat unlikely sequence will end up triggering a
    false positive report of a corrupted file system:

    mkfs.ext4 /dev/vdc
    mount -o ro /dev/vdc /vdc
    mount -o remount,rw /dev/vdc

    Instead, when initializing the inode table for block group zero, test
    to make sure that itable_unused count is not too large, since that is
    the case that will result in some or all of the reserved inodes
    getting cleared.

    This fixes the failures reported by Eric Whiteney when running
    generic/230 and generic/231 in the the nojournal test case.

    Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid")
    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 8d5a803c6a6ce4ec258e31f76059ea5153ba46ef upstream.

    With commit 044e6e3d74a3: "ext4: don't update checksum of new
    initialized bitmaps" the buffer valid bit will get set without
    actually setting up the checksum for the allocation bitmap, since the
    checksum will get calculated once we actually allocate an inode or
    block.

    If we are doing this, then we need to (re-)check the verified bit
    after we take the block group lock. Otherwise, we could race with
    another process reading and verifying the bitmap, which would then
    complain about the checksum being invalid.

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 362eca70b53389bddf3143fe20f53dcce2cfdf61 upstream.

    The inline data code was updating the raw inode directly; this is
    problematic since if metadata checksums are enabled,
    ext4_mark_inode_dirty() must be called to update the inode's checksum.
    In addition, the jbd2 layer requires that get_write_access() be called
    before the metadata buffer is modified. Fix both of these problems.

    https://bugzilla.kernel.org/show_bug.cgi?id=200443

    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 01cfb7937a9af2abb1136c7e89fbf3fd92952956 upstream.

    Anatoly Trosinenko reports that a corrupted squashfs image can cause a
    kernel oops. It turns out that squashfs can end up being confused about
    negative fragment lengths.

    The regular squashfs_read_data() does check for negative lengths, but
    squashfs_read_metadata() did not, and the fragment size code just
    blindly trusted the on-disk value. Fix both the fragment parsing and
    the metadata reading code.

    Reported-by: Anatoly Trosinenko
    Cc: Al Viro
    Cc: Phillip Lougher
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit 9362dd1109f87a9d0a798fbc890cb339c171ed35 upstream.

    Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Martin Wilck
     
  • [ Upstream commit 5b19d284f5195a925dd015a6397bfce184097378 ]

    pageout() in MM traslates EAGAIN, so calls handle_write_error()
    -> mapping_set_error() -> set_bit(AS_EIO, ...).
    file_write_and_wait_range() will see EIO error, which is critical
    to return value of fsync() followed by atomic_write failure to user.

    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Jaegeuk Kim
     
  • [ Upstream commit 36dd26e0c8d42699eeba87431246c07c28075bae ]

    Improve fscrypt read performance by switching the decryption workqueue
    from bound to unbound. With the bound workqueue, when multiple bios
    completed on the same CPU, they were decrypted on that same CPU. But
    with the unbound queue, they are now decrypted in parallel on any CPU.

    Although fscrypt read performance can be tough to measure due to the
    many sources of variation, this change is most beneficial when
    decryption is slow, e.g. on CPUs without AES instructions. For example,
    I timed tarring up encrypted directories on f2fs. On x86 with AES-NI
    instructions disabled, the unbound workqueue improved performance by
    about 25-35%, using 1 to NUM_CPUs jobs with 4 or 8 CPUs available. But
    with AES-NI enabled, performance was unchanged to within ~2%.

    I also did the same test on a quad-core ARM CPU using xts-speck128-neon
    encryption. There performance was usually about 10% better with the
    unbound workqueue, bringing it closer to the unencrypted speed.

    The unbound workqueue may be worse in some cases due to worse locality,
    but I think it's still the better default. dm-crypt uses an unbound
    workqueue by default too, so this change makes fscrypt match.

    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • [ Upstream commit ff3d27a048d926b3920ccdb75d98788c567cae0d ]

    Under the following case, qgroup rescan can double account cowed tree
    blocks:

    In this case, extent tree only has one tree block.

    -
    | transid=5 last committed=4
    | btrfs_qgroup_rescan_worker()
    | |- btrfs_start_transaction()
    | | transid = 5
    | |- qgroup_rescan_leaf()
    | |- btrfs_search_slot_for_read() on extent tree
    | Get the only extent tree block from commit root (transid = 4).
    | Scan it, set qgroup_rescan_progress to the last
    | EXTENT/META_ITEM + 1
    | now qgroup_rescan_progress = A + 1.
    |
    | fs tree get CoWed, new tree block is at A + 16K
    | transid 5 get committed
    -
    | transid=6 last committed=5
    | btrfs_qgroup_rescan_worker()
    | btrfs_qgroup_rescan_worker()
    | |- btrfs_start_transaction()
    | | transid = 5
    | |- qgroup_rescan_leaf()
    | |- btrfs_search_slot_for_read() on extent tree
    | Get the only extent tree block from commit root (transid = 5).
    | scan it using qgroup_rescan_progress (A + 1).
    | found new tree block beyong A, and it's fs tree block,
    | account it to increase qgroup numbers.
    -

    In above case, tree block A, and tree block A + 16K get accounted twice,
    while qgroup rescan should stop when it already reach the last leaf,
    other than continue using its qgroup_rescan_progress.

    Such case could happen by just looping btrfs/017 and with some
    possibility it can hit such double qgroup accounting problem.

    Fix it by checking the path to determine if we should finish qgroup
    rescan, other than relying on next loop to exit.

    Reported-by: Nikolay Borisov
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • [ Upstream commit 3d3a2e610ea5e7c6d4f9481ecce5d8e2d8317843 ]

    Currently the code assumes that there's an implied barrier by the
    sequence of code preceding the wakeup, namely the mutex unlock.

    As Nikolay pointed out:

    I think this is wrong (not your code) but the original assumption that
    the RELEASE semantics provided by mutex_unlock is sufficient.
    According to memory-barriers.txt:

    Section 'LOCK ACQUISITION FUNCTIONS' states:

    (2) RELEASE operation implication:

    Memory operations issued before the RELEASE will be completed before the
    RELEASE operation has completed.

    Memory operations issued after the RELEASE *may* be completed before the
    RELEASE operation has completed.

    (I've bolded the may portion)

    The example given there:

    As an example, consider the following:

    *A = a;
    *B = b;
    ACQUIRE
    *C = c;
    *D = d;
    RELEASE
    *E = e;
    *F = f;

    The following sequence of events is acceptable:

    ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE

    So if we assume that *C is modifying the flag which the waitqueue is checking,
    and *E is the actual wakeup, then those accesses can be re-ordered...

    IMHO this code should be considered broken...
    Signed-off-by: Greg Kroah-Hartman

    David Sterba
     
  • [ Upstream commit 0552210997badb6a60740a26ff9d976a416510f0 ]

    btrfs_free_extent() can fail because of ENOMEM. There's no reason to
    panic here, we can just abort the transaction.

    Fixes: f4b9aa8d3b87 ("btrfs_truncate")
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Omar Sandoval
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • [ Upstream commit c08db7d8d295a4f3a10faaca376de011afff7950 ]

    In btrfs_evict_inode(), if btrfs_truncate_inode_items() fails, the inode
    item will still be in the tree but we still return the ino to the ino
    cache. That will blow up later when someone tries to allocate that ino,
    so don't return it to the cache.

    Fixes: 581bb050941b ("Btrfs: Cache free inode numbers in memory")
    Reviewed-by: Josef Bacik
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • [ Upstream commit e73e81b6d0114d4a303205a952ab2e87c44bd279 ]

    [Problem description and how we fix it]
    We should balance dirty metadata pages at the end of
    btrfs_finish_ordered_io, since a small, unmergeable random write can
    potentially produce dirty metadata which is multiple times larger than
    the data itself. For example, a small, unmergeable 4KiB write may
    produce:

    16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
    16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree

    Although we do call balance dirty pages in write side, but in the
    buffered write path, most metadata are dirtied only after we reach the
    dirty background limit (which by far only counts dirty data pages) and
    wakeup the flusher thread. If there are many small, unmergeable random
    writes spread in a large btree, we'll find a burst of dirty pages
    exceeds the dirty_bytes limit after we wakeup the flusher thread - which
    is not what we expect. In our machine, it caused out-of-memory problem
    since a page cannot be dropped if it is marked dirty.

    Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
    but since we do btrfs_finish_ordered_io in a separate worker, it will not
    stop the flusher consuming dirty pages. Also, we use different worker for
    metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
    the size of dirty metadata pages.

    [Reproduce steps]
    To reproduce the problem, we need to do 4KiB write randomly spread in a
    large btree. In our 2GiB RAM machine:

    1) Create 4 subvolumes.
    2) Run fio on each subvolume:

    [global]
    direct=0
    rw=randwrite
    ioengine=libaio
    bs=4k
    iodepth=16
    numjobs=1
    group_reporting
    size=128G
    runtime=1800
    norandommap
    time_based
    randrepeat=0

    3) Take snapshot on each subvolume and repeat fio on existing files.
    4) Repeat step (3) until we get large btrees.
    In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
    metadata in each subvolume tree and 12GiB of metadata in extent tree.
    5) Stop all fio, take snapshot again, and wait until all delayed work is
    completed.
    6) Start all fio. Few seconds later we hit OOM when the flusher starts
    to work.

    It can be reproduced even when using nocow write.

    Signed-off-by: Ethan Lien
    Reviewed-by: David Sterba
    [ add comment ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ethan Lien
     
  • [ Upstream commit 27319ba4044c0c67d62ae39e53c0118c89f0a029 ]

    Thread GC thread
    - f2fs_ioc_start_atomic_write
    - get_dirty_pages
    - filemap_write_and_wait_range
    - f2fs_gc
    - do_garbage_collect
    - gc_data_segment
    - move_data_page
    - f2fs_is_atomic_file
    - set_page_dirty
    - set_inode_flag(, FI_ATOMIC_FILE)

    Dirty data page can still be generated by GC in race condition as
    above call stack.

    This patch adds fi->dio_rwsem[WRITE] in f2fs_ioc_start_atomic_write
    to avoid such race.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit c22aecd75919511abea872b201751e0be1add898 ]

    dquot_initialize() can fail due to any exception inside quota subsystem,
    f2fs needs to be aware of it, and return correct return value to caller.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 60b2b4ee2bc01dd052f99fa9d65da2232102ef8e ]

    f2fs_ioc_shutdown() ioctl gets stuck in the below path
    when issued with F2FS_GOING_DOWN_FULLSYNC option.

    __switch_to+0x90/0xc4
    percpu_down_write+0x8c/0xc0
    freeze_super+0xec/0x1e4
    freeze_bdev+0xc4/0xcc
    f2fs_ioctl+0xc0c/0x1ce0
    f2fs_compat_ioctl+0x98/0x1f0

    Signed-off-by: Sahitya Tummala
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Sahitya Tummala
     
  • [ Upstream commit e5e5732d8120654159254c16834bc8663d8be124 ]

    After revoking atomic write, related LBA can be reused by others, so we
    need to wait page writeback before reusing the LBA, in order to avoid
    interference between old atomic written in-flight IO and new IO.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 64c74a7ab505ea40d1b3e5d02735ecab08ae1b14 ]

    - f2fs_fill_super
    - recover_fsync_data
    - recover_data
    - del_fsync_inode
    - iput
    - iput_final
    - write_inode_now
    - f2fs_write_inode
    - f2fs_balance_fs
    - f2fs_balance_fs_bg
    - sync_dirty_inodes

    With data_flush mount option, during recovery, in order to avoid entering
    above writeback flow, let's detect recovery status and do skip in
    f2fs_balance_fs_bg.

    Signed-off-by: Chao Yu
    Signed-off-by: Yunlei He
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 14a28559f43ac7c0b98dd1b0e73ec9ec8ab4fc45 ]

    This patch fixes error path of move_data_page:
    - clear cold data flag if it fails to write page.
    - redirty page for non-ENOMEM case.

    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • [ Upstream commit 4071e67cffcc5c2a007116a02437471351f550eb ]

    The following patch disables loading of f2fs module on architectures
    which have PAGE_SIZE > 4096 , since it is impossible to mount f2fs on
    such architectures , log messages are:

    mount: /mnt: wrong fs type, bad option, bad superblock on
    /dev/vdiskb1, missing codepage or helper program, or other error.
    /dev/vdiskb1: F2FS filesystem,
    UUID=1d8b9ca4-2389-4910-af3b-10998969f09c, volume name ""

    May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
    page_cache_size (8192), supports only 4KB
    May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
    filesystem in 1th superblock
    May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
    page_cache_size (8192), supports only 4KB
    May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Can't find valid F2FS
    filesystem in 2th superblock
    May 15 18:03:13 ttip kernel: F2FS-fs (vdiskb1): Invalid
    page_cache_size (8192), supports only 4KB

    which was introduced by git commit 5c9b469295fb6b10d98923eab5e79c4edb80ed20

    tested on git kernel 4.17.0-rc6-00309-gec30dcf7f425

    with patch applied:

    modprobe: ERROR: could not insert 'f2fs': Invalid argument
    May 28 01:40:28 v215 kernel: F2FS not supported on PAGE_SIZE(8192) != 4096

    Signed-off-by: Anatoly Pugachev
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Anatoly Pugachev
     
  • [ Upstream commit ae55e59da0e401893b3c52b575fc18a00623d0a1 ]

    If the server recalls the layout that was just handed out, we risk hitting
    a race as described in RFC5661 Section 2.10.6.3 unless we ensure that we
    release the sequence slot after processing the LAYOUTGET operation that
    was sent as part of the OPEN compound.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • [ Upstream commit c36ed50de2ad1649ce0369a4a6fc2cc11b20dfb7 ]

    On currently logic:
    when I specify rasize=0~1 then it will be 4096.
    when I specify rasize=2~4097 then it will be 8192.

    Make it the same as rsize & wsize.

    Signed-off-by: Chengguang Xu
    Reviewed-by: "Yan, Zheng"
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • [ Upstream commit ab6ecf247a9321e3180e021a6a60164dee53ab2e ]

    In commit ab676b7d6fbf ("pagemap: do not leak physical addresses to
    non-privileged userspace"), the /proc/PID/pagemap is restricted to be
    readable only by CAP_SYS_ADMIN to address some security issue.

    In commit 1c90308e7a77 ("pagemap: hide physical addresses from
    non-privileged users"), the restriction is relieved to make
    /proc/PID/pagemap readable, but hide the physical addresses for
    non-privileged users.

    But the swap entries are readable for non-privileged users too. This
    has some security issues. For example, for page under migrating, the
    swap entry has physical address information. So, in this patch, the
    swap entries are hided for non-privileged users too.

    Link: http://lkml.kernel.org/r/20180508012745.7238-1-ying.huang@intel.com
    Fixes: 1c90308e7a77 ("pagemap: hide physical addresses from non-privileged users")
    Signed-off-by: "Huang, Ying"
    Suggested-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Cc: Konstantin Khlebnikov
    Cc: Andrei Vagin
    Cc: Jerome Glisse
    Cc: Daniel Colascione
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Huang Ying
     
  • [ Upstream commit 3171822fdcdd6e6d536047c425af6dc7a92dc585 ]

    When running a fuzz tester against a KASAN-enabled kernel, the following
    splat periodically occurs.

    The problem occurs when the test sends a GETDEVICEINFO request with a
    malformed xdr array (size but no data) for gdia_notify_types and the
    array size is > 0x3fffffff, which results in an overflow in the value of
    nbytes which is passed to read_buf().

    If the array size is 0x40000000, 0x80000000, or 0xc0000000, then after
    the overflow occurs, the value of nbytes 0, and when that happens the
    pointer returned by read_buf() points to the end of the xdr data (i.e.
    argp->end) when really it should be returning NULL.

    Fix this by returning NFS4ERR_BAD_XDR if the array size is > 1000 (this
    value is arbitrary, but it's the same threshold used by
    nfsd4_decode_bitmap()... in could really be any value >= 1 since it's
    expected to get at most a single bitmap in gdia_notify_types).

    [ 119.256854] ==================================================================
    [ 119.257611] BUG: KASAN: use-after-free in nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
    [ 119.258422] Read of size 4 at addr ffff880113ada000 by task nfsd/538

    [ 119.259146] CPU: 0 PID: 538 Comm: nfsd Not tainted 4.17.0+ #1
    [ 119.259662] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
    [ 119.261202] Call Trace:
    [ 119.262265] dump_stack+0x71/0xab
    [ 119.263371] print_address_description+0x6a/0x270
    [ 119.264609] kasan_report+0x258/0x380
    [ 119.265854] ? nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
    [ 119.267291] nfsd4_decode_getdeviceinfo+0x5a4/0x5b0 [nfsd]
    [ 119.268549] ? nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd]
    [ 119.269873] ? nfsd4_decode_sequence+0x490/0x490 [nfsd]
    [ 119.271095] nfs4svc_decode_compoundargs+0xa5b/0x13c0 [nfsd]
    [ 119.272393] ? nfsd4_release_compoundargs+0x1b0/0x1b0 [nfsd]
    [ 119.273658] nfsd_dispatch+0x183/0x850 [nfsd]
    [ 119.274918] svc_process+0x161c/0x31a0 [sunrpc]
    [ 119.276172] ? svc_printk+0x190/0x190 [sunrpc]
    [ 119.277386] ? svc_xprt_release+0x451/0x680 [sunrpc]
    [ 119.278622] nfsd+0x2b9/0x430 [nfsd]
    [ 119.279771] ? nfsd_destroy+0x1c0/0x1c0 [nfsd]
    [ 119.281157] kthread+0x2db/0x390
    [ 119.282347] ? kthread_create_worker_on_cpu+0xc0/0xc0
    [ 119.283756] ret_from_fork+0x35/0x40

    [ 119.286041] Allocated by task 436:
    [ 119.287525] kasan_kmalloc+0xa0/0xd0
    [ 119.288685] kmem_cache_alloc+0xe9/0x1f0
    [ 119.289900] get_empty_filp+0x7b/0x410
    [ 119.291037] path_openat+0xca/0x4220
    [ 119.292242] do_filp_open+0x182/0x280
    [ 119.293411] do_sys_open+0x216/0x360
    [ 119.294555] do_syscall_64+0xa0/0x2f0
    [ 119.295721] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [ 119.298068] Freed by task 436:
    [ 119.299271] __kasan_slab_free+0x130/0x180
    [ 119.300557] kmem_cache_free+0x78/0x210
    [ 119.301823] rcu_process_callbacks+0x35b/0xbd0
    [ 119.303162] __do_softirq+0x192/0x5ea

    [ 119.305443] The buggy address belongs to the object at ffff880113ada000
    which belongs to the cache filp of size 256
    [ 119.308556] The buggy address is located 0 bytes inside of
    256-byte region [ffff880113ada000, ffff880113ada100)
    [ 119.311376] The buggy address belongs to the page:
    [ 119.312728] page:ffffea00044eb680 count:1 mapcount:0 mapping:0000000000000000 index:0xffff880113ada780
    [ 119.314428] flags: 0x17ffe000000100(slab)
    [ 119.315740] raw: 0017ffe000000100 0000000000000000 ffff880113ada780 00000001000c0001
    [ 119.317379] raw: ffffea0004553c60 ffffea00045c11e0 ffff88011b167e00 0000000000000000
    [ 119.319050] page dumped because: kasan: bad access detected

    [ 119.321652] Memory state around the buggy address:
    [ 119.322993] ffff880113ad9f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 119.324515] ffff880113ad9f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    [ 119.326087] >ffff880113ada000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 119.327547] ^
    [ 119.328730] ffff880113ada080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [ 119.330218] ffff880113ada100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    [ 119.331740] ==================================================================

    Signed-off-by: Scott Mayhew
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Scott Mayhew
     
  • [ Upstream commit f9312a541050007ec59eb0106273a0a10718cd83 ]

    If the server returns NFS4ERR_SEQ_FALSE_RETRY or NFS4ERR_RETRY_UNCACHED_REP,
    then it thinks we're trying to replay an existing request. If so, then
    let's just bump the sequence ID and retry the operation.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • [ Upstream commit 93b7f7ad2018d2037559b1d0892417864c78b371 ]

    Currently, when IO to DS fails, client returns the layout and
    retries against the MDS. However, then on umounting (inode eviction)
    it returns the layout again.

    This is because pnfs_return_layout() was changed in
    commit d78471d32bb6 ("pnfs/blocklayout: set PNFS_LAYOUTRETURN_ON_ERROR")
    to always set NFS_LAYOUT_RETURN_REQUESTED so even if we returned
    the layout, it will be returned again. Instead, let's also check
    if we have already marked the layout invalid.

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Olga Kornievskaia
     

28 Jul, 2018

1 commit

  • This reverts commit 748144f35514aef14c4fdef5bcaa0db99cb9367a which is
    commit f46ecbd97f508e68a7806291a139499794874f3d upstream.

    Philip reports:
    seems adding "cifs: Fix slab-out-of-bounds in send_set_info() on SMB2
    ACE setting" (commit 748144f) [1] created a regression within linux
    v4.14 kernel series. Writing to a mounted cifs either freezes on writing
    or crashes the PC. A more detailed explanation you may find in our
    forums [2]. Reverting the patch, seems to "fix" it. Thoughts?

    [2] https://forum.manjaro.org/t/53250

    Reported-by: Philip Müller
    Cc: Jianhong Yin
    Cc: Stefano Brivio
    Cc: Aurelien Aptel
    Cc: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman