31 Dec, 2019

11 commits

  • commit 6609fee8897ac475378388238456c84298bff802 upstream.

    When a tree mod log user no longer needs to use the tree it calls
    btrfs_put_tree_mod_seq() to remove itself from the list of users and
    delete all no longer used elements of the tree's red black tree, which
    should be all elements with a sequence number less then our equals to
    the caller's sequence number. However the logic is broken because it
    can delete and free elements from the red black tree that have a
    sequence number greater then the caller's sequence number:

    1) At a point in time we have sequence numbers 1, 2, 3 and 4 in the
    tree mod log;

    2) The task which got assigned the sequence number 1 calls
    btrfs_put_tree_mod_seq();

    3) Sequence number 1 is deleted from the list of sequence numbers;

    4) The current minimum sequence number is computed to be the sequence
    number 2;

    5) A task using sequence number 2 is at tree_mod_log_rewind() and gets
    a pointer to one of its elements from the red black tree through
    a call to tree_mod_log_search();

    6) The task with sequence number 1 iterates the red black tree of tree
    modification elements and deletes (and frees) all elements with a
    sequence number less then or equals to 2 (the computed minimum sequence
    number) - it ends up only leaving elements with sequence numbers of 3
    and 4;

    7) The task with sequence number 2 now uses the pointer to its element,
    already freed by the other task, at __tree_mod_log_rewind(), resulting
    in a use-after-free issue. When CONFIG_DEBUG_PAGEALLOC=y it produces
    a trace like the following:

    [16804.546854] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    [16804.547451] CPU: 0 PID: 28257 Comm: pool Tainted: G W 5.4.0-rc8-btrfs-next-51 #1
    [16804.548059] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
    [16804.548666] RIP: 0010:rb_next+0x16/0x50
    (...)
    [16804.550581] RSP: 0018:ffffb948418ef9b0 EFLAGS: 00010202
    [16804.551227] RAX: 6b6b6b6b6b6b6b6b RBX: ffff90e0247f6600 RCX: 6b6b6b6b6b6b6b6b
    [16804.551873] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff90e0247f6600
    [16804.552504] RBP: ffff90dffe0d4688 R08: 0000000000000001 R09: 0000000000000000
    [16804.553136] R10: ffff90dffa4a0040 R11: 0000000000000000 R12: 000000000000002e
    [16804.553768] R13: ffff90e0247f6600 R14: 0000000000001663 R15: ffff90dff77862b8
    [16804.554399] FS: 00007f4b197ae700(0000) GS:ffff90e036a00000(0000) knlGS:0000000000000000
    [16804.555039] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [16804.555683] CR2: 00007f4b10022000 CR3: 00000002060e2004 CR4: 00000000003606f0
    [16804.556336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [16804.556968] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [16804.557583] Call Trace:
    [16804.558207] __tree_mod_log_rewind+0xbf/0x280 [btrfs]
    [16804.558835] btrfs_search_old_slot+0x105/0xd00 [btrfs]
    [16804.559468] resolve_indirect_refs+0x1eb/0xc70 [btrfs]
    [16804.560087] ? free_extent_buffer.part.19+0x5a/0xc0 [btrfs]
    [16804.560700] find_parent_nodes+0x388/0x1120 [btrfs]
    [16804.561310] btrfs_check_shared+0x115/0x1c0 [btrfs]
    [16804.561916] ? extent_fiemap+0x59d/0x6d0 [btrfs]
    [16804.562518] extent_fiemap+0x59d/0x6d0 [btrfs]
    [16804.563112] ? __might_fault+0x11/0x90
    [16804.563706] do_vfs_ioctl+0x45a/0x700
    [16804.564299] ksys_ioctl+0x70/0x80
    [16804.564885] ? trace_hardirqs_off_thunk+0x1a/0x20
    [16804.565461] __x64_sys_ioctl+0x16/0x20
    [16804.566020] do_syscall_64+0x5c/0x250
    [16804.566580] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [16804.567153] RIP: 0033:0x7f4b1ba2add7
    (...)
    [16804.568907] RSP: 002b:00007f4b197adc88 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [16804.569513] RAX: ffffffffffffffda RBX: 00007f4b100210d8 RCX: 00007f4b1ba2add7
    [16804.570133] RDX: 00007f4b100210d8 RSI: 00000000c020660b RDI: 0000000000000003
    [16804.570726] RBP: 000055de05a6cfe0 R08: 0000000000000000 R09: 00007f4b197add44
    [16804.571314] R10: 0000000000000000 R11: 0000000000000246 R12: 00007f4b197add48
    [16804.571905] R13: 00007f4b197add40 R14: 00007f4b100210d0 R15: 00007f4b197add50
    (...)
    [16804.575623] ---[ end trace 87317359aad4ba50 ]---

    Fix this by making btrfs_put_tree_mod_seq() skip deletion of elements that
    have a sequence number equals to the computed minimum sequence number, and
    not just elements with a sequence number greater then that minimum.

    Fixes: bd989ba359f2ac ("Btrfs: add tree modification log functions")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 714cd3e8cba6841220dce9063a7388a81de03825 upstream.

    If we get an -ENOENT back from btrfs_uuid_iter_rem when iterating the
    uuid tree we'll just continue and do btrfs_next_item(). However we've
    done a btrfs_release_path() at this point and no longer have a valid
    path. So increment the key and go back and do a normal search.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit ca1aa2818a53875cfdd175fb5e9a2984e997cce9 upstream.

    If we fail to read the fs root corresponding with a reloc root we'll
    just break out and free the reloc roots. But we remove our current
    reloc_root from this list higher up, which means we'll leak this
    reloc_root. Fix this by adding ourselves back to the reloc_roots list
    so we are properly cleaned up.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 9bc574de590510eff899c3ca8dbaf013566b5efe upstream.

    My fsstress modifications coupled with generic/475 uncovered a failure
    to mount and replay the log if we hit a orphaned root. We do not want
    to replay the log for an orphan root, but it's completely legitimate to
    have an orphaned root with a log attached. Fix this by simply skipping
    replaying the log. We still need to pin it's root node so that we do
    not overwrite it while replaying other logs, as we re-read the log root
    at every stage of the replay.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit c7e54b5102bf3614cadb9ca32d7be73bad6cecf0 upstream.

    We can just abort the transaction here, and in fact do that for every
    other failure in this function except these two cases.

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit fbd542971aa1e9ec33212afe1d9b4f1106cd85a1 upstream.

    We log warning if root::orphan_cleanup_state is not set to
    ORPHAN_CLEANUP_DONE in btrfs_ioctl_send(). However if the filesystem is
    mounted as readonly we skip the orphan item cleanup during the lookup
    and root::orphan_cleanup_state remains at the init state 0 instead of
    ORPHAN_CLEANUP_DONE (2). So during send in btrfs_ioctl_send() we hit the
    warning as below.

    WARN_ON(send_root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE);

    WARNING: CPU: 0 PID: 2616 at /Volumes/ws/btrfs-devel/fs/btrfs/send.c:7090 btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
    ::
    RIP: 0010:btrfs_ioctl_send+0xb2f/0x18c0 [btrfs]
    ::
    Call Trace:
    ::
    _btrfs_ioctl_send+0x7b/0x110 [btrfs]
    btrfs_ioctl+0x150a/0x2b00 [btrfs]
    ::
    do_vfs_ioctl+0xa9/0x620
    ? __fget+0xac/0xe0
    ksys_ioctl+0x60/0x90
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x49/0x130
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reproducer:
    mkfs.btrfs -fq /dev/sdb
    mount /dev/sdb /btrfs
    btrfs subvolume create /btrfs/sv1
    btrfs subvolume snapshot -r /btrfs/sv1 /btrfs/ss1
    umount /btrfs
    mount -o ro /dev/sdb /btrfs
    btrfs send /btrfs/ss1 -f /tmp/f

    The warning exists because having orphan inodes could confuse send and
    cause it to fail or produce incorrect streams. The two cases that would
    cause such send failures, which are already fixed are:

    1) Inodes that were unlinked - these are orphanized and remain with a
    link count of 0. These caused send operations to fail because it
    expected to always find at least one path for an inode. However this
    is no longer a problem since send is now able to deal with such
    inodes since commit 46b2f4590aab ("Btrfs: fix send failure when root
    has deleted files still open") and treats them as having been
    completely removed (the state after an orphan cleanup is performed).

    2) Inodes that were in the process of being truncated. These resulted in
    send not knowing about the truncation and potentially issue write
    operations full of zeroes for the range from the new file size to the
    old file size. This is no longer a problem because we no longer
    create orphan items for truncation since commit f7e9e8fc792f ("Btrfs:
    stop creating orphan items for truncate").

    As such before these commits, the WARN_ON here provided a clue in case
    something went wrong. Instead of being a warning against the
    root::orphan_cleanup_state value, it could have been more accurate by
    checking if there were actually any orphan items, and then issue a
    warning only if any exists, but that would be more expensive to check.
    Since orphanized inodes no longer cause problems for send, just remove
    the warning.

    Reported-by: Christoph Anton Mitterer
    Link: https://lore.kernel.org/linux-btrfs/21cb5e8d059f6e1496a903fa7bfc0a297e2f5370.camel@scientia.net/
    CC: stable@vger.kernel.org # 4.19+
    Suggested-by: Filipe Manana
    Reviewed-by: Filipe Manana
    Signed-off-by: Anand Jain
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Anand Jain
     
  • commit 40e046acbd2f369cfbf93c3413639c66514cec2d upstream.

    When logging a file that has shared extents (reflinked with other files or
    with itself), we can end up logging multiple checksum items that cover
    overlapping ranges. This confuses the search for checksums at log replay
    time causing some checksums to never be added to the fs/subvolume tree.

    Consider the following example of a file that shares the same extent at
    offsets 0 and 256Kb:

    [ bytenr 13893632, offset 64Kb, len 64Kb ]
    0 64Kb

    [ bytenr 13631488, offset 64Kb, len 192Kb ]
    64Kb 256Kb

    [ bytenr 13893632, offset 0, len 256Kb ]
    256Kb 512Kb

    When logging the inode, at tree-log.c:copy_items(), when processing the
    file extent item at offset 0, we log a checksum item covering the range
    13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 +
    64Kb + 64Kb, respectively.

    Later when processing the extent item at offset 256K, we log the checksums
    for the range from 13893632 to 14155776 (which corresponds to 13893632 +
    256Kb). These checksums get merged with the checksum item for the range
    from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync.
    So after this we get the two following checksum items in the log tree:

    (...)
    item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512
    range start 13631488 end 14155776 length 524288
    item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64
    range start 13959168 end 14024704 length 65536

    The first one covers the range from the second one, they overlap.

    So far this does not cause a problem after replaying the log, because
    when replaying the file extent item for offset 256K, we copy all the
    checksums for the extent 13893632 from the log tree to the fs/subvolume
    tree, since searching for an checksum item for bytenr 13893632 leaves us
    at the first checksum item, which covers the whole range of the extent.

    However if we write 64Kb to file offset 256Kb for example, we will
    not be able to find and copy the checksums for the last 128Kb of the
    extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb.

    After writing 64Kb into file offset 256Kb we get the following extent
    layout for our file:

    [ bytenr 13893632, offset 64K, len 64Kb ]
    0 64Kb

    [ bytenr 13631488, offset 64Kb, len 192Kb ]
    64Kb 256Kb

    [ bytenr 14155776, offset 0, len 64Kb ]
    256Kb 320Kb

    [ bytenr 13893632, offset 64Kb, len 192Kb ]
    320Kb 512Kb

    After fsync'ing the file, if we have a power failure and then mount
    the filesystem to replay the log, the following happens:

    1) When replaying the file extent item for file offset 320Kb, we
    lookup for the checksums for the extent range from 13959168
    (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call
    to btrfs_lookup_csums_range();

    2) btrfs_lookup_csums_range() finds the checksum item that starts
    precisely at offset 13959168 (item 7 in the log tree, shown before);

    3) However that checksum item only covers 64Kb of data, and not 192Kb
    of data;

    4) As a result only the checksums for the first 64Kb of data referenced
    by the file extent item are found and copied to the fs/subvolume tree.
    The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get
    the corresponding data checksums found and copied to the fs/subvolume
    tree.

    5) After replaying the log userspace will not be able to read the file
    range from 384Kb to 512Kb, because the checksums are missing and
    resulting in an -EIO error.

    The following steps reproduce this scenario:

    $ mkfs.btrfs -f /dev/sdc
    $ mount /dev/sdc /mnt/sdc

    $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar
    $ xfs_io -c "fsync" /mnt/sdc/foobar
    $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar

    $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar
    $ xfs_io -c "fsync" /mnt/sdc/foobar

    $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar
    $ xfs_io -c "fsync" /mnt/sdc/foobar

    $ mount /dev/sdc /mnt/sdc
    $ md5sum /mnt/sdc/foobar
    md5sum: /mnt/sdc/foobar: Input/output error

    $ dmesg | tail
    [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408
    [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504
    [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600
    [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696
    [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792
    [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888
    [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984
    [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080
    [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1
    [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1

    Fix this simply by deleting first any checksums, from the log tree, for the
    range of the extent we are logging at copy_items(). This ensures we do not
    get checksum items in the log tree that have overlapping ranges.

    This is a long time issue that has been present since we have the clone
    (and deduplication) ioctl, and can happen both when an extent is shared
    between different files and within the same file.

    A test case for fstests follows soon.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit b6293c821ea8fa2a631a2112cd86cd435effeb8b upstream.

    Callers of alloc_test_extent_buffer have not correctly interpreted the
    return value as error pointer, as alloc_test_extent_buffer should behave
    as alloc_extent_buffer. The self-tests were unaffected but
    btrfs_find_create_tree_block could call both functions and that would
    cause problems up in the call chain.

    Fixes: faa2dbf004e8 ("Btrfs: add sanity tests for new qgroup accounting code")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Dan Carpenter
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     
  • commit ad1d8c439978ede77cbf73cbdd11bafe810421a5 upstream.

    Having checksum items, either on the checksums tree or in a log tree, that
    represent ranges that overlap each other is a sign of a corruption. Such
    case confuses the checksum lookup code and can result in not being able to
    find checksums or find stale checksums.

    So add a check for such case.

    This is motivated by a recent fix for a case where a log tree had checksum
    items covering ranges that overlap each other due to extent cloning, and
    resulted in missing checksums after replaying the log tree. It also helps
    detect past issues such as stale and outdated checksums due to overlapping,
    commit 27b9a8122ff71a ("Btrfs: fix csum tree corruption, duplicate and
    outdated checksums").

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit f72ff01df9cf5db25c76674cac16605992d15467 upstream.

    Testing with the new fsstress uncovered a pretty nasty deadlock with
    lookup and snapshot deletion.

    Process A
    unlink
    -> final iput
    -> inode_tree_del
    -> synchronize_srcu(subvol_srcu)

    Process B
    btrfs_lookup btrfs_iget
    -> find inode that has I_FREEING set
    -> __wait_on_freeing_inode()

    We're holding the srcu_read_lock() while doing the iget in order to make
    sure our fs root doesn't go away, and then we are waiting for the inode
    to finish freeing. However because the free'ing process is doing a
    synchronize_srcu() we deadlock.

    Fix this by dropping the synchronize_srcu() in inode_tree_del(). We
    don't need people to stop accessing the fs root at this point, we're
    only adding our empty root to the dead roots list.

    A larger much more invasive fix is forthcoming to address how we deal
    with fs roots, but this fixes the immediate problem.

    Fixes: 76dda93c6ae2 ("Btrfs: add snapshot/subvolume destroy ioctl")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 943eb3bf25f4a7b745dd799e031be276aa104d82 upstream.

    If we're rename exchanging two subvols we'll try to lock this lock
    twice, which is bad. Just lock once if either of the ino's are subvols.

    Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     

21 Dec, 2019

13 commits

  • commit 5bb30a4dd60e2a10a4de9932daff23e503f1dd2b upstream.

    Make sure that DFS referrals are sent to newly resolved root targets
    as in a multi tier DFS setup.

    Signed-off-by: Paulo Alcantara (SUSE)
    Link: https://lkml.kernel.org/r/05aa2995-e85e-0ff4-d003-5bb08bd17a22@canonical.com
    Cc: stable@vger.kernel.org
    Tested-by: Matthew Ruffell
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Paulo Alcantara (SUSE)
     
  • commit 86a7964be7afaf3df6b64faaa10a7032d2444e51 upstream.

    There is a race between a system call processing thread
    and the demultiplex thread when mid->resp_buf becomes NULL
    and later is being accessed to get credits. It happens when
    the 1st thread wakes up before a mid callback is called in
    the 2nd one but the mid state has already been set to
    MID_RESPONSE_RECEIVED. This causes NULL pointer dereference
    in mid callback.

    Fix this by saving credits from the response before we
    update the mid state and then use this value in the mid
    callback rather then accessing a response buffer.

    Cc: Stable
    Fixes: ee258d79159afed5 ("CIFS: Move credit processing to mid callbacks for SMB3")
    Tested-by: Frank Sorenson
    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit 7b71843fa7028475b052107664cbe120156a2cfc upstream.

    When an OPEN command is cancelled we mark a mid as
    cancelled and let the demultiplex thread process it
    by closing an open handle. The problem is there is
    a race between a system call thread and the demultiplex
    thread and there may be a situation when the mid has
    been already processed before it is set as cancelled.

    Fix this by processing cancelled requests when mids
    are being destroyed which means that there is only
    one thread referencing a particular mid. Also set
    mids as cancelled unconditionally on their state.

    Cc: Stable
    Tested-by: Frank Sorenson
    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit 9150c3adbf24d77cfba37f03639d4a908ca4ac25 upstream.

    If Close command is interrupted before sending a request
    to the server the client ends up leaking an open file
    handle. This wastes server resources and can potentially
    block applications that try to remove the file or any
    directory containing this file.

    Fix this by putting the close command into a worker queue,
    so another thread retries it later.

    Cc: Stable
    Tested-by: Frank Sorenson
    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit 44805b0e62f15e90d233485420e1847133716bdc upstream.

    Currently the client translates O_SYNC and O_DIRECT flags
    into corresponding SMB create options when openning a file.
    The problem is that on reconnect when the file is being
    re-opened the client doesn't set those flags and it causes
    a server to reject re-open requests because create options
    don't match. The latter means that any subsequent system
    call against that open file fail until a share is re-mounted.

    Fix this by properly setting SMB create options when
    re-openning files after reconnects.

    Fixes: 1013e760d10e6: ("SMB3: Don't ignore O_SYNC/O_DSYNC and O_DIRECT flags")
    Cc: Stable
    Signed-off-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Pavel Shilovsky
     
  • commit 14cc639c17ab0b6671526a7459087352507609e4 upstream.

    On reconnect, the transport data structure is NULL and its information is not
    available.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit acd4680e2bef2405a0e1ef2149fbb01cce7e116c upstream.

    The transport should return this error so the upper layer will reconnect.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit 37941ea17d3f8eb2f5ac2f59346fab9e8439271a upstream.

    While it's not friendly to fail user processes that issue more iovs
    than we support, at least we should return the correct error code so the
    user process gets a chance to retry with smaller number of iovs.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit d63cdbae60ac6fbb2864bd3d8df7404f12b7407d upstream.

    Log these activities to help production support.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit c21ce58eab1eda4c66507897207e20c82e62a5ac upstream.

    It's not necessary to queue invalidated memory registration to work queue, as
    all we need to do is to unmap the SG and make it usable again. This can save
    CPU cycles in normal data paths as memory registration errors are rare and
    normally only happens during reconnection.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit 4357d45f50e58672e1d17648d792f27df01dfccd upstream.

    During reconnecting, the transport may have already been destroyed and is in
    the process being reconnected. In this case, return -EAGAIN to not fail and
    to retry this I/O.

    Signed-off-by: Long Li
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Long Li
     
  • commit fe5e7ba11fcf1d75af8173836309e8562aefedef upstream.

    Commit 9287c6452d2b fixed a situation in which gfs2 could use a glock
    after it had been freed. To do that, it temporarily added a new glock
    reference by calling gfs2_glock_hold in function gfs2_add_revoke.
    However, if the bd element was removed by gfs2_trans_remove_revoke, it
    failed to drop the additional reference.

    This patch adds logic to gfs2_trans_remove_revoke to properly drop the
    additional glock reference.

    Fixes: 9287c6452d2b ("gfs2: Fix occasional glock use-after-free")
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Bob Peterson
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Greg Kroah-Hartman

    Bob Peterson
     
  • commit f53056c43063257ae4159d83c425eaeb772bcd71 upstream.

    In gfs2_page_mkwrite's gfs2_allocate_page_backing helper, try to
    allocate as many blocks at once as we need. Pass in the size of the
    requested allocation.

    Fixes: 35af80aef99b ("gfs2: don't use buffer_heads in gfs2_allocate_page_backing")
    Cc: stable@vger.kernel.org # v5.3+
    Signed-off-by: Andreas Gruenbacher
    Signed-off-by: Greg Kroah-Hartman

    Andreas Gruenbacher
     

18 Dec, 2019

16 commits

  • commit f4c2d372b89a1e504ebb7b7eb3e29b8306479366 upstream.

    Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when
    invalidating pages") moved freeing of delayed allocation reservations
    from dirty page invalidation time to time when we evict corresponding
    status extent from extent status tree. For inodes which don't have any
    blocks allocated this may actually happen only in ext4_clear_blocks()
    which is after we've dropped references to quota structures from the
    inode. Thus reservation of quota leaked. Fix the problem by clearing
    quota information from the inode only after evicting extent status tree
    in ext4_clear_inode().

    Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz
    Reported-by: Konstantin Khlebnikov
    Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages")
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 565333a1554d704789e74205989305c811fd9c7a upstream.

    No need to wait for any commit once the page is fully truncated.
    Besides, it may confuse e.g. concurrent ext4_writepage() with the page
    still be dirty (will be cleared by truncate_pagecache() in
    ext4_setattr()) but buffers has been freed; and then trigger a bug
    show as below:

    [ 26.057508] ------------[ cut here ]------------
    [ 26.058531] kernel BUG at fs/ext4/inode.c:2134!
    ...
    [ 26.088130] Call trace:
    [ 26.088695] ext4_writepage+0x914/0xb28
    [ 26.089541] writeout.isra.4+0x1b4/0x2b8
    [ 26.090409] move_to_new_page+0x3b0/0x568
    [ 26.091338] __unmap_and_move+0x648/0x988
    [ 26.092241] unmap_and_move+0x48c/0xbb8
    [ 26.093096] migrate_pages+0x220/0xb28
    [ 26.093945] kernel_mbind+0x828/0xa18
    [ 26.094791] __arm64_sys_mbind+0xc8/0x138
    [ 26.095716] el0_svc_common+0x190/0x490
    [ 26.096571] el0_svc_handler+0x60/0xd0
    [ 26.097423] el0_svc+0x8/0xc

    Run the procedure (generate by syzkaller) parallel with ext3.

    void main()
    {
    int fd, fd1, ret;
    void *addr;
    size_t length = 4096;
    int flags;
    off_t offset = 0;
    char *str = "12345";

    fd = open("a", O_RDWR | O_CREAT);
    assert(fd >= 0);

    /* Truncate to 4k */
    ret = ftruncate(fd, length);
    assert(ret == 0);

    /* Journal data mode */
    flags = 0xc00f;
    ret = ioctl(fd, _IOW('f', 2, long), &flags);
    assert(ret == 0);

    /* Truncate to 0 */
    fd1 = open("a", O_TRUNC | O_NOATIME);
    assert(fd1 >= 0);

    addr = mmap(NULL, length, PROT_WRITE | PROT_READ,
    MAP_SHARED, fd, offset);
    assert(addr != (void *)-1);

    memcpy(addr, str, 5);
    mbind(addr, length, 0, 0, 0, MPOL_MF_MOVE);
    }

    And the bug will be triggered once we seen the below order.

    reproduce1 reproduce2

    ... | ...
    truncate to 4k |
    change to journal data mode |
    | memcpy(set page dirty)
    truncate to 0: |
    ext4_setattr: |
    ... |
    ext4_wait_for_tail_page_commit |
    | mbind(trigger bug)
    truncate_pagecache(clean dirty)| ...
    ... |

    mbind will call ext4_writepage() since the page still be dirty, and then
    report the bug since the buffers has been free. Fix it by return
    directly once offset equals to 0 which means the page has been fully
    truncated.

    Reported-by: Hulk Robot
    Signed-off-by: yangerkun
    Link: https://lore.kernel.org/r/20190919063508.1045-1-yangerkun@huawei.com
    Reviewed-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    yangerkun
     
  • commit 3253d9d093376d62b4a56e609f15d2ec5085ac73 upstream.

    Andreas Grünbacher reports that on the two filesystems that support
    iomap directio, it's possible for splice() to return -EAGAIN (instead of
    a short splice) if the pipe being written to has less space available in
    its pipe buffers than the length supplied by the calling process.

    Months ago we fixed splice_direct_to_actor to clamp the length of the
    read request to the size of the splice pipe. Do the same to do_splice.

    Fixes: 17614445576b6 ("splice: don't read more than available pipe space")
    Reported-by: syzbot+3c01db6025f26530cf8d@syzkaller.appspotmail.com
    Reported-by: Andreas Grünbacher
    Reviewed-by: Andreas Grünbacher
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Darrick J. Wong
     
  • commit c7df4a1ecb8579838ec8c56b2bb6a6716e974f37 upstream.

    If the file system is corrupted such that a file's i_links_count is
    too small, then it's possible that when unlinking that file, i_nlink
    will already be zero. Previously we were working around this kind of
    corruption by forcing i_nlink to one; but we were doing this before
    trying to delete the directory entry --- and if the file system is
    corrupted enough that ext4_delete_entry() fails, then we exit with
    i_nlink elevated, and this causes the orphan inode list handling to be
    FUBAR'ed, such that when we unmount the file system, the orphan inode
    list can get corrupted.

    A better way to fix this is to simply skip trying to call drop_nlink()
    if i_nlink is already zero, thus moving the check to the place where
    it makes the most sense.

    https://bugzilla.kernel.org/show_bug.cgi?id=205433

    Link: https://lore.kernel.org/r/20191112032903.8828-1-tytso@mit.edu
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Reviewed-by: Andreas Dilger
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 60e4cf67a582d64f07713eda5fcc8ccdaf7833e6 upstream.

    Since commit d0a5b995a308 (vfs: Add IOP_XATTR inode operations flag)
    extended attributes haven't worked on the root directory in reiserfs.

    This is due to reiserfs conditionally setting the sb->s_xattrs handler
    array depending on whether it located or create the internal privroot
    directory. It necessarily does this after the root inode is already
    read in. The IOP_XATTR flag is set during inode initialization, so
    it never gets set on the root directory.

    This commit unconditionally assigns sb->s_xattrs and clears IOP_XATTR on
    internal inodes. The old return values due to the conditional assignment
    are handled via open_xa_root, which now returns EOPNOTSUPP as the VFS
    would have done.

    Link: https://lore.kernel.org/r/20191024143127.17509-1-jeffm@suse.com
    CC: stable@vger.kernel.org
    Fixes: d0a5b995a308 ("vfs: Add IOP_XATTR inode operations flag")
    Signed-off-by: Jeff Mahoney
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Jeff Mahoney
     
  • commit 65db869c754e7c271691dd5feabf884347e694f5 upstream.

    Estimate for the number of credits needed for final freeing of inode in
    ext4_evict_inode() was to small. We may modify 4 blocks (inode & sb for
    orphan deletion, bitmap & group descriptor for inode freeing) and not
    just 3.

    [ Fixed minor whitespace nit. -- TYT ]

    Fixes: e50e5129f384 ("ext4: xattr-in-inode support")
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-6-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 6ff33d99fc5c96797103b48b7b0902c296f09c05 upstream.

    Write only quotas which are dirty at entry.

    XFSTEST: https://github.com/dmonakhov/xfstests/commit/b10ad23566a5bf75832a6f500e1236084083cddc

    Link: https://lore.kernel.org/r/20191031103920.3919-1-dmonakhov@openvz.org
    CC: stable@vger.kernel.org
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Monakhov
     
  • commit e705f4b8aa27a59f8933e8f384e9752f052c469c upstream.

    Check err when partial == NULL is meaningless because
    partial == NULL means getting branch successfully without
    error.

    CC: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20191105045100.7104-1-cgxu519@mykernel.net
    Signed-off-by: Chengguang Xu
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Chengguang Xu
     
  • commit df4bb5d128e2c44848aeb36b7ceceba3ac85080d upstream.

    There is a race window where quota was redirted once we drop dq_list_lock inside dqput(),
    but before we grab dquot->dq_lock inside dquot_release()

    TASK1 TASK2 (chowner)
    ->dqput()
    we_slept:
    spin_lock(&dq_list_lock)
    if (dquot_dirty(dquot)) {
    spin_unlock(&dq_list_lock);
    dquot->dq_sb->dq_op->write_dquot(dquot);
    goto we_slept
    if (test_bit(DQ_ACTIVE_B, &dquot->dq_flags)) {
    spin_unlock(&dq_list_lock);
    dquot->dq_sb->dq_op->release_dquot(dquot);
    dqget()
    mark_dquot_dirty()
    dqput()
    goto we_slept;
    }
    So dquot dirty quota will be released by TASK1, but on next we_sleept loop
    we detect this and call ->write_dquot() for it.
    XFSTEST: https://github.com/dmonakhov/xfstests/commit/440a80d4cbb39e9234df4d7240aee1d551c36107

    Link: https://lore.kernel.org/r/20191031103920.3919-2-dmonakhov@openvz.org
    CC: stable@vger.kernel.org
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Dmitry Monakhov
     
  • commit 926d1650176448d7684b991fbe1a5b1a8289e97c upstream.

    As David reported [1], ENODATA returns when attempting
    to modify files by using EROFS as an overlayfs lower layer.

    The root cause is that listxattr could return unexpected
    -ENODATA by mistake for inodes without xattr. That breaks
    listxattr return value convention and it can cause copy
    up failure when used with overlayfs.

    Resolve by zeroing out if no xattr is found for listxattr.

    [1] https://lore.kernel.org/r/CAEvUa7nxnby+rxK-KRMA46=exeOMApkDMAV08AjMkkPnTPV4CQ@mail.gmail.com
    Link: https://lore.kernel.org/r/20191201084040.29275-1-hsiangkao@aol.com
    Fixes: cadf1ccf1b00 ("staging: erofs: add error handling for xattr submodule")
    Cc: # 4.19+
    Reviewed-by: Chao Yu
    Signed-off-by: Gao Xiang
    Signed-off-by: Greg Kroah-Hartman

    Gao Xiang
     
  • commit 6889ee5a53b8d969aa542047f5ac8acdc0e79a91 upstream.

    In ovl_rename(), if new upper is hardlinked to old upper underneath
    overlayfs before upper dirs are locked, user will get an ESTALE error
    and a WARN_ON will be printed.

    Changes to underlying layers while overlayfs is mounted may result in
    unexpected behavior, but it shouldn't crash the kernel and it shouldn't
    trigger WARN_ON() either, so relax this WARN_ON().

    Reported-by: syzbot+bb1836a212e69f8e201a@syzkaller.appspotmail.com
    Fixes: 804032fabb3b ("ovl: don't check rename to self")
    Cc: # v4.9+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 9c6d8f13e9da10a26ad7f0a020ef86e8ef142835 upstream.

    On non-samefs overlay without xino, non pure upper inodes should use a
    pseudo_dev assigned to each unique lower fs and pure upper inodes use the
    real upper st_dev.

    It is fine for an overlay pure upper inode to use the same st_dev;st_ino
    values as the real upper inode, because the content of those two different
    filesystem objects is always the same.

    In this case, however:
    - two filesystems, A and B
    - upper layer is on A
    - lower layer 1 is also on A
    - lower layer 2 is on B

    Non pure upper overlay inode, whose origin is in layer 1 will have the same
    st_dev;st_ino values as the real lower inode. This may result with a false
    positive results of 'diff' between the real lower and copied up overlay
    inode.

    Fix this by using the upper st_dev;st_ino values in this case. This breaks
    the property of constant st_dev;st_ino across copy up of this case. This
    breakage will be fixed by a later patch.

    Fixes: 5148626b806a ("ovl: allocate anon bdev per unique lower fs")
    Cc: stable@vger.kernel.org # v4.17+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 7e63c87fc2dcf3be9d3aab82d4a0ea085880bdca upstream.

    In the past, overlayfs required that lower fs have non null uuid in
    order to support nfs export and decode copy up origin file handles.

    Commit 9df085f3c9a2 ("ovl: relax requirement for non null uuid of
    lower fs") relaxed this requirement for nfs export support, as long
    as uuid (even if null) is unique among all lower fs.

    However, said commit unintentionally also relaxed the non null uuid
    requirement for decoding copy up origin file handles, regardless of
    the unique uuid requirement.

    Amend this mistake by disabling decoding of copy up origin file handle
    from lower fs with a conflicting uuid.

    We still encode copy up origin file handles from those fs, because
    file handles like those already exist in the wild and because they
    might provide useful information in the future.

    There is an unhandled corner case described by Miklos this way:
    - two filesystems, A and B, both have null uuid
    - upper layer is on A
    - lower layer 1 is also on A
    - lower layer 2 is on B

    In this case bad_uuid won't be set for B, because the check only
    involves the list of lower fs. Hence we'll try to decode a layer 2
    origin on layer 1 and fail.

    We will deal with this corner case later.

    Reported-by: Colin Ian King
    Tested-by: Colin Ian King
    Link: https://lore.kernel.org/lkml/20191106234301.283006-1-colin.king@canonical.com/
    Fixes: 9df085f3c9a2 ("ovl: relax requirement for non null uuid ...")
    Cc: stable@vger.kernel.org # v4.20+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 3e1740993e43116b3bc71b0aad1e6872f6ccf341 upstream.

    Testing with the new fsstress support for subvolumes uncovered a pretty
    bad problem with rename exchange on subvolumes. We're modifying two
    different subvolumes, but we only start the transaction on one of them,
    so the other one is not added to the dirty root list. This is caught by
    btrfs_cow_block() with a warning because the root has not been updated,
    however if we do not modify this root again we'll end up pointing at an
    invalid root because the root item is never updated.

    Fix this by making sure we add the destination root to the trans list,
    the same as we do with normal renames. This fixes the corruption.

    Fixes: cdd1fedf8261 ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
    CC: stable@vger.kernel.org # 4.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit fd0ddbe2509568b00df364156f47561e9f469f15 upstream.

    Backreference walking, which is used by send to figure if it can issue
    clone operations instead of write operations, can be very slow and use
    too much memory when extents have many references. This change simply
    skips backreference walking when an extent has more than 64 references,
    in which case we fallback to a write operation instead of a clone
    operation. This limit is conservative and in practice I observed no
    signicant slowdown with up to 100 references and still low memory usage
    up to that limit.

    This is a temporary workaround until there are speedups in the backref
    walking code, and as such it does not attempt to add extra interfaces or
    knobs to tweak the threshold.

    Reported-by: Atemu
    Link: https://lore.kernel.org/linux-btrfs/CAE4GHgkvqVADtS4AzcQJxo0Q1jKQgKaW3JGp3SGdoinVo=C9eQ@mail.gmail.com/T/#me55dc0987f9cc2acaa54372ce0492c65782be3fa
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Qu Wenruo
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 34b127aecd4fe8e6a3903e10f204a7b7ffddca22 upstream.

    The last user of btrfs_bio::flags was removed in commit 326e1dbb5736
    ("block: remove management of bi_remaining when restoring original
    bi_end_io"), remove it.

    (Tagged for stable as the structure is heavily used and space savings
    are desirable.)

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo