14 Jan, 2020

1 commit


11 Jan, 2020

3 commits

  • Pull char/misc fix from Greg KH:
    "Here is a single fix, for the chrdev core, for 5.5-rc6

    There's been a long-standing race condition triggered by syzbot, and
    occasionally real people, in the chrdev open() path. Will finally took
    the time to track it down and fix it for real before the holidays.

    Here's that one patch, it's been in linux-next for a while with no
    reported issues and it does fix the reported problem"

    * tag 'char-misc-5.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
    chardev: Avoid potential use-after-free in 'chrdev_open()'

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "A few fixes that should go into this round.

    This pull request contains two NVMe fixes via Keith, removal of a dead
    function, and a fix for the bio op for read truncates (Ming)"

    * tag 'block-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
    nvmet: fix per feat data len for get_feature
    nvme: Translate more status codes to blk_status_t
    fs: move guard_bio_eod() after bio_set_op_attrs
    block: remove unused mp_bvec_last_segment

    Linus Torvalds
     
  • Pull io_uring fix from Jens Axboe:
    "Single fix for this series, fixing a regression with the short read
    handling.

    This just removes it, as it cannot safely be done for all cases"

    * tag 'io_uring-5.5-2020-01-10' of git://git.kernel.dk/linux-block:
    io_uring: remove punt of short reads to async context

    Linus Torvalds
     

10 Jan, 2020

1 commit


09 Jan, 2020

2 commits

  • Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     
  • In my attempt to fix a memory leak, I introduced a double-free in the
    pstore error path. Instead of trying to manage the allocation lifetime
    between persistent_ram_new() and its callers, adjust the logic so
    persistent_ram_new() always takes a kstrdup() copy, and leaves the
    caller's allocation lifetime up to the caller. Therefore callers are
    _always_ responsible for freeing their label. Before, it only needed
    freeing when the prz itself failed to allocate, and not in any of the
    other prz failure cases, which callers would have no visibility into,
    which is the root design problem that lead to both the leak and now
    double-free bugs.

    Reported-by: Cengiz Can
    Link: https://lore.kernel.org/lkml/d4ec59002ede4aaf9928c7f7526da87c@kernel.wtf
    Fixes: 8df955a32a73 ("pstore/ram: Fix error-path memory leak in persistent_ram_new() callers")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook

    Kees Cook
     

08 Jan, 2020

1 commit

  • We currently punt any short read on a regular file to async context,
    but this fails if the short read is due to running into EOF. This is
    especially problematic since we only do the single prep for commands
    now, as we don't reset kiocb->ki_pos. This can result in a 4k read on
    a 1k file returning zero, as we detect the short read and then retry
    from async context. At the time of retry, the position is now 1k, and
    we end up reading nothing, and hence return 0.

    Instead of trying to patch around the fact that short reads can be
    legitimate and won't succeed in case of retry, remove the logic to punt
    a short read to async context. Simply return it.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

07 Jan, 2020

1 commit

  • 'chrdev_open()' calls 'cdev_get()' to obtain a reference to the
    'struct cdev *' stashed in the 'i_cdev' field of the target inode
    structure. If the pointer is NULL, then it is initialised lazily by
    looking up the kobject in the 'cdev_map' and so the whole procedure is
    protected by the 'cdev_lock' spinlock to serialise initialisation of
    the shared pointer.

    Unfortunately, it is possible for the initialising thread to fail *after*
    installing the new pointer, for example if the subsequent '->open()' call
    on the file fails. In this case, 'cdev_put()' is called, the reference
    count on the kobject is dropped and, if nobody else has taken a reference,
    the release function is called which finally clears 'inode->i_cdev' from
    'cdev_purge()' before potentially freeing the object. The problem here
    is that a racing thread can happily take the 'cdev_lock' and see the
    non-NULL pointer in the inode, which can result in a refcount increment
    from zero and a warning:

    | ------------[ cut here ]------------
    | refcount_t: addition on 0; use-after-free.
    | WARNING: CPU: 2 PID: 6385 at lib/refcount.c:25 refcount_warn_saturate+0x6d/0xf0
    | Modules linked in:
    | CPU: 2 PID: 6385 Comm: repro Not tainted 5.5.0-rc2+ #22
    | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
    | RIP: 0010:refcount_warn_saturate+0x6d/0xf0
    | Code: 05 55 9a 15 01 01 e8 9d aa c8 ff 0f 0b c3 80 3d 45 9a 15 01 00 75 ce 48 c7 c7 00 9c 62 b3 c6 08
    | RSP: 0018:ffffb524c1b9bc70 EFLAGS: 00010282
    | RAX: 0000000000000000 RBX: ffff9e9da1f71390 RCX: 0000000000000000
    | RDX: ffff9e9dbbd27618 RSI: ffff9e9dbbd18798 RDI: ffff9e9dbbd18798
    | RBP: 0000000000000000 R08: 000000000000095f R09: 0000000000000039
    | R10: 0000000000000000 R11: ffffb524c1b9bb20 R12: ffff9e9da1e8c700
    | R13: ffffffffb25ee8b0 R14: 0000000000000000 R15: ffff9e9da1e8c700
    | FS: 00007f3b87d26700(0000) GS:ffff9e9dbbd00000(0000) knlGS:0000000000000000
    | CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    | CR2: 00007fc16909c000 CR3: 000000012df9c000 CR4: 00000000000006e0
    | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    | DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    | Call Trace:
    | kobject_get+0x5c/0x60
    | cdev_get+0x2b/0x60
    | chrdev_open+0x55/0x220
    | ? cdev_put.part.3+0x20/0x20
    | do_dentry_open+0x13a/0x390
    | path_openat+0x2c8/0x1470
    | do_filp_open+0x93/0x100
    | ? selinux_file_ioctl+0x17f/0x220
    | do_sys_open+0x186/0x220
    | do_syscall_64+0x48/0x150
    | entry_SYSCALL_64_after_hwframe+0x44/0xa9
    | RIP: 0033:0x7f3b87efcd0e
    | Code: 89 54 24 08 e8 a3 f4 ff ff 8b 74 24 0c 48 8b 3c 24 41 89 c0 44 8b 54 24 08 b8 01 01 00 00 89 f4
    | RSP: 002b:00007f3b87d259f0 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
    | RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3b87efcd0e
    | RDX: 0000000000000000 RSI: 00007f3b87d25a80 RDI: 00000000ffffff9c
    | RBP: 00007f3b87d25e90 R08: 0000000000000000 R09: 0000000000000000
    | R10: 0000000000000000 R11: 0000000000000293 R12: 00007ffe188f504e
    | R13: 00007ffe188f504f R14: 00007f3b87d26700 R15: 0000000000000000
    | ---[ end trace 24f53ca58db8180a ]---

    Since 'cdev_get()' can already fail to obtain a reference, simply move
    it over to use 'kobject_get_unless_zero()' instead of 'kobject_get()',
    which will cause the racing thread to return -ENXIO if the initialising
    thread fails unexpectedly.

    Cc: Hillf Danton
    Cc: Andrew Morton
    Cc: Al Viro
    Reported-by: syzbot+82defefbbd8527e1c2cb@syzkaller.appspotmail.com
    Signed-off-by: Will Deacon
    Cc: stable
    Link: https://lore.kernel.org/r/20191219120203.32691-1-will@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

05 Jan, 2020

6 commits

  • Because ocfs2_get_dlm_debug() function is called once less here, ocfs2
    file system will trigger the system crash, usually after ocfs2 file
    system is unmounted.

    This system crash is caused by a generic memory corruption, these crash
    backtraces are not always the same, for exapmle,

    ocfs2: Unmounting device (253,16) on (node 172167785)
    general protection fault: 0000 [#1] SMP PTI
    CPU: 3 PID: 14107 Comm: fence_legacy Kdump:
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
    RIP: 0010:__kmalloc+0xa5/0x2a0
    Code: 00 00 4d 8b 07 65 4d 8b
    RSP: 0018:ffffaa1fc094bbe8 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: d310a8800d7a3faf RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 0000000000000dc0 RDI: ffff96e68fc036c0
    RBP: d310a8800d7a3faf R08: ffff96e6ffdb10a0 R09: 00000000752e7079
    R10: 000000000001c513 R11: 0000000004091041 R12: 0000000000000dc0
    R13: 0000000000000039 R14: ffff96e68fc036c0 R15: ffff96e68fc036c0
    FS: 00007f699dfba540(0000) GS:ffff96e6ffd80000(0000) knlGS:00000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055f3a9d9b768 CR3: 000000002cd1c000 CR4: 00000000000006e0
    Call Trace:
    ext4_htree_store_dirent+0x35/0x100 [ext4]
    htree_dirblock_to_tree+0xea/0x290 [ext4]
    ext4_htree_fill_tree+0x1c1/0x2d0 [ext4]
    ext4_readdir+0x67c/0x9d0 [ext4]
    iterate_dir+0x8d/0x1a0
    __x64_sys_getdents+0xab/0x130
    do_syscall_64+0x60/0x1f0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f699d33a9fb

    This regression problem was introduced by commit e581595ea29c ("ocfs: no
    need to check return value of debugfs_create functions").

    Link: http://lkml.kernel.org/r/20191225061501.13587-1-ghe@suse.com
    Fixes: e581595ea29c ("ocfs: no need to check return value of debugfs_create functions")
    Signed-off-by: Gang He
    Acked-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc: [5.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gang He
     
  • If journal is dirty when mount, it will be replayed but jbd2 sb log tail
    cannot be updated to mark a new start because journal->j_flag has
    already been set with JBD2_ABORT first in journal_init_common.

    When a new transaction is committed, it will be recored in block 1
    first(journal->j_tail is set to 1 in journal_reset). If emergency
    restart happens again before journal super block is updated
    unfortunately, the new recorded trans will not be replayed in the next
    mount.

    The following steps describe this procedure in detail.
    1. mount and touch some files
    2. these transactions are committed to journal area but not checkpointed
    3. emergency restart
    4. mount again and its journals are replayed
    5. journal super block's first s_start is 1, but its s_seq is not updated
    6. touch a new file and its trans is committed but not checkpointed
    7. emergency restart again
    8. mount and journal is dirty, but trans committed in 6 will not be
    replayed.

    This exception happens easily when this lun is used by only one node.
    If it is used by multi-nodes, other node will replay its journal and its
    journal super block will be updated after recovery like what this patch
    does.

    ocfs2_recover_node->ocfs2_replay_journal.

    The following jbd2 journal can be generated by touching a new file after
    journal is replayed, and seq 15 is the first valid commit, but first seq
    is 13 in journal super block.

    logdump:
    Block 0: Journal Superblock
    Seq: 0 Type: 4 (JBD2_SUPERBLOCK_V2)
    Blocksize: 4096 Total Blocks: 32768 First Block: 1
    First Commit ID: 13 Start Log Blknum: 1
    Error: 0
    Feature Compat: 0
    Feature Incompat: 2 block64
    Feature RO compat: 0
    Journal UUID: 4ED3822C54294467A4F8E87D2BA4BC36
    FS Share Cnt: 1 Dynamic Superblk Blknum: 0
    Per Txn Block Limit Journal: 0 Data: 0

    Block 1: Journal Commit Block
    Seq: 14 Type: 2 (JBD2_COMMIT_BLOCK)

    Block 2: Journal Descriptor
    Seq: 15 Type: 1 (JBD2_DESCRIPTOR_BLOCK)
    No. Blocknum Flags
    0. 587 none
    UUID: 00000000000000000000000000000000
    1. 8257792 JBD2_FLAG_SAME_UUID
    2. 619 JBD2_FLAG_SAME_UUID
    3. 24772864 JBD2_FLAG_SAME_UUID
    4. 8257802 JBD2_FLAG_SAME_UUID
    5. 513 JBD2_FLAG_SAME_UUID JBD2_FLAG_LAST_TAG
    ...
    Block 7: Inode
    Inode: 8257802 Mode: 0640 Generation: 57157641 (0x3682809)
    FS Generation: 2839773110 (0xa9437fb6)
    CRC32: 00000000 ECC: 0000
    Type: Regular Attr: 0x0 Flags: Valid
    Dynamic Features: (0x1) InlineData
    User: 0 (root) Group: 0 (root) Size: 7
    Links: 1 Clusters: 0
    ctime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    atime: 0x5de5d870 0x113181a1 -- Tue Dec 3 11:37:20.288457121 2019
    mtime: 0x5de5d870 0x11104c61 -- Tue Dec 3 11:37:20.286280801 2019
    dtime: 0x0 -- Thu Jan 1 08:00:00 1970
    ...
    Block 9: Journal Commit Block
    Seq: 15 Type: 2 (JBD2_COMMIT_BLOCK)

    The following is journal recovery log when recovering the upper jbd2
    journal when mount again.

    syslog:
    ocfs2: File system on device (252,1) was not unmounted cleanly, recovering it.
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 0
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 1
    fs/jbd2/recovery.c:(do_one_pass, 449): Starting recovery pass 2
    fs/jbd2/recovery.c:(jbd2_journal_recover, 278): JBD2: recovery, exit status 0, recovered transactions 13 to 13

    Due to first commit seq 13 recorded in journal super is not consistent
    with the value recorded in block 1(seq is 14), journal recovery will be
    terminated before seq 15 even though it is an unbroken commit, inode
    8257802 is a new file and it will be lost.

    Link: http://lkml.kernel.org/r/20191217020140.2197-1-li.kai4@h3c.com
    Signed-off-by: Kai Li
    Reviewed-by: Joseph Qi
    Reviewed-by: Changwei Ge
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Junxiao Bi
    Cc: Gang He
    Cc: Jun Piao
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kai Li
     
  • Fix kernel-doc warnings in fs/posix_acl.c.
    Also fix one typo (setgit -> setgid).

    fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
    fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
    fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'

    Link: http://lkml.kernel.org/r/29b0dc46-1f28-a4e5-b1d0-ba2b65629779@infradead.org
    Fixes: 073931017b49d ("posix_acl: Clear SGID bit when setting file permissions")

    Signed-off-by: Randy Dunlap
    Acked-by: Andreas Gruenbacher
    Reviewed-by: Jan Kara
    Cc: Jan Kara
    Cc: Andreas Gruenbacher
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Make to_mnt_ns() static to address the following 'sparse' warning:

    fs/namespace.c:1731:22: warning: symbol 'to_mnt_ns' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191209234830.156260-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Include linux/proc_fs.h and fs/internal.h to address the following
    'sparse' warnings:

    fs/nsfs.c:41:32: warning: symbol 'ns_dentry_operations' was not declared. Should it be static?
    fs/nsfs.c:145:5: warning: symbol 'open_related_ns' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191209234822.156179-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • Include fs/internal.h to address the following 'sparse' warning:

    fs/direct-io.c:591:5: warning: symbol 'sb_init_dio_done_wq' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191209234544.128302-1-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Reviewed-by: Jan Kara
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric Biggers
     

04 Jan, 2020

3 commits

  • Pull btrfs fixes from David Sterba:
    "A few fixes for btrfs:

    - blkcg accounting problem with compression that could stall writes

    - setting up blkcg bio for compression crashes due to NULL bdev
    pointer

    - fix possible infinite loop in writeback for nocow files (here
    possible means almost impossible, 13 things that need to happen to
    trigger it)"

    * tag 'for-5.5-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
    Btrfs: fix infinite loop during nocow writeback due to race
    btrfs: fix compressed write bio blkcg attribution
    btrfs: punt all bios created in btrfs_submit_compressed_write()

    Linus Torvalds
     
  • Pull block fixes from Jens Axboe:
    "Three fixes in here:

    - Fix for a missing split on default memory boundary mask (4G) (Ming)

    - Fix for multi-page read bio truncate (Ming)

    - Fix for null_blk zone close request handling (Damien)"

    * tag 'block-5.5-20200103' of git://git.kernel.dk/linux-block:
    null_blk: Fix REQ_OP_ZONE_CLOSE handling
    block: fix splitting segments on boundary masks
    block: add bio_truncate to fix guard_bio_eod

    Linus Torvalds
     
  • LTP memfd_create04 started failing for some huge page sizes
    after v5.4-10135-gc3bfc5dd73c6.

    The problem is the check introduced to for_each_hstate() loop that
    should skip default_hstate_idx. Since it doesn't update 'i' counter,
    all subsequent huge page sizes are skipped as well.

    Fixes: 8fc312b32b25 ("mm/hugetlbfs: fix error handling when setting up mounts")
    Signed-off-by: Jan Stancek
    Reviewed-by: Mike Kravetz
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

03 Jan, 2020

4 commits

  • Pull pstore bug fixes from Kees Cook:

    - always reset circular buffer state when writing new dump (Aleksandr
    Yashkin)

    - fix rare error-path memory leak (Kees Cook)

    * tag 'pstore-v5.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    pstore/ram: Write new dumps to start of recycled zones
    pstore/ram: Fix error-path memory leak in persistent_ram_new() callers

    Linus Torvalds
     
  • This reverts commit 8243186f0cc7 ("fs: remove ksys_dup()") and the
    subsequent fix for it in commit 2d3145f8d280 ("early init: fix error
    handling when opening /dev/console").

    Trying to use filp_open() and f_dupfd() instead of pseudo-syscalls
    caused more trouble than what is worth it: it requires accessing vfs
    internals and it turns out there were other bugs in it too.

    In particular, the file reference counting was wrong - because unlike
    the original "open+2*dup" sequence it used "filp_open+3*f_dupfd" and
    thus had an extra leaked file reference.

    That in turn then caused odd problems with Androidx86 long after boot
    becaue of how the extra reference to the console kept the session active
    even after all file descriptors had been closed.

    Reported-by: youling 257
    Cc: Arvind Sankar
    Cc: Al Viro
    Signed-off-by: Dominik Brodowski
    Signed-off-by: Linus Torvalds

    Dominik Brodowski
     
  • The ram_core.c routines treat przs as circular buffers. When writing a
    new crash dump, the old buffer needs to be cleared so that the new dump
    doesn't end up in the wrong place (i.e. at the end).

    The solution to this problem is to reset the circular buffer state before
    writing a new Oops dump.

    Signed-off-by: Aleksandr Yashkin
    Signed-off-by: Nikolay Merinov
    Signed-off-by: Ariel Gilman
    Link: https://lore.kernel.org/r/20191223133816.28155-1-n.merinov@inango-systems.com
    Fixes: 896fc1f0c4c6 ("pstore/ram: Switch to persistent_ram routines")
    Cc: stable@vger.kernel.org
    Signed-off-by: Kees Cook

    Aleksandr Yashkin
     
  • For callers that allocated a label for persistent_ram_new(), if the call
    fails, they must clean up the allocation.

    Suggested-by: Navid Emamdoost
    Fixes: 1227daa43bce ("pstore/ram: Clarify resource reservation labels")
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/lkml/20191211191353.14385-1-navid.emamdoost@gmail.com
    Signed-off-by: Kees Cook

    Kees Cook
     

30 Dec, 2019

5 commits

  • When starting writeback for a range that covers part of a preallocated
    extent, due to a race with writeback for another range that also covers
    another part of the same preallocated extent, we can end up in an infinite
    loop.

    Consider the following example where for inode 280 we have two dirty
    ranges:

    range A, from 294912 to 303103, 8192 bytes
    range B, from 348160 to 438271, 90112 bytes

    and we have the following file extent item layout for our inode:

    leaf 38895616 gen 24544 total ptrs 29 free space 13820 owner 5
    (...)
    item 27 key (280 108 200704) itemoff 14598 itemsize 53
    extent data disk bytenr 0 nr 0 type 1 (regular)
    extent data offset 0 nr 94208 ram 94208
    item 28 key (280 108 294912) itemoff 14545 itemsize 53
    extent data disk bytenr 10433052672 nr 81920 type 2 (prealloc)
    extent data offset 0 nr 81920 ram 81920

    Then the following happens:

    1) Writeback starts for range B (from 348160 to 438271), execution of
    run_delalloc_nocow() starts;

    2) The first iteration of run_delalloc_nocow()'s whil loop leaves us at
    the extent item at slot 28, pointing to the prealloc extent item
    covering the range from 294912 to 376831. This extent covers part of
    our range;

    3) An ordered extent is created against that extent, covering the file
    range from 348160 to 376831 (28672 bytes);

    4) We adjust 'cur_offset' to 376832 and move on to the next iteration of
    the while loop;

    5) The call to btrfs_lookup_file_extent() leaves us at the same leaf,
    pointing to slot 29, 1 slot after the last item (the extent item
    we processed in the previous iteration);

    6) Because we are a slot beyond the last item, we call btrfs_next_leaf(),
    which releases the search path before doing a another search for the
    last key of the leaf (280 108 294912);

    7) Right after btrfs_next_leaf() released the path, and before it did
    another search for the last key of the leaf, writeback for the range
    A (from 294912 to 303103) completes (it was previously started at
    some point);

    8) Upon completion of the ordered extent for range A, the prealloc extent
    we previously found got split into two extent items, one covering the
    range from 294912 to 303103 (8192 bytes), with a type of regular extent
    (and no longer prealloc) and another covering the range from 303104 to
    376831 (73728 bytes), with a type of prealloc and an offset of 8192
    bytes. So our leaf now has the following layout:

    leaf 38895616 gen 24544 total ptrs 31 free space 13664 owner 5
    (...)
    item 27 key (280 108 200704) itemoff 14598 itemsize 53
    extent data disk bytenr 0 nr 0 type 1
    extent data offset 0 nr 8192 ram 94208
    item 28 key (280 108 208896) itemoff 14545 itemsize 53
    extent data disk bytenr 10433142784 nr 86016 type 1
    extent data offset 0 nr 86016 ram 86016
    item 29 key (280 108 294912) itemoff 14492 itemsize 53
    extent data disk bytenr 10433052672 nr 81920 type 1
    extent data offset 0 nr 8192 ram 81920
    item 30 key (280 108 303104) itemoff 14439 itemsize 53
    extent data disk bytenr 10433052672 nr 81920 type 2
    extent data offset 8192 nr 73728 ram 81920

    9) After btrfs_next_leaf() returns, we have our path pointing to that same
    leaf and at slot 30, since it has a key we didn't have before and it's
    the first key greater then the key that was previously the last key of
    the leaf (key (280 108 294912));

    10) The extent item at slot 30 covers the range from 303104 to 376831
    which is in our target range, so we process it, despite having already
    created an ordered extent against this extent for the file range from
    348160 to 376831. This is because we skip to the next extent item only
    if its end is less than or equals to the start of our delalloc range,
    and not less than or equals to the current offset ('cur_offset');

    11) As a result we compute 'num_bytes' as:

    num_bytes = min(end + 1, extent_end) - cur_offset;
    = min(438271 + 1, 376832) - 376832 = 0

    12) We then call create_io_em() for a 0 bytes range starting at offset
    376832;

    13) Then create_io_em() enters an infinite loop because its calls to
    btrfs_drop_extent_cache() do nothing due to the 0 length range
    passed to it. So no existing extent maps that cover the offset
    376832 get removed, and therefore calls to add_extent_mapping()
    return -EEXIST, resulting in an infinite loop. This loop from
    create_io_em() is the following:

    do {
    btrfs_drop_extent_cache(BTRFS_I(inode), em->start,
    em->start + em->len - 1, 0);
    write_lock(&em_tree->lock);
    ret = add_extent_mapping(em_tree, em, 1);
    write_unlock(&em_tree->lock);
    /*
    * The caller has taken lock_extent(), who could race with us
    * to add em?
    */
    } while (ret == -EEXIST);

    Also, each call to btrfs_drop_extent_cache() triggers a warning because
    the start offset passed to it (376832) is smaller then the end offset
    (376832 - 1) passed to it by -1, due to the 0 length:

    [258532.052621] ------------[ cut here ]------------
    [258532.052643] WARNING: CPU: 0 PID: 9987 at fs/btrfs/file.c:602 btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
    (...)
    [258532.052672] CPU: 0 PID: 9987 Comm: fsx Tainted: G W 5.4.0-rc7-btrfs-next-64 #1
    [258532.052673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
    [258532.052691] RIP: 0010:btrfs_drop_extent_cache+0x3f4/0x590 [btrfs]
    (...)
    [258532.052695] RSP: 0018:ffffb4be0153f860 EFLAGS: 00010287
    [258532.052700] RAX: ffff975b445ee360 RBX: ffff975b44eb3e08 RCX: 0000000000000000
    [258532.052700] RDX: 0000000000038fff RSI: 0000000000039000 RDI: ffff975b445ee308
    [258532.052700] RBP: 0000000000038fff R08: 0000000000000000 R09: 0000000000000001
    [258532.052701] R10: ffff975b513c5c10 R11: 00000000e3c0cfa9 R12: 0000000000039000
    [258532.052703] R13: ffff975b445ee360 R14: 00000000ffffffef R15: ffff975b445ee308
    [258532.052705] FS: 00007f86a821de80(0000) GS:ffff975b76a00000(0000) knlGS:0000000000000000
    [258532.052707] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [258532.052708] CR2: 00007fdacf0f3ab4 CR3: 00000001f9d26002 CR4: 00000000003606f0
    [258532.052712] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [258532.052717] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [258532.052717] Call Trace:
    [258532.052718] ? preempt_schedule_common+0x32/0x70
    [258532.052722] ? ___preempt_schedule+0x16/0x20
    [258532.052741] create_io_em+0xff/0x180 [btrfs]
    [258532.052767] run_delalloc_nocow+0x942/0xb10 [btrfs]
    [258532.052791] btrfs_run_delalloc_range+0x30b/0x520 [btrfs]
    [258532.052812] ? find_lock_delalloc_range+0x221/0x250 [btrfs]
    [258532.052834] writepage_delalloc+0xe4/0x140 [btrfs]
    [258532.052855] __extent_writepage+0x110/0x4e0 [btrfs]
    [258532.052876] extent_write_cache_pages+0x21c/0x480 [btrfs]
    [258532.052906] extent_writepages+0x52/0xb0 [btrfs]
    [258532.052911] do_writepages+0x23/0x80
    [258532.052915] __filemap_fdatawrite_range+0xd2/0x110
    [258532.052938] btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
    [258532.052954] start_ordered_ops+0x57/0xa0 [btrfs]
    [258532.052973] ? btrfs_sync_file+0x225/0x490 [btrfs]
    [258532.052988] btrfs_sync_file+0x225/0x490 [btrfs]
    [258532.052997] __x64_sys_msync+0x199/0x200
    [258532.053004] do_syscall_64+0x5c/0x250
    [258532.053007] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [258532.053010] RIP: 0033:0x7f86a7dfd760
    (...)
    [258532.053014] RSP: 002b:00007ffd99af0368 EFLAGS: 00000246 ORIG_RAX: 000000000000001a
    [258532.053016] RAX: ffffffffffffffda RBX: 0000000000000ec9 RCX: 00007f86a7dfd760
    [258532.053017] RDX: 0000000000000004 RSI: 000000000000836c RDI: 00007f86a8221000
    [258532.053019] RBP: 0000000000021ec9 R08: 0000000000000003 R09: 00007f86a812037c
    [258532.053020] R10: 0000000000000001 R11: 0000000000000246 R12: 00000000000074a3
    [258532.053021] R13: 00007f86a8221000 R14: 000000000000836c R15: 0000000000000001
    [258532.053032] irq event stamp: 1653450494
    [258532.053035] hardirqs last enabled at (1653450493): [] _raw_spin_unlock_irq+0x29/0x50
    [258532.053037] hardirqs last disabled at (1653450494): [] trace_hardirqs_off_thunk+0x1a/0x20
    [258532.053039] softirqs last enabled at (1653449852): [] __do_softirq+0x466/0x6bd
    [258532.053042] softirqs last disabled at (1653449845): [] irq_exit+0xec/0x120
    [258532.053043] ---[ end trace 8476fce13d9ce20a ]---

    Which results in flooding dmesg/syslog since btrfs_drop_extent_cache()
    uses WARN_ON() and not WARN_ON_ONCE().

    So fix this issue by changing run_delalloc_nocow()'s loop to move to the
    next extent item when the current extent item ends at at offset less than
    or equals to the current offset instead of the start offset.

    Fixes: 80ff385665b7fc ("Btrfs: update nodatacow code v2")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba

    Filipe Manana
     
  • Bio attribution is handled at bio_set_dev() as once we have a device, we
    have a corresponding request_queue and then can derive the current css.
    In special cases, we want to attribute to bio to someone else. This can
    be done by calling bio_associate_blkg_from_css() or
    kthread_associate_blkcg() depending on the scenario. Btrfs does this for
    compressed writeback as they are handled by kworkers, so the latter can
    be done here.

    Commit 1a41802701ec ("btrfs: drop bio_set_dev where not needed") removes
    early bio_set_dev() calls prior to submit_stripe_bio(). This breaks the
    above assumption that we'll have a request_queue when we are doing
    association. To fix this, switch to using kthread_associate_blkcg().

    Without this, we crash in btrfs/024:

    [ 3052.093088] BUG: kernel NULL pointer dereference, address: 0000000000000510
    [ 3052.107013] #PF: supervisor read access in kernel mode
    [ 3052.107014] #PF: error_code(0x0000) - not-present page
    [ 3052.107015] PGD 0 P4D 0
    [ 3052.107021] Oops: 0000 [#1] SMP
    [ 3052.138904] CPU: 42 PID: 201270 Comm: kworker/u161:0 Kdump: loaded Not tainted 5.5.0-rc1-00062-g4852d8ac90a9 #712
    [ 3052.138905] Hardware name: Quanta Tioga Pass Single Side 01-0032211004/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018
    [ 3052.138912] Workqueue: btrfs-delalloc btrfs_work_helper
    [ 3052.191375] RIP: 0010:bio_associate_blkg_from_css+0x1e/0x3c0
    [ 3052.191379] RSP: 0018:ffffc900210cfc90 EFLAGS: 00010282
    [ 3052.191380] RAX: 0000000000000000 RBX: ffff88bfe5573c00 RCX: 0000000000000000
    [ 3052.191382] RDX: ffff889db48ec2f0 RSI: ffff88bfe5573c00 RDI: ffff889db48ec2f0
    [ 3052.191386] RBP: 0000000000000800 R08: 0000000000203bb0 R09: ffff889db16b2400
    [ 3052.293364] R10: 0000000000000000 R11: ffff88a07fffde80 R12: ffff889db48ec2f0
    [ 3052.293365] R13: 0000000000001000 R14: ffff889de82bc000 R15: ffff889e2b7bdcc8
    [ 3052.293367] FS: 0000000000000000(0000) GS:ffff889ffba00000(0000) knlGS:0000000000000000
    [ 3052.293368] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 3052.293369] CR2: 0000000000000510 CR3: 0000000002611001 CR4: 00000000007606e0
    [ 3052.293370] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 3052.293371] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 3052.293372] PKRU: 55555554
    [ 3052.293376] Call Trace:
    [ 3052.402552] btrfs_submit_compressed_write+0x137/0x390
    [ 3052.402558] submit_compressed_extents+0x40f/0x4c0
    [ 3052.422401] btrfs_work_helper+0x246/0x5a0
    [ 3052.422408] process_one_work+0x200/0x570
    [ 3052.438601] ? process_one_work+0x180/0x570
    [ 3052.438605] worker_thread+0x4c/0x3e0
    [ 3052.438614] kthread+0x103/0x140
    [ 3052.460735] ? process_one_work+0x570/0x570
    [ 3052.460737] ? kthread_mod_delayed_work+0xc0/0xc0
    [ 3052.460744] ret_from_fork+0x24/0x30

    Fixes: 1a41802701ec ("btrfs: drop bio_set_dev where not needed")
    Reported-by: Chris Murphy
    Signed-off-by: Dennis Zhou
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dennis Zhou
     
  • Compressed writes happen in the background via kworkers. However, this
    causes bios to be attributed to root bypassing any cgroup limits from
    the actual writer. We tag the first bio with REQ_CGROUP_PUNT, which will
    punt the bio to an appropriate cgroup specific workqueue and attribute
    the IO properly. However, if btrfs_submit_compressed_write() creates a
    new bio, we don't tag it the same way. Add the appropriate tagging for
    subsequent bios.

    Fixes: ec39f7696ccfa ("Btrfs: use REQ_CGROUP_PUNT for worker thread submitted bios")
    Reviewed-by: Chris Mason
    Signed-off-by: Dennis Zhou
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba

    Dennis Zhou
     
  • Pull /proc/locks formatting fix from Jeff Layton:
    "This is a trivial fix for a _very_ long standing bug in /proc/locks
    formatting. Ordinarily, I'd wait for the merge window for something
    like this, but it is making it difficult to validate some overlayfs
    fixes.

    I've also gone ahead and marked this for stable"

    * tag 'locks-v5.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux:
    locks: print unsigned ino in /proc/locks

    Linus Torvalds
     
  • Pull cifs fixes from Steve French:
    "One performance fix for large directory searches, and one minor style
    cleanup noticed by Clang"

    * tag '5.5-rc3-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
    cifs: Optimize readdir on reparse points
    cifs: Adjust indentation in smb2_open_file

    Linus Torvalds
     

29 Dec, 2019

2 commits

  • An ino is unsigned, so display it as such in /proc/locks.

    Cc: stable@vger.kernel.org
    Signed-off-by: Amir Goldstein
    Signed-off-by: Jeff Layton

    Amir Goldstein
     
  • Some filesystem, such as vfat, may send bio which crosses device boundary,
    and the worse thing is that the IO request starting within device boundaries
    can contain more than one segment past EOD.

    Commit dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    tries to fix this issue by returning -EIO for this situation. However,
    this way lets fs user code lose chance to handle -EIO, then sync_inodes_sb()
    may hang for ever.

    Also the current truncating on last segment is dangerous by updating the
    last bvec, given bvec table becomes not immutable any more, and fs bio
    users may not retrieve the truncated pages via bio_for_each_segment_all() in
    its .end_io callback.

    Fixes this issue by supporting multi-segment truncating. And the
    approach is simpler:

    - just update bio size since block layer can make correct bvec with
    the updated bio size. Then bvec table becomes really immutable.

    - zero all truncated segments for read bio

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixed-by: dce30ca9e3b6 ("fs: fix guard_bio_eod to check for real EOD errors")
    Reported-by: syzbot+2b9e54155c8c25d8d165@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Dec, 2019

1 commit

  • Pull io_uring fixes from Jens Axboe:

    - Removal of now unused busy wqe list (Hillf)

    - Add cond_resched() to io-wq work processing (Hillf)

    - And then the series that I hinted at from last week, which removes
    the sqe from the io_kiocb and keeps all sqe handling on the prep
    side. This guarantees that an opcode can't do the wrong thing and
    read the sqe more than once. This is unchanged from last week, no
    issues have been observed with this in testing. Hence I really think
    we should fold this into 5.5.

    * tag 'io_uring-5.5-20191226' of git://git.kernel.dk/linux-block:
    io-wq: add cond_resched() to worker thread
    io-wq: remove unused busy list from io_sqe
    io_uring: pass in 'sqe' to the prep handlers
    io_uring: standardize the prep methods
    io_uring: read 'count' for IORING_OP_TIMEOUT in prep handler
    io_uring: move all prep state for IORING_OP_{SEND,RECV}_MGS to prep handler
    io_uring: move all prep state for IORING_OP_CONNECT to prep handler
    io_uring: add and use struct io_rw for read/writes
    io_uring: use u64_to_user_ptr() consistently

    Linus Torvalds
     

25 Dec, 2019

1 commit


23 Dec, 2019

7 commits

  • Commit e61df66c69b1 ("io-wq: ensure free/busy list browsing see all
    items") added a list for io workers in addition to the free and busy
    lists, not only making worker walk cleaner, but leaving the busy list
    unused. Let's remove it.

    Signed-off-by: Hillf Danton
    Signed-off-by: Jens Axboe

    Hillf Danton
     
  • When listing a directory with thounsands of files and most of them are
    reparse points, we simply marked all those dentries for revalidation
    and then sending additional (compounded) create/getinfo/close requests
    for each of them.

    Instead, upon receiving a response from an SMB2_QUERY_DIRECTORY
    (FileIdFullDirectoryInformation) command, the directory entries that
    have a file attribute of FILE_ATTRIBUTE_REPARSE_POINT will contain an
    EaSize field with a reparse tag in it, so we parse it and mark the
    dentry for revalidation only if it is a DFS or a symlink.

    Signed-off-by: Paulo Alcantara (SUSE)
    Reviewed-by: Pavel Shilovsky
    Signed-off-by: Steve French

    Paulo Alcantara (SUSE)
     
  • Clang warns:

    ../fs/cifs/smb2file.c:70:3: warning: misleading indentation; statement
    is not part of the previous 'if' [-Wmisleading-indentation]
    if (oparms->tcon->use_resilient) {
    ^
    ../fs/cifs/smb2file.c:66:2: note: previous statement is here
    if (rc)
    ^
    1 warning generated.

    This warning occurs because there is a space after the tab on this line.
    Remove it so that the indentation is consistent with the Linux kernel
    coding style and clang no longer warns.

    Fixes: 592fafe644bf ("Add resilienthandles mount parm")
    Link: https://github.com/ClangBuiltLinux/linux/issues/826
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Steve French

    Nathan Chancellor
     
  • Pull vfs fixes from Al Viro:
    "Eric's s_inodes softlockup fixes + Jan's fix for recent regression
    from pipe rework"

    * 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    fs: call fsnotify_sb_delete after evict_inodes
    fs: avoid softlockups in s_inodes iterators
    pipe: Fix bogus dereference in iov_iter_alignment()

    Linus Torvalds
     
  • Pull xfs fixes from Darrick Wong:
    "Fix a few bugs that could lead to corrupt files, fsck complaints, and
    filesystem crashes:

    - Minor documentation fixes

    - Fix a file corruption due to read racing with an insert range
    operation.

    - Fix log reservation overflows when allocating large rt extents

    - Fix a buffer log item flags check

    - Don't allow administrators to mount with sunit= options that will
    cause later xfs_repair complaints about the root directory being
    suspicious because the fs geometry appeared inconsistent

    - Fix a non-static helper that should have been static"

    * tag 'xfs-5.5-fixes-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    xfs: Make the symbol 'xfs_rtalloc_log_count' static
    xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
    xfs: split the sunit parameter update into two parts
    xfs: refactor agfl length computation function
    libxfs: resync with the userspace libxfs
    xfs: use bitops interface for buf log item AIL flag check
    xfs: fix log reservation overflows when allocating large rt extents
    xfs: stabilize insert range start boundary to avoid COW writeback race
    xfs: fix Sphinx documentation warning

    Linus Torvalds
     
  • Pull ext4 bug fixes from Ted Ts'o:
    "Ext4 bug fixes, including a regression fix"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: clarify impact of 'commit' mount option
    ext4: fix unused-but-set-variable warning in ext4_add_entry()
    jbd2: fix kernel-doc notation warning
    ext4: use RCU API in debug_print_tree
    ext4: validate the debug_want_extra_isize mount option at parse time
    ext4: reserve revoke credits in __ext4_new_inode
    ext4: unlock on error in ext4_expand_extra_isize()
    ext4: optimize __ext4_check_dir_entry()
    ext4: check for directory entries too close to block end
    ext4: fix ext4_empty_dir() for directories with holes

    Linus Torvalds
     
  • LTP pipeio_1 test is hanging with v5.5-rc2-385-gb8e382a185eb,
    with read side observing empty pipe and sleeping and write
    side running out of space and then sleeping as well. In this
    scenario there are 5 writers and 1 reader.

    Problem is that after pipe_write() reacquires pipe lock, it
    re-checks for empty pipe with potentially stale 'head' and
    doesn't wake up read side anymore. pipe->tail can advance
    beyond 'head', because there are multiple writers.

    Use pipe->head for empty pipe check after reacquiring lock
    to observe current state.

    Testing: With patch, LTP pipeio_1 ran successfully in loop for 1 hour.
    Without patch it hanged within a minute.

    Fixes: 1b6b26ae7053 ("pipe: fix and clarify pipe write wakeup logic")
    Reported-by: Rachel Sibley
    Signed-off-by: Jan Stancek
    Signed-off-by: Linus Torvalds

    Jan Stancek
     

22 Dec, 2019

1 commit

  • Warning is found when compile with "-Wunused-but-set-variable":

    fs/ext4/namei.c: In function ‘ext4_add_entry’:
    fs/ext4/namei.c:2167:23: warning: variable ‘sbi’ set but not used
    [-Wunused-but-set-variable]
    struct ext4_sb_info *sbi;
    ^~~
    Fix this by moving the variable @sbi under CONFIG_UNICODE.

    Signed-off-by: Yunfeng Ye
    Reviewed-by: Ritesh Harjani
    Link: https://lore.kernel.org/r/cb5eb904-224a-9701-c38f-cb23514b1fff@huawei.com
    Signed-off-by: Theodore Ts'o

    Yunfeng Ye
     

21 Dec, 2019

1 commit

  • Pull io_uring fixes from Jens Axboe:
    "Here's a set of fixes that should go into 5.5-rc3 for io_uring.

    This is bigger than I'd like it to be, mainly because we're fixing the
    case where an application reuses sqe data right after issue. This
    really must work, or it's confusing. With 5.5 we're flagging us as
    submit stable for the actual data, this must also be the case for
    SQEs.

    Honestly, I'd really like to add another series on top of this, since
    it cleans it up considerable and prevents any SQE reuse by design. I
    posted that here:

    https://lore.kernel.org/io-uring/20191220174742.7449-1-axboe@kernel.dk/T/#u

    and may still send it your way early next week once it's been looked
    at and had some more soak time (does pass all regression tests). With
    that series, we've unified the prep+issue handling, and only the prep
    phase even has access to the SQE.

    Anyway, outside of that, fixes in here for a few other issues that
    have been hit in testing or production"

    * tag 'io_uring-5.5-20191220' of git://git.kernel.dk/linux-block:
    io_uring: io_wq_submit_work() should not touch req->rw
    io_uring: don't wait when under-submitting
    io_uring: warn about unhandled opcode
    io_uring: read opcode and user_data from SQE exactly once
    io_uring: make IORING_OP_TIMEOUT_REMOVE deferrable
    io_uring: make IORING_OP_CANCEL_ASYNC deferrable
    io_uring: make IORING_POLL_ADD and IORING_POLL_REMOVE deferrable
    io_uring: make HARDLINK imply LINK
    io_uring: any deferred command must have stable sqe data
    io_uring: remove 'sqe' parameter to the OP helpers that take it
    io_uring: fix pre-prepped issue with force_nonblock == true
    io-wq: re-add io_wq_current_is_worker()
    io_uring: fix sporadic -EFAULT from IORING_OP_RECVMSG
    io_uring: fix stale comment and a few typos

    Linus Torvalds