13 Dec, 2019

2 commits

  • commit cba22d86e0a10b7070d2e6a7379dbea51aa0883c upstream.

    Currently, block device size in not updated on second and further open
    for block devices where partition scan is disabled. This is particularly
    annoying for example for DVD drives as that means block device size does
    not get updated once the media is inserted into a drive if the device is
    already open when inserting the media. This is actually always the case
    for example when pktcdvd is in use.

    Fix the problem by revalidating block device size on every open even for
    devices with partition scan disabled.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Cc: Laura Abbott
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 731dc4868311ee097757b8746eaa1b4f8b2b4f1c upstream.

    Factor out code handling revalidation of bdev on disk change into a
    common helper.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe
    Cc: Laura Abbott
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

19 Sep, 2019

1 commit


20 Aug, 2019

1 commit


16 Aug, 2019

1 commit

  • We had a few issues with this code, and there's still a problem around
    how we deal with error handling for chained/split bios. For now, just
    revert the code and we'll try again with a thoroug solution. This
    reverts commits:

    e15c2ffa1091 ("block: fix O_DIRECT error handling for bio fragments")
    0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
    6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    893a1c97205a ("blk-mq: allow REQ_NOWAIT to return an error inline")

    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Aug, 2019

2 commits

  • Commit 89e524c04fa9 ("loop: Fix mount(2) failure due to race with
    LOOP_SET_FD") converted blkdev_get() to use the new helpers for
    finishing claiming of a block device. However the conversion botched the
    error handling in blkdev_get() and thus the bdev has been marked as held
    even in case __blkdev_get() returned error. This led to occasional
    warnings with block/001 test from blktests like:

    kernel: WARNING: CPU: 5 PID: 907 at fs/block_dev.c:1899 __blkdev_put+0x396/0x3a0

    Correct the error handling.

    CC: stable@vger.kernel.org
    Fixes: 89e524c04fa9 ("loop: Fix mount(2) failure due to race with LOOP_SET_FD")
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     
  • 0eb6ddfb865c tried to fix this up, but introduced a use-after-free
    of dio. Additionally, we still had an issue with error handling,
    as reported by Darrick:

    "I noticed a regression in xfs/747 (an unreleased xfstest for the
    xfs_scrub media scanning feature) on 5.3-rc3. I'll condense that down
    to a simpler reproducer:

    error-test: 0 209 linear 8:48 0
    error-test: 209 1 error
    error-test: 210 6446894 linear 8:48 210

    Basically we have a ~3G /dev/sdd and we set up device mapper to fail IO
    for sector 209 and to pass the io to the scsi device everywhere else.

    On 5.3-rc3, performing a directio pread of this range with a < 1M buffer
    (in other words, a request for fewer than MAX_BIO_PAGES bytes) yields
    EIO like you'd expect:

    pread64(3, 0x7f880e1c7000, 1048576, 0) = -1 EIO (Input/output error)
    pread: Input/output error
    +++ exited with 0 +++

    But doing it with a larger buffer succeeds(!):

    pread64(3, "XFSB\0\0\20\0\0\0\0\0\0\fL\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 1146880, 0) = 1146880
    read 1146880/1146880 bytes at offset 0
    1 MiB, 1 ops; 0.0009 sec (1.124 GiB/sec and 1052.6316 ops/sec)
    +++ exited with 0 +++

    (Note that the part of the buffer corresponding to the dm-error area is
    uninitialized)

    On 5.3-rc2, both commands would fail with EIO like you'd expect. The
    only change between rc2 and rc3 is commit 0eb6ddfb865c ("block: Fix
    __blkdev_direct_IO() for bio fragments").

    AFAICT we end up in __blkdev_direct_IO with a 1120K buffer, which gets
    split into two bios: one for the first BIO_MAX_PAGES worth of data (1MB)
    and a second one for the 96k after that."

    Fix this by noting that it's always safe to dereference dio if we get
    BLK_QC_T_EAGAIN returned, as end_io hasn't been run for that case. So
    we can safely increment the dio size before calling submit_bio(), and
    then decrement it on failure (not that it really matters, as the bio
    and dio are going away).

    For error handling, return to the original method of just using 'ret'
    for tracking the error, and the size tracking in dio->size.

    Fixes: 0eb6ddfb865c ("block: Fix __blkdev_direct_IO() for bio fragments")
    Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    Reported-by: Darrick J. Wong
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Aug, 2019

1 commit

  • The recent fix to properly handle IOCB_NOWAIT for async O_DIRECT IO
    (patch 6a43074e2f46) introduced two problems with BIO fragment handling
    for direct IOs:
    1) The dio size processed is calculated by incrementing the ret variable
    by the size of the bio fragment issued for the dio. However, this size
    is obtained directly from bio->bi_iter.bi_size AFTER the bio submission
    which may result in referencing the bi_size value after the bio
    completed, resulting in an incorrect value use.
    2) The ret variable is not incremented by the size of the last bio
    fragment issued for the bio, leading to an invalid IO size being
    returned to the user.

    Fix both problem by using dio->size (which is incremented before the bio
    submission) to update the value of ret after bio submissions, including
    for the last bio fragment issued.

    Fixes: 6a43074e2f46 ("block: properly handle IOCB_NOWAIT for async O_DIRECT IO")
    Reported-by: Masato Suzuki
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

31 Jul, 2019

1 commit

  • Commit 33ec3e53e7b1 ("loop: Don't change loop device under exclusive
    opener") made LOOP_SET_FD ioctl acquire exclusive block device reference
    while it updates loop device binding. However this can make perfectly
    valid mount(2) fail with EBUSY due to racing LOOP_SET_FD holding
    temporarily the exclusive bdev reference in cases like this:

    for i in {a..z}{a..z}; do
    dd if=/dev/zero of=$i.image bs=1k count=0 seek=1024
    mkfs.ext2 $i.image
    mkdir mnt$i
    done

    echo "Run"
    for i in {a..z}{a..z}; do
    mount -o loop -t ext2 $i.image mnt$i &
    done

    Fix the problem by not getting full exclusive bdev reference in
    LOOP_SET_FD but instead just mark the bdev as being claimed while we
    update the binding information. This just blocks new exclusive openers
    instead of failing them with EBUSY thus fixing the problem.

    Fixes: 33ec3e53e7b1 ("loop: Don't change loop device under exclusive opener")
    Cc: stable@vger.kernel.org
    Tested-by: Kai-Heng Feng
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

27 Jul, 2019

1 commit

  • Pull block fixes from Jens Axboe:

    - Several io_uring fixes/improvements:
    - Blocking fix for O_DIRECT (me)
    - Latter page slowness for registered buffers (me)
    - Fix poll hang under certain conditions (me)
    - Defer sequence check fix for wrapped rings (Zhengyuan)
    - Mismatch in async inc/dec accounting (Zhengyuan)
    - Memory ordering issue that could cause stall (Zhengyuan)
    - Track sequential defer in bytes, not pages (Zhengyuan)

    - NVMe pull request from Christoph

    - Set of hang fixes for wbt (Josef)

    - Redundant error message kill for libahci (Ding)

    - Remove unused blk_mq_sched_started_request() and related ops (Marcos)

    - drbd dynamic alloc shash descriptor to reduce stack use (Arnd)

    - blkcg ->pd_stat() non-debug print (Tejun)

    - bcache memory leak fix (Wei)

    - Comment fix (Akinobu)

    - BFQ perf regression fix (Paolo)

    * tag 'for-linus-20190726' of git://git.kernel.dk/linux-block: (24 commits)
    io_uring: ensure ->list is initialized for poll commands
    Revert "nvme-pci: don't create a read hctx mapping without read queues"
    nvme: fix multipath crash when ANA is deactivated
    nvme: fix memory leak caused by incorrect subsystem free
    nvme: ignore subnqn for ADATA SX6000LNP
    drbd: dynamically allocate shash descriptor
    block: blk-mq: Remove blk_mq_sched_started_request and started_request
    bcache: fix possible memory leak in bch_cached_dev_run()
    io_uring: track io length in async_list based on bytes
    io_uring: don't use iov_iter_advance() for fixed buffers
    block: properly handle IOCB_NOWAIT for async O_DIRECT IO
    blk-mq: allow REQ_NOWAIT to return an error inline
    io_uring: add a memory barrier before atomic_read
    rq-qos: use a mb for got_token
    rq-qos: set ourself TASK_UNINTERRUPTIBLE after we schedule
    rq-qos: don't reset has_sleepers on spurious wakeups
    rq-qos: fix missed wake-ups in rq_qos_throttle
    wait: add wq_has_single_sleeper helper
    block, bfq: check also in-flight I/O in dispatch plugging
    block: fix sysfs module parameters directory path in comment
    ...

    Linus Torvalds
     

22 Jul, 2019

1 commit

  • A caller is supposed to pass in REQ_NOWAIT if we can't block for any
    given operation, but O_DIRECT for block devices just ignore this. Hence
    we'll block for various resource shortages on the block layer side,
    like having to wait for requests.

    Use the new REQ_NOWAIT_INLINE to ask for this error to be returned
    inline, so we can handle it appropriately and return -EAGAIN to the
    caller.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

20 Jul, 2019

1 commit

  • Pull vfs mount updates from Al Viro:
    "The first part of mount updates.

    Convert filesystems to use the new mount API"

    * 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits)
    mnt_init(): call shmem_init() unconditionally
    constify ksys_mount() string arguments
    don't bother with registering rootfs
    init_rootfs(): don't bother with init_ramfs_fs()
    vfs: Convert smackfs to use the new mount API
    vfs: Convert selinuxfs to use the new mount API
    vfs: Convert securityfs to use the new mount API
    vfs: Convert apparmorfs to use the new mount API
    vfs: Convert openpromfs to use the new mount API
    vfs: Convert xenfs to use the new mount API
    vfs: Convert gadgetfs to use the new mount API
    vfs: Convert oprofilefs to use the new mount API
    vfs: Convert ibmasmfs to use the new mount API
    vfs: Convert qib_fs/ipathfs to use the new mount API
    vfs: Convert efivarfs to use the new mount API
    vfs: Convert configfs to use the new mount API
    vfs: Convert binfmt_misc to use the new mount API
    convenience helper: get_tree_single()
    convenience helper get_tree_nodev()
    vfs: Kill sget_userns()
    ...

    Linus Torvalds
     

29 Jun, 2019

2 commits


27 May, 2019

1 commit

  • When hidden gendisk is revalidated, there's no point in revalidating
    associated block device as there's none. We would thus just create new
    bdev inode, report "detected capacity change from 0 to XXX" message and
    evict the bdev inode again. Avoid this pointless dance and confusing
    message in the kernel log.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

26 May, 2019

2 commits

  • Convert the bdev filesystem to the new internal mount API as the old
    one will be obsoleted and removed. This allows greater flexibility in
    communication of mount parameters between userspace, the VFS and the
    filesystem.

    See Documentation/filesystems/mount_api.txt for more information.

    Signed-off-by: David Howells
    cc: Jens Axboe
    cc: linux-fsdevel@vger.kernel.org
    Signed-off-by: Al Viro

    David Howells
     
  • Once upon a time we used to set ->d_name of e.g. pipefs root
    so that d_path() on pipes would work. These days it's
    completely pointless - dentries of pipes are not even connected
    to pipefs root. However, mount_pseudo() had set the root
    dentry name (passed as the second argument) and callers
    kept inventing names to pass to it. Including those that
    didn't *have* any non-root dentries to start with...

    All of that had been pointless for about 8 years now; it's
    time to get rid of that cargo-culting...

    Signed-off-by: Al Viro

    Al Viro
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit


08 May, 2019

3 commits

  • Pull misc vfs updates from Al Viro:
    "Assorted stuff, with no common topic whatsoever..."

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    libfs: document simple_get_link()
    Documentation/filesystems/Locking: fix ->get_link() prototype
    Documentation/filesystems/vfs.txt: document how ->i_link works
    Documentation/filesystems/vfs.txt: remove bogus "Last updated" date
    fs: use timespec64 in relatime_need_update
    fs/block_dev.c: remove unused include

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "Nothing major in this series, just fixes and improvements all over the
    map. This contains:

    - Series of fixes for sed-opal (David, Jonas)

    - Fixes and performance tweaks for BFQ (via Paolo)

    - Set of fixes for bcache (via Coly)

    - Set of fixes for md (via Song)

    - Enabling multi-page for passthrough requests (Ming)

    - Queue release fix series (Ming)

    - Device notification improvements (Martin)

    - Propagate underlying device rotational status in loop (Holger)

    - Removal of mtip32xx trim support, which has been disabled for years
    (Christoph)

    - Improvement and cleanup of nvme command handling (Christoph)

    - Add block SPDX tags (Christoph)

    - Cleanup/hardening of bio/bvec iteration (Christoph)

    - A few NVMe pull requests (Christoph)

    - Removal of CONFIG_LBDAF (Christoph)

    - Various little fixes here and there"

    * tag 'for-5.2/block-20190507' of git://git.kernel.dk/linux-block: (164 commits)
    block: fix mismerge in bvec_advance
    block: don't drain in-progress dispatch in blk_cleanup_queue()
    blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release
    blk-mq: always free hctx after request queue is freed
    blk-mq: split blk_mq_alloc_and_init_hctx into two parts
    blk-mq: free hw queue's resource in hctx's release handler
    blk-mq: move cancel of requeue_work into blk_mq_release
    blk-mq: grab .q_usage_counter when queuing request from plug code path
    block: fix function name in comment
    nvmet: protect discovery change log event list iteration
    nvme: mark nvme_core_init and nvme_core_exit static
    nvme: move command size checks to the core
    nvme-fabrics: check more command sizes
    nvme-pci: check more command sizes
    nvme-pci: remove an unneeded variable initialization
    nvme-pci: unquiesce admin queue on shutdown
    nvme-pci: shutdown on timeout during deletion
    nvme-pci: fix psdt field for single segment sgls
    nvme-multipath: don't print ANA group state by default
    nvme-multipath: split bios with the ns_head bio_set before submitting
    ...

    Linus Torvalds
     
  • Pull vfs inode freeing updates from Al Viro:
    "Introduction of separate method for RCU-delayed part of
    ->destroy_inode() (if any).

    Pretty much as posted, except that destroy_inode() stashes
    ->free_inode into the victim (anon-unioned with ->i_fops) before
    scheduling i_callback() and the last two patches (sockfs conversion
    and folding struct socket_wq into struct socket) are excluded - that
    pair should go through netdev once davem reopens his tree"

    * 'work.icache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (58 commits)
    orangefs: make use of ->free_inode()
    shmem: make use of ->free_inode()
    hugetlb: make use of ->free_inode()
    overlayfs: make use of ->free_inode()
    jfs: switch to ->free_inode()
    fuse: switch to ->free_inode()
    ext4: make use of ->free_inode()
    ecryptfs: make use of ->free_inode()
    ceph: use ->free_inode()
    btrfs: use ->free_inode()
    afs: switch to use of ->free_inode()
    dax: make use of ->free_inode()
    ntfs: switch to ->free_inode()
    securityfs: switch to ->free_inode()
    apparmor: switch to ->free_inode()
    rpcpipe: switch to ->free_inode()
    bpf: switch to ->free_inode()
    mqueue: switch to ->free_inode()
    ufs: switch to ->free_inode()
    coda: switch to ->free_inode()
    ...

    Linus Torvalds
     

02 May, 2019

1 commit


01 May, 2019

1 commit

  • Commit 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag") introduces
    BIO_NO_PAGE_REF, and once this flag is set for one bio, all pages
    in the bio won't be get/put during IO.

    However, if one bio is submitted via __blkdev_direct_IO_simple(),
    even though BIO_NO_PAGE_REF is set, pages still may be put.

    Fixes this issue by avoiding to put pages if BIO_NO_PAGE_REF is
    set.

    Fixes: 399254aaf489211 ("block: add BIO_NO_PAGE_REF flag")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

30 Apr, 2019

1 commit


12 Apr, 2019

1 commit

  • If the last bio returned is not dio->bio, the status of the bio will
    not assigned to dio->bio if it is error. This will cause the whole IO
    status wrong.

    ksoftirqd/21-117 [021] ..s. 4017.966090: 8,0 C N 4883648 [0]
    -0 [018] ..s. 4017.970888: 8,0 C WS 4924800 + 1024 [0]
    -0 [018] ..s. 4017.970909: 8,0 D WS 4935424 + 1024 []
    -0 [018] ..s. 4017.970924: 8,0 D WS 4936448 + 321 []
    ksoftirqd/21-117 [021] ..s. 4017.995033: 8,0 C R 4883648 + 336 [65475]
    ksoftirqd/21-117 [021] d.s. 4018.001988: myprobe1: (blkdev_bio_end_io+0x0/0x168) bi_status=7
    ksoftirqd/21-117 [021] d.s. 4018.001992: myprobe: (aio_complete_rw+0x0/0x148) x0=0xffff802f2595ad80 res=0x12a000 res2=0x0

    We always have to assign bio->bi_status to dio->bio.bi_status because we
    will only check dio->bio.bi_status when we return the whole IO to
    the upper layer.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Cc: stable@vger.kernel.org
    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Reviewed-by: Ming Lei
    Signed-off-by: Jason Yan
    Signed-off-by: Jens Axboe

    Jason Yan
     

10 Apr, 2019

1 commit


19 Mar, 2019

1 commit

  • If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
    with NO_REF, then we don't need to add a page reference for the pages
    that we add.

    Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
    not to drop a reference to these pages.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Feb, 2019

2 commits

  • For the upcoming async polled IO, we can't sleep allocating requests.
    If we do, then we introduce a deadlock where the submitter already
    has async polled IO in-flight, but can't wait for them to complete
    since polled requests must be active found and reaped.

    Utilize the helper in the blockdev DIRECT_IO code.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just call blk_poll on the iocb cookie, we can derive the block device
    from the inode trivially.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

15 Feb, 2019

1 commit

  • This patch introduces one extra iterator variable to bio_for_each_segment_all(),
    then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

    Given it is just one mechannical & simple change on all bio_for_each_segment_all()
    users, this patch does tree-wide change in one single patch, so that we can
    avoid to use a temporary helper for this conversion.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

15 Jan, 2019

1 commit

  • bd_set_size() updates also block device's block size. This is somewhat
    unexpected from its name and at this point, only blkdev_open() uses this
    functionality. Furthermore, this can result in changing block size under
    a filesystem mounted on a loop device which leads to livelocks inside
    __getblk_gfp() like:

    Sending NMI from CPU 0 to CPUs 1:
    NMI backtrace for cpu 1
    CPU: 1 PID: 10863 Comm: syz-executor0 Not tainted 4.18.0-rc5+ #151
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google
    01/01/2011
    RIP: 0010:__sanitizer_cov_trace_pc+0x3f/0x50 kernel/kcov.c:106
    ...
    Call Trace:
    init_page_buffers+0x3e2/0x530 fs/buffer.c:904
    grow_dev_page fs/buffer.c:947 [inline]
    grow_buffers fs/buffer.c:1009 [inline]
    __getblk_slow fs/buffer.c:1036 [inline]
    __getblk_gfp+0x906/0xb10 fs/buffer.c:1313
    __bread_gfp+0x2d/0x310 fs/buffer.c:1347
    sb_bread include/linux/buffer_head.h:307 [inline]
    fat12_ent_bread+0x14e/0x3d0 fs/fat/fatent.c:75
    fat_ent_read_block fs/fat/fatent.c:441 [inline]
    fat_alloc_clusters+0x8ce/0x16e0 fs/fat/fatent.c:489
    fat_add_cluster+0x7a/0x150 fs/fat/inode.c:101
    __fat_get_block fs/fat/inode.c:148 [inline]
    ...

    Trivial reproducer for the problem looks like:

    truncate -s 1G /tmp/image
    losetup /dev/loop0 /tmp/image
    mkfs.ext4 -b 1024 /dev/loop0
    mount -t ext4 /dev/loop0 /mnt
    losetup -c /dev/loop0
    l /mnt

    Fix the problem by moving initialization of a block device block size
    into a separate function and call it when needed.

    Thanks to Tetsuo Handa for help with
    debugging the problem.

    Reported-by: syzbot+9933e4476f365f5d5a1b@syzkaller.appspotmail.com
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

03 Jan, 2019

1 commit

  • This mostly reverts commit 849a370016a5 ("block: avoid ordered task
    state change for polled IO"). It was wrongly claiming that the ordering
    wasn't necessary. The memory barrier _is_ necessary.

    If something is truly polling and not going to sleep, it's the whole
    state setting that is unnecessary, not the memory barrier. Whenever you
    set your state to a sleeping state, you absolutely need the memory
    barrier.

    Note that sometimes the memory barrier can be elsewhere. For example,
    the ordering might be provided by an external lock, or by setting the
    process state to sleeping before adding yourself to the wait queue list
    that is used for waking up (where the wait queue lock itself will
    guarantee that any wakeup will correctly see the sleeping state).

    But none of those cases were true here.

    NOTE! Some of the polling paths may indeed be able to drop the state
    setting entirely, at which point the memory barrier also goes away.

    (Also note that this doesn't revert the TASK_RUNNING cases: there is no
    race between a wakeup and setting the process state to TASK_RUNNING,
    since the end result doesn't depend on ordering).

    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

29 Dec, 2018

2 commits

  • Merge misc updates from Andrew Morton:

    - large KASAN update to use arm's "software tag-based mode"

    - a few misc things

    - sh updates

    - ocfs2 updates

    - just about all of MM

    * emailed patches from Andrew Morton : (167 commits)
    kernel/fork.c: mark 'stack_vm_area' with __maybe_unused
    memcg, oom: notify on oom killer invocation from the charge path
    mm, swap: fix swapoff with KSM pages
    include/linux/gfp.h: fix typo
    mm/hmm: fix memremap.h, move dev_page_fault_t callback to hmm
    hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
    hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
    memory_hotplug: add missing newlines to debugging output
    mm: remove __hugepage_set_anon_rmap()
    include/linux/vmstat.h: remove unused page state adjustment macro
    mm/page_alloc.c: allow error injection
    mm: migrate: drop unused argument of migrate_page_move_mapping()
    blkdev: avoid migration stalls for blkdev pages
    mm: migrate: provide buffer_migrate_page_norefs()
    mm: migrate: move migrate_page_lock_buffers()
    mm: migrate: lock buffers before migrate_page_move_mapping()
    mm: migration: factor out code to compute expected number of page references
    mm, page_alloc: enable pcpu_drain with zone capability
    kmemleak: add config to select auto scan
    mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init
    ...

    Linus Torvalds
     
  • Currently, block device pages don't provide a ->migratepage callback and
    thus fallback_migrate_page() is used for them. This handler cannot deal
    with dirty pages in async mode and also with the case a buffer head is in
    the LRU buffer head cache (as it has elevated b_count). Thus such page
    can block memory offlining.

    Fix the problem by using buffer_migrate_page_norefs() for migrating block
    device pages. That function takes care of dropping bh LRU in case
    migration would fail due to elevated buffer refcount to avoid stalls and
    can also migrate dirty pages without writing them.

    Link: http://lkml.kernel.org/r/20181211172143.7358-6-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

30 Nov, 2018

1 commit

  • The bio referencing has a trick that doesn't do any actual atomic
    inc/dec on the reference count until we have to elevator to > 1. For the
    async IO O_DIRECT case, we can't use the simple DIO variants, so we use
    __blkdev_direct_IO(). It always grabs an extra reference to the bio
    after allocation, which means we then enter the slower path of actually
    having to do atomic_inc/dec on the count.

    We don't need to do that for the async case, unless we end up going
    multi-bio, in which case we're already doing huge amounts of IO. For the
    smaller IO case (< BIO_MAX_PAGES), we can do without the extra ref.

    Based on an earlier patch (and commit log) from Jens Axboe.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

26 Nov, 2018

1 commit

  • blk_poll() has always kept spinning until it found an IO. This is
    fine for SYNC polling, since we need to find one request we have
    pending, but in preparation for ASYNC polling it can be beneficial
    to just check if we have any entries available or not.

    Existing callers are converted to pass in 'spin == true', to retain
    the old behavior.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Nov, 2018

1 commit

  • For the core poll helper, the task state setting don't need to imply any
    atomics, as it's the current task itself that is being modified and
    we're not going to sleep.

    For IRQ driven, the wakeup path have the necessary barriers to not need
    us using the heavy handed version of the task state setting.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2018

2 commits