17 Apr, 2019

1 commit

  • commit a3761c3c91209b58b6f33bf69dd8bb8ec0c9d925 upstream.

    When bio_add_pc_page() fails in bio_copy_user_iov() we should free
    the page we just allocated otherwise we are leaking it.

    Cc: linux-block@vger.kernel.org
    Cc: Linus Torvalds
    Cc: stable@vger.kernel.org
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     

20 Dec, 2018

1 commit

  • commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream.

    We don't need to zero fill the bio if not using kernel allocated pages.

    Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
    Reported-by: Todd Aiken
    Cc: Laurence Oberman
    Cc: stable@vger.kernel.org
    Cc: Bart Van Assche
    Tested-by: Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Keith Busch
     

01 Dec, 2018

1 commit

  • [ Upstream commit ca474b73896bf6e0c1eb8787eb217b0f80221610 ]

    We need to copy the io priority, too; otherwise the clone will run
    with a different priority than the original one.

    Fixes: 43b62ce3ff0a ("block: move bio io prio to a new field")
    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jean Delvare

    Fixed up subject, and ordered stores.

    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Hannes Reinecke
     

27 Nov, 2018

1 commit

  • [ Upstream commit f3587d76da05f68098ddb1cb3c98cc6a9e8a402c ]

    If the kernel allocates a bounce buffer for user read data, this memory
    needs to be cleared before copying it to the user, otherwise it may leak
    kernel memory to user space.

    Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Keith Busch
     

22 Sep, 2018

1 commit

  • Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
    updating properly on 4.18. This is because we started using ktime to
    track elapsed time, and we convert nanoseconds to jiffies when we update
    the partition counter. However, this gets rounded down, so any I/Os that
    take less than a jiffy are not accounted for. Previously in this case,
    the value of jiffies would sometimes increment while we were doing I/O,
    so at least some I/Os were accounted for.

    Let's convert the stats to use nanoseconds internally. We still report
    milliseconds as before, now more accurately than ever. The value is
    still truncated to 32 bits for backwards compatibility.

    Fixes: 522a777566f5 ("block: consolidate struct request timestamp fields")
    Cc: stable@vger.kernel.org
    Reported-by: Klaus Kusche
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

01 Sep, 2018

1 commit

  • There is a very small change a bio gets caught up in a really
    unfortunate race between a task migration, cgroup exiting, and itself
    trying to associate with a blkg. This is due to css offlining being
    performed after the css->refcnt is killed which triggers removal of
    blkgs that reach their blkg->refcnt of 0.

    To avoid this, association with a blkg should use tryget and fallback to
    using the root_blkg.

    Fixes: 08e18eab0c579 ("block: add bi_blkg to the bio for cgroups")
    Reviewed-by: Josef Bacik
    Signed-off-by: Dennis Zhou
    Cc: Jiufei Xue
    Cc: Joseph Qi
    Cc: Tejun Heo
    Cc: Josef Bacik
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

09 Aug, 2018

1 commit

  • In commit ed996a52c868 ("block: simplify and cleanup bvec pool
    handling"), the value of the slab index is incremented by one in
    bvec_alloc() after the allocation is done to indicate an index value of
    0 does not need to be later freed.

    bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
    value. Decrement idx before performing the lookup.

    Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe

    Greg Edwards
     

28 Jul, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "Bigger than usual at this time, mostly due to the O_DIRECT corruption
    issue and the fact that I was on vacation last week. This contains:

    - NVMe pull request with two fixes for the FC code, and two target
    fixes (Christoph)

    - a DIF bio reset iteration fix (Greg Edwards)

    - two nbd reply and requeue fixes (Josef)

    - SCSI timeout fixup (Keith)

    - a small series that fixes an issue with bio_iov_iter_get_pages(),
    which ended up causing corruption for larger sized O_DIRECT writes
    that ended up racing with buffered writes (Martin Wilck)"

    * tag 'for-linus-20180727' of git://git.kernel.dk/linux-block:
    block: reset bi_iter.bi_done after splitting bio
    block: bio_iov_iter_get_pages: pin more pages for multi-segment IOs
    blkdev: __blkdev_direct_IO_simple: fix leak in error case
    block: bio_iov_iter_get_pages: fix size of last iovec
    nvmet: only check for filebacking on -ENOTBLK
    nvmet: fixup crash on NULL device path
    scsi: set timed out out mq requests to complete
    blk-mq: export setting request completion state
    nvme: if_ready checks to fail io to deleting controller
    nvmet-fc: fix target sgl list on large transfers
    nbd: handle unexpected replies better
    nbd: don't requeue the same request twice.

    Linus Torvalds
     

27 Jul, 2018

3 commits

  • After the bio has been updated to represent the remaining sectors, reset
    bi_done so bio_rewind_iter() does not rewind further than it should.

    This resolves a bio_integrity_process() failure on reads where the
    original request was split.

    Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
    Signed-off-by: Greg Edwards
    Signed-off-by: Jens Axboe

    Greg Edwards
     
  • bio_iov_iter_get_pages() currently only adds pages for the next non-zero
    segment from the iov_iter to the bio. That's suboptimal for callers,
    which typically try to pin as many pages as fit into the bio. This patch
    converts the current bio_iov_iter_get_pages() into a static helper, and
    introduces a new helper that allocates as many pages as

    1) fit into the bio,
    2) are present in the iov_iter,
    3) and can be pinned by MM.

    Error is returned only if zero pages could be pinned. Because of 3), a
    zero return value doesn't necessarily mean all pages have been pinned.
    Callers that have to pin every page in the iov_iter must still call this
    function in a loop (this is currently the case).

    This change matters most for __blkdev_direct_IO_simple(), which calls
    bio_iov_iter_get_pages() only once. If it obtains less pages than
    requested, it returns a "short write" or "short read", and
    __generic_file_write_iter() falls back to buffered writes, which may
    lead to data corruption.

    Fixes: 72ecad22d9f1 ("block: support a full bio worth of IO for simplified bdev direct-io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe

    Martin Wilck
     
  • If the last page of the bio is not "full", the length of the last
    vector slot needs to be corrected. This slot has the index
    (bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
    array, which is shifted by the value of bio->bi_vcnt at function
    invocation, the correct index is (nr_pages - 1).

    v2: improved readability following suggestions from Ming Lei.
    v3: followed a formatting suggestion from Christoph Hellwig.

    Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Martin Wilck
    Signed-off-by: Jens Axboe

    Martin Wilck
     

25 Jul, 2018

3 commits

  • Now only used by the bounce code, so move it there and mark the function
    static.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • So don't bother handling it.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bio_check_pages_dirty currently inviolates the invariant that bv_page of
    a bio_vec inside bi_vcnt shouldn't be zero, and that is going to become
    really annoying with multpath biovecs. Fortunately there isn't any
    all that good reason for it - once we decide to defer freeing the bio
    to a workqueue holding onto a few additional pages isn't really an
    issue anymore. So just check if there is a clean page that needs
    dirtying in the first path, and do a second pass to free them if there
    was none, while the cache is still hot.

    Also use the chance to micro-optimize bio_dirty_fn a bit by not saving
    irq state - we know we are called from a workqueue.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Jul, 2018

1 commit

  • Add and use a new op_stat_group() function for indexing partition stat
    fields rather than indexing them by rq_data_dir() or bio_data_dir().
    This function works similarly to op_is_sync() in that it takes the
    request::cmd_flags or bio::bi_opf flags and determines which stats
    should et updated.

    In addition, the second parameter to generic_start_io_acct() and
    generic_end_io_acct() is now a REQ_OP rather than simply a read or
    write bit and it uses op_stat_group() on the parameter to determine
    the stat group.

    Note that the partition in_flight counts are not part of the per-cpu
    statistics and as such are not indexed via this function. It's now
    indexed by op_is_write().

    tj: Refreshed on top of v4.17. Updated to pass around REQ_OP.

    Signed-off-by: Michael Callahan
    Signed-off-by: Tejun Heo
    Cc: Minchan Kim
    Cc: Dan Williams
    Cc: Joshua Morris
    Cc: Philipp Reisner
    Cc: Matias Bjorling
    Cc: Kent Overstreet
    Cc: Alasdair Kergon
    Signed-off-by: Jens Axboe

    Michael Callahan
     

09 Jul, 2018

3 commits

  • wbt cares only about request completion time, but controllers may need
    information that is on the bio itself, so add a done_bio callback for
    rq-qos so things like blk-iolatency can use it to have the bio when it
    completes.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • For backcharging we need to know who the page belongs to when swapping
    it out. We don't worry about things that do ->rw_page (zram etc) at the
    moment, we're only worried about pages that actually go to a block
    device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Josef Bacik
    Acked-by: Johannes Weiner
    Acked-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently io.low uses a bi_cg_private to stash its private data for the
    blkg, however other blkcg policies may want to use this as well. Since
    we can get the private data out of the blkg, move this to bi_blkg in the
    bio and make it generic, then we can use bio_associate_blkg() to attach
    the blkg to the bio.

    Theoretically we could simply replace the bi_css with this since we can
    get to all the same information from the blkg, however you have to
    lookup the blkg, so for example wbc_init_bio() would have to lookup and
    possibly allocate the blkg for the css it was trying to attach to the
    bio. This could be problematic and result in us either not attaching
    the css at all to the bio, or falling back to the root blkcg if we are
    unable to allocate the corresponding blkg.

    So for now do this, and in the future if possible we could just replace
    the bi_css with bi_blkg and update the helpers to do the correct
    translation.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

24 Jun, 2018

1 commit

  • Pull block fixes from Jens Axboe:

    - Further timeout fixes. We aren't quite there yet, so expect another
    round of fixes for that to completely close some of the IRQ vs
    completion races. (Christoph/Bart)

    - Set of NVMe fixes from the usual suspects, mostly error handling

    - Two off-by-one fixes (Dan)

    - Another bdi race fix (Jan)

    - Fix nbd reconfigure with NBD_DISCONNECT_ON_CLOSE (Doron)

    * tag 'for-linus-20180623' of git://git.kernel.dk/linux-block:
    blk-mq: Fix timeout handling in case the timeout handler returns BLK_EH_DONE
    bdi: Fix another oops in wb_workfn()
    lightnvm: Remove depends on HAS_DMA in case of platform dependency
    nvme-pci: limit max IO size and segments to avoid high order allocations
    nvme-pci: move nvme_kill_queues to nvme_remove_dead_ctrl
    nvme-fc: release io queues to allow fast fail
    nbd: Add the nbd NBD_DISCONNECT_ON_CLOSE config flag.
    block: sed-opal: Fix a couple off by one bugs
    blk-mq-debugfs: Off by one in blk_mq_rq_state_name()
    nvmet: reset keep alive timer in controller enable
    nvme-rdma: don't override opts->queue_size
    nvme-rdma: Fix command completion race at error recovery
    nvme-rdma: fix possible free of a non-allocated async event buffer
    nvme-rdma: fix possible double free condition when failing to create a controller
    Revert "block: Add warning for bi_next not NULL in bio_endio()"
    block: fix timeout changes for legacy request drivers

    Linus Torvalds
     

20 Jun, 2018

1 commit

  • Commit 0ba99ca4838b ("block: Add warning for bi_next not NULL in
    bio_endio()") breaks the dm driver. end_clone_bio() detects whether
    or not a bio is the last bio associated with a request by checking
    the .bi_next field. Commit 0ba99ca4838b clears that field before
    end_clone_bio() has had a chance to inspect that field. Hence revert
    commit 0ba99ca4838b.

    This patch avoids that KASAN reports the following complaint when
    running the srp-test software (srp-test/run_tests -c -d -r 10 -t 02-mq):

    ==================================================================
    BUG: KASAN: use-after-free in bio_advance+0x11b/0x1d0
    Read of size 4 at addr ffff8801300e06d0 by task ksoftirqd/0/9

    CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 4.18.0-rc1-dbg+ #1
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    Call Trace:
    dump_stack+0xa4/0xf5
    print_address_description+0x6f/0x270
    kasan_report+0x241/0x360
    __asan_load4+0x78/0x80
    bio_advance+0x11b/0x1d0
    blk_update_request+0xa7/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d
    run_ksoftirqd+0x3f/0x60
    smpboot_thread_fn+0x352/0x460
    kthread+0x1c1/0x1e0
    ret_from_fork+0x24/0x30

    Allocated by task 1918:
    save_stack+0x43/0xd0
    kasan_kmalloc+0xad/0xe0
    kasan_slab_alloc+0x11/0x20
    kmem_cache_alloc+0xfe/0x350
    mempool_alloc_slab+0x15/0x20
    mempool_alloc+0xfb/0x270
    bio_alloc_bioset+0x244/0x350
    submit_bh_wbc+0x9c/0x2f0
    __block_write_full_page+0x299/0x5a0
    block_write_full_page+0x16b/0x180
    blkdev_writepage+0x18/0x20
    __writepage+0x42/0x80
    write_cache_pages+0x376/0x8a0
    generic_writepages+0xbe/0x110
    blkdev_writepages+0xe/0x10
    do_writepages+0x9b/0x180
    __filemap_fdatawrite_range+0x178/0x1c0
    file_write_and_wait_range+0x59/0xc0
    blkdev_fsync+0x46/0x80
    vfs_fsync_range+0x66/0x100
    do_fsync+0x3d/0x70
    __x64_sys_fsync+0x21/0x30
    do_syscall_64+0x77/0x230
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Freed by task 9:
    save_stack+0x43/0xd0
    __kasan_slab_free+0x137/0x190
    kasan_slab_free+0xe/0x10
    kmem_cache_free+0xd3/0x380
    mempool_free_slab+0x17/0x20
    mempool_free+0x63/0x160
    bio_free+0x81/0xa0
    bio_put+0x59/0x60
    end_bio_bh_io_sync+0x5d/0x70
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    end_clone_bio+0xa3/0xd0 [dm_mod]
    bio_endio+0x1a7/0x360
    blk_update_request+0xd0/0x5b0
    scsi_end_request+0x56/0x320 [scsi_mod]
    scsi_io_completion+0x7d6/0xb20 [scsi_mod]
    scsi_finish_command+0x1c0/0x280 [scsi_mod]
    scsi_softirq_done+0x19a/0x230 [scsi_mod]
    blk_mq_complete_request+0x160/0x240
    scsi_mq_done+0x50/0x1a0 [scsi_mod]
    srp_recv_done+0x515/0x1330 [ib_srp]
    __ib_process_cq+0xa0/0xf0 [ib_core]
    ib_poll_handler+0x38/0xa0 [ib_core]
    irq_poll_softirq+0xe8/0x1f0
    __do_softirq+0x128/0x60d

    The buggy address belongs to the object at ffff8801300e0640
    which belongs to the cache bio-0 of size 200
    The buggy address is located 144 bytes inside of
    200-byte region [ffff8801300e0640, ffff8801300e0708)
    The buggy address belongs to the page:
    page:ffffea0004c03800 count:1 mapcount:0 mapping:ffff88015a563a00 index:0x0 compound_mapcount: 0
    flags: 0x8000000000008100(slab|head)
    raw: 8000000000008100 dead000000000100 dead000000000200 ffff88015a563a00
    raw: 0000000000000000 0000000000330033 00000001ffffffff 0000000000000000
    page dumped because: kasan: bad access detected

    Memory state around the buggy address:
    ffff8801300e0580: fb fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc
    ffff8801300e0600: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
    >ffff8801300e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    ffff8801300e0700: fb fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
    ffff8801300e0780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    Cc: Kent Overstreet
    Fixes: 0ba99ca4838b ("block: Add warning for bi_next not NULL in bio_endio()")
    Acked-by: Mike Snitzer
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

13 Jun, 2018

1 commit

  • The kzalloc() function has a 2-factor argument form, kcalloc(). This
    patch replaces cases of:

    kzalloc(a * b, gfp)

    with:
    kcalloc(a * b, gfp)

    as well as handling cases of:

    kzalloc(a * b * c, gfp)

    with:

    kzalloc(array3_size(a, b, c), gfp)

    as it's slightly less ugly than:

    kzalloc_array(array_size(a, b), c, gfp)

    This does, however, attempt to ignore constant size factors like:

    kzalloc(4 * 1024, gfp)

    though any constants defined via macros get caught up in the conversion.

    Any factors with a sizeof() of "unsigned char", "char", and "u8" were
    dropped, since they're redundant.

    The Coccinelle script used for this was:

    // Fix redundant parens around sizeof().
    @@
    type TYPE;
    expression THING, E;
    @@

    (
    kzalloc(
    - (sizeof(TYPE)) * E
    + sizeof(TYPE) * E
    , ...)
    |
    kzalloc(
    - (sizeof(THING)) * E
    + sizeof(THING) * E
    , ...)
    )

    // Drop single-byte sizes and redundant parens.
    @@
    expression COUNT;
    typedef u8;
    typedef __u8;
    @@

    (
    kzalloc(
    - sizeof(u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * (COUNT)
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(__u8) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(char) * COUNT
    + COUNT
    , ...)
    |
    kzalloc(
    - sizeof(unsigned char) * COUNT
    + COUNT
    , ...)
    )

    // 2-factor product with sizeof(type/expression) and identifier or constant.
    @@
    type TYPE;
    expression THING;
    identifier COUNT_ID;
    constant COUNT_CONST;
    @@

    (
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_ID)
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_ID
    + COUNT_ID, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (COUNT_CONST)
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * COUNT_CONST
    + COUNT_CONST, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_ID)
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_ID
    + COUNT_ID, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (COUNT_CONST)
    + COUNT_CONST, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * COUNT_CONST
    + COUNT_CONST, sizeof(THING)
    , ...)
    )

    // 2-factor product, only identifiers.
    @@
    identifier SIZE, COUNT;
    @@

    - kzalloc
    + kcalloc
    (
    - SIZE * COUNT
    + COUNT, SIZE
    , ...)

    // 3-factor product with 1 sizeof(type) or sizeof(expression), with
    // redundant parens removed.
    @@
    expression THING;
    identifier STRIDE, COUNT;
    type TYPE;
    @@

    (
    kzalloc(
    - sizeof(TYPE) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(TYPE))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * (COUNT) * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * (STRIDE)
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    |
    kzalloc(
    - sizeof(THING) * COUNT * STRIDE
    + array3_size(COUNT, STRIDE, sizeof(THING))
    , ...)
    )

    // 3-factor product with 2 sizeof(variable), with redundant parens removed.
    @@
    expression THING1, THING2;
    identifier COUNT;
    type TYPE1, TYPE2;
    @@

    (
    kzalloc(
    - sizeof(TYPE1) * sizeof(TYPE2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(THING1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(THING1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * COUNT
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    |
    kzalloc(
    - sizeof(TYPE1) * sizeof(THING2) * (COUNT)
    + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
    , ...)
    )

    // 3-factor product, only identifiers, with redundant parens removed.
    @@
    identifier STRIDE, SIZE, COUNT;
    @@

    (
    kzalloc(
    - (COUNT) * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * STRIDE * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - (COUNT) * (STRIDE) * (SIZE)
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    |
    kzalloc(
    - COUNT * STRIDE * SIZE
    + array3_size(COUNT, STRIDE, SIZE)
    , ...)
    )

    // Any remaining multi-factor products, first at least 3-factor products,
    // when they're not all constants...
    @@
    expression E1, E2, E3;
    constant C1, C2, C3;
    @@

    (
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(
    - (E1) * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * E3
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - (E1) * (E2) * (E3)
    + array3_size(E1, E2, E3)
    , ...)
    |
    kzalloc(
    - E1 * E2 * E3
    + array3_size(E1, E2, E3)
    , ...)
    )

    // And then all remaining 2 factors products when they're not all constants,
    // keeping sizeof() as the second factor argument.
    @@
    expression THING, E1, E2;
    type TYPE;
    constant C1, C2, C3;
    @@

    (
    kzalloc(sizeof(THING) * C2, ...)
    |
    kzalloc(sizeof(TYPE) * C2, ...)
    |
    kzalloc(C1 * C2 * C3, ...)
    |
    kzalloc(C1 * C2, ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * (E2)
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(TYPE) * E2
    + E2, sizeof(TYPE)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * (E2)
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - sizeof(THING) * E2
    + E2, sizeof(THING)
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * E2
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - (E1) * (E2)
    + E1, E2
    , ...)
    |
    - kzalloc
    + kcalloc
    (
    - E1 * E2
    + E1, E2
    , ...)
    )

    Signed-off-by: Kees Cook

    Kees Cook
     

09 Jun, 2018

1 commit

  • Pull block fixes from Jens Axboe:
    "A few fixes for this merge window, where some of them should go in
    sooner rather than later, hence a new pull this week. This pull
    request contains:

    - Set of NVMe fixes, mostly follow up cleanups/fixes to the queue
    changes, but also teardown/removal and misc changes (Christop/Dan/
    Johannes/Sagi/Steve).

    - Two lightnvm fixes for issues that showed up in this window
    (Colin/Wei).

    - Failfast/driver flags inheritance for flush requests (Hannes).

    - The md device put sanitization and fix (Kent).

    - dm bio_set inheritance fix (me).

    - nbd discard granularity fix (Josef).

    - nbd consistency in command printing (Kevin).

    - Loop recursion validation fix (Ted).

    - Partition overlap check (Wang)"

    [ .. and now my build is warning-free again thanks to the md fix - Linus ]

    * tag 'for-linus-20180608' of git://git.kernel.dk/linux-block: (22 commits)
    nvme: cleanup double shift issue
    nvme-pci: make CMB SQ mod-param read-only
    nvme-pci: unquiesce dead controller queues
    nvme-pci: remove HMB teardown on reset
    nvme-pci: queue creation fixes
    nvme-pci: remove unnecessary completion doorbell check
    nvme-pci: remove unnecessary nested locking
    nvmet: filter newlines from user input
    nvme-rdma: correctly check for target keyed sgl support
    nvme: don't hold nvmf_transports_rwsem for more than transport lookups
    nvmet: return all zeroed buffer when we can't find an active namespace
    md: Unify mddev destruction paths
    dm: use bioset_init_from_src() to copy bio_set
    block: add bioset_init_from_src() helper
    block: always set partition number to '0' in blk_partition_remap()
    block: pass failfast and driver-specific flags to flush requests
    nbd: set discard_alignment to the granularity
    nbd: Consistently use request pointer in debug messages.
    block: add verifier for cmdline partition
    lightnvm: pblk: fix resource leak of invalid_bitmap
    ...

    Linus Torvalds
     

08 Jun, 2018

1 commit


06 Jun, 2018

1 commit

  • Pull xfs updates from Darrick Wong:
    "New features this cycle include the ability to relabel mounted
    filesystems, support for fallocated swapfiles, and using FUA for pure
    data O_DSYNC directio writes. With this cycle we begin to integrate
    online filesystem repair and refactor the growfs code in preparation
    for eventual subvolume support, though the road ahead for both
    features is quite long.

    There are also numerous refactorings of the iomap code to remove
    unnecessary log overhead, to disentangle some of the quota code, and
    to prepare for buffer head removal in a future upstream kernel.

    Metadata validation continues to improve, both in the hot path
    veifiers and the online filesystem check code. I anticipate sending a
    second pull request in a few days with more metadata validation
    improvements.

    This series has been run through a full xfstests run over the weekend
    and through a quick xfstests run against this morning's master, with
    no major failures reported.

    Summary:

    - Strengthen inode number and structure validation when allocating
    inodes.

    - Reduce pointless buffer allocations during cache miss

    - Use FUA for pure data O_DSYNC directio writes

    - Various iomap refactorings

    - Strengthen quota metadata verification to avoid unfixable broken
    quota

    - Make AGFL block freeing a deferred operation to avoid blowing out
    transaction reservations when running complex operations

    - Get rid of the log item descriptors to reduce log overhead

    - Fix various reflink bugs where inodes were double-joined to
    transactions

    - Don't issue discards when trimming unwritten extents

    - Refactor incore dquot initialization and retrieval interfaces

    - Fix some locking problmes in the quota scrub code

    - Strengthen btree structure checks in scrub code

    - Rewrite swapfile activation to use iomap and support unwritten
    extents

    - Make scrub exit to userspace sooner when corruptions or
    cross-referencing problems are found

    - Make scrub invoke the data fork scrubber directly on metadata
    inodes

    - Don't do background reclamation of post-eof and cow blocks when the
    fs is suspended

    - Fix secondary superblock buffer lifespan hinting

    - Refactor growfs to use table-dispatched functions instead of long
    stringy functions

    - Move growfs code to libxfs

    - Implement online fs label getting and setting

    - Introduce online filesystem repair (in a very limited capacity)

    - Fix unit conversion problems in the realtime freemap iteration
    functions

    - Various refactorings and cleanups in preparation to remove buffer
    heads in a future release

    - Reimplement the old bmap call with iomap

    - Remove direct buffer head accesses from seek hole/data

    - Various bug fixes"

    * tag 'xfs-4.18-merge-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits)
    fs: use ->is_partially_uptodate in page_cache_seek_hole_data
    fs: remove the buffer_unwritten check in page_seek_hole_data
    fs: move page_cache_seek_hole_data to iomap.c
    xfs: use iomap_bmap
    iomap: add an iomap-based bmap implementation
    iomap: add a iomap_sector helper
    iomap: use __bio_add_page in iomap_dio_zero
    iomap: move IOMAP_F_BOUNDARY to gfs2
    iomap: fix the comment describing IOMAP_NOWAIT
    iomap: inline data should be an iomap type, not a flag
    mm: split ->readpages calls to avoid non-contiguous pages lists
    mm: return an unsigned int from __do_page_cache_readahead
    mm: give the 'ret' variable a better name __do_page_cache_readahead
    block: add a lower-level bio_add_page interface
    xfs: fix error handling in xfs_refcount_insert()
    xfs: fix xfs_rtalloc_rec units
    xfs: strengthen rtalloc query range checks
    xfs: xfs_rtbuf_get should check the bmapi_read results
    xfs: xfs_rtword_t should be unsigned, not signed
    dax: change bdev_dax_supported() to support boolean returns
    ...

    Linus Torvalds
     

02 Jun, 2018

1 commit

  • For the upcoming removal of buffer heads in XFS we need to keep track of
    the number of outstanding writeback requests per page. For this we need
    to know if bio_add_page merged a region with the previous bvec or not.
    Instead of adding additional arguments this refactors bio_add_page to
    be implemented using three lower level helpers which users like XFS can
    use directly if they care about the merge decisions.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Reviewed-by: Ming Lei
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Christoph Hellwig
     

31 May, 2018

1 commit


15 May, 2018

8 commits


22 Mar, 2018

1 commit


30 Jan, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block IO related changes for the
    4.16 kernel. Nothing major in this pull request, but a good amount of
    improvements and fixes all over the map. This contains:

    - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
    Paolo.

    - Support for SMR zones for deadline and mq-deadline from Damien and
    Christoph.

    - Set of fixes for bcache by way of Michael Lyle, including fixes
    from himself, Kent, Rui, Tang, and Coly.

    - Series from Matias for lightnvm with fixes from Hans Holmberg,
    Javier, and Matias. Mostly centered around pblk, and the removing
    rrpc 1.2 in preparation for supporting 2.0.

    - A couple of NVMe pull requests from Christoph. Nothing major in
    here, just fixes and cleanups, and support for command tracing from
    Johannes.

    - Support for blk-throttle for tracking reads and writes separately.
    From Joseph Qi. A few cleanups/fixes also for blk-throttle from
    Weiping.

    - Series from Mike Snitzer that enables dm to register its queue more
    logically, something that's alwways been problematic on dm since
    it's a stacked device.

    - Series from Ming cleaning up some of the bio accessor use, in
    preparation for supporting multipage bvecs.

    - Various fixes from Ming closing up holes around queue mapping and
    quiescing.

    - BSD partition fix from Richard Narron, fixing a problem where we
    can't mount newer (10/11) FreeBSD partitions.

    - Series from Tejun reworking blk-mq timeout handling. The previous
    scheme relied on atomic bits, but it had races where we would think
    a request had timed out if it to reused at the wrong time.

    - null_blk now supports faking timeouts, to enable us to better
    exercise and test that functionality separately. From me.

    - Kill the separate atomic poll bit in the request struct. After
    this, we don't use the atomic bits on blk-mq anymore at all. From
    me.

    - sgl_alloc/free helpers from Bart.

    - Heavily contended tag case scalability improvement from me.

    - Various little fixes and cleanups from Arnd, Bart, Corentin,
    Douglas, Eryu, Goldwyn, and myself"

    * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
    block: remove smart1,2.h
    nvme: add tracepoint for nvme_complete_rq
    nvme: add tracepoint for nvme_setup_cmd
    nvme-pci: introduce RECONNECTING state to mark initializing procedure
    nvme-rdma: remove redundant boolean for inline_data
    nvme: don't free uuid pointer before printing it
    nvme-pci: Suspend queues after deleting them
    bsg: use pr_debug instead of hand crafted macros
    blk-mq-debugfs: don't allow write on attributes with seq_operations set
    nvme-pci: Fix queue double allocations
    block: Set BIO_TRACE_COMPLETION on new bio during split
    blk-throttle: use queue_is_rq_based
    block: Remove kblockd_schedule_delayed_work{,_on}()
    blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
    blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
    lib/scatterlist: Fix chaining support in sgl_alloc_order()
    blk-throttle: track read and write request individually
    block: add bdev_read_only() checks to common helpers
    block: fail op_is_write() requests to read-only partitions
    blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
    ...

    Linus Torvalds
     

24 Jan, 2018

1 commit


07 Jan, 2018

1 commit

  • bcache is the only user of bio_alloc_pages(), so move this function into
    bcache, and avoid it being misused in the future.

    Also rename it to bch_bio_allo_pages() since it is bcache only.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

21 Dec, 2017

1 commit

  • If a bio is throttled and split after throttling, the bio could be
    resubmited and enters the throttling again. This will cause part of the
    bio to be charged multiple times. If the cgroup has an IO limit, the
    double charge will significantly harm the performance. The bio split
    becomes quite common after arbitrary bio size change.

    To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
    If the bio is cloned/split, we copy the flag to new bio too to avoid a
    double charge. However, cloned bio could be directed to a new disk,
    keeping the flag be a problem. The observation is we always set new disk
    for the bio in this case, so we can clear the flag in bio_set_dev().

    This issue exists for a long time, arbitrary bio size change just makes
    it worse, so this should go into stable at least since v4.2.

    V1-> V2: Not add extra field in bio based on discussion with Tejun

    Cc: Vivek Goyal
    Cc: stable@vger.kernel.org
    Acked-by: Tejun Heo
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li