01 May, 2019

1 commit


30 Apr, 2019

4 commits


24 Apr, 2019

1 commit

  • The refcount has been increased for pages retrieved from non-bvec iov iter
    via __bio_iov_iter_get_pages(), so don't need to do that again.

    Otherwise, IO pages are leaked easily.

    Cc: Christoph Hellwig
    Reviewed-by: Chaitanya Kulkarni
    Fixes: 7321ecbfc7cf ("block: change how we get page references in bio_iov_iter_get_pages")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Apr, 2019

1 commit

  • bio_add_page() and __bio_add_page() are capable of adding pages into
    bio, and now we have at least two such usages alreay:

    - __bio_iov_bvec_add_pages()
    - nvmet_bdev_execute_rw().

    So update comments on these two helpers.

    The thing is a bit special for __bio_try_merge_page(), given the caller
    needs to know if the new added page is same with the last added page,
    then it isn't safe to pass multi-page in case that 'same_page' is true,
    so adds warning on potential misuse, and updates comment on
    __bio_try_merge_page().

    Cc: linux-xfs@vger.kernel.org
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

22 Apr, 2019

1 commit

  • Pull in v5.1-rc6 to resolve two conflicts. One is in BFQ, in just a
    comment, and is trivial. The other one is a conflict due to a later fix
    in the bio multi-page work, and needs a bit more care.

    * tag 'v5.1-rc6': (770 commits)
    Linux 5.1-rc6
    block: make sure that bvec length can't be overflow
    block: kill all_q_node in request_queue
    x86/cpu/intel: Lower the "ENERGY_PERF_BIAS: Set to normal" message's log priority
    coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping
    mm/kmemleak.c: fix unused-function warning
    init: initialize jump labels before command line option parsing
    kernel/watchdog_hld.c: hard lockup message should end with a newline
    kcov: improve CONFIG_ARCH_HAS_KCOV help text
    mm: fix inactive list balancing between NUMA nodes and cgroups
    mm/hotplug: treat CMA pages as unmovable
    proc: fixup proc-pid-vm test
    proc: fix map_files test on F29
    mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
    mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
    mm: swapoff: shmem_unuse() stop eviction without igrab()
    mm: swapoff: take notice of completion sooner
    mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
    mm: swapoff: shmem_find_swap_entries() filter out other types
    slab: store tagged freelist for off-slab slabmgmt
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

12 Apr, 2019

4 commits


11 Apr, 2019

1 commit

  • When bio_add_pc_page() fails in bio_copy_user_iov() we should free
    the page we just allocated otherwise we are leaking it.

    Cc: linux-block@vger.kernel.org
    Cc: Linus Torvalds
    Cc: stable@vger.kernel.org
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Jens Axboe

    Jérôme Glisse
     

04 Apr, 2019

1 commit

  • With the introduction of BIO_NO_PAGE_REF we've used up all available bits
    in bio::bi_flags.

    Convert the defines of the flags to an enum and add a BUILD_BUG_ON() call
    to make sure no-one adds a new one and thus overrides the BVEC_POOL_IDX
    causing crashes.

    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

02 Apr, 2019

5 commits

  • Now block IO stack is basically ready for supporting multi-page bvec,
    however it isn't enabled on passthrough IO.

    One reason is that passthrough IO is dispatched to LLD directly and bio
    split is bypassed, so the bio has to be built correctly for dispatch to
    LLD from the beginning.

    Implement multi-page support for passthrough IO by limitting each bvec
    as block device's segment and applying all kinds of queue limit in
    blk_add_pc_page(). Then we don't need to calculate segments any more for
    passthrough IO any more, turns out code is simplified much.

    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • When the added page is merged to last same page in bio_add_pc_page(),
    the user may need to put this page for avoiding page leak.

    bio_map_user_iov() needs this kind of handling, and now it deals with
    it by itself in hack style.

    Moves the handling of put page into __bio_add_pc_page(), so
    bio_map_user_iov() may be simplified a bit, and maybe more users
    can benefit from this change.

    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Now the check for deciding if one page is mergeable to current bvec
    becomes a bit complicated, and we need to reuse the code before
    adding pc page.

    So move the check in one dedicated helper.

    No function change.

    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • REQ_PC is out of date, so replace it with passthrough IO.

    Also remove the local variable of 'prev' since we can reuse
    the top local variable of 'bvec'.

    No function change.

    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • XEN has special page merge requirement, see xen_biovec_phys_mergeable().
    We can't merge pages into one bvec simply for XEN.

    So move XEN's specific check on page merge into __bio_try_merge_page(),
    then abvoid to break XEN by multi-page bvec.

    Cc: ris Ostrovsky
    Cc: xen-devel@lists.xenproject.org
    Cc: Omar Sandoval
    Cc: Christoph Hellwig
    Reviewed-by: Juergen Gross
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

19 Mar, 2019

1 commit

  • If bio_iov_iter_get_pages() is called on an iov_iter that is flagged
    with NO_REF, then we don't need to add a page reference for the pages
    that we add.

    Add BIO_NO_PAGE_REF to track this in the bio, so IO completion knows
    not to drop a reference to these pages.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Feb, 2019

1 commit

  • For an ITER_BVEC, we can just iterate the iov and add the pages
    to the bio directly. For now, we grab a reference to those pages,
    and release them normally on IO completion. This isn't really needed
    for the normal case of O_DIRECT from/to a file, but some of the more
    esoteric use cases (like splice(2)) will unconditionally put the
    pipe buffer pages when the buffers are released. Until we can manage
    that case properly, ITER_BVEC pages are treated like normal pages
    in terms of reference counting.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Feb, 2019

2 commits

  • This patch pulls the trigger for multi-page bvecs.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • This patch introduces one extra iterator variable to bio_for_each_segment_all(),
    then we can allow bio_for_each_segment_all() to iterate over multi-page bvec.

    Given it is just one mechannical & simple change on all bio_for_each_segment_all()
    users, this patch does tree-wide change in one single patch, so that we can
    avoid to use a temporary helper for this conversion.

    Reviewed-by: Omar Sandoval
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

03 Jan, 2019

1 commit

  • Pull more block updates from Jens Axboe:

    - Dead code removal for loop/sunvdc (Chengguang)

    - Mark BIDI support for bsg as deprecated, logging a single dmesg
    warning if anyone is actually using it (Christoph)

    - blkcg cleanup, killing a dead function and making the tryget_closest
    variant easier to read (Dennis)

    - Floppy fixes, one fixing a regression in swim3 (Finn)

    - lightnvm use-after-free fix (Gustavo)

    - gdrom leak fix (Wenwen)

    - a set of drbd updates (Lars, Luc, Nathan, Roland)

    * tag 'for-4.21/block-20190102' of git://git.kernel.dk/linux-block: (28 commits)
    block/swim3: Fix regression on PowerBook G3
    block/swim3: Fix -EBUSY error when re-opening device after unmount
    block/swim3: Remove dead return statement
    block/amiflop: Don't log error message on invalid ioctl
    gdrom: fix a memory leak bug
    lightnvm: pblk: fix use-after-free bug
    block: sunvdc: remove redundant code
    block: loop: remove redundant code
    bsg: deprecate BIDI support in bsg
    blkcg: remove unused __blkg_release_rcu()
    blkcg: clean up blkg_tryget_closest()
    drbd: Change drbd_request_detach_interruptible's return type to int
    drbd: Avoid Clang warning about pointless switch statment
    drbd: introduce P_ZEROES (REQ_OP_WRITE_ZEROES on the "wire")
    drbd: skip spurious timeout (ping-timeo) when failing promote
    drbd: don't retry connection if peers do not agree on "authentication" settings
    drbd: fix print_st_err()'s prototype to match the definition
    drbd: avoid spurious self-outdating with concurrent disconnect / down
    drbd: do not block when adjusting "disk-options" while IO is frozen
    drbd: fix comment typos
    ...

    Linus Torvalds
     

29 Dec, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block/storage for 4.21.

    Larger than usual, it was a busy round with lots of goodies queued up.
    Most notable is the removal of the old IO stack, which has been a long
    time coming. No new features for a while, everything coming in this
    week has all been fixes for things that were previously merged.

    This contains:

    - Use atomic counters instead of semaphores for mtip32xx (Arnd)

    - Cleanup of the mtip32xx request setup (Christoph)

    - Fix for circular locking dependency in loop (Jan, Tetsuo)

    - bcache (Coly, Guoju, Shenghui)
    * Optimizations for writeback caching
    * Various fixes and improvements

    - nvme (Chaitanya, Christoph, Sagi, Jay, me, Keith)
    * host and target support for NVMe over TCP
    * Error log page support
    * Support for separate read/write/poll queues
    * Much improved polling
    * discard OOM fallback
    * Tracepoint improvements

    - lightnvm (Hans, Hua, Igor, Matias, Javier)
    * Igor added packed metadata to pblk. Now drives without metadata
    per LBA can be used as well.
    * Fix from Geert on uninitialized value on chunk metadata reads.
    * Fixes from Hans and Javier to pblk recovery and write path.
    * Fix from Hua Su to fix a race condition in the pblk recovery
    code.
    * Scan optimization added to pblk recovery from Zhoujie.
    * Small geometry cleanup from me.

    - Conversion of the last few drivers that used the legacy path to
    blk-mq (me)

    - Removal of legacy IO path in SCSI (me, Christoph)

    - Removal of legacy IO stack and schedulers (me)

    - Support for much better polling, now without interrupts at all.
    blk-mq adds support for multiple queue maps, which enables us to
    have a map per type. This in turn enables nvme to have separate
    completion queues for polling, which can then be interrupt-less.
    Also means we're ready for async polled IO, which is hopefully
    coming in the next release.

    - Killing of (now) unused block exports (Christoph)

    - Unification of the blk-rq-qos and blk-wbt wait handling (Josef)

    - Support for zoned testing with null_blk (Masato)

    - sx8 conversion to per-host tag sets (Christoph)

    - IO priority improvements (Damien)

    - mq-deadline zoned fix (Damien)

    - Ref count blkcg series (Dennis)

    - Lots of blk-mq improvements and speedups (me)

    - sbitmap scalability improvements (me)

    - Make core inflight IO accounting per-cpu (Mikulas)

    - Export timeout setting in sysfs (Weiping)

    - Cleanup the direct issue path (Jianchao)

    - Export blk-wbt internals in block debugfs for easier debugging
    (Ming)

    - Lots of other fixes and improvements"

    * tag 'for-4.21/block-20181221' of git://git.kernel.dk/linux-block: (364 commits)
    kyber: use sbitmap add_wait_queue/list_del wait helpers
    sbitmap: add helpers for add/del wait queue handling
    block: save irq state in blkg_lookup_create()
    dm: don't reuse bio for flushes
    nvme-pci: trace SQ status on completions
    nvme-rdma: implement polling queue map
    nvme-fabrics: allow user to pass in nr_poll_queues
    nvme-fabrics: allow nvmf_connect_io_queue to poll
    nvme-core: optionally poll sync commands
    block: make request_to_qc_t public
    nvme-tcp: fix spelling mistake "attepmpt" -> "attempt"
    nvme-tcp: fix endianess annotations
    nvmet-tcp: fix endianess annotations
    nvme-pci: refactor nvme_poll_irqdisable to make sparse happy
    nvme-pci: only set nr_maps to 2 if poll queues are supported
    nvmet: use a macro for default error location
    nvmet: fix comparison of a u16 with -1
    blk-mq: enable IO poll if .nr_queues of type poll > 0
    blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()
    blk-mq: skip zero-queue maps in blk_mq_map_swqueue
    ...

    Linus Torvalds
     

21 Dec, 2018

1 commit

  • The implementation of blkg_tryget_closest() wasn't super obvious and
    became a point of suspicion when debugging [1]. So let's clean it up so
    it's obviously not the problem.

    Also add missing RCU read locking to bio_clone_blkg_association(), which
    got exposed by adding the RCU read lock held check in
    blkg_tryget_closest().

    [1] https://lore.kernel.org/linux-block/a7e97e4b-0dd8-3a54-23b7-a0f27b17fde8@kernel.dk/

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

14 Dec, 2018

3 commits


11 Dec, 2018

1 commit

  • We don't need to zero fill the bio if not using kernel allocated pages.

    Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
    Reported-by: Todd Aiken
    Cc: Laurence Oberman
    Cc: stable@vger.kernel.org
    Cc: Bart Van Assche
    Tested-by: Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Dec, 2018

2 commits

  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • All of part_stat_* and related methods are used with preempt disabled,
    so there is no need to pass cpu around to allow of them. Just call
    smp_processor_id() as needed.

    Suggested-by: Jens Axboe
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

08 Dec, 2018

8 commits

  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or %NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Now that a bio only holds a blkg reference, so clean up is simply
    putting back that reference. Remove bio_disassociate_task() as it just
    calls bio_disassociate_blkg() and call the latter directly.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • The previous patch in this series removed carrying around a pointer to
    the css in blkg. However, the blkg association logic still relied on
    taking a reference on the css to ensure we wouldn't fail in getting a
    reference for the blkg.

    Here the implicit dependency on the css is removed. The association
    continues to rely on the tryget logic walking up the blkg tree. This
    streamlines the three ways that association can happen: normal, swap,
    and writeback.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Prior patches ensured that any bio that interacts with a request_queue
    is properly associated with a blkg. This makes bio->bi_css unnecessary
    as blkg maintains a reference to blkcg already.

    This removes the bio field bi_css and transfers corresponding uses to
    access via bi_blkg.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • One of the goals of this series is to remove a separate reference to
    the css of the bio. This can and should be accessed via bio_blkcg(). In
    this patch, wbc_init_bio() now requires a bio to have a device
    associated with it.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou
     
  • Previously, blkg association was handled by controller specific code in
    blk-throttle and blk-iolatency. However, because a blkg represents a
    relationship between a blkcg and a request_queue, it makes sense to keep
    the blkg->q and bio->bi_disk->queue consistent.

    This patch moves association into the bio_set_dev macro(). This should
    cover the majority of cases where the device is set/changed keeping the
    two pointers consistent. Fallback code is added to
    blkcg_bio_issue_check() to catch any missing paths.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Dennis Zhou