10 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:

    - Two fixes for an ubd regression, one for missing locking, and one for
    a missing initialization of a field. The latter was an old latent
    bug, but it's now visible and triggers (Me, Anton Ivanov)

    - Set of NVMe fixes via Christoph, but applied manually due to a git
    tree mixup (Christoph, Sagi)

    - Fix for a discard split regression, in three patches (Ming)

    - Update libata git trees (Geert)

    - SPDX identifier for sata_rcar (Kuninori Morimoto)

    - Virtual boundary merge fix (Johannes)

    - Preemptively clear memory we are going to pass to userspace, in case
    the driver does a short read (Keith)

    * tag 'for-linus-20181109' of git://git.kernel.dk/linux-block:
    block: make sure writesame bio is aligned with logical block size
    block: cleanup __blkdev_issue_discard()
    block: make sure discard bio is aligned with logical block size
    Revert "nvmet-rdma: use a private workqueue for delete"
    nvme: make sure ns head inherits underlying device limits
    nvmet: don't try to add ns to p2p map unless it actually uses it
    sata_rcar: convert to SPDX identifiers
    ubd: fix missing initialization of io_req
    block: Clear kernel memory before copying to user
    MAINTAINERS: Fix remaining pointers to obsolete libata.git
    ubd: fix missing lock around request issue
    block: respect virtual boundary mask in bvecs

    Linus Torvalds
     

09 Nov, 2018

3 commits

  • Obviously the created writesame bio has to be aligned with logical block
    size, and use bio_allowed_max_sectors() to retrieve this number.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}")
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Cleanup __blkdev_issue_discard() a bit:

    - remove local variable of 'end_sect'
    - remove code block of 'fail'

    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

08 Nov, 2018

2 commits

  • If the kernel allocates a bounce buffer for user read data, this memory
    needs to be cleared before copying it to the user, otherwise it may leak
    kernel memory to user space.

    Laurence Oberman
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • With drivers that are settting a virtual boundary constrain, we are
    seeing a lot of bio splitting and smaller I/Os being submitted to the
    driver.

    This happens because the bio gap detection code does not account cases
    where PAGE_SIZE - 1 is bigger than queue_virt_boundary() and thus will
    split the bio unnecessarily.

    Cc: Jan Kara
    Cc: Bart Van Assche
    Cc: Ming Lei
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Johannes Thumshirn
    Acked-by: Keith Busch
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     

03 Nov, 2018

1 commit

  • Pull block layer fixes from Jens Axboe:
    "The biggest part of this pull request is the revert of the blkcg
    cleanup series. It had one fix earlier for a stacked device issue, but
    another one was reported. Rather than play whack-a-mole with this,
    revert the entire series and try again for the next kernel release.

    Apart from that, only small fixes/changes.

    Summary:

    - Indentation fixup for mtip32xx (Colin Ian King)

    - The blkcg cleanup series revert (Dennis Zhou)

    - Two NVMe fixes. One fixing a regression in the nvme request
    initialization in this merge window, causing nvme-fc to not work.
    The other is a suspend/resume p2p resource issue (James, Keith)

    - Fix sg discard merge, allowing us to merge in cases where we didn't
    before (Jianchao Wang)

    - Call rq_qos_exit() after the queue is frozen, preventing a hang
    (Ming)

    - Fix brd queue setup, fixing an oops if we fail setting up all
    devices (Ming)"

    * tag 'for-linus-20181102' of git://git.kernel.dk/linux-block:
    nvme-pci: fix conflicting p2p resource adds
    nvme-fc: fix request private initialization
    blkcg: revert blkcg cleanups series
    block: brd: associate with queue until adding disk
    block: call rq_qos_exit() after queue is frozen
    mtip32xx: clean an indentation issue, remove extraneous tabs
    block: fix the DISCARD request merge

    Linus Torvalds
     

02 Nov, 2018

2 commits

  • Pull AFS updates from Al Viro:
    "AFS series, with some iov_iter bits included"

    * 'work.afs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
    missing bits of "iov_iter: Separate type from direction and use accessor functions"
    afs: Probe multiple fileservers simultaneously
    afs: Fix callback handling
    afs: Eliminate the address pointer from the address list cursor
    afs: Allow dumping of server cursor on operation failure
    afs: Implement YFS support in the fs client
    afs: Expand data structure fields to support YFS
    afs: Get the target vnode in afs_rmdir() and get a callback on it
    afs: Calc callback expiry in op reply delivery
    afs: Fix FS.FetchStatus delivery from updating wrong vnode
    afs: Implement the YFS cache manager service
    afs: Remove callback details from afs_callback_break struct
    afs: Commit the status on a new file/dir/symlink
    afs: Increase to 64-bit volume ID and 96-bit vnode ID for YFS
    afs: Don't invoke the server to read data beyond EOF
    afs: Add a couple of tracepoints to log I/O errors
    afs: Handle EIO from delivery function
    afs: Fix TTL on VL server and address lists
    afs: Implement VL server rotation
    afs: Improve FS server rotation error handling
    ...

    Linus Torvalds
     
  • This reverts a series committed earlier due to null pointer exception
    bug report in [1]. It seems there are edge case interactions that I did
    not consider and will need some time to understand what causes the
    adverse interactions.

    The original series can be found in [2] with a follow up series in [3].

    [1] https://www.spinics.net/lists/cgroups/msg20719.html
    [2] https://lore.kernel.org/lkml/20180911184137.35897-1-dennisszhou@gmail.com/
    [3] https://lore.kernel.org/lkml/20181020185612.51587-1-dennis@kernel.org/

    This reverts the following commits:
    d459d853c2ed, b2c3fa546705, 101246ec02b5, b3b9f24f5fcc, e2b0989954ae,
    f0fcb3ec89f3, c839e7a03f92, bdc2491708c4, 74b7c02a9bc1, 5bf9a1f3b4ef,
    a7b39b4e961c, 07b05bcc3213, 49f4c2dc2b50, 27e6fa996c53

    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

31 Oct, 2018

2 commits

  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • rq_qos_exit() removes the current q->rq_qos, this action has to be
    done after queue is frozen, otherwise the IO queue path may never
    be waken up, then IO hang is caused.

    So fixes this issue by moving rq_qos_exit() after queue is frozen.

    Cc: Josef Bacik
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

29 Oct, 2018

1 commit

  • There are two cases when handle DISCARD merge.
    If max_discard_segments == 1, the bios/requests need to be contiguous
    to merge. If max_discard_segments > 1, it takes every bio as a range
    and different range needn't to be contiguous.

    But now, attempt_merge screws this up. It always consider contiguity
    for DISCARD for the case max_discard_segments > 1 and cannot merge
    contiguous DISCARD for the case max_discard_segments == 1, because
    rq_attempt_discard_merge always returns false in this case.
    This patch fixes both of the two cases above.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

27 Oct, 2018

2 commits

  • Merge updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - most of MM

    * emailed patches from Andrew Morton : (132 commits)
    hugetlbfs: dirty pages as they are added to pagecache
    mm: export add_swap_extent()
    mm: split SWP_FILE into SWP_ACTIVATED and SWP_FS
    tools/testing/selftests/vm/map_fixed_noreplace.c: add test for MAP_FIXED_NOREPLACE
    mm: thp: relocate flush_cache_range() in migrate_misplaced_transhuge_page()
    mm: thp: fix mmu_notifier in migrate_misplaced_transhuge_page()
    mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition
    mm/kasan/quarantine.c: make quarantine_lock a raw_spinlock_t
    mm/gup: cache dev_pagemap while pinning pages
    Revert "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"
    mm: return zero_resv_unavail optimization
    mm: zero remaining unavailable struct pages
    tools/testing/selftests/vm/gup_benchmark.c: add MAP_HUGETLB option
    tools/testing/selftests/vm/gup_benchmark.c: add MAP_SHARED option
    tools/testing/selftests/vm/gup_benchmark.c: allow user specified file
    tools/testing/selftests/vm/gup_benchmark.c: fix 'write' flag usage
    mm/gup_benchmark.c: add additional pinning methods
    mm/gup_benchmark.c: time put_page()
    mm: don't raise MEMCG_OOM event due to failed high-order allocation
    mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock
    ...

    Linus Torvalds
     
  • There are several definitions of those functions/macros in places that
    mess with fixed-point load averages. Provide an official version.

    [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
    Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Suren Baghdasaryan
    Tested-by: Daniel Drake
    Cc: Christopher Lameter
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Mike Galbraith
    Cc: Peter Enderborg
    Cc: Randy Dunlap
    Cc: Shakeel Butt
    Cc: Tejun Heo
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

26 Oct, 2018

10 commits

  • Since commit 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios
    detection"), a scenario is defined asymmetric when one of the
    following conditions holds:
    - active bfq_queues have different weights
    - one or more group of entities (bfq_queue or other groups of entities)
    are active
    bfq grants fairness and low latency also in such asymmetric scenarios,
    by plugging the dispatching of I/O if the bfq_queue in service happens
    to be temporarily idle. This plugging may lower throughput, so it is
    important to do it only when strictly needed.

    By mistake, in commit '2d29c9f89fcd' ("block, bfq: improve asymmetric
    scenarios detection") the num_active_groups counter was firstly
    incremented and subsequently decremented at any entity (group or
    bfq_queue) weight change.

    This is useless, because only transitions from active to inactive and
    vice versa matter for that counter. Unfortunately this is also
    incorrect in the following case: the entity at issue is a bfq_queue
    and it is under weight raising. In fact in this case there is a
    spurious increment of the num_active_groups counter.

    This spurious increment may cause scenarios to be wrongly detected as
    asymmetric, thus causing useless plugging and loss of throughput.

    This commit fixes this issue by simply removing the above useless and
    wrong increments and decrements.

    Fixes: 2d29c9f89fcd ("block, bfq: improve asymmetric scenarios detection")
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     
  • trace_block_getrq() is to indicate a request struct has been allocated
    for queue, so put it in right place.

    Reviewed-by: Jianchao Wang
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe

    Xiaoguang Wang
     
  • Drivers exposing zoned block devices have to initialize and maintain
    correctness (i.e. revalidate) of the device zone bitmaps attached to
    the device request queue (seq_zones_bitmap and seq_zones_wlock).

    To simplify coding this, introduce a generic helper function
    blk_revalidate_disk_zones() suitable for most (and likely all) cases.
    This new function always update the seq_zones_bitmap and seq_zones_wlock
    bitmaps as well as the queue nr_zones field when called for a disk
    using a request based queue. For a disk using a BIO based queue, only
    the number of zones is updated since these queues do not have
    schedulers and so do not need the zone bitmaps.

    With this change, the zone bitmap initialization code in sd_zbc.c can be
    replaced with a call to this function in sd_zbc_read_zones(), which is
    called from the disk revalidate block operation method.

    A call to blk_revalidate_disk_zones() is also added to the null_blk
    driver for devices created with the zoned mode enabled.

    Finally, to ensure that zoned devices created with dm-linear or
    dm-flakey expose the correct number of zones through sysfs, a call to
    blk_revalidate_disk_zones() is added to dm_table_set_restrictions().

    The zone bitmaps allocated and initialized with
    blk_revalidate_disk_zones() are freed automatically from
    __blk_release_queue() using the block internal function
    blk_queue_free_zone_bitmaps().

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Dispatching a report zones command through the request queue is a major
    pain due to the command reply payload rewriting necessary. Given that
    blkdev_report_zones() is executing everything synchronously, implement
    report zones as a block device file operation instead, allowing major
    simplification of the code in many places.

    sd, null-blk, dm-linear and dm-flakey being the only block device
    drivers supporting exposing zoned block devices, these drivers are
    modified to provide the device side implementation of the
    report_zones() block device file operation.

    For device mappers, a new report_zones() target type operation is
    defined so that the upper block layer calls blkdev_report_zones() can
    be propagated down to the underlying devices of the dm targets.
    Implementation for this new operation is added to the dm-linear and
    dm-flakey targets.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Christoph Hellwig
    [Damien]
    * Changed method block_device argument to gendisk
    * Various bug fixes and improvements
    * Added support for null_blk, dm-linear and dm-flakey.
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Mike Snitzer
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Expose through sysfs the nr_zones field of struct request_queue.
    Exposing this value helps in debugging disk issues as well as
    facilitating scripts based use of the disk (e.g. blktests).

    For zoned block devices, the nr_zones field indicates the total number
    of zones of the device calculated using the known disk capacity and
    zone size. This number of zones is always 0 for regular block devices.

    Since nr_zones is defined conditionally with CONFIG_BLK_DEV_ZONED,
    introduce the blk_queue_nr_zones() function to return the correct value
    for any device, regardless if CONFIG_BLK_DEV_ZONED is set.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Get a zoned block device total number of zones. The device can be a
    partition of the whole device. The number of zones is always 0 for
    regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Get a zoned block device zone size in number of 512 B sectors.
    The zone size is always 0 for regular block devices.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • There is no point in allocating more zone descriptors than the number of
    zones a block device has for doing a zone report. Avoid doing that in
    blkdev_report_zones_ioctl() by limiting the number of zone decriptors
    allocated internally to process the user request.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Introduce the blkdev_nr_zones() helper function to get the total
    number of zones of a zoned block device. This number is always 0 for a
    regular block device (q->limits.zoned == BLK_ZONED_NONE case).

    Replace hard-coded number of zones calculation in dmz_get_zoned_device()
    with a call to this helper.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

24 Oct, 2018

1 commit

  • Use accessor functions to access an iterator's type and direction. This
    allows for the possibility of using some other method of determining the
    type of iterator than if-chains with bitwise-AND conditions.

    Signed-off-by: David Howells

    David Howells
     

23 Oct, 2018

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block changes for 4.20. This
    contains:

    - Series enabling runtime PM for blk-mq (Bart).

    - Two pull requests from Christoph for NVMe, with items such as;
    - Better AEN tracking
    - Multipath improvements
    - RDMA fixes
    - Rework of FC for target removal
    - Fixes for issues identified by static checkers
    - Fabric cleanups, as prep for TCP transport
    - Various cleanups and bug fixes

    - Block merging cleanups (Christoph)

    - Conversion of drivers to generic DMA mapping API (Christoph)

    - Series fixing ref count issues with blkcg (Dennis)

    - Series improving BFQ heuristics (Paolo, et al)

    - Series improving heuristics for the Kyber IO scheduler (Omar)

    - Removal of dangerous bio_rewind_iter() API (Ming)

    - Apply single queue IPI redirection logic to blk-mq (Ming)

    - Set of fixes and improvements for bcache (Coly et al)

    - Series closing a hotplug race with sysfs group attributes (Hannes)

    - Set of patches for lightnvm:
    - pblk trace support (Hans)
    - SPDX license header update (Javier)
    - Tons of refactoring patches to cleanly abstract the 1.2 and 2.0
    specs behind a common core interface. (Javier, Matias)
    - Enable pblk to use a common interface to retrieve chunk metadata
    (Matias)
    - Bug fixes (Various)

    - Set of fixes and updates to the blk IO latency target (Josef)

    - blk-mq queue number updates fixes (Jianchao)

    - Convert a bunch of drivers from the old legacy IO interface to
    blk-mq. This will conclude with the removal of the legacy IO
    interface itself in 4.21, with the rest of the drivers (me, Omar)

    - Removal of the DAC960 driver. The SCSI tree will introduce two
    replacement drivers for this (Hannes)"

    * tag 'for-4.20/block-20181021' of git://git.kernel.dk/linux-block: (204 commits)
    block: setup bounce bio_sets properly
    blkcg: reassociate bios when make_request() is called recursively
    blkcg: fix edge case for blk_get_rl() under memory pressure
    nvme-fabrics: move controller options matching to fabrics
    nvme-rdma: always have a valid trsvcid
    mtip32xx: fully switch to the generic DMA API
    rsxx: switch to the generic DMA API
    umem: switch to the generic DMA API
    sx8: switch to the generic DMA API
    sx8: remove dead IF_64BIT_DMA_IS_POSSIBLE code
    skd: switch to the generic DMA API
    ubd: remove use of blk_rq_map_sg
    nvme-pci: remove duplicate check
    drivers/block: Remove DAC960 driver
    nvme-pci: fix hot removal during error handling
    nvmet-fcloop: suppress a compiler warning
    nvme-core: make implicit seed truncation explicit
    nvmet-fc: fix kernel-doc headers
    nvme-fc: rework the request initialization code
    nvme-fc: introduce struct nvme_fcp_op_w_sgl
    ...

    Linus Torvalds
     

22 Oct, 2018

1 commit

  • We're only setting up the bounce bio sets if we happen
    to need bouncing for regular HIGHMEM, not if we only need
    it for ISA devices.

    Protect the ISA bounce setup with a mutex, since it's
    being invoked from driver init functions and can thus be
    called in parallel.

    Cc: stable@vger.kernel.org
    Reported-by: Ondrej Zary
    Tested-by: Ondrej Zary
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Oct, 2018

1 commit

  • When submitting a bio, multiple recursive calls to make_request() may
    occur. This causes the initial associate done in blkcg_bio_issue_check()
    to be incorrect and reference the prior request_queue. This introduces
    a helper to do reassociation when make_request() is recursively called.

    Fixes: a7b39b4e961c ("blkcg: always associate a bio with a blkg")
    Reported-by: Valdis Kletnieks
    Signed-off-by: Dennis Zhou
    Tested-by: Valdis Kletnieks
    Signed-off-by: Jens Axboe

    Dennis Zhou
     

18 Oct, 2018

1 commit

  • blk_queue_split() does respect this limit via bio splitting, so no
    need to do that in blkdev_issue_discard(), then we can align to
    normal bio submit(bio_add_page() & submit_bio()).

    More importantly, this patch fixes one issue introduced in a22c4d7e34402cc
    ("block: re-add discard_granularity and alignment checks"), in which
    zero discard bio may be generated in case of zero alignment.

    Fixes: a22c4d7e34402ccdf3 ("block: re-add discard_granularity and alignment checks")
    Cc: stable@vger.kernel.org
    Cc: Ming Lin
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Tested-by: Mariusz Dabrowski
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

16 Oct, 2018

1 commit


15 Oct, 2018

1 commit


14 Oct, 2018

5 commits

  • When we try to increate the nr_hw_queues, we may fail due to
    shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
    and some entries in q->queue_hw_ctx are left with NULL. However,
    because queue map has been updated with new nr_hw_queues, some cpus
    have been mapped to hw queue which just encounters allocation failure,
    thus blk_mq_map_queue could return NULL. This will cause panic in
    following blk_mq_map_swqueue.

    To fix it, when increase nr_hw_queues fails, fallback to previous
    nr_hw_queues and post warning. At the same time, driver's .map_queues
    usually use completion irq affinity to map hw and cpu, fallback
    nr_hw_queues will cause lack of some cpu's map to hw, so use default
    blk_mq_map_queues to do that.

    Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • When the hw queues and mq_map are updated, a hctx could be mapped
    to a different numa node. At this moment, we need to realloc the
    hctx. If fail to do that, go on using previous hctx.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • blk_mq_realloc_hw_ctxs could be invoked during update hw queues.
    At the momemt, IO is blocked. Change the gfp flags from GFP_KERNEL
    to GFP_NOIO to avoid forever hang during memory allocation in
    blk_mq_realloc_hw_ctxs.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • blk-mq debugfs and sysfs entries need to be removed before updating
    queue map, otherwise, we get get wrong result there. This patch fixes
    it and remove the redundant debugfs and sysfs register/unregister
    operations during __blk_mq_update_nr_hw_queues.

    Signed-off-by: Jianchao Wang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jianchao Wang
     
  • bfq defines as asymmetric a scenario where an active entity, say E
    (representing either a single bfq_queue or a group of other entities),
    has a higher weight than some other entities. If the entity E does sync
    I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
    other entities in the following situation: E is in service but
    temporarily has no pending I/O request. In fact, without this plugging,
    all the times that E stops being temporarily idle, it may find the
    internal queues of the storage device already filled with an
    out-of-control number of extra requests, from other entities. So E may
    have to wait for the service of these extra requests, before finally
    having its own requests served. This may easily break service
    guarantees, with E getting less than its fair share of the device
    throughput. Usually, the end result is that E gets the same fraction of
    the throughput as the other entities, instead of getting more, according
    to its higher weight.

    Yet there are two other more subtle cases where E, even if its weight is
    actually equal to or even lower than the weight of any other active
    entities, may get less than its fair share of the throughput in case the
    above I/O plugging is not performed:
    1. other entities issue larger requests than E;
    2. other entities contain more active child entities than E (or in
    general tend to have more backlog than E).

    In the first case, other entities may get more service than E because
    they get larger requests, than those of E, served during the temporary
    idle periods of E. In the second case, other entities get more service
    because, by having many child entities, they have many requests ready
    for dispatching while E is temporarily idle.

    This commit addresses this issue by extending the definition of
    asymmetric scenario: a scenario is asymmetric when
    - active entities representing bfq_queues have differentiated weights,
    as in the original definition
    or (inclusive)
    - one or more entities representing groups of entities are active.

    This broader definition makes sure that I/O plugging will be performed
    in all the above cases, provided that there is at least one active
    group. Of course, this definition is very coarse, so it will trigger
    I/O plugging also in cases where it is not needed, such as, e.g.,
    multiple active entities with just one child each, and all with the same
    I/O-request size. The reason for this coarse definition is just that a
    finer-grained definition would be rather heavy to compute.

    On the opposite end, even this new definition does not trigger I/O
    plugging in all cases where there is no active group, and all bfq_queues
    have the same weight. So, in these cases some unfairness may occur if
    there are asymmetries in I/O-request sizes. We made this choice because
    I/O plugging may lower throughput, and probably a user that has not
    created any group cares more about throughput than about perfect
    fairness. At any rate, as for possible applications that may care about
    service guarantees, bfq already guarantees a high responsiveness and a
    low latency to soft real-time applications automatically.

    Signed-off-by: Federico Motta
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Federico Motta
     

12 Oct, 2018

2 commits

  • Tetsuo brought to my attention that I screwed up the scale_up/scale_down
    helpers when I factored out the rq-qos code. We need to wake up all the
    waiters when we add slots for requests to make, not when we shrink the
    slots. Otherwise we'll end up things waiting forever. This was a
    mistake and simply puts everything back the way it was.

    cc: stable@vger.kernel.org
    Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt")
    eported-by: Tetsuo Handa
    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • BFQ is already doing a similar thing in its .pd_offline_fn() method
    implementation.

    While it seems that after commit 4c6994806f70
    ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
    was reverted leaving these pointers intact no longer causes crashes
    clearing them is still a sensible thing to do to make the code more robust.

    Signed-off-by: Maciej S. Szmigiero
    Signed-off-by: Jens Axboe

    Maciej S. Szmigiero
     

11 Oct, 2018

1 commit

  • 'default n' is the default value for any bool or tristate Kconfig
    setting so there is no need to write it explicitly.

    Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
    is not set' for visible symbols") the Kconfig behavior is the same
    regardless of 'default n' being present or not:

    ...
    One side effect of (and the main motivation for) this change is making
    the following two definitions behave exactly the same:

    config FOO
    bool

    config FOO
    bool
    default n

    With this change, neither of these will generate a
    '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
    That might make it clearer to people that a bare 'default n' is
    redundant.
    ...

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Jens Axboe

    Bartlomiej Zolnierkiewicz
     

09 Oct, 2018

1 commit

  • Lot of controllers may have only one irq vector for completing IO
    request. And usually affinity of the only irq vector is all possible
    CPUs, however, on most of ARCH, there may be only one specific CPU
    for handling this interrupt.

    So if all IOs are completed in hardirq context, it is inevitable to
    degrade IO performance because of increased irq latency.

    This patch tries to address this issue by allowing to complete request
    in softirq context, like the legacy IO path.

    IOPS is observed as ~13%+ in the following randread test on raid0 over
    virtio-scsi.

    mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

    fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

    Cc: Dongli Zhang
    Cc: Zach Marano
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei