21 Apr, 2020

1 commit

  • Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
    argument of the iocost_ioc_vrate_adj trace entry defined in
    include/trace/events/iocost.h leading to the following error:

    /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
    error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
    , u32[]* __tracepoint_arg_missed_ppm

    That argument type is indeed rather complex and hard to read. Looking
    at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
    the argument to a simple "u32 *missed_ppm" and adjusting the trace
    entry accordingly, the compilation error was gone.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Tejun Heo
    Signed-off-by: Waiman Long
    Signed-off-by: Jens Axboe

    Waiman Long
     

17 Apr, 2020

1 commit


16 Apr, 2020

1 commit


11 Apr, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "Here's a set of fixes that should go into this merge window. This
    contains:

    - NVMe pull request from Christoph with various fixes

    - Better discard support for loop (Evan)

    - Only call ->commit_rqs() if we have queued IO (Keith)

    - blkcg offlining fixes (Tejun)

    - fix (and fix the fix) for busy partitions"

    * tag 'block-5.7-2020-04-10' of git://git.kernel.dk/linux-block:
    block: fix busy device checking in blk_drop_partitions again
    block: fix busy device checking in blk_drop_partitions
    nvmet-rdma: fix double free of rdma queue
    blk-mq: don't commit_rqs() if none were queued
    nvme-fc: Revert "add module to ops template to allow module references"
    nvme: fix deadlock caused by ANA update wrong locking
    nvmet-rdma: fix bonding failover possible NULL deref
    loop: Better discard support for block devices
    loop: Report EOPNOTSUPP properly
    nvmet: fix NULL dereference when removing a referral
    nvme: inherit stable pages constraint in the mpath stack device
    blkcg: don't offline parent blkcg first
    blkcg: rename blkcg->cgwb_refcnt to ->online_pin and always use it
    nvme-tcp: fix possible crash in recv error flow
    nvme-tcp: don't poll a non-live queue
    nvme-tcp: fix possible crash in write_zeroes processing
    nvmet-fc: fix typo in comment
    nvme-rdma: Replace comma with a semicolon
    nvme-fcloop: fix deallocation of working context
    nvme: fix compat address handling in several ioctls

    Linus Torvalds
     

10 Apr, 2020

1 commit


08 Apr, 2020

1 commit

  • bd_super is only set by get_tree_bdev and mount_bdev, and thus not by
    other openers like btrfs or the XFS realtime and log devices, as well as
    block devices directly opened from user space. Check bd_openers
    instead.

    Fixes: 77032ca66f86 ("Return EBUSY from BLKRRPART for mounted whole-dev fs")
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Apr, 2020

1 commit


03 Apr, 2020

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series has a huge amount of churn because it pulls in Mauro's doc
    update changing all our txt files to rst ones.

    Excluding that, we have the usual driver updates (qla2xxx, ufs, lpfc,
    zfcp, ibmvfc, pm80xx, aacraid), a treewide update for scnprintf and
    some other minor updates.

    The major core change is Hannes moving functions out of the aacraid
    driver and into the core"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (223 commits)
    scsi: aic7xxx: aic97xx: Remove FreeBSD-specific code
    scsi: ufs: Do not rely on prefetched data
    scsi: dc395x: remove dc395x_bios_param
    scsi: libiscsi: Fix error count for active session
    scsi: hpsa: correct race condition in offload enabled
    scsi: message: fusion: Replace zero-length array with flexible-array member
    scsi: qedi: Add PCI shutdown handler support
    scsi: qedi: Add MFW error recovery process
    scsi: ufs: Enable block layer runtime PM for well-known logical units
    scsi: ufs-qcom: Override devfreq parameters
    scsi: ufshcd: Let vendor override devfreq parameters
    scsi: ufshcd: Update the set frequency to devfreq
    scsi: ufs: Resume ufs host before accessing ufs device
    scsi: ufs-mediatek: customize the delay for enabling host
    scsi: ufs: make HCE polling more compact to improve initialization latency
    scsi: ufs: allow custom delay prior to host enabling
    scsi: ufs-mediatek: use common delay function
    scsi: ufs: introduce common and flexible delay function
    scsi: ufs: use an enum for host capabilities
    scsi: ufs: fix uninitialized tx_lanes in ufshcd_disable_tx_lcc()
    ...

    Linus Torvalds
     

02 Apr, 2020

3 commits

  • Pull trivial tree updates from Jiri Kosina:
    "My attempt to revitalize trivial queue I've been neglecting for years
    (what a disaster that was for this world, right? :) ) with patches
    collected from backlog that were still relevant and not applied
    elsewhere in the meantime"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
    err.h: remove deprecated PTR_RET for good
    blk-mq: Fix typo in comment
    x86/boot: Fix comment spelling
    sh: mach-highlander: Fix comment spelling
    s390/dasd: Fix comment spelling
    mfd: wm8994: Fix comment spelling
    docs: Add reference in binfmt-misc.rst
    genirq: fix kerneldoc comment for irq_desc
    drm/amdgpu: fix two documentation mismatch issues
    HID: fix Kconfig word ordering
    list/hashtable: minor documentation corrections.

    Linus Torvalds
     
  • blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
    don't get offlined while there are active cgwbs on them. However, it
    ends up making offlining unordered sometimes causing parents to be
    offlined before children.

    Let's fix this by making child blkcgs pin the parents' online states.

    Note that pin/unpin names are chosen over get/put intentionally
    because css uses get/put online for something different.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg->cgwb_refcnt is used to delay blkcg offlining so that blkgs
    don't get offlined while there are active cgwbs on them. However, it
    ends up making offlining unordered sometimes causing parents to be
    offlined before children.

    To fix it, we want child blkcgs to pin the parents' online states
    turning the refcnt into a more generic online pinning mechanism.

    In prepartion,

    * blkcg->cgwb_refcnt -> blkcg->online_pin
    * blkcg_cgwb_get/put() -> blkcg_pin/unpin_online()
    * Take them out of CONFIG_CGROUP_WRITEBACK

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

31 Mar, 2020

3 commits

  • Pull EFI updates from Ingo Molnar:
    "The EFI changes in this cycle are much larger than usual, for two
    (positive) reasons:

    - The GRUB project is showing signs of life again, resulting in the
    introduction of the generic Linux/UEFI boot protocol, instead of
    x86 specific hacks which are increasingly difficult to maintain.
    There's hope that all future extensions will now go through that
    boot protocol.

    - Preparatory work for RISC-V EFI support.

    The main changes are:

    - Boot time GDT handling changes

    - Simplify handling of EFI properties table on arm64

    - Generic EFI stub cleanups, to improve command line handling, file
    I/O, memory allocation, etc.

    - Introduce a generic initrd loading method based on calling back
    into the firmware, instead of relying on the x86 EFI handover
    protocol or device tree.

    - Introduce a mixed mode boot method that does not rely on the x86
    EFI handover protocol either, and could potentially be adopted by
    other architectures (if another one ever surfaces where one
    execution mode is a superset of another)

    - Clean up the contents of 'struct efi', and move out everything that
    doesn't need to be stored there.

    - Incorporate support for UEFI spec v2.8A changes that permit
    firmware implementations to return EFI_UNSUPPORTED from UEFI
    runtime services at OS runtime, and expose a mask of which ones are
    supported or unsupported via a configuration table.

    - Partial fix for the lack of by-VA cache maintenance in the
    decompressor on 32-bit ARM.

    - Changes to load device firmware from EFI boot service memory
    regions

    - Various documentation updates and minor code cleanups and fixes"

    * 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    efi/libstub/arm: Fix spurious message that an initrd was loaded
    efi/libstub/arm64: Avoid image_base value from efi_loaded_image
    partitions/efi: Fix partition name parsing in GUID partition entry
    efi/x86: Fix cast of image argument
    efi/libstub/x86: Use ULONG_MAX as upper bound for all allocations
    efi: Fix a mistype in comments mentioning efivar_entry_iter_begin()
    efi/libstub: Avoid linking libstub/lib-ksyms.o into vmlinux
    efi/x86: Preserve %ebx correctly in efi_set_virtual_address_map()
    efi/x86: Ignore the memory attributes table on i386
    efi/x86: Don't relocate the kernel unless necessary
    efi/x86: Remove extra headroom for setup block
    efi/x86: Add kernel preferred address to PE header
    efi/x86: Decompress at start of PE image load address
    x86/boot/compressed/32: Save the output address instead of recalculating it
    efi/libstub/x86: Deal with exit() boot service returning
    x86/boot: Use unsigned comparison for addresses
    efi/x86: Avoid using code32_start
    efi/x86: Make efi32_pe_entry() more readable
    efi/x86: Respect 32-bit ABI in efi32_pe_entry()
    efi/x86: Annotate the LOADED_IMAGE_PROTOCOL_GUID with SYM_DATA
    ...

    Linus Torvalds
     
  • Pull block driver updates from Jens Axboe:

    - floppy driver cleanup series from Willy

    - NVMe updates and fixes (Various)

    - null_blk trace improvements (Chaitanya)

    - bcache fixes (Coly)

    - md fixes (via Song)

    - loop block size change optimizations (Martijn)

    - scnprintf() use (Takashi)

    * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
    null_blk: add trace in null_blk_zoned.c
    null_blk: add tracepoint helpers for zoned mode
    block: add a zone condition debug helper
    nvme: cleanup namespace identifier reporting in nvme_init_ns_head
    nvme: rename __nvme_find_ns_head to nvme_find_ns_head
    nvme: refactor nvme_identify_ns_descs error handling
    nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
    nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
    nvme: Fix controller creation races with teardown flow
    nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
    nvme: Fix ctrl use-after-free during sysfs deletion
    nvme-pci: Re-order nvme_pci_free_ctrl
    nvme: Remove unused return code from nvme_delete_ctrl_sync
    nvme: Use nvme_state_terminal helper
    nvme: release ida resources
    nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
    nvmet-tcp: optimize tcp stack TX when data digest is used
    nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
    nvme-multipath: do not reset on unknown status
    nvmet-rdma: allocate RW ctxs according to mdts
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:

    - Online capacity resizing (Balbir)

    - Number of hardware queue change fixes (Bart)

    - null_blk fault injection addition (Bart)

    - Cleanup of queue allocation, unifying the node/no-node API
    (Christoph)

    - Cleanup of genhd, moving code to where it makes sense (Christoph)

    - Cleanup of the partition handling code (Christoph)

    - disk stat fixes/improvements (Konstantin)

    - BFQ improvements (Paolo)

    - Various fixes and improvements

    * tag 'for-5.7/block-2020-03-29' of git://git.kernel.dk/linux-block: (72 commits)
    block: return NULL in blk_alloc_queue() on error
    block: move bio_map_* to blk-map.c
    Revert "blkdev: check for valid request queue before issuing flush"
    block: simplify queue allocation
    bcache: pass the make_request methods to blk_queue_make_request
    null_blk: use blk_mq_init_queue_data
    block: add a blk_mq_init_queue_data helper
    block: move the ->devnode callback to struct block_device_operations
    block: move the part_stat* helpers from genhd.h to a new header
    block: move block layer internals out of include/linux/genhd.h
    block: move guard_bio_eod to bio.c
    block: unexport get_gendisk
    block: unexport disk_map_sector_rcu
    block: unexport disk_get_part
    block: mark part_in_flight and part_in_flight_rw static
    block: mark block_depr static
    block: factor out requeue handling from dispatch code
    block/diskstats: replace time_in_queue with sum of request times
    block/diskstats: accumulate all per-cpu counters in one pass
    block/diskstats: more accurate approximation of io_ticks for slow disks
    ...

    Linus Torvalds
     

30 Mar, 2020

1 commit

  • This patch fixes follwoing warning:

    block/blk-core.c: In function ‘blk_alloc_queue’:
    block/blk-core.c:558:10: warning: returning ‘int’ from a function with return type ‘struct request_queue *’ makes pointer from integer without a cast [-Wint-conversion]
    return -EINVAL;

    Fixes: 3d745ea5b095a ("block: simplify queue allocation")
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

28 Mar, 2020

5 commits

  • Add a helper to stringify the zone conditions. We use this helper in the
    next patch to track zone conditions in tracepoints.

    Reviewed-by: Damien Le Moal
    Signed-off-by: Chaitanya Kulkarni
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • The bio_map_* helpers are just the low-level helpers for the
    blk_rq_map_* APIs. Move them together for better logical grouping,
    as no there isn't much overlap with other code in bio.c.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This reverts commit f10d9f617a65905c556c3b37c9b9646ae7d04ed7.

    We can't have queues without a make_request_fn any more (and the
    loop device uses blk-mq these days anyway..).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This allows a driver to pass a queuedata member before ->init_hctx is
    called. null_blk currently open codes this logic, but I'd rather have
    it in the core to ease future maintainance.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

27 Mar, 2020

1 commit


25 Mar, 2020

12 commits

  • These macros are just used by a few files. Move them out of genhd.h,
    which is included everywhere into a new standalone header.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • None of this needs to be exposed to drivers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This is bio layer functionality and not related to buffer heads.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • get_gendisk is not used by any modular code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • disk_map_sector_rcu is not used by any modular code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • disk_get_part is not used by any modular code.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Factor out the requeue handling from the dispatch code, this will make
    subsequent addition of different requeueing schemes easier.

    Signed-off-by: Johannes Thumshirn
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Johannes Thumshirn
     
  • Column "time_in_queue" in diskstats is supposed to show total waiting time
    of all requests. I.e. value should be equal to the sum of times from other
    columns. But this is not true, because column "time_in_queue" is counted
    separately in jiffies rather than in nanoseconds as other times.

    This patch removes redundant counter for "time_in_queue" and shows total
    time of read, write, discard and flush requests.

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Reading /proc/diskstats iterates over all cpus for summing each field.
    It's faster to sum all fields in one pass.

    Hammering /proc/diskstats with fio shows 2x performance improvement:

    fio --name=test --numjobs=$JOBS --filename=/proc/diskstats \
    --size=1k --bs=1k --fallocate=none --create_on_open=1 \
    --time_based=1 --runtime=10 --invalidate=0 --group_report

    JOBS=1 JOBS=10
    Before: 7k iops 64k iops
    After: 18k iops 120k iops

    Also this way code is more compact:

    add/remove: 1/0 grow/shrink: 0/2 up/down: 194/-1540 (-1346)
    Function old new delta
    part_stat_read_all - 194 +194
    diskstats_show 1344 631 -713
    part_stat_show 1219 392 -827
    Total: Before=14966947, After=14965601, chg -0.01%

    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     
  • Currently io_ticks is approximated by adding one at each start and end of
    requests if jiffies counter has changed. This works perfectly for requests
    shorter than a jiffy or if one of requests starts/ends at each jiffy.

    If disk executes just one request at a time and they are longer than two
    jiffies then only first and last jiffies will be accounted.

    Fix is simple: at the end of request add up into io_ticks jiffies passed
    since last update rather than just one jiffy.

    Example: common HDD executes random read 4k requests around 12ms.

    fio --name=test --filename=/dev/sdb --rw=randread --direct=1 --runtime=30 &
    iostat -x 10 sdb

    Note changes of iostat's "%util" 8,43% -> 99,99% before/after patch:

    Before:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,60 0,00 330,40 0,00 8,00 0,96 12,09 12,09 0,00 1,02 8,43

    After:

    Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
    sdb 0,00 0,00 82,50 0,00 330,00 0,00 8,00 1,00 12,10 12,10 0,00 12,12 99,99

    Now io_ticks does not loose time between start and end of requests, but
    for queue-depth > 1 some I/O time between adjacent starts might be lost.

    For load estimation "%util" is not as useful as average queue length,
    but it clearly shows how often disk queue is completely empty.

    Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
    Signed-off-by: Konstantin Khlebnikov
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Konstantin Khlebnikov
     

24 Mar, 2020

7 commits