25 Jul, 2017

1 commit

  • We already do this for PCI mappings, and the higher level code now
    expects that CPU on/offlining doesn't have an affect on the queue
    mappings.

    Signed-off-by: Christoph Hellwig
    Tested-by: Max Gurtovoy
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 Jul, 2017

1 commit

  • The blk-mq code lacks support for looking at the rpm_status field, tracking
    active requests and the RQF_PM flag.

    Due to the default switch to blk-mq for scsi people start to run into
    suspend / resume issue due to this fact, so make sure we disable the runtime
    PM functionality until it is properly implemented.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Jul, 2017

3 commits

  • There are mq devices (eg., virtio-blk, nbd and loopback) which don't
    invoke blk_mq_run_hw_queues() after the completion of a request.
    If bfq is enabled on these devices and the slice_idle attribute or
    strict_guarantees attribute is set as zero, it is possible that
    after a request completion the remaining requests of busy bfq queue
    will stalled in the bfq schedule until a new request arrives.

    To fix the scheduler latency problem, we need to check whether or not
    all issued requests have completed and dispatch more requests to driver
    if there is no request in driver.

    The problem can be reproduced by running the following script
    on a virtio-blk device with nr_hw_queues as 1:

    #!/bin/sh

    dev=vdb
    # mount point for dev
    mp=/tmp/mnt
    cd $mp

    job=strict.job
    cat < $job
    [global]
    direct=1
    bs=4k
    size=256M
    rw=write
    ioengine=libaio
    iodepth=128
    runtime=5
    time_based

    [1]
    filename=1.data

    [2]
    new_group
    filename=2.data
    EOF

    echo bfq > /sys/block/$dev/queue/scheduler
    echo 1 > /sys/block/$dev/queue/iosched/strict_guarantees
    fio $job

    Signed-off-by: Hou Tao
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • The start time of eligible entity should be less than or equal to
    the current virtual time, and the entity in idle tree has a finish
    time being greater than the current virtual time.

    Signed-off-by: Hou Tao
    Reviewed-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Hou Tao
     
  • Pull more block updates from Jens Axboe:
    "This is a followup for block changes, that didn't make the initial
    pull request. It's a bit of a mixed bag, this contains:

    - A followup pull request from Sagi for NVMe. Outside of fixups for
    NVMe, it also includes a series for ensuring that we properly
    quiesce hardware queues when browsing live tags.

    - Set of integrity fixes from Dmitry (mostly), fixing various issues
    for folks using DIF/DIX.

    - Fix for a bug introduced in cciss, with the req init changes. From
    Christoph.

    - Fix for a bug in BFQ, from Paolo.

    - Two followup fixes for lightnvm/pblk from Javier.

    - Depth fix from Ming for blk-mq-sched.

    - Also from Ming, performance fix for mtip32xx that was introduced
    with the dynamic initialization of commands"

    * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
    block: call bio_uninit in bio_endio
    nvmet: avoid unneeded assignment of submit_bio return value
    nvme-pci: add module parameter for io queue depth
    nvme-pci: compile warnings in nvme_alloc_host_mem()
    nvmet_fc: Accept variable pad lengths on Create Association LS
    nvme_fc/nvmet_fc: revise Create Association descriptor length
    lightnvm: pblk: remove unnecessary checks
    lightnvm: pblk: control I/O flow also on tear down
    cciss: initialize struct scsi_req
    null_blk: fix error flow for shared tags during module_init
    block: Fix __blkdev_issue_zeroout loop
    nvme-rdma: unconditionally recycle the request mr
    nvme: split nvme_uninit_ctrl into stop and uninit
    virtio_blk: quiesce/unquiesce live IO when entering PM states
    mtip32xx: quiesce request queues to make sure no submissions are inflight
    nbd: quiesce request queues to make sure no submissions are inflight
    nvme: kick requeue list when requeueing a request instead of when starting the queues
    nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
    ...

    Linus Torvalds
     

11 Jul, 2017

1 commit

  • bio_free isn't a good place to free cgroup info. There are a
    lot of cases bio is allocated in special way (for example, in stack) and
    never gets called by bio_put hence bio_free, we are leaking memory. This
    patch moves the free to bio endio, which should be called anyway. The
    bio_uninit call in bio_free is kept, in case the bio never gets called
    bio endio.

    This assumes ->bi_end_io() doesn't access cgroup info, which seems true
    in my audit.

    This along with Christoph's integrity patch should fix the memory leak
    issue.

    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Jul, 2017

1 commit

  • Pull misc compat stuff updates from Al Viro:
    "This part is basically untangling various compat stuff. Compat
    syscalls moved to their native counterparts, getting rid of quite a
    bit of double-copying and/or set_fs() uses. A lot of field-by-field
    copyin/copyout killed off.

    - kernel/compat.c is much closer to containing just the
    copyin/copyout of compat structs. Not all compat syscalls are gone
    from it yet, but it's getting there.

    - ipc/compat_mq.c killed off completely.

    - block/compat_ioctl.c cleaned up; floppy compat ioctls moved to
    drivers/block/floppy.c where they belong. Yes, there are several
    drivers that implement some of the same ioctls. Some are m68k and
    one is 32bit-only pmac. drivers/block/floppy.c is the only one in
    that bunch that can be built on biarch"

    * 'misc.compat' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    mqueue: move compat syscalls to native ones
    usbdevfs: get rid of field-by-field copyin
    compat_hdio_ioctl: get rid of set_fs()
    take floppy compat ioctls to sodding floppy.c
    ipmi: get rid of field-by-field __get_user()
    ipmi: get COMPAT_IPMICTL_RECEIVE_MSG in sync with the native one
    rt_sigtimedwait(): move compat to native
    select: switch compat_{get,put}_fd_set() to compat_{get,put}_bitmap()
    put_compat_rusage(): switch to copy_to_user()
    sigpending(): move compat to native
    getrlimit()/setrlimit(): move compat to native
    times(2): move compat to native
    compat_{get,put}_bitmap(): use unsafe_{get,put}_user()
    fb_get_fscreeninfo(): don't bother with do_fb_ioctl()
    do_sigaltstack(): lift copying to/from userland into callers
    take compat_sys_old_getrlimit() to native syscall
    trim __ARCH_WANT_SYS_OLD_GETRLIMIT

    Linus Torvalds
     

06 Jul, 2017

1 commit

  • The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
    with a maximum number of bvec (pages) equal to

    min(nr_sects, (sector_t)BIO_MAX_PAGES)

    This works since the requested number of bvecs will always be limited
    to the absolute maximum number supported (BIO_MAX_PAGES), but this is
    ineficient as too many bvec entries may be requested due to the
    different units being used in the min() operation (number of sectors vs
    number of pages).
    To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
    correctly calculate the number of bvecs for zeroout BIOs as the issuing
    loop progresses. The calculation is done using consistent units and
    makes sure that the number of pages return is at least 1 (for cases
    where the number of sectors is less that the number of sectors in
    a page).

    Also remove a trailing space after the bit shift in the internal loop
    min() call.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

05 Jul, 2017

1 commit

  • block/bio-integrity.c:318:10-11: WARNING: return of 0/1 in function 'bio_integrity_prep' with return type bool

    Return statements in functions returning bool should use
    true/false instead of 1/0.
    Generated by: scripts/coccinelle/misc/boolreturn.cocci

    Fixes: e23947bd76f0 ("bio-integrity: fold bio_integrity_enabled to bio_integrity_prep")
    CC: Dmitry Monakhov
    Signed-off-by: Fengguang Wu
    Signed-off-by: Jens Axboe

    kbuild test robot
     

04 Jul, 2017

13 commits

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     
  • And instead call directly into the integrity code from bio_end_io.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently ->verify_fn not woks at all because at the moment it is called
    bio->bi_iter.bi_size == 0, so we do not iterate integrity bvecs at all.

    In order to perform verification we need to know original data vector,
    with new bvec rewind API this is trivial.

    testcase: https://github.com/dmonakhov/xfstests/commit/3c6509eaa83b9c17cd0bc95d73fcdd76e1c54a85

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dmitry Monakhov
    [hch: adopted for new status values]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Currently all integrity prep hooks are open-coded, and if prepare fails
    we ignore it's code and fail bio with EIO. Let's return real error to
    upper layer, so later caller may react accordingly.

    In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
    so it is reasonable to fold it in to one function.

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    [hch: merged with the latest block tree,
    return bool from bio_integrity_prep]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • bio_integrity_trim inherent it's interface from bio_trim and accept
    offset and size, but this API is error prone because data offset
    must always be insync with bio's data offset. That is why we have
    integrity update hook in bio_advance()

    So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
    Let's just remove them completely.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • SCSI drivers do care about bip_seed so we must update it accordingly.

    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Dmitry Monakhov
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • When mq-deadline is taken, IOPS of sequential read and
    seqential write is observed more than 20% drop on sata(scsi-mq)
    devices, compared with using 'none' scheduler.

    The reason is that the default nr_requests for scheduler is
    too big for small queuedepth devices, and latency is increased
    much.

    Since the principle of taking 256 requests for mq scheduler
    is based on 128 queue depth, this patch changes into
    double size of min(hw queue_depth, 128).

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • On each deactivation or re-scheduling (after being served) of a
    bfq_queue, BFQ invokes the function __bfq_entity_update_weight_prio(),
    to perform pending updates of ioprio, weight and ioprio class for the
    bfq_queue. BFQ also invokes this function on I/O-request dispatches,
    to raise or lower weights more quickly when needed, thereby improving
    latency. However, the entity representing the bfq_queue may be on the
    active (sub)tree of a service tree when this happens, and, although
    with a very low probability, the bfq_queue may happen to also have a
    pending change of its ioprio class. If both conditions hold when
    __bfq_entity_update_weight_prio() is invoked, then the entity moves to
    a sort of hybrid state: the new service tree for the entity, as
    returned by bfq_entity_service_tree(), differs from service tree on
    which the entity still is. The functions that handle activations and
    deactivations of entities do not cope with such a hybrid state (and
    would need to become more complex to cope).

    This commit addresses this issue by just making
    __bfq_entity_update_weight_prio() not perform also a possible pending
    change of ioprio class, when invoked on an I/O-request dispatch for a
    bfq_queue. Such a change is thus postponed to when
    __bfq_entity_update_weight_prio() is invoked on deactivation or
    re-scheduling of the bfq_queue.

    Reported-by: Marco Piazza
    Reported-by: Laurentiu Nicola
    Signed-off-by: Paolo Valente
    Tested-by: Marco Piazza
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     
  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     
  • Pull uuid subsystem from Christoph Hellwig:
    "This is the new uuid subsystem, in which Amir, Andy and I have started
    consolidating our uuid/guid helpers and improving the types used for
    them. Note that various other subsystems have pulled in this tree, so
    I'd like it to go in early.

    UUID/GUID summary:

    - introduce the new uuid_t/guid_t types that are going to replace the
    somewhat confusing uuid_be/uuid_le types and make the terminology
    fit the various specs, as well as the userspace libuuid library.
    (me, based on a previous version from Amir)

    - consolidated generic uuid/guid helper functions lifted from XFS and
    libnvdimm (Amir and me)

    - conversions to the new types and helpers (Amir, Andy and me)"

    * tag 'uuid-for-4.13' of git://git.infradead.org/users/hch/uuid: (34 commits)
    ACPI: hns_dsaf_acpi_dsm_guid can be static
    mmc: sdhci-pci: make guid intel_dsm_guid static
    uuid: Take const on input of uuid_is_null() and guid_is_null()
    thermal: int340x_thermal: fix compile after the UUID API switch
    thermal: int340x_thermal: Switch to use new generic UUID API
    acpi: always include uuid.h
    ACPI: Switch to use generic guid_t in acpi_evaluate_dsm()
    ACPI / extlog: Switch to use new generic UUID API
    ACPI / bus: Switch to use new generic UUID API
    ACPI / APEI: Switch to use new generic UUID API
    acpi, nfit: Switch to use new generic UUID API
    MAINTAINERS: add uuid entry
    tmpfs: generate random sb->s_uuid
    scsi_debug: switch to uuid_t
    nvme: switch to uuid_t
    sysctl: switch to use uuid_t
    partitions/ldm: switch to use uuid_t
    overlayfs: use uuid_t instead of uuid_be
    fs: switch ->s_uuid to uuid_t
    ima/policy: switch to use uuid_t
    ...

    Linus Torvalds
     

30 Jun, 2017

2 commits


29 Jun, 2017

4 commits

  • This patch performs sequential mapping between CPUs and queues.
    In case the system has more CPUs than HWQs then there are still
    CPUs to map to HWQs. In hyperthreaded system, map the unmapped CPUs
    and their siblings to the same HWQ.
    This actually fixes a bug that found unmapped HWQs in a system with
    2 sockets, 18 cores per socket, 2 threads per core (total 72 CPUs)
    running NVMEoF (opens upto maximum of 64 HWQs).

    Performance results running fio (72 jobs, 128 iodepth)
    using null_blk (w/w.o patch):

    bs IOPS(read submit_queues=72) IOPS(write submit_queues=72) IOPS(read submit_queues=24) IOPS(write submit_queues=24)
    ----- ---------------------------- ------------------------------ ---------------------------- -----------------------------
    512 4890.4K/4723.5K 4524.7K/4324.2K 4280.2K/4264.3K 3902.4K/3909.5K
    1k 4910.1K/4715.2K 4535.8K/4309.6K 4296.7K/4269.1K 3906.8K/3914.9K
    2k 4906.3K/4739.7K 4526.7K/4330.6K 4301.1K/4262.4K 3890.8K/3900.1K
    4k 4918.6K/4730.7K 4556.1K/4343.6K 4297.6K/4264.5K 3886.9K/3893.9K
    8k 4906.4K/4748.9K 4550.9K/4346.7K 4283.2K/4268.8K 3863.4K/3858.2K
    16k 4903.8K/4782.6K 4501.5K/4233.9K 4292.3K/4282.3K 3773.1K/3773.5K
    32k 4885.8K/4782.4K 4365.9K/4184.2K 4307.5K/4289.4K 3780.3K/3687.3K
    64k 4822.5K/4762.7K 2752.8K/2675.1K 4308.8K/4312.3K 2651.5K/2655.7K
    128k 2388.5K/2313.8K 1391.9K/1375.7K 2142.8K/2152.2K 1395.5K/1374.2K

    Signed-off-by: Max Gurtovoy
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Max Gurtovoy
     
  • Wen reports significant memory leaks with DIF and O_DIRECT:

    "With nvme devive + T10 enabled, On a system it has 256GB and started
    logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
    it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
    leaking.

    /proc/meminfo | grep SUnreclaim...

    SUnreclaim: 6752128 kB
    SUnreclaim: 6874880 kB
    SUnreclaim: 7238080 kB
    ....
    SUnreclaim: 22307264 kB
    SUnreclaim: 22485888 kB
    SUnreclaim: 22720256 kB

    When testcases with T10 enabled call into __blkdev_direct_IO_simple,
    code doesn't free memory allocated by bio_integrity_alloc. The patch
    fixes the issue. HTX has been run with +60 hours without failure."

    Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
    doesn't go through the regular bio free. This means that any ancillary
    data allocated with the bio through the stack is not freed. Hence, we
    can leak the integrity data associated with the bio, if the device is
    using DIF/DIX.

    Fix this by providing a bio_uninit() and export it, so that we can use
    it to free this data. Note that this is a minimal fix for this issue.
    Any current user of bio's that are allocated outside of
    bio_alloc_bioset() suffers from this issue, most notably some drivers.
    We will fix those in a more comprehensive patch for 4.13. This also
    means that the commit marked as being fixed by this isn't the real
    culprit, it's just the most obvious one out there.

    Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
    Reported-by: Wen Xiong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     
  • This way we get a nice distribution independent of the current cpu
    online / offline state.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

28 Jun, 2017

10 commits

  • This commit fixes a bug triggered by a non-trivial sequence of
    events. These events are briefly described in the next two
    paragraphs. The impatiens, or those who are familiar with queue
    merging and splitting, can jump directly to the last paragraph.

    On each I/O-request arrival for a shared bfq_queue, i.e., for a
    bfq_queue that is the result of the merge of two or more bfq_queues,
    BFQ checks whether the shared bfq_queue has become seeky (i.e., if too
    many random I/O requests have arrived for the bfq_queue; if the device
    is non rotational, then random requests must be also small for the
    bfq_queue to be tagged as seeky). If the shared bfq_queue is actually
    detected as seeky, then a split occurs: the bfq I/O context of the
    process that has issued the request is redirected from the shared
    bfq_queue to a new non-shared bfq_queue. As a degenerate case, if the
    shared bfq_queue actually happens to be shared only by one process
    (because of previous splits), then no new bfq_queue is created: the
    state of the shared bfq_queue is just changed from shared to non
    shared.

    Regardless of whether a brand new non-shared bfq_queue is created, or
    the pre-existing shared bfq_queue is just turned into a non-shared
    bfq_queue, several parameters of the non-shared bfq_queue are set
    (restored) to the original values they had when the bfq_queue
    associated with the bfq I/O context of the process (that has just
    issued an I/O request) was merged with the shared bfq_queue. One of
    these parameters is the weight-raising state.

    If, on the split of a shared bfq_queue,
    1) a pre-existing shared bfq_queue is turned into a non-shared
    bfq_queue;
    2) the previously shared bfq_queue happens to be busy;
    3) the weight-raising state of the previously shared bfq_queue happens
    to change;
    the number of weight-raised busy queues changes. The field
    wr_busy_queues must then be updated accordingly, but such an update
    was missing. This commit adds the missing update.

    Reported-by: Luca Miccio
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • Instead move it to the callers. Those that either don't use bio_data() or
    page_address() or are specific to architectures that do not support highmem
    are skipped.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • And just move it into scsi_transport_sas which needs it due to low-level
    drivers directly derferencing bio_data, and into blk_init_queue_node,
    which will need a further push into the callers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • For historical reasons we default to bouncing highmem pages for all block
    queues. But the blk-mq drivers are easy to audit to ensure that we don't
    need this - scsi and mtip32xx set explicit limits and everyone else doesn't
    have any particular ones.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • We only call blk_queue_bounce for request-based drivers, so stop messing
    with it for make_request based drivers.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Only used inside the bounce code, and opencoding it makes it more obvious
    what is going on.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This makes moves the knowledge about bouncing out of the callers into the
    block core (just like we do for the normal I/O path), and allows to unexport
    blk_queue_bounce.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Useful to verify that things are working the way they should.
    Reading the file will return number of kb written with each
    write hint. Writing the file will reset the statistics. No care
    is taken to ensure that we don't race on updates.

    Drivers will write to q->write_hints[] if they handle a given
    write hint.

    Reviewed-by: Andreas Dilger
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • No functional changes in this patch, we just use up some holes
    in the bio and request structures to define a write hint that
    we psas down the stack.

    Ensure that we don't merge requests that have different life time
    hints assigned to them, and that we inherit the write hint when
    cloning a bio.

    Reviewed-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Jun, 2017

1 commit


23 Jun, 2017

1 commit