22 Nov, 2016

3 commits

  • If a ZBC device is partitioned and operations are performed on the partition
    the zone information is rebased to the partition, however the zone reset
    is not mapped from the partition to device as are other operations.

    This causes the API (report zones / reset zone) to be unbalanced in this
    regard. Checking for the zone reset op code explicitly will balance the
    API.

    Signed-off-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Shaun Tancheff
     
  • Since commit 87374179 ("block: add a proper block layer data direction
    encoding") we only or the new op and flags into bi_opf in bio_set_op_attrs
    instead of clearing the old value. I've not seen any breakage with the
    new behavior, but it seems dangerous.

    Also convert it to an inline function to make the argument passing
    safer.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This driver is both orphaned, and not really useful anymore. Mark
    it as such, and remove it in a future kernel after a release or
    two.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Nov, 2016

10 commits

  • With compilers which follow the C99 standard (like modern versions of
    gcc and clang), "extern inline" does the opposite thing from older
    versions of gcc (emits code for an externally linkable version of the
    inline function).

    "static inline" does the intended behavior in all cases instead.

    Description taken from commit 6d91857d4826 ("staging, rtl8192e,
    LLVMLinux: Change extern inline to static inline").

    This also fixes the following GCC warning when building with CONFIG_PM
    disabled:

    ./include/linux/blkdev.h:1143:20: warning: no previous prototype for 'blk_set_runtime_active' [-Wmissing-prototypes]

    Fixes: d07ab6d11477 ("block: Add blk_set_runtime_active()")
    Reviewed-by: Mika Westerberg
    Signed-off-by: Tobias Klauser
    Signed-off-by: Jens Axboe

    Tobias Klauser
     
  • Drop duplicate header scatterlist.h from skd_main.c.

    Signed-off-by: Geliang Tang
    Signed-off-by: Jens Axboe

    Geliang Tang
     
  • This was documented in the original commit, 64f1c21e86f7, but it
    never made it into the proper location for queue sysfs files.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Similar to the simple fast path, but we now need a dio structure to
    track multiple-bio completions. It's basically a cut-down version
    of the new iomap-based direct I/O code for filesystems, but without
    all the logic to call into the filesystem for extent lookup or
    allocation, and without the complex I/O completion workqueue handler
    for AIO - instead we just use the FUA bit on the bios to ensure
    data is flushed to stable storage.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Split the op setting code into a helper, use it in both places.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Just alloc the bio_vec array if we exceed the inline limit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The previous commit introduced the hybrid sleep/poll mode. Take
    that one step further, and use the completion latencies to
    automatically sleep for half the mean completion time. This is
    a good approximation.

    This changes the 'io_poll_delay' sysfs file a bit to expose the
    various options. Depending on the value, the polling code will
    behave differently:

    -1 Never enter hybrid sleep mode
    0 Use half of the completion mean for the sleep delay
    >0 Use this specific value as the sleep delay

    Signed-off-by: Jens Axboe
    Tested-By: Stephen Bates
    Reviewed-By: Stephen Bates

    Jens Axboe
     
  • This patch enables a hybrid polling mode. Instead of polling after IO
    submission, we can induce an artificial delay, and then poll after that.
    For example, if the IO is presumed to complete in 8 usecs from now, we
    can sleep for 4 usecs, wake up, and then do our polling. This still puts
    a sleep/wakeup cycle in the IO path, but instead of the wakeup happening
    after the IO has completed, it'll happen before. With this hybrid
    scheme, we can achieve big latency reductions while still using the same
    (or less) amount of CPU.

    Signed-off-by: Jens Axboe
    Tested-By: Stephen Bates
    Reviewed-By: Stephen Bates

    Jens Axboe
     
  • This patch adds a small and simple fast patch for small direct I/O
    requests on block devices that don't use AIO. Between the neat
    bio_iov_iter_get_pages helper that avoids allocating a page array
    for get_user_pages and the on-stack bio and biovec this avoid memory
    allocations and atomic operations entirely in the direct I/O code
    (lower levels might still do memory allocations and will usually
    have at least some atomic operations, though).

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Tested-By: Stephen Bates
    Reviewed-By: Stephen Bates

    Christoph Hellwig
     
  • For writes, we can get a completion in while we're still iterating
    the request and bio chain. If that happens, we're reading freed
    memory and we can crash.

    Break out after the last segment and avoid having the iterator
    read freed memory.

    Reviewed-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Jens Axboe
     

16 Nov, 2016

5 commits

  • The newly added driver causes a harmless warning in some configurations:

    block/blk-wbt.c:250:1: error: ‘inline’ is not at beginning of declaration [-Werror=old-style-declaration]
    static bool inline stat_sample_valid(struct blk_rq_stat *stat)

    This makes it use the expected format for the declaration.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • If CONFIG_NVM is disabled, loading null_block module with use_lightnvm=1
    fails. But there are no messages and documents related to the failure.

    Add the appropriate error message.

    Signed-off-by: Yasuaki Ishimatsu

    Massaged the text a bit.

    Signed-off-by: Jens Axboe

    Yasuaki Ishimatsu
     
  • In both legacy and mq path, req count of plug list is computed
    before allocating request, so the number can be stale when falling
    back to slept allocation, also the new introduced wbt can sleep
    too.

    This patch deals with the case by checking if plug list becomes
    empty, and fixes the KASAN report of 'BUG: KASAN: stack-out-of-bounds'
    which is introduced by Shaohua's patches of dispatching big request.

    Fixes: 600271d900002(blk-mq: immediately dispatch big size request)
    Fixes: 50d24c34403c6(block: immediately dispatch big size request)
    Cc: Shaohua Li
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
    specific values. No functional change.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Let's not depend on any of the BLK_MQ_RQ_QUEUE_* constants having
    specific values. No functional change.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Nov, 2016

3 commits

  • ->queue_rq() should return one of the BLK_MQ_RQ_QUEUE_* constants, not
    an errno.

    f4aa4c7bbac6 ("block: loop: convert to per-device workqueue")
    Signed-off-by: Omar Sandoval
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Normally, sd_read_capacity sets sdp->use_16_for_rw to 1 based on the
    disk capacity so that READ16/WRITE16 are used for large drives.
    However, for a zoned disk with RC_BASIS set to 0, the capacity reported
    through READ_CAPACITY may be very small, leading to use_16_for_rw not being
    set and READ10/WRITE10 commands being used, even after the actual zoned disk
    capacity is corrected in sd_zbc_read_zones. This causes LBA offset overflow for
    accesses beyond 2TB.

    As the ZBC standard makes it mandatory for ZBC drives to support
    the READ16/WRITE16 commands anyway, make sure that use_16_for_rw is set.

    Signed-off-by: Damien Le Moal
    eviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Avoid that sparse complains about unbalanced lock actions.

    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

12 Nov, 2016

6 commits


11 Nov, 2016

6 commits

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We can hook this up to the block layer, to help throttle buffered
    writes.

    wbt registers a few trace points that can be used to track what is
    happening in the system:

    wbt_lat: 259:0: latency 2446318
    wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
    wmean=518866, wmin=15522, wmax=5330353, wsamples=57
    wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

    This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
    dumps the current read/write stats for that window, and wbt_step shows a
    step down event where we now scale back writes. Each trace includes the
    device, 259:0 in this case.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For legacy block, we simply track them in the request queue. For
    blk-mq, we track them on a per-sw queue basis, which we can then
    sum up through the hardware queues and finally to a per device
    state.

    The stats are tracked in, roughly, 0.1s interval windows.

    Add sysfs files to display the stats.

    The feature is off by default, to avoid any extra overhead. In-kernel
    users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
    flags. We currently don't turn it on if someone just reads any of
    the stats files, that is something we could add as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • We only need the status and result fields, and passing them explicitly
    makes life a lot easier for the Fibre Channel transport which doesn't
    have a full CQE for the fast path case.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This adds a shared per-request structure for all NVMe I/O. This structure
    is embedded as the first member in all NVMe transport drivers request
    private data and allows to implement common functionality between the
    drivers.

    The first use is to replace the current abuse of the SCSI command
    passthrough fields in struct request for the NVMe command passthrough,
    but it will grow a field more fields to allow implementing things
    like common abort handlers in the future.

    The passthrough commands are handled by having a pointer to the SQE
    (struct nvme_command) in struct nvme_request, and the union of the
    possible result fields, which had to be turned from an anonymous
    into a named union for that purpose. This avoids having to pass
    a reference to a full CQE around and thus makes checking the result
    a lot more lightweight.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Nov, 2016

2 commits

  • Building with W=1 shows a harmless warning for the skd driver:

    drivers/block/skd_main.c:2959:1: error: ‘static’ is not at beginning of declaration [-Werror=old-style-declaration]

    This changes the prototype to the expected formatting.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     
  • As reported by gcc -Wmaybe-uninitialized, the cleanup path for
    skd_acquire_msix tries to free the already allocated msi-x vectors
    in reverse order, but the index variable may not have been
    used yet:

    drivers/block/skd_main.c: In function ‘skd_acquire_irq’:
    drivers/block/skd_main.c:3890:8: error: ‘i’ may be used uninitialized in this function [-Werror=maybe-uninitialized]

    This changes the failure path to skip releasing the interrupts
    if we have not started requesting them yet.

    Fixes: 180b0ae77d49 ("skd: use pci_alloc_irq_vectors")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

09 Nov, 2016

1 commit

  • If we insert a flush request, we clear REQ_PREFLUSH and/or REQ_FUA,
    depending on flush settings. Since op_is_sync() factors those flags
    in for deciding whether this request is sync or not, we should
    set REQ_SYNC to avoid screwing up this accounting.

    This should be less fragile.

    Reported-by: Logan Gunthorpe
    Fixes: b685d3d65ac ("block: treat REQ_FUA and REQ_PREFLUSH as synchronous")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Nov, 2016

3 commits


07 Nov, 2016

1 commit

  • Hi Peter, hi Jens,

    I've been looking over the multi page bio vec work again recently, and
    one of the stumbling blocks is raw biovec access in the pktcdvd.

    The first issue is that it directly sets up the page and offset pointers
    in the biovec just before calling bio_add_page. As bio_add_page already
    does the setup it's trivial to just switch it to stack variables for the
    arguments.

    The second issue is the copy code in pkt_make_local_copy, which
    effectively is an opencoded version of bio_copy_data except that it
    skips pages that already are the same in the ѕource and destination.
    But we look at the only calleer we just set up the bio using bio_add_page
    to point exactly to the page array that pkt_make_local_copy compares,
    so the pages will always be the same and we can just remove this function.

    Note that all this is done based on code inspection, I don't have any
    packet writing hardware myself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig