09 Oct, 2018

1 commit

  • Lot of controllers may have only one irq vector for completing IO
    request. And usually affinity of the only irq vector is all possible
    CPUs, however, on most of ARCH, there may be only one specific CPU
    for handling this interrupt.

    So if all IOs are completed in hardirq context, it is inevitable to
    degrade IO performance because of increased irq latency.

    This patch tries to address this issue by allowing to complete request
    in softirq context, like the legacy IO path.

    IOPS is observed as ~13%+ in the following randread test on raid0 over
    virtio-scsi.

    mdadm --create --verbose /dev/md0 --level=0 --chunk=1024 --raid-devices=8 /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi

    fio --time_based --name=benchmark --runtime=30 --filename=/dev/md0 --nrfiles=1 --ioengine=libaio --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=32 --rw=randread --blocksize=4k

    Cc: Dongli Zhang
    Cc: Zach Marano
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Jianchao Wang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

05 Oct, 2018

1 commit


01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Sep, 2018

6 commits

  • We apply a smoothing to the scale changes in order to keep sawtoothy
    behavior from occurring. However our window for checking if we've
    missed our target can sometimes be lower than the smoothing interval
    (500ms), especially on faster drives like ssd's. In order to deal with
    this keep track of the running tally of the previous intervals that we
    threw away because we had already done a scale event recently.

    This is needed for the ssd case as these low latency drives will have
    bursts of latency, and if it happens to be ok for the window that
    directly follows the opening of the scale window we could unthrottle
    when previous windows we were missing our target.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We use an average latency approach for determining if we're missing our
    latency target. This works well for rotational storage where we have
    generally consistent latencies, but for ssd's and other low latency
    devices you have more of a spikey behavior, which means we often won't
    throttle misbehaving groups because a lot of IO completes at drastically
    faster times than our latency target. Instead keep track of how many
    IO's miss our target and how many IO's are done in our time window. If
    the p(90) latency is above our target then we know we need to throttle.
    With this change in place we are seeing the same throttling behavior
    with our testcase on ssd's as we see with rotational drives.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • There is logic to keep cgroups that haven't done a lot of IO in the most
    recent scale window from being punished for over-active higher priority
    groups. However for things like ssd's where the windows are pretty
    short we'll end up with small numbers of samples, so 5% of samples will
    come out to 0 if there aren't enough. Make the floor 1 sample to keep
    us from improperly bailing out of scaling down.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Hitting the case where blk_queue_depth() returned 1 uncovered the fact
    that iolatency doesn't actually handle this case properly, it simply
    doesn't scale down anybody. For this case we should go straight into
    applying the time delay, which we weren't doing. Since we already limit
    the floor at 1 request this if statement is not needed, and this allows
    us to set our depth to 1 which allows us to apply the delay if needed.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • We were using blk_queue_depth() assuming that it would return
    nr_requests, but we hit a case in production on drives that had to have
    NCQ turned off in order for them to not shit the bed which resulted in a
    qd of 1, even though the nr_requests was much larger. iolatency really
    only cares about requests we are allowed to queue up, as any io that
    get's onto the request list is going to be serviced soonish, so we want
    to be throttling before the bio gets onto the request list. To make
    iolatency work as expected, simply use q->nr_requests instead of
    blk_queue_depth() as that is what we actually care about.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • NSEC_PER_SEC has type long, so 5 * NSEC_PER_SEC is calculated as a long.
    However, 5 seconds is 5,000,000,000 nanoseconds, which overflows a
    32-bit long. Make sure all of the targets are calculated as 64-bit
    values.

    Fixes: 6e25cb01ea20 ("kyber: implement improved heuristics")
    Reported-by: Stephen Rothwell
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

28 Sep, 2018

7 commits

  • Update device_add_disk() to take an 'groups' argument so that
    individual drivers can register a device with additional sysfs
    attributes.
    This avoids race condition the driver would otherwise have if these
    groups were to be created with sysfs_add_groups().

    Signed-off-by: Martin Wilck
    Signed-off-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     
  • When debugging Kyber, it's really useful to know what latencies we've
    been having, how the domain depths have been adjusted, and if we've
    actually been throttling. Add three tracepoints, kyber_latency,
    kyber_adjust, and kyber_throttled, to record that.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Kyber's current heuristics have a few flaws:

    - It's based on the mean latency, but p99 latency tends to be more
    meaningful to anyone who cares about latency. The mean can also be
    skewed by rare outliers that the scheduler can't do anything about.
    - The statistics calculations are purely time-based with a short window.
    This works for steady, high load, but is more sensitive to outliers
    with bursty workloads.
    - It only considers the latency once an I/O has been submitted to the
    device, but the user cares about the time spent in the kernel, as
    well.

    These are shortcomings of the generic blk-stat code which doesn't quite
    fit the ideal use case for Kyber. So, this replaces the statistics with
    a histogram used to calculate percentiles of total latency and I/O
    latency, which we then use to adjust depths in a slightly more
    intelligent manner:

    - Sync and async writes are now the same domain.
    - Discards are a separate domain.
    - Domain queue depths are scaled by the ratio of the p99 total latency
    to the target latency (e.g., if the p99 latency is double the target
    latency, we will double the queue depth; if the p99 latency is half of
    the target latency, we can halve the queue depth).
    - We use the I/O latency to determine whether we should scale queue
    depths down: we will only scale down if any domain's I/O latency
    exceeds the target latency, which is an indicator of congestion in the
    device.

    These new heuristics are just as scalable as the heuristics they
    replace.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • The domain token sbitmaps are currently initialized to the device queue
    depth or 256, whichever is larger, and immediately resized to the
    maximum depth for that domain (256, 128, or 64 for read, write, and
    other, respectively). The sbitmap is never resized larger than that, so
    it's unnecessary to allocate a bitmap larger than the maximum depth.
    Let's just allocate it to the maximum depth to begin with. This will use
    marginally less memory, and more importantly, give us a more appropriate
    number of bits per sbitmap word.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Kyber will need this in a future change if it is built as a module.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Commit 4bc6339a583c ("block: move blk_stat_add() to
    __blk_mq_end_request()") consolidated some calls using ktime_get() so
    we'd only need to call it once. Kyber's ->completed_request() hook also
    calls ktime_get(), so let's move it to the same place, too.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • trace_block_unplug() takes true for explicit unplugs and false for
    implicit unplugs. schedule() unplugs are implicit and should be
    reported as timer unplugs. While correct in the legacy code, this has
    been inverted in blk-mq since 4.11.

    Cc: stable@vger.kernel.org
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

27 Sep, 2018

8 commits

  • When the deadline scheduler is used with a zoned block device, writes
    to a zone will be dispatched one at a time. This causes the warning
    message:

    deadline: forced dispatching is broken (nr_sorted=X), please report this

    to be displayed when switching to another elevator with the legacy I/O
    path while write requests to a zone are being retained in the scheduler
    queue.

    Prevent this message from being displayed when executing
    elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
    loop until all writes are dispatched and completed, resulting in the
    desired elevator queue drain without extensive modifications to the
    deadline code itself to handle forced-dispatch calls.

    Signed-off-by: Damien Le Moal
    Fixes: 8dc8146f9c92 ("deadline-iosched: Introduce zone locking support")
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Damien Le Moal
     
  • Now that the blk-mq core processes power management requests
    (marked with RQF_PREEMPT) in other states than RPM_ACTIVE, enable
    runtime power management for blk-mq.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of allowing requests that are not power management requests
    to enter the queue in runtime suspended status (RPM_SUSPENDED), make
    the blk_get_request() caller block. This change fixes a starvation
    issue: it is now guaranteed that power management requests will be
    executed no matter how many blk_get_request() callers are waiting.
    For blk-mq, instead of maintaining the q->nr_pending counter, rely
    on q->q_usage_counter. Call pm_runtime_mark_last_busy() every time a
    request finishes instead of only if the queue depth drops to zero.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A later patch will call blk_freeze_queue_start() followed by
    blk_mq_unfreeze_queue() without waiting for q_usage_counter to drop
    to zero. Make sure that this doesn't cause a kernel warning to appear
    by switching from percpu_ref_reinit() to percpu_ref_resurrect(). The
    former namely requires that the refcount it operates on is zero.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of scheduling runtime resume of a request queue after a
    request has been queued, schedule asynchronous resume during request
    allocation. The new pm_request_resume() calls occur after
    blk_queue_enter() has increased the q_usage_counter request queue
    member. This change is needed for a later patch that will make request
    allocation block while the queue status is not RPM_ACTIVE.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the pm_request_resume() and pm_runtime_mark_last_busy() calls into
    two new functions and thereby separate legacy block layer code from code
    that works for both the legacy block layer and blk-mq. A later patch will
    add calls to the new functions in the blk-mq code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Martin K. Petersen
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • The RQF_PREEMPT flag is used for three purposes:
    - In the SCSI core, for making sure that power management requests
    are executed even if a device is in the "quiesced" state.
    - For domain validation by SCSI drivers that use the parallel port.
    - In the IDE driver, for IDE preempt requests.
    Rename "preempt-only" into "pm-only" because the primary purpose of
    this mode is power management. Since the power management core may
    but does not have to resume a runtime suspended device before
    performing system-wide suspend and since a later patch will set
    "pm-only" mode as long as a block device is runtime suspended, make
    it possible to set "pm-only" mode from more than one context. Since
    with this change scsi_device_quiesce() is no longer idempotent, make
    that function return early if it is called for a quiesced queue.

    Signed-off-by: Bart Van Assche
    Acked-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Cc: Jianchao Wang
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Move the code for runtime power management from blk-core.c into the
    new source file blk-pm.c. Move the corresponding declarations from
    into . For CONFIG_PM=n, leave out
    the declarations of the functions that are not used in that mode.
    This patch not only reduces the number of #ifdefs in the block layer
    core code but also reduces the size of header file
    and hence should help to reduce the build time of the Linux kernel
    if CONFIG_PM is not defined.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

26 Sep, 2018

2 commits

  • Take the Xen check into the core code instead of delegating it to
    the architectures.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • A recent commit runs tag iterator callbacks under the rcu read lock,
    but existing callbacks do not satisfy the non-blocking requirement.
    The commit intended to prevent an iterator from accessing a queue that's
    being modified. This patch fixes the original issue by taking a queue
    reference instead of reading it, which allows callbacks to make blocking
    calls.

    Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
    Acked-by: Jianchao Wang
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

25 Sep, 2018

7 commits


22 Sep, 2018

7 commits

  • Make it easier to understand the purpose of the functions that iterate
    over requests by documenting their purpose. Fix several minor spelling
    and grammer mistakes in comments in these functions.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • blkg reference counting now uses percpu_ref rather than atomic_t. Let's
    make this consistent with css_tryget. This renames blkg_try_get to
    blkg_tryget and now returns a bool rather than the blkg or NULL.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • Now that every bio is associated with a blkg, this puts the use of
    blkg_get, blkg_try_get, and blkg_put on the hot path. This switches over
    the refcnt in blkg to use percpu_ref.

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • The previous patch in this series removed carrying around a pointer to
    the css in blkg. However, the blkg association logic still relied on
    taking a reference on the css to ensure we wouldn't fail in getting a
    reference for the blkg.

    Here the implicit dependency on the css is removed. The association
    continues to rely on the tryget logic walking up the blkg tree. This
    streamlines the three ways that association can happen: normal, swap,
    and writeback.

    Acked-by: Tejun Heo
    Signed-off-by: Dennis Zhou
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • Prior patches ensured that all bios are now associated with some blkg.
    This now makes bio->bi_css unnecessary as blkg maintains a reference to
    the blkcg already.

    This patch removes the field bi_css and transfers corresponding uses to
    access via bi_blkg.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • A prior patch in this series added blkg association to bios issued by
    cgroups. There are two other paths that we want to attribute work back
    to the appropriate cgroup: swap and writeback. Here we modify the way
    swap tags bios to include the blkg. Writeback will be tackle in the next
    patch.

    Signed-off-by: Dennis Zhou
    Reviewed-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)
     
  • bio_issue_init among other things initializes the timestamp for an IO.
    Rather than have this logic handled by policies, this consolidates it to
    be on the init paths (normal, clone, bounce clone).

    Signed-off-by: Dennis Zhou
    Acked-by: Tejun Heo
    Reviewed-by: Liu Bo
    Signed-off-by: Jens Axboe

    Dennis Zhou (Facebook)