14 Nov, 2018

1 commit

  • commit 1adfc5e4136f5967d591c399aff95b3b035f16b7 upstream.

    Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

21 Aug, 2018

1 commit

  • Currently, when update nr_hw_queues, IO scheduler's init_hctx will
    be invoked before the mapping between ctx and hctx is adapted
    correctly by blk_mq_map_swqueue. The IO scheduler init_hctx (kyber)
    may depend on this mapping and get wrong result and panic finally.
    A simply way to fix this is that switch the IO scheduler to 'none'
    before update the nr_hw_queues, and then switch it back after
    update nr_hw_queues. blk_mq_sched_init_/exit_hctx are removed due
    to nobody use them any more.

    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

17 Aug, 2018

1 commit


09 Aug, 2018

1 commit


09 Jul, 2018

1 commit

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     

01 Jun, 2018

3 commits


09 May, 2018

1 commit

  • Currently, struct request has four timestamp fields:

    - A start time, set at get_request time, in jiffies, used for iostats
    - An I/O start time, set at start_request time, in ktime nanoseconds,
    used for blk-stats (i.e., wbt, kyber, hybrid polling)
    - Another start time and another I/O start time, used for cfq and bfq

    These can all be consolidated into one start time and one I/O start
    time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
    request depending on the kernel config.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

09 Mar, 2018

1 commit


30 Jan, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "This is the main pull request for block IO related changes for the
    4.16 kernel. Nothing major in this pull request, but a good amount of
    improvements and fixes all over the map. This contains:

    - BFQ improvements, fixes, and cleanups from Angelo, Chiara, and
    Paolo.

    - Support for SMR zones for deadline and mq-deadline from Damien and
    Christoph.

    - Set of fixes for bcache by way of Michael Lyle, including fixes
    from himself, Kent, Rui, Tang, and Coly.

    - Series from Matias for lightnvm with fixes from Hans Holmberg,
    Javier, and Matias. Mostly centered around pblk, and the removing
    rrpc 1.2 in preparation for supporting 2.0.

    - A couple of NVMe pull requests from Christoph. Nothing major in
    here, just fixes and cleanups, and support for command tracing from
    Johannes.

    - Support for blk-throttle for tracking reads and writes separately.
    From Joseph Qi. A few cleanups/fixes also for blk-throttle from
    Weiping.

    - Series from Mike Snitzer that enables dm to register its queue more
    logically, something that's alwways been problematic on dm since
    it's a stacked device.

    - Series from Ming cleaning up some of the bio accessor use, in
    preparation for supporting multipage bvecs.

    - Various fixes from Ming closing up holes around queue mapping and
    quiescing.

    - BSD partition fix from Richard Narron, fixing a problem where we
    can't mount newer (10/11) FreeBSD partitions.

    - Series from Tejun reworking blk-mq timeout handling. The previous
    scheme relied on atomic bits, but it had races where we would think
    a request had timed out if it to reused at the wrong time.

    - null_blk now supports faking timeouts, to enable us to better
    exercise and test that functionality separately. From me.

    - Kill the separate atomic poll bit in the request struct. After
    this, we don't use the atomic bits on blk-mq anymore at all. From
    me.

    - sgl_alloc/free helpers from Bart.

    - Heavily contended tag case scalability improvement from me.

    - Various little fixes and cleanups from Arnd, Bart, Corentin,
    Douglas, Eryu, Goldwyn, and myself"

    * 'for-4.16/block' of git://git.kernel.dk/linux-block: (186 commits)
    block: remove smart1,2.h
    nvme: add tracepoint for nvme_complete_rq
    nvme: add tracepoint for nvme_setup_cmd
    nvme-pci: introduce RECONNECTING state to mark initializing procedure
    nvme-rdma: remove redundant boolean for inline_data
    nvme: don't free uuid pointer before printing it
    nvme-pci: Suspend queues after deleting them
    bsg: use pr_debug instead of hand crafted macros
    blk-mq-debugfs: don't allow write on attributes with seq_operations set
    nvme-pci: Fix queue double allocations
    block: Set BIO_TRACE_COMPLETION on new bio during split
    blk-throttle: use queue_is_rq_based
    block: Remove kblockd_schedule_delayed_work{,_on}()
    blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces unintended delays
    blk-mq: Rename blk_mq_request_direct_issue() into blk_mq_request_issue_directly()
    lib/scatterlist: Fix chaining support in sgl_alloc_order()
    blk-throttle: track read and write request individually
    block: add bdev_read_only() checks to common helpers
    block: fail op_is_write() requests to read-only partitions
    blk-throttle: export io_serviced_recursive, io_service_bytes_recursive
    ...

    Linus Torvalds
     

19 Jan, 2018

1 commit


11 Jan, 2018

3 commits

  • We only have one atomic flag left. Instead of using an entire
    unsigned long for that, steal the bottom bit of the deadline
    field that we already reserved.

    Remove ->atomic_flags, since it's now unused.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We reduce the resolution of request expiry, but since we're already
    using jiffies for this where resolution depends on the kernel
    configuration and since the timeout resolution is coarse anyway,
    that should be fine.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't need this to be an atomic flag, it can be a regular
    flag. We either end up on the same CPU for the polling, in which
    case the state is sane, or we did the sleep which would imply
    the needed barrier to ensure we see the right state.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Jan, 2018

2 commits

  • After the recent updates to use generation number and state based
    synchronization, we can easily replace REQ_ATOM_STARTED usages by
    adding an extra state to distinguish completed but not yet freed
    state.

    Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
    blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users
    left and is removed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blk-mq timeout path synchronizes against the usual
    issue/completion path using a complex scheme involving atomic
    bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
    rules. Unfortunately, it contains quite a few holes.

    There's a complex dancing around REQ_ATOM_STARTED and
    REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
    they don't have a synchronization point across request recycle
    instances and it isn't clear what the barriers add.
    blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
    deadline from N-1'th, blk_mark_rq_complete() against Nth instance.

    In fact, it's pretty easy to make blk_mq_check_expired() terminate a
    later instance of a request. If we induce 5 sec delay before
    time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
    2s, and issue back-to-back large IOs, blk-mq starts timing out
    requests spuriously pretty quickly. Nothing actually timed out. It
    just made the call on a recycle instance of a request and then
    terminated a later instance long after the original instance finished.
    The scenario isn't theoretical either.

    This patch replaces the broken synchronization mechanism with a RCU
    and generation number based one.

    1. Each request has a u64 generation + state value, which can be
    updated only by the request owner. Whenever a request becomes
    in-flight, the generation number gets bumped up too. This provides
    the basis for the timeout path to distinguish different recycle
    instances of the request.

    Also, marking a request in-flight and setting its deadline are
    protected with a seqcount so that the timeout path can fetch both
    values coherently.

    2. The timeout path fetches the generation, state and deadline. If
    the verdict is timeout, it records the generation into a dedicated
    request abortion field and does RCU wait.

    3. The completion path is also protected by RCU (from the previous
    patch) and checks whether the current generation number and state
    match the abortion field. If so, it skips completion.

    4. The timeout path, after RCU wait, scans requests again and
    terminates the ones whose generation and state still match the ones
    requested for abortion.

    By now, the timeout path knows that either the generation number
    and state changed if it lost the race or the completion will yield
    to it and can safely timeout the request.

    While it's more lines of code, it's conceptually simpler, doesn't
    depend on direct use of subtle memory ordering or coherence, and
    hopefully doesn't terminate the wrong instance.

    While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
    between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
    removed yet as it's still used in other places. Future patches will
    move all state tracking to the new mechanism and remove all bitops in
    the hot paths.

    Note that this patch adds a comment explaining a race condition in
    BLK_EH_RESET_TIMER path. The race has always been there and this
    patch doesn't change it. It's just documenting the existing race.

    v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
    - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
    - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.

    v3: - Fixed possible extended seqcount / u64_stats_sync read looping
    spotted by Peter.
    - MQ_RQ_IDLE was incorrectly being set in complete_request instead
    of free_request. Fixed.

    v4: - Rebased on top of hctx_lock() refactoring patch.
    - Added comment explaining the use of hctx_lock() in completion path.

    v5: - Added comments requested by Bart.
    - Note the addition of BLK_EH_RESET_TIMER race condition in the
    commit message.

    Signed-off-by: Tejun Heo
    Cc: "jianchao.wang"
    Cc: Peter Zijlstra
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Jan, 2018

1 commit

  • Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
    ("block: Make q_usage_counter also track legacy requests"), but that
    commit never runs and drains legacy queue before waiting for this counter
    becoming zero, then IO hang is caused in the test of pulling disk during IO.

    This patch fixes the issue by draining requests before waiting for
    q_usage_counter becoming zero, both Mauricio and chenxiang reported this
    issue, and observed that it can be fixed by this patch.

    Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
    Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
    Cc: Wen Xiong
    Tested-by: "chenxiang (M)"
    Tested-by: Mauricio Faria de Oliveira
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

05 Oct, 2017

1 commit


03 Oct, 2017

1 commit


29 Aug, 2017

1 commit


24 Aug, 2017

1 commit


04 Jul, 2017

1 commit


28 Jun, 2017

1 commit


21 Jun, 2017

1 commit

  • Some functions in block/blk-core.c must only be used on blk-sq queues
    while others are safe to use against any queue type. Document which
    functions are intended for blk-sq queues and issue a warning if the
    blk-sq API is misused. This does not only help block driver authors
    but will also make it easier to remove the blk-sq code once that code
    is declared obsolete.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

02 Jun, 2017

1 commit

  • Since the introduction of .init_rq_fn() and .exit_rq_fn() it is
    essential that the memory allocated for struct request_queue
    stays around until all blk_exit_rl() calls have finished. Hence
    make blk_init_rl() take a reference on struct request_queue.

    This patch fixes the following crash:

    general protection fault: 0000 [#2] SMP
    CPU: 3 PID: 28 Comm: ksoftirqd/3 Tainted: G D 4.12.0-rc2-dbg+ #2
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    task: ffff88013a108040 task.stack: ffffc9000071c000
    RIP: 0010:free_request_size+0x1a/0x30
    RSP: 0018:ffffc9000071fd38 EFLAGS: 00010202
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff880067362a88 RCX: 0000000000000003
    RDX: ffff880067464178 RSI: ffff880067362a88 RDI: ffff880135ea4418
    RBP: ffffc9000071fd40 R08: 0000000000000000 R09: 0000000100180009
    R10: ffffc9000071fd38 R11: ffffffff81110800 R12: ffff88006752d3d8
    R13: ffff88006752d3d8 R14: ffff88013a108040 R15: 000000000000000a
    FS: 0000000000000000(0000) GS:ffff88013fd80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fa8ec1edb00 CR3: 0000000138ee8000 CR4: 00000000001406e0
    Call Trace:
    mempool_destroy.part.10+0x21/0x40
    mempool_destroy+0xe/0x10
    blk_exit_rl+0x12/0x20
    blkg_free+0x4d/0xa0
    __blkg_release_rcu+0x59/0x170
    rcu_process_callbacks+0x260/0x4e0
    __do_softirq+0x116/0x250
    smpboot_thread_fn+0x123/0x1e0
    kthread+0x109/0x140
    ret_from_fork+0x31/0x40

    Fixes: commit e9c787e65c0c ("scsi: allocate scsi_cmnd structures as part of struct request")
    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Cc: Jan Kara
    Cc: # v4.11+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Apr, 2017

2 commits


28 Mar, 2017

4 commits

  • User configures latency target, but the latency threshold for each
    request size isn't fixed. For a SSD, the IO latency highly depends on
    request size. To calculate latency threshold, we sample some data, eg,
    average latency for request size 4k, 8k, 16k, 32k .. 1M. The latency
    threshold of each request size will be the sample latency (I'll call it
    base latency) plus latency target. For example, the base latency for
    request size 4k is 80us and user configures latency target 60us. The 4k
    latency threshold will be 80 + 60 = 140us.

    To sample data, we calculate the order base 2 of rounded up IO sectors.
    If the IO size is bigger than 1M, it will be accounted as 1M. Since the
    calculation does round up, the base latency will be slightly smaller
    than actual value. Also if there isn't any IO dispatched for a specific
    IO size, we will use the base latency of smaller IO size for this IO
    size.

    But we shouldn't sample data at any time. The base latency is supposed
    to be latency where disk isn't congested, because we use latency
    threshold to schedule IOs between cgroups. If disk is congested, the
    latency is higher, using it for scheduling is meaningless. Hence we only
    do the sampling when block throttling is in the LOW limit, with
    assumption disk isn't congested in such state. If the assumption isn't
    true, eg, low limit is too high, calculated latency threshold will be
    higher.

    Hard disk is completely different. Latency depends on spindle seek
    instead of request size. Currently this feature is SSD only, we probably
    can use a fixed threshold like 4ms for hard disk though.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • A cgroup gets assigned a low limit, but the cgroup could never dispatch
    enough IO to cross the low limit. In such case, the queue state machine
    will remain in LIMIT_LOW state and all other cgroups will be throttled
    according to low limit. This is unfair for other cgroups. We should
    treat the cgroup idle and upgrade the state machine to lower state.

    We also have a downgrade logic. If the state machine upgrades because of
    cgroup idle (real idle), the state machine will downgrade soon as the
    cgroup is below its low limit. This isn't what we want. A more
    complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
    when queue gets upgraded to lower state, other cgroups could dispatch
    more IO and this cgroup can't dispatch enough IO, so the cgroup is below
    its low limit and looks like idle (fake idle). In this case, the queue
    should downgrade soon. The key to determine if we should do downgrade is
    to detect if cgroup is truely idle.

    Unfortunately it's very hard to determine if a cgroup is real idle. This
    patch uses the 'think time check' idea from CFQ for the purpose. Please
    note, the idea doesn't work for all workloads. For example, a workload
    with io depth 8 has disk utilization 100%, hence think time is 0, eg,
    not idle. But the workload can run higher bandwidth with io depth 16.
    Compared to io depth 16, the io depth 8 workload is idle. We use the
    idea to roughly determine if a cgroup is idle.

    We treat a cgroup idle if its think time is above a threshold (by
    default 1ms for SSD and 100ms for HD). The idea is think time above the
    threshold will start to harm performance. HD is much slower so a longer
    think time is ok.

    The patch (and the latter patches) uses 'unsigned long' to track time.
    We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
    precision, should not a big deal.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • The throtl_slice is 100ms by default. This is a long time for SSD, a lot
    of IO can run. To make cgroups have smoother throughput, we choose a
    small value (20ms) for SSD.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • throtl_slice is important for blk-throttling. It's called slice
    internally but it really is a time window blk-throttling samples data.
    blk-throttling will make decision based on the samplings. An example is
    bandwidth measurement. A cgroup's bandwidth is measured in the time
    interval of throtl_slice.

    A small throtl_slice meanse cgroups have smoother throughput but burn
    more CPUs. It has 100ms default value, which is not appropriate for all
    disks. A fast SSD can dispatch a lot of IOs in 100ms. This patch makes
    it tunable.

    Since throtl_slice isn't a time slice, the sysfs name
    'throttle_sample_time' reflects its character better.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

09 Feb, 2017

3 commits

  • Add a new merge strategy that merges discard bios into a request until the
    maximum number of discard ranges (or the maximum discard size) is reached
    from the plug merging code. I/O scheduler merging is not wired up yet
    but might also be useful, although not for fast devices like NVMe which
    are the only user for now.

    Note that for now we don't support limiting the size of each discard range,
    but if needed that can be added later.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Switch these constants to an enum, and make let the compiler ensure that
    all callers of blk_try_merge and elv_merge handle all potential values.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This makes it available outside of blk-merge.c, and inlining such a trivial
    helper seems pretty useful to start with.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Feb, 2017

1 commit

  • When we attempt to merge request-to-request, we return a 0/1 if we
    ended up merging or not. Change that to return the pointer to the
    request that we freed. We will use this to move the freeing of
    that request out of the merge logic, so that callers can drop
    locks before freeing the request.

    There should be no functional changes in this patch.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

03 Feb, 2017

1 commit


01 Feb, 2017

1 commit

  • This can be used to check for fs vs non-fs requests and basically
    removes all knowledge of BLOCK_PC specific from the block layer,
    as well as preparing for removing the cmd_type field in struct request.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig