23 Feb, 2017

1 commit

  • errata:
    When a read command returns less data than specified in the PRDs (for
    example, there are two PRDs for this command, but the device returns a
    number of bytes which is less than in the first PRD), the second PRD of
    this command is not read out of the PRD FIFO, causing the next command
    to use this PRD erroneously.

    workaround
    - forces sg_tablesize = 1
    - modified the sg_io function in block/scsi_ioctl.c to use a 64k buffer
    allocated with dma_alloc_coherent during the probe in ahci_imx
    - In order to fix the scsi/sata hang, when CD_ROM and HDD are
    accessed simultaneously after the workaround is applied.
    Do not go to sleep in scsi_eh_handler, when there is host failed.

    Signed-off-by: Richard Zhu

    Richard Zhu
     

20 Jan, 2017

2 commits

  • commit c02ebfdddbafa9a6a0f52fbd715e6bfa229af9d3 upstream.

    Commit 0e87e58bf60e ("blk-mq: improve warning for running a queue on the
    wrong CPU") attempts to avoid triggering the WARN_ON in
    __blk_mq_run_hw_queue when the expected CPU is dead. Problem is, in the
    last batch execution before round robin, blk_mq_hctx_next_cpu can
    schedule a dead CPU and also update next_cpu to the next alive CPU in
    the mask, which will trigger the WARN_ON despite the previous
    workaround.

    The following patch fixes this scenario by always scheduling the value
    in hctx->next_cpu. This changes the moment when we round-robin the CPU
    running the hctx, but it really doesn't matter, since it still executes
    BLK_MQ_CPU_WORK_BATCH times in a row before switching to another CPU.

    Fixes: 0e87e58bf60e ("blk-mq: improve warning for running a queue on the wrong CPU")
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Gabriel Krisman Bertazi
     
  • commit ebc4ff661fbe76781c6b16dfb7b754a5d5073f8e upstream.

    cfq_cpd_alloc() which is the cpd_alloc_fn implementation for cfq was
    incorrectly hard coding GFP_KERNEL instead of using the mask specified
    through the @gfp parameter. This currently doesn't cause any actual
    issues because all current callers specify GFP_KERNEL. Fix it.

    Signed-off-by: Tejun Heo
    Reported-by: Dan Carpenter
    Fixes: e4a9bde9589f ("blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

09 Jan, 2017

1 commit

  • commit 128394eff343fc6d2f32172f03e24829539c5835 upstream.

    Both damn things interpret userland pointers embedded into the payload;
    worse, they are actually traversing those. Leaving aside the bad
    API design, this is very much _not_ safe to call with KERNEL_DS.
    Bail out early if that happens.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     

06 Jan, 2017

1 commit

  • commit bc27c01b5c46d3bfec42c96537c7a3fae0bb2cc4 upstream.

    The meaning of the BLK_MQ_S_STOPPED flag is "do not call
    .queue_rq()". Hence modify blk_mq_make_request() such that requests
    are queued instead of issued if a queue has been stopped.

    Reported-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

08 Dec, 2016

1 commit


27 Oct, 2016

1 commit

  • If we end up sleeping due to running out of requests, we should
    update the hardware and software queues in the map ctx structure.
    Otherwise we could end up having rq->mq_ctx point to the pre-sleep
    context, and risk corrupting ctx->rq_list since we'll be
    grabbing the wrong lock when inserting the request.

    Reported-by: Dave Jones
    Reported-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: 63581af3f31e ("blk-mq: remove non-blocking pass in blk_mq_map_request")
    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Oct, 2016

1 commit


22 Oct, 2016

2 commits

  • When bandblocks_set acknowledges a range or badblocks_clear a range,
    it's possible all badblocks are acknowledged. We should update
    unacked_exist if this occurs.

    Signed-off-by: Shaohua Li
    Reviewed-by: Tomasz Majchrzak
    Tested-by: Tomasz Majchrzak
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Pull block fixes from Jens Axboe:
    "A set of fixes that missed the merge window, mostly due to me being
    away around that time.

    Nothing major here, a mix of nvme cleanups and fixes, and one fix for
    the badblocks handling"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    nvmet: use symbolic constants for CNS values
    nvme: use symbolic constants for CNS values
    nvme.h: add an enum for cns values
    nvme.h: don't use uuid_be
    nvme.h: resync with nvme-cli
    nvme: Add tertiary number to NVME_VS
    nvme : Add sysfs entry for NVMe CMBs when appropriate
    nvme: don't schedule multiple resets
    nvme: Delete created IO queues on reset
    nvme: Stop probing a removed device
    badblocks: fix overlapping check for clearing

    Linus Torvalds
     

16 Oct, 2016

1 commit

  • Pull gcc plugins update from Kees Cook:
    "This adds a new gcc plugin named "latent_entropy". It is designed to
    extract as much possible uncertainty from a running system at boot
    time as possible, hoping to capitalize on any possible variation in
    CPU operation (due to runtime data differences, hardware differences,
    SMP ordering, thermal timing variation, cache behavior, etc).

    At the very least, this plugin is a much more comprehensive example
    for how to manipulate kernel code using the gcc plugin internals"

    * tag 'gcc-plugins-v4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
    latent_entropy: Mark functions with __latent_entropy
    gcc-plugins: Add latent_entropy plugin

    Linus Torvalds
     

15 Oct, 2016

1 commit

  • Pull cgroup updates from Tejun Heo:

    - tracepoints for basic cgroup management operations added

    - kernfs and cgroup path formatting functions updated to behave in the
    style of strlcpy()

    - non-critical bug fixes

    * 'for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    blkcg: Unlock blkcg_pol_mutex only once when cpd == NULL
    cgroup: fix error handling regressions in proc_cgroup_show() and cgroup_release_agent()
    cpuset: fix error handling regression in proc_cpuset_show()
    cgroup: add tracepoints for basic operations
    cgroup: make cgroup_path() and friends behave in the style of strlcpy()
    kernfs: remove kernfs_path_len()
    kernfs: make kernfs_path*() behave in the style of strlcpy()
    kernfs: add dummy implementation of kernfs_path_from_node()

    Linus Torvalds
     

12 Oct, 2016

3 commits

  • Current bad block clear implementation assumes the range to clear
    overlaps with at least one bad block already stored. If given range to
    clear precedes first bad block in a list, the first entry is incorrectly
    updated.

    Check not only if stored block end is past clear block end but also if
    stored block start is before clear block end.

    Signed-off-by: Tomasz Majchrzak
    Acked-by: NeilBrown
    Signed-off-by: Jens Axboe

    Tomasz Majchrzak
     
  • Make sure that the offset and length arguments that we're using to
    construct WRITE SAME and DISCARD requests are actually aligned to the
    logical block size. Failure to do this causes other errors in other parts
    of the block layer or the SCSI layer because disks don't support partial
    logical block writes.

    Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     
  • Patch series "fallocate for block devices", v11.

    This is a patchset to fix page cache coherency with BLKZEROOUT and
    implement fallocate for block devices.

    The first patch is a fix to the existing BLKZEROOUT ioctl to invalidate
    the page cache if the zeroing command to the underlying device succeeds.
    Without this patch we still have the pagecache coherence bug that's been
    in the kernel forever.

    The second patch changes the internal block device functions to reject
    attempts to discard or zeroout that are not aligned to the logical block
    size. Previously, we only checked that the start/len parameters were
    512-byte aligned, which caused kernel BUG_ONs for unaligned IOs to 4k-LBA
    devices.

    The third patch creates an fallocate handler for block devices, wires up
    the FALLOC_FL_PUNCH_HOLE flag to zeroing-discard, and connects
    FALLOC_FL_ZERO_RANGE to write-same so that we can have a consistent
    fallocate interface between files and block devices. It also allows the
    combination of PUNCH_HOLE and NO_HIDE_STALE to invoke non-zeroing discard.

    Test cases for the new block device fallocate are now in xfstests as
    generic/349-351.

    This patch (of 3):

    Invalidate the page cache (as a regular O_DIRECT write would do) to avoid
    returning stale cache contents at a later time.

    Link: http://lkml.kernel.org/r/147518378313.22791.16649519283678515021.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

11 Oct, 2016

1 commit

  • The __latent_entropy gcc attribute can be used only on functions and
    variables. If it is on a function then the plugin will instrument it for
    gathering control-flow entropy. If the attribute is on a variable then
    the plugin will initialize it with random contents. The variable must
    be an integer, an integer array type or a structure with integer fields.

    These specific functions have been selected because they are init
    functions (to help gather boot-time entropy), are called at unpredictable
    times, or they have variable loops, each of which provide some level of
    latent entropy.

    Signed-off-by: Emese Revfy
    [kees: expanded commit message]
    Signed-off-by: Kees Cook

    Emese Revfy
     

10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     

04 Oct, 2016

1 commit

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     

30 Sep, 2016

1 commit

  • Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
    such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
    avoids that smatch reports the following error:

    block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'

    Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Cc: # v4.2+
    Signed-off-by: Tejun Heo

    Bart Van Assche
     

24 Sep, 2016

2 commits

  • This provides the caller a feedback that a given hctx is not mapped and thus
    no command can be sent on it.

    Signed-off-by: Christoph Hellwig
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

23 Sep, 2016

3 commits


22 Sep, 2016

4 commits

  • Two cases:

    1) blk_mq_alloc_request() needlessly re-runs the queue, after
    calling into the tag allocation without NOWAIT set. We don't
    need to do that.

    2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
    the async flag set to false.

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     
  • Install the callbacks via the state machine so we can phase out the cpu
    hotplug notifiers mess.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Sebastian Andrzej Siewior
     
  • Replace the block-mq notifier list management with the multi instance
    facility in the cpu hotplug state machine.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     
  • bio_free_pages is introduced in commit 1dfa0f68c040
    ("block: add a helper to free bio bounce buffer pages"),
    we can reuse the func in other modules after it was
    imported.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Mike Snitzer
    Cc: Shaohua Li
    Signed-off-by: Guoqing Jiang
    Acked-by: Kent Overstreet
    Signed-off-by: Jens Axboe

    Guoqing Jiang
     

21 Sep, 2016

1 commit


20 Sep, 2016

2 commits

  • Right now, if slice is expired, we start a new slice. If a bio is
    queued, we keep on extending slice by throtle_slice interval (100ms).

    This worked well as long as pending timer function got executed with-in
    few milli seconds of scheduled time. But looks like with recent changes
    in timer subsystem, slack can be much longer depending on the expiry time
    of the scheduled timer.

    commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

    This means, by the time timer function gets executed, it is possible the
    delay from scheduled time is more than 100ms. That means current code
    will conclude that existing slice has expired and a new one needs to
    be started. New slice will be 100ms by default and that will not be
    sufficient to meet rate requirement of group given the bio size and
    bio will not be dispatched and we will start a new timer function to
    wait. And when that timer expires, same process will repeat and we
    will wait again and this can easily be an infinite loop.

    Solve this issue by starting a new slice only if throttle gropup is
    empty. If it is not empty, that means there should be an active slice
    going on. Ideally it should not be expired but given the slack, it is
    possible that it has expired.

    Reported-by: Hou Tao
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Jens Axboe
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160906170457.32393-9-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

19 Sep, 2016

1 commit


17 Sep, 2016

5 commits

  • In order to get good cache behavior from a sbitmap, we want each CPU to
    stick to its own cacheline(s) as much as possible. This might happen
    naturally as the bitmap gets filled up and the alloc_hint values spread
    out, but we really want this behavior from the start. blk-mq apparently
    intended to do this, but the code to do this was never wired up. Get rid
    of the dead code and make it part of the sbitmap library.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Again, there's no point in passing this in every time. Make it part of
    struct sbitmap_queue and clean up the API.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • We currently account a '0' dispatch, and anything above that still falls
    below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
    we don't account it.

    Change the last bucket to be inclusive of anything above the range we
    track, and have the sysfs file reflect that by including a '+' in the
    output:

    $ cat /sys/block/nvme0n1/mq/0/dispatched
    0 1006
    1 20229
    2 1
    4 0
    8 0
    16 0
    32+ 0

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

15 Sep, 2016

1 commit