10 Oct, 2016

2 commits

  • Pull blk-mq CPU hotplug update from Jens Axboe:
    "This is the conversion of blk-mq to the new hotplug state machine"

    * 'for-4.9/block-smp' of git://git.kernel.dk/linux-block:
    blk-mq: fixup "Convert to new hotplug state machine"
    blk-mq: Convert to new hotplug state machine
    blk-mq/cpu-notif: Convert to new hotplug state machine

    Linus Torvalds
     
  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

08 Oct, 2016

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the main pull request for block layer changes in 4.9.

    As mentioned at the last merge window, I've changed things up and now
    do just one branch for core block layer changes, and driver changes.
    This avoids dependencies between the two branches. Outside of this
    main pull request, there are two topical branches coming as well.

    This pull request contains:

    - A set of fixes, and a conversion to blk-mq, of nbd. From Josef.

    - Set of fixes and updates for lightnvm from Matias, Simon, and Arnd.
    Followup dependency fix from Geert.

    - General fixes from Bart, Baoyou, Guoqing, and Linus W.

    - CFQ async write starvation fix from Glauber.

    - Add supprot for delayed kick of the requeue list, from Mike.

    - Pull out the scalable bitmap code from blk-mq-tag.c and make it
    generally available under the name of sbitmap. Only blk-mq-tag uses
    it for now, but the blk-mq scheduling bits will use it as well.
    From Omar.

    - bdev thaw error progagation from Pierre.

    - Improve the blk polling statistics, and allow the user to clear
    them. From Stephen.

    - Set of minor cleanups from Christoph in block/blk-mq.

    - Set of cleanups and optimizations from me for block/blk-mq.

    - Various nvme/nvmet/nvmeof fixes from the various folks"

    * 'for-4.9/block' of git://git.kernel.dk/linux-block: (54 commits)
    fs/block_dev.c: return the right error in thaw_bdev()
    nvme: Pass pointers, not dma addresses, to nvme_get/set_features()
    nvme/scsi: Remove power management support
    nvmet: Make dsm number of ranges zero based
    nvmet: Use direct IO for writes
    admin-cmd: Added smart-log command support.
    nvme-fabrics: Add host_traddr options field to host infrastructure
    nvme-fabrics: revise host transport option descriptions
    nvme-fabrics: rework nvmf_get_address() for variable options
    nbd: use BLK_MQ_F_BLOCKING
    blkcg: Annotate blkg_hint correctly
    cfq: fix starvation of asynchronous writes
    blk-mq: add flag for drivers wanting blocking ->queue_rq()
    blk-mq: remove non-blocking pass in blk_mq_map_request
    blk-mq: get rid of manual run of queue with __blk_mq_run_hw_queue()
    block: export bio_free_pages to other modules
    lightnvm: propagate device_add() error code
    lightnvm: expose device geometry through sysfs
    lightnvm: control life of nvm_dev in driver
    blk-mq: register device instead of disk
    ...

    Linus Torvalds
     

04 Oct, 2016

1 commit

  • Pull CPU hotplug updates from Thomas Gleixner:
    "Yet another batch of cpu hotplug core updates and conversions:

    - Provide core infrastructure for multi instance drivers so the
    drivers do not have to keep custom lists.

    - Convert custom lists to the new infrastructure. The block-mq custom
    list conversion comes through the block tree and makes the diffstat
    tip over to more lines removed than added.

    - Handle unbalanced hotplug enable/disable calls more gracefully.

    - Remove the obsolete CPU_STARTING/DYING notifier support.

    - Convert another batch of notifier users.

    The relayfs changes which conflicted with the conversion have been
    shipped to me by Andrew.

    The remaining lot is targeted for 4.10 so that we finally can remove
    the rest of the notifiers"

    * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
    cpufreq: Fix up conversion to hotplug state machine
    blk/mq: Reserve hotplug states for block multiqueue
    x86/apic/uv: Convert to hotplug state machine
    s390/mm/pfault: Convert to hotplug state machine
    mips/loongson/smp: Convert to hotplug state machine
    mips/octeon/smp: Convert to hotplug state machine
    fault-injection/cpu: Convert to hotplug state machine
    padata: Convert to hotplug state machine
    cpufreq: Convert to hotplug state machine
    ACPI/processor: Convert to hotplug state machine
    virtio scsi: Convert to hotplug state machine
    oprofile/timer: Convert to hotplug state machine
    block/softirq: Convert to hotplug state machine
    lib/irq_poll: Convert to hotplug state machine
    x86/microcode: Convert to hotplug state machine
    sh/SH-X3 SMP: Convert to hotplug state machine
    ia64/mca: Convert to hotplug state machine
    ARM/OMAP/wakeupgen: Convert to hotplug state machine
    ARM/shmobile: Convert to hotplug state machine
    arm64/FP/SIMD: Convert to hotplug state machine
    ...

    Linus Torvalds
     

24 Sep, 2016

2 commits

  • This provides the caller a feedback that a given hctx is not mapped and thus
    no command can be sent on it.

    Signed-off-by: Christoph Hellwig
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • While debugging timeouts happening in my application workload (ScyllaDB), I have
    observed calls to open() taking a long time, ranging everywhere from 2 seconds -
    the first ones that are enough to time out my application - to more than 30
    seconds.

    The problem seems to happen because XFS may block on pending metadata updates
    under certain circumnstances, and that's confirmed with the following backtrace
    taken by the offcputime tool (iovisor/bcc):

    ffffffffb90c57b1 finish_task_switch
    ffffffffb97dffb5 schedule
    ffffffffb97e310c schedule_timeout
    ffffffffb97e1f12 __down
    ffffffffb90ea821 down
    ffffffffc046a9dc xfs_buf_lock
    ffffffffc046abfb _xfs_buf_find
    ffffffffc046ae4a xfs_buf_get_map
    ffffffffc046babd xfs_buf_read_map
    ffffffffc0499931 xfs_trans_read_buf_map
    ffffffffc044a561 xfs_da_read_buf
    ffffffffc0451390 xfs_dir3_leaf_read.constprop.16
    ffffffffc0452b90 xfs_dir2_leaf_lookup_int
    ffffffffc0452e0f xfs_dir2_leaf_lookup
    ffffffffc044d9d3 xfs_dir_lookup
    ffffffffc047d1d9 xfs_lookup
    ffffffffc0479e53 xfs_vn_lookup
    ffffffffb925347a path_openat
    ffffffffb9254a71 do_filp_open
    ffffffffb9242a94 do_sys_open
    ffffffffb9242b9e sys_open
    ffffffffb97e42b2 entry_SYSCALL_64_fastpath
    00007fb0698162ed [unknown]

    Inspecting my run with blktrace, I can see that the xfsaild kthread exhibit very
    high "Dispatch wait" times, on the dozens of seconds range and consistent with
    the open() times I have saw in that run.

    Still from the blktrace output, we can after searching a bit, identify the
    request that wasn't dispatched:

    8,0 11 152 81.092472813 804 A WM 141698288 + 8
    8,0 0 289372 96.718761435 0 D WM 141698288 + 8 (15626265317) [swapper/0]

    As we can see above, in this particular example CFQ took 15 seconds to dispatch
    this request. Going back to the full trace, we can see that the xfsaild queue
    had plenty of opportunity to run, and it was selected as the active queue many
    times. It would just always be preempted by something else (example):

    8,0 1 0 81.117912979 0 m N cfq1618SN / insert_request
    8,0 1 0 81.117913419 0 m N cfq1618SN / add_to_rr
    8,0 1 0 81.117914044 0 m N cfq1618SN / preempt
    8,0 1 0 81.117914398 0 m N cfq767A / slice expired t=1
    8,0 1 0 81.117914755 0 m N cfq767A / resid=40
    8,0 1 0 81.117915340 0 m N / served: vt=1948520448 min_vt=1948520448
    8,0 1 0 81.117915858 0 m N cfq767A / sl_used=1 disp=0 charge=0 iops=1 sect=0

    where cfq767 is the xfsaild queue and cfq1618 corresponds to one of the ScyllaDB
    IO dispatchers.

    The requests preempting the xfsaild queue are synchronous requests. That's a
    characteristic of ScyllaDB workloads, as we only ever issue O_DIRECT requests.
    While it can be argued that preempting ASYNC requests in favor of SYNC is part
    of the CFQ logic, I don't believe that doing so for 15+ seconds is anyone's
    goal.

    Moreover, unless I am misunderstanding something, that breaks the expectation
    set by the "fifo_expire_async" tunable, which in my system is set to the
    default.

    Looking at the code, it seems to me that the issue is that after we make
    an async queue active, there is no guarantee that it will execute any request.

    When the queue itself tests if it cfq_may_dispatch() it can bail if it sees SYNC
    requests in flight. An incoming request from another queue can also preempt it
    in such situation before we have the chance to execute anything (as seen in the
    trace above).

    This patch sets the must_dispatch flag if we notice that we have requests
    that are already fifo_expired. This flag is always cleared after
    cfq_dispatch_request() returns from cfq_dispatch_requests(), so it won't pin
    the queue for subsequent requests (unless they are themselves expired)

    Care is taken during preempt to still allow rt requests to preempt us
    regardless.

    Testing my workload with this patch applied produces much better results.
    From the application side I see no timeouts, and the open() latency histogram
    generated by systemtap looks much better, with the worst outlier at 131ms:

    Latency histogram of xfs_buf_lock acquisition (microseconds):
    value |-------------------------------------------------- count
    0 | 11
    1 |@@@@ 161
    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1966
    4 |@ 54
    8 | 36
    16 | 7
    32 | 0
    64 | 0
    ~
    1024 | 0
    2048 | 0
    4096 | 1
    8192 | 1
    16384 | 2
    32768 | 0
    65536 | 0
    131072 | 1
    262144 | 0
    524288 | 0

    Signed-off-by: Glauber Costa
    CC: Jens Axboe
    CC: linux-block@vger.kernel.org
    CC: linux-kernel@vger.kernel.org

    Signed-off-by: Glauber Costa
    Signed-off-by: Jens Axboe

    Glauber Costa
     

23 Sep, 2016

3 commits


22 Sep, 2016

4 commits

  • Two cases:

    1) blk_mq_alloc_request() needlessly re-runs the queue, after
    calling into the tag allocation without NOWAIT set. We don't
    need to do that.

    2) blk_mq_map_request() should just use blk_mq_run_hw_queue() with
    the async flag set to false.

    Signed-off-by: Jens Axboe
    Reviewed-by: Christoph Hellwig

    Jens Axboe
     
  • Install the callbacks via the state machine so we can phase out the cpu
    hotplug notifiers mess.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Link: http://lkml.kernel.org/r/20160919212601.180033814@linutronix.de
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Jens Axboe

    Sebastian Andrzej Siewior
     
  • Replace the block-mq notifier list management with the multi instance
    facility in the cpu hotplug state machine.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: linux-block@vger.kernel.org
    Cc: rt@linutronix.de
    Cc: Christoph Hellwing
    Signed-off-by: Jens Axboe

    Thomas Gleixner
     
  • bio_free_pages is introduced in commit 1dfa0f68c040
    ("block: add a helper to free bio bounce buffer pages"),
    we can reuse the func in other modules after it was
    imported.

    Cc: Christoph Hellwig
    Cc: Jens Axboe
    Cc: Mike Snitzer
    Cc: Shaohua Li
    Signed-off-by: Guoqing Jiang
    Acked-by: Kent Overstreet
    Signed-off-by: Jens Axboe

    Guoqing Jiang
     

21 Sep, 2016

1 commit


20 Sep, 2016

2 commits

  • Right now, if slice is expired, we start a new slice. If a bio is
    queued, we keep on extending slice by throtle_slice interval (100ms).

    This worked well as long as pending timer function got executed with-in
    few milli seconds of scheduled time. But looks like with recent changes
    in timer subsystem, slack can be much longer depending on the expiry time
    of the scheduled timer.

    commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

    This means, by the time timer function gets executed, it is possible the
    delay from scheduled time is more than 100ms. That means current code
    will conclude that existing slice has expired and a new one needs to
    be started. New slice will be 100ms by default and that will not be
    sufficient to meet rate requirement of group given the bio size and
    bio will not be dispatched and we will start a new timer function to
    wait. And when that timer expires, same process will repeat and we
    will wait again and this can easily be an infinite loop.

    Solve this issue by starting a new slice only if throttle gropup is
    empty. If it is not empty, that means there should be an active slice
    going on. Ideally it should not be expired but given the slack, it is
    possible that it has expired.

    Reported-by: Hou Tao
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Install the callbacks via the state machine.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Jens Axboe
    Cc: rt@linutronix.de
    Link: http://lkml.kernel.org/r/20160906170457.32393-9-bigeasy@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

19 Sep, 2016

1 commit


17 Sep, 2016

5 commits

  • In order to get good cache behavior from a sbitmap, we want each CPU to
    stick to its own cacheline(s) as much as possible. This might happen
    naturally as the bitmap gets filled up and the alloc_hint values spread
    out, but we really want this behavior from the start. blk-mq apparently
    intended to do this, but the code to do this was never wired up. Get rid
    of the dead code and make it part of the sbitmap library.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Again, there's no point in passing this in every time. Make it part of
    struct sbitmap_queue and clean up the API.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • We currently account a '0' dispatch, and anything above that still falls
    below the range set by BLK_MQ_MAX_DISPATCH_ORDER. If we dispatch more,
    we don't account it.

    Change the last bucket to be inclusive of anything above the range we
    track, and have the sysfs file reflect that by including a '+' in the
    output:

    $ cat /sys/block/nvme0n1/mq/0/dispatched
    0 1006
    1 20229
    2 1
    4 0
    8 0
    16 0
    32+ 0

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

15 Sep, 2016

8 commits

  • Fixes 1b157939f92a ("blk-mq: get rid of the cpumask in struct blk_mq_tags")
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Unused now that NVMe sets up irq affinity before calling into blk-mq.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • This allows drivers specify their own queue mapping by overriding the
    setup-time function that builds the mq_map. This can be used for
    example to build the map based on the MSI-X vector mapping provided
    by the core interrupt layer for PCI devices.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • The mapping is identical for all queues in a tag_set, so stop wasting
    memory for building multiple. Note that for now I've kept the mq_map
    pointer in the request_queue, but we'll need to investigate if we can
    remove it without suffering too much from the additional pointer chasing.
    The same would apply to the mq_ops pointer as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently blk-mq will totally remap hardware context when a CPU hotplug
    even happened, which causes major havoc for drivers, as they are never
    told about this remapping. E.g. any carefully sorted out CPU affinity
    will just be completely messed up.

    The rebuild also doesn't really help for the common case of cpu
    hotplug, which is soft onlining / offlining of cpus - in this case we
    should just leave the queue and irq mapping as is. If it actually
    worked it would have helped in the case of physical cpu hotplug,
    although for that we'd need a way to actually notify the driver.
    Note that drivers may already be able to accommodate such a topology
    change on their own, e.g. using the reset_controller sysfs file in NVMe
    will cause the driver to get things right for this case.

    With the rebuild removed we will simplify retain the queue mapping for
    a soft offlined CPU that will work when it comes back online, and will
    map any newly onlined CPU to queue 0 until the driver initiates
    a rebuild of the queue map.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • blk_mq_delay_kick_requeue_list() provides the ability to kick the
    q->requeue_list after a specified time. To do this the request_queue's
    'requeue_work' member was changed to a delayed_work.

    blk_mq_delay_kick_requeue_list() allows DM to defer processing requeued
    requests while it doesn't make sense to immediately requeue them
    (e.g. when all paths in a DM multipath have failed).

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

14 Sep, 2016

3 commits

  • commit e1defc4ff0cf57aca6c5e3ff99fa503f5943c1f1
    "block: Do away with the notion of hardsect_size"
    removed the notion of "hardware sector size" from
    the kernel in favor of logical block size, but
    references remain in comments and documentation.

    Update the remaining sites mentioning hardsect.

    Signed-off-by: Linus Walleij
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Linus Walleij
     
  • Allow the io_poll statistics to be zeroed to make for easier logging
    of polling event.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     
  • In order to help determine the effectiveness of polling in a running
    system it is usful to determine the ratio of how often the poll
    function is called vs how often the completion is checked. For this
    reason we add a poll_considered variable and add it to the sysfs entry
    for io_poll.

    Signed-off-by: Stephen Bates
    Acked-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Stephen Bates
     

29 Aug, 2016

3 commits


25 Aug, 2016

2 commits

  • __blk_mq_run_hw_queue() currently warns if we are running the queue on a
    CPU that isn't set in its mask. However, this can happen if a CPU is
    being offlined, and the workqueue handling will place the work on CPU0
    instead. Improve the warning so that it only triggers if the batch cpu
    in the hardware queue is currently online. If it triggers for that
    case, then it's indicative of a flow problem in blk-mq, so we want to
    retain it for that case.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We do this in a few places, if the CPU is offline. This isn't allowed,
    though, since on multi queue hardware, we can't just move a request
    from one software queue to another, if they map to different hardware
    queues. The request and tag isn't valid on another hardware queue.

    This can happen if plugging races with CPU offlining. But it does
    no harm, since it can only happen in the window where we are
    currently busy freezing the queue and flushing IO, in preparation
    for redoing the software hardware queue mappings.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Aug, 2016

1 commit

  • After arbitrary bio size was introduced, the incoming bio may
    be very big. We have to split the bio into small bios so that
    each holds at most BIO_MAX_PAGES bvecs for safety reason, such
    as bio_clone().

    This patch fixes the following kernel crash:

    > [ 172.660142] BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
    > [ 172.660229] IP: [] bio_trim+0xf/0x2a
    > [ 172.660289] PGD 7faf3e067 PUD 7f9279067 PMD 0
    > [ 172.660399] Oops: 0000 [#1] SMP
    > [...]
    > [ 172.664780] Call Trace:
    > [ 172.664813] [] ? raid1_make_request+0x2e8/0xad7 [raid1]
    > [ 172.664846] [] ? blk_queue_split+0x377/0x3d4
    > [ 172.664880] [] ? md_make_request+0xf6/0x1e9 [md_mod]
    > [ 172.664912] [] ? generic_make_request+0xb5/0x155
    > [ 172.664947] [] ? prio_io+0x85/0x95 [bcache]
    > [ 172.664981] [] ? register_cache_set+0x355/0x8d0 [bcache]
    > [ 172.665016] [] ? register_bcache+0x1006/0x1174 [bcache]

    The issue can be reproduced by the following steps:
    - create one raid1 over two virtio-blk
    - build bcache device over the above raid1 and another cache device
    and bucket size is set as 2Mbytes
    - set cache mode as writeback
    - run random write over ext4 on the bcache device

    Fixes: 54efd50(block: make generic_make_request handle arbitrarily sized bios)
    Reported-by: Sebastian Roesner
    Reported-by: Eric Wheeler
    Cc: stable@vger.kernel.org (4.3+)
    Cc: Shaohua Li
    Acked-by: Kent Overstreet
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

17 Aug, 2016

1 commit

  • blk_set_queue_dying() can be called while another thread is
    submitting I/O or changing queue flags, e.g. through dm_stop_queue().
    Hence protect the QUEUE_FLAG_DYING flag change with locking.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Mike Snitzer
    Cc: stable
    Signed-off-by: Jens Axboe

    Bart Van Assche