24 Jun, 2020

1 commit


14 May, 2020

3 commits

  • Blk-crypto delegates crypto operations to inline encryption hardware
    when available. The separately configurable blk-crypto-fallback contains
    a software fallback to the kernel crypto API - when enabled, blk-crypto
    will use this fallback for en/decryption when inline encryption hardware
    is not available.

    This lets upper layers not have to worry about whether or not the
    underlying device has support for inline encryption before deciding to
    specify an encryption context for a bio. It also allows for testing
    without actual inline encryption hardware - in particular, it makes it
    possible to test the inline encryption code in ext4 and f2fs simply by
    running xfstests with the inlinecrypt mount option, which in turn allows
    for things like the regular upstream regression testing of ext4 to cover
    the inline encryption code paths.

    For more details, refer to Documentation/block/inline-encryption.rst.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • Inline Encryption hardware allows software to specify an encryption context
    (an encryption key, crypto algorithm, data unit num, data unit size) along
    with a data transfer request to a storage device, and the inline encryption
    hardware will use that context to en/decrypt the data. The inline
    encryption hardware is part of the storage device, and it conceptually sits
    on the data path between system memory and the storage device.

    Inline Encryption hardware implementations often function around the
    concept of "keyslots". These implementations often have a limited number
    of "keyslots", each of which can hold a key (we say that a key can be
    "programmed" into a keyslot). Requests made to the storage device may have
    a keyslot and a data unit number associated with them, and the inline
    encryption hardware will en/decrypt the data in the requests using the key
    programmed into that associated keyslot and the data unit number specified
    with the request.

    As keyslots are limited, and programming keys may be expensive in many
    implementations, and multiple requests may use exactly the same encryption
    contexts, we introduce a Keyslot Manager to efficiently manage keyslots.

    We also introduce a blk_crypto_key, which will represent the key that's
    programmed into keyslots managed by keyslot managers. The keyslot manager
    also functions as the interface that upper layers will use to program keys
    into inline encryption hardware. For more information on the Keyslot
    Manager, refer to documentation found in block/keyslot-manager.c and
    linux/keyslot-manager.h.

    Co-developed-by: Eric Biggers
    Signed-off-by: Eric Biggers
    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

24 Mar, 2020

1 commit


30 Jan, 2020

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series is slightly unusual because it includes Arnd's compat
    ioctl tree here:

    1c46a2cf2dbd Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue

    Excluding Arnd's changes, this is mostly an update of the usual
    drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas.

    There are a couple of core and base updates around error propagation
    and atomicity in the attribute container base we use for the SCSI
    transport classes.

    The rest is minor changes and updates"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits)
    scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask
    scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity
    scsi: hisi_sas: Modify the file permissions of trigger_dump to write only
    scsi: hisi_sas: Replace magic number when handle channel interrupt
    scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock
    scsi: hisi_sas: use threaded irq to process CQ interrupts
    scsi: ufs: Use UFS device indicated maximum LU number
    scsi: ufs: Add max_lu_supported in struct ufs_dev_info
    scsi: ufs: Delete is_init_prefetch from struct ufs_hba
    scsi: ufs: Inline two functions into their callers
    scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init()
    scsi: ufs: Split ufshcd_probe_hba() based on its called flow
    scsi: ufs: Delete struct ufs_dev_desc
    scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
    scsi: ufs-mediatek: enable low-power mode for hibern8 state
    scsi: ufs: export some functions for vendor usage
    scsi: ufs-mediatek: add dbg_register_dump implementation
    scsi: qla2xxx: Fix a NULL pointer dereference in an error path
    scsi: qla1280: Make checking for 64bit support consistent
    scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1
    ...

    Linus Torvalds
     

07 Jan, 2020

1 commit

  • Currently t10-pi can only be built into the block layer which via
    crc-t10dif pulls in a whole chunk of the Crypto API. In fact all
    users of t10-pi work as modules and there is no reason for it to
    always be built-in.

    This patch adds a new hidden option for t10-pi that is selected
    automatically based on BLK_DEV_INTEGRITY and whether the users
    of t10-pi are built-in or not.

    Signed-off-by: Herbert Xu
    Signed-off-by: Jens Axboe

    Herbert Xu
     

03 Jan, 2020

1 commit


08 Nov, 2019

1 commit


29 Aug, 2019

1 commit

  • This patchset implements IO cost model based work-conserving
    proportional controller.

    While io.latency provides the capability to comprehensively prioritize
    and protect IOs depending on the cgroups, its protection is binary -
    the lowest latency target cgroup which is suffering is protected at
    the cost of all others. In many use cases including stacking multiple
    workload containers in a single system, it's necessary to distribute
    IO capacity with better granularity.

    One challenge of controlling IO resources is the lack of trivially
    observable cost metric. The most common metrics - bandwidth and iops
    - can be off by orders of magnitude depending on the device type and
    IO pattern. However, the cost isn't a complete mystery. Given
    several key attributes, we can make fairly reliable predictions on how
    expensive a given stream of IOs would be, at least compared to other
    IO patterns.

    The function which determines the cost of a given IO is the IO cost
    model for the device. This controller distributes IO capacity based
    on the costs estimated by such model. The more accurate the cost
    model the better but the controller adapts based on IO completion
    latency and as long as the relative costs across differents IO
    patterns are consistent and sensible, it'll adapt to the actual
    performance of the device.

    Currently, the only implemented cost model is a simple linear one with
    a few sets of default parameters for different classes of device.
    This covers most common devices reasonably well. All the
    infrastructure to tune and add different cost models is already in
    place and a later patch will also allow using bpf progs for cost
    models.

    Please see the top comment in blk-iocost.c and documentation for
    more details.

    v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Cc: Josef Bacik
    Cc: Rik van Riel
    Signed-off-by: Jens Axboe

    Tejun Heo
     

08 Nov, 2018

2 commits


27 Sep, 2018

1 commit

  • Move the code for runtime power management from blk-core.c into the
    new source file blk-pm.c. Move the corresponding declarations from
    into . For CONFIG_PM=n, leave out
    the declarations of the functions that are not used in that mode.
    This patch not only reduces the number of #ifdefs in the block layer
    core code but also reduces the size of header file
    and hence should help to reduce the build time of the Linux kernel
    if CONFIG_PM is not defined.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jul, 2018

3 commits

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Exclude zoned block device members from struct request_queue for
    CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
    the code that uses these struct request_queue members if
    CONFIG_BLK_DEV_ZONED != n.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Cc: Matias Bjorling
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Aug, 2017

1 commit

  • Like pci and virtio, we add a rdma helper for affinity
    spreading. This achieves optimal mq affinity assignments
    according to the underlying rdma device affinity maps.

    Reviewed-by: Jens Axboe
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     

19 Apr, 2017

2 commits

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • We tag as v0 the version of BFQ containing only BFQ's engine plus
    hierarchical support. BFQ's engine is introduced by this commit, while
    hierarchical support is added by next commit. We use the v0 tag to
    distinguish this minimal version of BFQ from the versions containing
    also the features and the improvements added by next commits. BFQ-v0
    coincides with the version of BFQ submitted a few years ago [1], apart
    from the introduction of preemption, described below.

    BFQ is a proportional-share I/O scheduler, whose general structure,
    plus a lot of code, are borrowed from CFQ.

    - Each process doing I/O on a device is associated with a weight and a
    (bfq_)queue.

    - BFQ grants exclusive access to the device, for a while, to one queue
    (process) at a time, and implements this service model by
    associating every queue with a budget, measured in number of
    sectors.

    - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

    - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
    holding the device for too long and dramatically reducing
    throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
    sync requests may not be expired immediately when it empties. In
    contrast, BFQ may idle the device for a short time interval,
    giving the process the chance to go on being served if it issues
    a new request in time. Device idling typically boosts the
    throughput on rotational devices, if processes do synchronous
    and sequential I/O. In addition, under BFQ, device idling is
    also instrumental in guaranteeing the desired throughput
    fraction to processes issuing sync requests (see [2] for
    details).

    - With respect to idling for service guarantees, if several
    processes are competing for the device at the same time, but
    all processes (and groups, after the following commit) have
    the same weight, then BFQ guarantees the expected throughput
    distribution without ever idling the device. Throughput is
    thus as high as possible in this common scenario.

    - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in the next commit, which focuses
    exactly on this feature.

    - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

    - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons:

    - First, with any proportional-share scheduler, the maximum
    deviation with respect to an ideal service is proportional to
    the maximum budget (slice) assigned to queues. As a consequence,
    BFQ can keep this deviation tight not only because of the
    accurate service of B-WF2Q+, but also because BFQ *does not*
    need to assign a larger budget to a queue to let the queue
    receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
    budget that best fits the needs of the process, or best
    leverages the I/O pattern of the process. In particular, BFQ
    updates queue budgets with a simple feedback-loop algorithm that
    allows a high throughput to be achieved, while still providing
    tight latency guarantees to time-sensitive applications. When
    the in-service queue expires, this algorithm computes the next
    budget of the queue so as to:

    - Let large budgets be eventually assigned to the queues
    associated with I/O-bound applications performing sequential
    I/O: in fact, the longer these applications are served once
    got access to the device, the higher the throughput is.

    - Let small budgets be eventually assigned to the queues
    associated with time-sensitive applications (which typically
    perform sporadic and short I/O), because, the smaller the
    budget assigned to a queue waiting for service is, the sooner
    B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

    - Weights can be assigned to processes only indirectly, through I/O
    priorities, and according to the relation:
    weight = 10 * (IOPRIO_BE_NR - ioprio).
    The next patch provides, instead, a cgroups interface through which
    weights can be assigned explicitly.

    - If several processes are competing for the device at the same time,
    but all processes and groups have the same weight, then BFQ
    guarantees the expected throughput distribution without ever idling
    the device. It uses preemption instead. Throughput is then much
    higher in this common scenario.

    - ioprio classes are served in strict priority order, i.e.,
    lower-priority queues are not served as long as there are
    higher-priority queues. Among queues in the same class, the
    bandwidth is distributed in proportion to the weight of each
    queue. A very thin extra bandwidth is however guaranteed to the Idle
    class, to prevent it from starving.

    - If the strict_guarantees parameter is set (default: unset), then BFQ
    - always performs idling when the in-service queue becomes empty;
    - forces the device to serve one I/O request at a time, by
    dispatching a new request only if there is no outstanding
    request.
    In the presence of differentiated weights or I/O-request sizes,
    both the above conditions are needed to guarantee that every
    queue receives its allotted share of the bandwidth (see
    Documentation/block/bfq-iosched.txt for more details). Setting
    strict_guarantees may evidently affect throughput.

    [1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

    [2] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Fabio Checconi
    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Apr, 2017

1 commit

  • The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
    scale to multiple queues. Users configure only two knobs, the target
    read and synchronous write latencies, and the scheduler tunes itself to
    achieve that latency goal.

    The implementation is based on "tokens", built on top of the scalable
    bitmap library. Tokens serve as a mechanism for limiting requests. There
    are two tiers of tokens: queueing tokens and dispatch tokens.

    A queueing token is required to allocate a request. In fact, these
    tokens are actually the blk-mq internal scheduler tags, but the
    scheduler manages the allocation directly in order to implement its
    policy.

    Dispatch tokens are device-wide and split up into two scheduling
    domains: reads vs. writes. Each hardware queue dispatches batches
    round-robin between the scheduling domains as long as tokens are
    available for that domain.

    These tokens can be used as the mechanism to enable various policies.
    The policy Kyber uses is inspired by active queue management techniques
    for network routing, similar to blk-wbt. The scheduler monitors
    latencies and scales the number of dispatch tokens accordingly. Queueing
    tokens are used to prevent starvation of synchronous requests by
    asynchronous requests.

    Various extensions are possible, including better heuristics and ionice
    support. The new scheduler isn't set as the default yet.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

03 Mar, 2017

1 commit

  • Pull vhost updates from Michael Tsirkin:
    "virtio, vhost: optimizations, fixes

    Looks like a quiet cycle for vhost/virtio, just a couple of minor
    tweaks. Most notable is automatic interrupt affinity for blk and scsi.
    Hopefully other devices are not far behind"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-console: avoid DMA from stack
    vhost: introduce O(1) vq metadata cache
    virtio_scsi: use virtio IRQ affinity
    virtio_blk: use virtio IRQ affinity
    blk-mq: provide a default queue mapping for virtio device
    virtio: provide a method to get the IRQ affinity mask for a virtqueue
    virtio: allow drivers to request IRQ affinity when creating VQs
    virtio_pci: simplify MSI-X setup
    virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
    virtio_pci: use shared interrupts for virtqueues
    virtio_pci: remove struct virtio_pci_vq_info
    vhost: try avoiding avail index access when getting descriptor
    virtio_mmio: expose header to userspace

    Linus Torvalds
     

28 Feb, 2017

1 commit


18 Feb, 2017

1 commit


07 Feb, 2017

1 commit

  • This patch implements the necessary logic to bring an Opal
    enabled drive out of a factory-enabled into a working
    Opal state.

    This patch set also enables logic to save a password to
    be replayed during a resume from suspend.

    Signed-off-by: Scott Bauer
    Signed-off-by: Rafael Antognolli
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Scott Bauer
     

01 Feb, 2017

1 commit


28 Jan, 2017

1 commit

  • This fixes a couple of problems:

    1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
    2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
    all.

    Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

    Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
    Signed-off-by: Omar Sandoval

    Augment Kconfig description.

    Signed-off-by: Jens Axboe

    Omar Sandoval
     

27 Jan, 2017

1 commit

  • In preparation for putting blk-mq debugging information in debugfs,
    create a directory tree mirroring the one in sysfs:

    # tree -d /sys/kernel/debug/block
    /sys/kernel/debug/block
    |-- nvme0n1
    | `-- mq
    | |-- 0
    | | `-- cpu0
    | |-- 1
    | | `-- cpu1
    | |-- 2
    | | `-- cpu2
    | `-- 3
    | `-- cpu3
    `-- vda
    `-- mq
    `-- 0
    |-- cpu0
    |-- cpu1
    |-- cpu2
    `-- cpu3

    Also add the scaffolding for the actual files that will go in here,
    either under the hardware queue or software queue directories.

    Reviewed-by: Hannes Reinecke
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

18 Jan, 2017

2 commits

  • This is basically identical to deadline-iosched, except it registers
    as a MQ capable scheduler. This is still a single queue design.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     
  • This adds a set of hooks that intercepts the blk-mq path of
    allocating/inserting/issuing/completing requests, allowing
    us to develop a scheduler within that framework.

    We reuse the existing elevator scheduler API on the registration
    side, but augment that with the scheduler flagging support for
    the blk-mq interfce, and with a separate set of ops hooks for MQ
    devices.

    We split driver and scheduler tags, so we can run the scheduling
    independently of device queue depth.

    Signed-off-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

11 Nov, 2016

2 commits

  • We can hook this up to the block layer, to help throttle buffered
    writes.

    wbt registers a few trace points that can be used to track what is
    happening in the system:

    wbt_lat: 259:0: latency 2446318
    wbt_stat: 259:0: rmean=2446318, rmin=2446318, rmax=2446318, rsamples=1,
    wmean=518866, wmin=15522, wmax=5330353, wsamples=57
    wbt_step: 259:0: step down: step=1, window=72727272, background=8, normal=16, max=32

    This shows a sync issue event (wbt_lat) that exceeded it's time. wbt_stat
    dumps the current read/write stats for that window, and wbt_step shows a
    step down event where we now scale back writes. Each trace includes the
    device, 259:0 in this case.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For legacy block, we simply track them in the request queue. For
    blk-mq, we track them on a per-sw queue basis, which we can then
    sum up through the hardware queues and finally to a per device
    state.

    The stats are tracked in, roughly, 0.1s interval windows.

    Add sysfs files to display the stats.

    The feature is off by default, to avoid any extra overhead. In-kernel
    users of it can turn it on by setting QUEUE_FLAG_STATS in the queue
    flags. We currently don't turn it on if someone just reads any of
    the stats files, that is something we could add as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Oct, 2016

1 commit

  • Implement zoned block device zone information reporting and reset.
    Zone information are reported as struct blk_zone. This implementation
    does not differentiate between host-aware and host-managed device
    models and is valid for both. Two functions are provided:
    blkdev_report_zones for discovering the zone configuration of a
    zoned block device, and blkdev_reset_zones for resetting the write
    pointer of sequential zones. The helper function blk_queue_zone_size
    and bdev_zone_size are also provided for, as the name suggest,
    obtaining the zone size (in 512B sectors) of the zones of the device.

    Signed-off-by: Hannes Reinecke

    [Damien: * Removed the zone cache
    * Implement report zones operation based on earlier proposal
    by Shaun Tancheff ]
    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Shaun Tancheff
    Tested-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

10 Oct, 2016

1 commit


22 Sep, 2016

1 commit


19 Sep, 2016

1 commit


15 Sep, 2016

1 commit


24 Jan, 2016

1 commit

  • Pull rdma updates from Doug Ledford:
    "Initial roundup of 4.5 merge window patches

    - Remove usage of ib_query_device and instead store attributes in
    ib_device struct

    - Move iopoll out of block and into lib, rename to irqpoll, and use
    in several places in the rdma stack as our new completion queue
    polling library mechanism. Update the other block drivers that
    already used iopoll to use the new mechanism too.

    - Replace the per-entry GID table locks with a single GID table lock

    - IPoIB multicast cleanup

    - Cleanups to the IB MR facility

    - Add support for 64bit extended IB counters

    - Fix for netlink oops while parsing RDMA nl messages

    - RoCEv2 support for the core IB code

    - mlx4 RoCEv2 support

    - mlx5 RoCEv2 support

    - Cross Channel support for mlx5

    - Timestamp support for mlx5

    - Atomic support for mlx5

    - Raw QP support for mlx5

    - MAINTAINERS update for mlx4/mlx5

    - Misc ocrdma, qib, nes, usNIC, cxgb3, cxgb4, mlx4, mlx5 updates

    - Add support for remote invalidate to the iSER driver (pushed
    through the RDMA tree due to dependencies, acknowledged by nab)

    - Update to NFSoRDMA (pushed through the RDMA tree due to
    dependencies, acknowledged by Bruce)"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma: (169 commits)
    IB/mlx5: Unify CQ create flags check
    IB/mlx5: Expose Raw Packet QP to user space consumers
    {IB, net}/mlx5: Move the modify QP operation table to mlx5_ib
    IB/mlx5: Support setting Ethernet priority for Raw Packet QPs
    IB/mlx5: Add Raw Packet QP query functionality
    IB/mlx5: Add create and destroy functionality for Raw Packet QP
    IB/mlx5: Refactor mlx5_ib_qp to accommodate other QP types
    IB/mlx5: Allocate a Transport Domain for each ucontext
    net/mlx5_core: Warn on unsupported events of QP/RQ/SQ
    net/mlx5_core: Add RQ and SQ event handling
    net/mlx5_core: Export transport objects
    IB/mlx5: Expose CQE version to user-space
    IB/mlx5: Add CQE version 1 support to user QPs and SRQs
    IB/mlx5: Fix data validation in mlx5_ib_alloc_ucontext
    IB/sa: Fix netlink local service GFP crash
    IB/srpt: Remove redundant wc array
    IB/qib: Improve ipoib UD performance
    IB/mlx4: Advertise RoCE v2 support
    IB/mlx4: Create and use another QP1 for RoCEv2
    IB/mlx4: Enable send of RoCE QP1 packets with IP/UDP headers
    ...

    Linus Torvalds
     

09 Jan, 2016

1 commit

  • Take the core badblocks implementation from md, and make it generally
    available. This follows the same style as kernel implementations of
    linked lists, rb-trees etc, where you can have a structure that can be
    embedded anywhere, and accessor functions to manipulate the data.

    The only changes in this copy of the code are ones to generalize
    function/variable names from md-specific ones. Also add init and free
    functions.

    Signed-off-by: Vishal Verma
    Signed-off-by: Dan Williams

    Vishal Verma
     

12 Dec, 2015

1 commit