12 Sep, 2021

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Christoph:
    - fix nvmet command set reporting for passthrough controllers (Adam Manzanares)
    - update a MAINTAINERS email address (Chaitanya Kulkarni)
    - set QUEUE_FLAG_NOWAIT for nvme-multipth (me)
    - handle errors from add_disk() (Luis Chamberlain)
    - update the keep alive interval when kato is modified (Tatsuya Sasaki)
    - fix a buffer overrun in nvmet_subsys_attr_serial (Hannes Reinecke)
    - do not reset transport on data digest errors in nvme-tcp (Daniel Wagner)
    - only call synchronize_srcu when clearing current path (Daniel Wagner)
    - revalidate paths during rescan (Hannes Reinecke)

    - Split out the fs/block_dev into block/fops.c and block/bdev.c, which
    has been long overdue. Do this now before -rc1, to avoid annoying
    conflicts due to this (Christoph)

    - blk-throtl use-after-free fix (Li)

    - Improve plug depth for multi-device plugs, greatly increasing md
    resync performance (Song)

    - blkdev_show() locking fix (Tetsuo)

    - n64cart error check fix (Yang)

    * tag 'block-5.15-2021-09-11' of git://git.kernel.dk/linux-block:
    n64cart: fix return value check in n64cart_probe()
    blk-mq: allow 4x BLK_MAX_REQUEST_COUNT at blk_plug for multiple_queues
    block: move fs/block_dev.c to block/bdev.c
    block: split out operations on block special files
    blk-throttle: fix UAF by deleteing timer in blk_throtl_exit()
    block: genhd: don't call blkdev_show() with major_names_lock held
    nvme: update MAINTAINERS email address
    nvme: add error handling support for add_disk()
    nvme: only call synchronize_srcu when clearing current path
    nvme: update keep alive interval when kato is modified
    nvme-tcp: Do not reset transport on data digest errors
    nvmet: fixup buffer overrun in nvmet_subsys_attr_serial()
    nvmet: return bool from nvmet_passthru_ctrl and nvmet_is_passthru_req
    nvmet: looks at the passthrough controller when initializing CAP
    nvme: move nvme_multi_css into nvme.h
    nvme-multipath: revalidate paths during rescan
    nvme-multipath: set QUEUE_FLAG_NOWAIT

    Linus Torvalds
     

07 Sep, 2021

2 commits


03 Sep, 2021

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series consists of the usual driver updates (ufs, qla2xxx,
    target, smartpqi, lpfc, mpt3sas).

    The core change causing the most churn was replacing the command
    request field request with a macro, allowing us to offset map to it
    and remove the redundant field; the same was also done for the tag
    field.

    The most impactful change is the final removal of scsi_ioctl, which
    has been deprecated for over a decade"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (293 commits)
    scsi: ufs: Fix ufshcd_request_sense_async() for Samsung KLUFG8RHDA-B2D1
    scsi: ufs: ufs-exynos: Fix static checker warning
    scsi: mpt3sas: Use the proper SCSI midlayer interfaces for PI
    scsi: lpfc: Use the proper SCSI midlayer interfaces for PI
    scsi: lpfc: Copyright updates for 14.0.0.1 patches
    scsi: lpfc: Update lpfc version to 14.0.0.1
    scsi: lpfc: Add bsg support for retrieving adapter cmf data
    scsi: lpfc: Add cmf_info sysfs entry
    scsi: lpfc: Add debugfs support for cm framework buffers
    scsi: lpfc: Add support for maintaining the cm statistics buffer
    scsi: lpfc: Add rx monitoring statistics
    scsi: lpfc: Add support for the CM framework
    scsi: lpfc: Add cmfsync WQE support
    scsi: lpfc: Add support for cm enablement buffer
    scsi: lpfc: Add cm statistics buffer support
    scsi: lpfc: Add EDC ELS support
    scsi: lpfc: Expand FPIN and RDF receive logging
    scsi: lpfc: Add MIB feature enablement support
    scsi: lpfc: Add SET_HOST_DATA mbox cmd to pass date/time info to firmware
    scsi: fc: Add EDC ELS definition
    ...

    Linus Torvalds
     

31 Aug, 2021

1 commit

  • Pull block updates from Jens Axboe:
    "Nothing major in here - lots of good cleanups and tech debt handling,
    which is also evident in the diffstats. In particular:

    - Add disk sequence numbers (Matteo)

    - Discard merge fix (Ming)

    - Relax disk zoned reporting restrictions (Niklas)

    - Bio error handling zoned leak fix (Pavel)

    - Start of proper add_disk() error handling (Luis, Christoph)

    - blk crypto fix (Eric)

    - Non-standard GPT location support (Dmitry)

    - IO priority improvements and cleanups (Damien)o

    - blk-throtl improvements (Chunguang)

    - diskstats_show() stack reduction (Abd-Alrhman)

    - Loop scheduler selection (Bart)

    - Switch block layer to use kmap_local_page() (Christoph)

    - Remove obsolete disk_name helper (Christoph)

    - block_device refcounting improvements (Christoph)

    - Ensure gendisk always has a request queue reference (Christoph)

    - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

    * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
    sg: pass the device name to blk_trace_setup
    block, bfq: cleanup the repeated declaration
    blk-crypto: fix check for too-large dun_bytes
    blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
    blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
    block: mark blkdev_fsync static
    block: refine the disk_live check in del_gendisk
    mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
    mmc: block: Support alternative_gpt_sector() operation
    partitions/efi: Support non-standard GPT location
    block: Add alternative_gpt_sector() operation
    bio: fix page leak bio_add_hw_page failure
    block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
    block: remove a pointless call to MINOR() in device_add_disk
    null_blk: add error handling support for add_disk()
    virtio_blk: add error handling support for add_disk()
    block: add error handling for device_add_disk / add_disk
    block: return errors from disk_alloc_events
    block: return errors from blk_integrity_add
    block: call blk_register_queue earlier in device_add_disk
    ...

    Linus Torvalds
     

12 Aug, 2021

1 commit

  • This reverts commit 08a9ad8bf607 ("block/mq-deadline: Add cgroup support")
    and a follow-up commit c06bc5a3fb42 ("block/mq-deadline: Remove a
    WARN_ON_ONCE() call"). The added cgroup support has the following issues:

    * It breaks cgroup interface file format rule by adding custom elements to a
    nested key-value file.

    * It registers mq-deadline as a cgroup-aware policy even though all it's
    doing is collecting per-cgroup stats. Even if we need these stats, this
    isn't the right way to add them.

    * It hasn't been reviewed from cgroup side.

    Cc: Bart Van Assche
    Cc: Jens Axboe
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

10 Aug, 2021

1 commit

  • Move the block holder code into a separate file as it is not in any way
    related to the other block_dev.c code, and add a new selectable config
    option for it so that we don't have to build it without any remapped
    drivers selected.

    The Kconfig symbol contains a _DEPRECATED suffix to match the comments
    added in commit 49731baa41df
    ("block: restore multiple bd_link_disk_holder() support").

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Mike Snitzer
    Link: https://lore.kernel.org/r/20210804094147.459763-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

03 Aug, 2021

1 commit

  • cmdline-parser.c is only used by the cmdline faux partition format,
    so merge the code into that and avoid an indirect call.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210728053756.409654-1-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

29 Jul, 2021

2 commits


25 Jun, 2021

1 commit

  • Move the code for handling disk events from genhd.c into a new file
    as it isn't very related to the rest of the file while at the same
    time requiring lots of forward declarations.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210624073843.251178-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

22 Jun, 2021

2 commits

  • Maintain statistics per cgroup and export these to user space. These
    statistics are essential for verifying whether the proper I/O priorities
    have been assigned to requests. An example of the statistics data with
    this patch applied:

    $ cat /sys/fs/cgroup/io.stat
    11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0
    8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0

    Cc: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Introduce an rq-qos policy that assigns an I/O priority to requests based
    on blk-cgroup configuration settings. This policy has the following
    advantages over the ioprio_set() system call:
    - This policy is cgroup based so it has all the advantages of cgroups.
    - While ioprio_set() does not affect page cache writeback I/O, this rq-qos
    controller affects page cache writeback I/O for filesystems that support
    assiociating a cgroup with writeback I/O. See also
    Documentation/admin-guide/cgroup-v2.rst.

    Cc: Damien Le Moal
    Cc: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Johannes Thumshirn
    Cc: Himanshu Madhani
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

24 Jun, 2020

1 commit


14 May, 2020

3 commits

  • Blk-crypto delegates crypto operations to inline encryption hardware
    when available. The separately configurable blk-crypto-fallback contains
    a software fallback to the kernel crypto API - when enabled, blk-crypto
    will use this fallback for en/decryption when inline encryption hardware
    is not available.

    This lets upper layers not have to worry about whether or not the
    underlying device has support for inline encryption before deciding to
    specify an encryption context for a bio. It also allows for testing
    without actual inline encryption hardware - in particular, it makes it
    possible to test the inline encryption code in ext4 and f2fs simply by
    running xfstests with the inlinecrypt mount option, which in turn allows
    for things like the regular upstream regression testing of ext4 to cover
    the inline encryption code paths.

    For more details, refer to Documentation/block/inline-encryption.rst.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • We must have some way of letting a storage device driver know what
    encryption context it should use for en/decrypting a request. However,
    it's the upper layers (like the filesystem/fscrypt) that know about and
    manages encryption contexts. As such, when the upper layer submits a bio
    to the block layer, and this bio eventually reaches a device driver with
    support for inline encryption, the device driver will need to have been
    told the encryption context for that bio.

    We want to communicate the encryption context from the upper layer to the
    storage device along with the bio, when the bio is submitted to the block
    layer. To do this, we add a struct bio_crypt_ctx to struct bio, which can
    represent an encryption context (note that we can't use the bi_private
    field in struct bio to do this because that field does not function to pass
    information across layers in the storage stack). We also introduce various
    functions to manipulate the bio_crypt_ctx and make the bio/request merging
    logic aware of the bio_crypt_ctx.

    We also make changes to blk-mq to make it handle bios with encryption
    contexts. blk-mq can merge many bios into the same request. These bios need
    to have contiguous data unit numbers (the necessary changes to blk-merge
    are also made to ensure this) - as such, it suffices to keep the data unit
    number of just the first bio, since that's all a storage driver needs to
    infer the data unit number to use for each data block in each bio in a
    request. blk-mq keeps track of the encryption context to be used for all
    the bios in a request with the request's rq_crypt_ctx. When the first bio
    is added to an empty request, blk-mq will program the encryption context
    of that bio into the request_queue's keyslot manager, and store the
    returned keyslot in the request's rq_crypt_ctx. All the functions to
    operate on encryption contexts are in blk-crypto.c.

    Upper layers only need to call bio_crypt_set_ctx with the encryption key,
    algorithm and data_unit_num; they don't have to worry about getting a
    keyslot for each encryption context, as blk-mq/blk-crypto handles that.
    Blk-crypto also makes it possible for request-based layered devices like
    dm-rq to make use of inline encryption hardware by cloning the
    rq_crypt_ctx and programming a keyslot in the new request_queue when
    necessary.

    Note that any user of the block layer can submit bios with an
    encryption context, such as filesystems, device-mapper targets, etc.

    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     
  • Inline Encryption hardware allows software to specify an encryption context
    (an encryption key, crypto algorithm, data unit num, data unit size) along
    with a data transfer request to a storage device, and the inline encryption
    hardware will use that context to en/decrypt the data. The inline
    encryption hardware is part of the storage device, and it conceptually sits
    on the data path between system memory and the storage device.

    Inline Encryption hardware implementations often function around the
    concept of "keyslots". These implementations often have a limited number
    of "keyslots", each of which can hold a key (we say that a key can be
    "programmed" into a keyslot). Requests made to the storage device may have
    a keyslot and a data unit number associated with them, and the inline
    encryption hardware will en/decrypt the data in the requests using the key
    programmed into that associated keyslot and the data unit number specified
    with the request.

    As keyslots are limited, and programming keys may be expensive in many
    implementations, and multiple requests may use exactly the same encryption
    contexts, we introduce a Keyslot Manager to efficiently manage keyslots.

    We also introduce a blk_crypto_key, which will represent the key that's
    programmed into keyslots managed by keyslot managers. The keyslot manager
    also functions as the interface that upper layers will use to program keys
    into inline encryption hardware. For more information on the Keyslot
    Manager, refer to documentation found in block/keyslot-manager.c and
    linux/keyslot-manager.h.

    Co-developed-by: Eric Biggers
    Signed-off-by: Eric Biggers
    Signed-off-by: Satya Tangirala
    Reviewed-by: Eric Biggers
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Satya Tangirala
     

24 Mar, 2020

1 commit


30 Jan, 2020

1 commit

  • Pull SCSI updates from James Bottomley:
    "This series is slightly unusual because it includes Arnd's compat
    ioctl tree here:

    1c46a2cf2dbd Merge tag 'block-ioctl-cleanup-5.6' into 5.6/scsi-queue

    Excluding Arnd's changes, this is mostly an update of the usual
    drivers: megaraid_sas, mpt3sas, qla2xxx, ufs, lpfc, hisi_sas.

    There are a couple of core and base updates around error propagation
    and atomicity in the attribute container base we use for the SCSI
    transport classes.

    The rest is minor changes and updates"

    * tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (149 commits)
    scsi: hisi_sas: Rename hisi_sas_cq.pci_irq_mask
    scsi: hisi_sas: Add prints for v3 hw interrupt converge and automatic affinity
    scsi: hisi_sas: Modify the file permissions of trigger_dump to write only
    scsi: hisi_sas: Replace magic number when handle channel interrupt
    scsi: hisi_sas: replace spin_lock_irqsave/spin_unlock_restore with spin_lock/spin_unlock
    scsi: hisi_sas: use threaded irq to process CQ interrupts
    scsi: ufs: Use UFS device indicated maximum LU number
    scsi: ufs: Add max_lu_supported in struct ufs_dev_info
    scsi: ufs: Delete is_init_prefetch from struct ufs_hba
    scsi: ufs: Inline two functions into their callers
    scsi: ufs: Move ufshcd_get_max_pwr_mode() to ufshcd_device_params_init()
    scsi: ufs: Split ufshcd_probe_hba() based on its called flow
    scsi: ufs: Delete struct ufs_dev_desc
    scsi: ufs: Fix ufshcd_probe_hba() reture value in case ufshcd_scsi_add_wlus() fails
    scsi: ufs-mediatek: enable low-power mode for hibern8 state
    scsi: ufs: export some functions for vendor usage
    scsi: ufs-mediatek: add dbg_register_dump implementation
    scsi: qla2xxx: Fix a NULL pointer dereference in an error path
    scsi: qla1280: Make checking for 64bit support consistent
    scsi: megaraid_sas: Update driver version to 07.713.01.00-rc1
    ...

    Linus Torvalds
     

07 Jan, 2020

1 commit

  • Currently t10-pi can only be built into the block layer which via
    crc-t10dif pulls in a whole chunk of the Crypto API. In fact all
    users of t10-pi work as modules and there is no reason for it to
    always be built-in.

    This patch adds a new hidden option for t10-pi that is selected
    automatically based on BLK_DEV_INTEGRITY and whether the users
    of t10-pi are built-in or not.

    Signed-off-by: Herbert Xu
    Signed-off-by: Jens Axboe

    Herbert Xu
     

03 Jan, 2020

1 commit


08 Nov, 2019

1 commit


29 Aug, 2019

1 commit

  • This patchset implements IO cost model based work-conserving
    proportional controller.

    While io.latency provides the capability to comprehensively prioritize
    and protect IOs depending on the cgroups, its protection is binary -
    the lowest latency target cgroup which is suffering is protected at
    the cost of all others. In many use cases including stacking multiple
    workload containers in a single system, it's necessary to distribute
    IO capacity with better granularity.

    One challenge of controlling IO resources is the lack of trivially
    observable cost metric. The most common metrics - bandwidth and iops
    - can be off by orders of magnitude depending on the device type and
    IO pattern. However, the cost isn't a complete mystery. Given
    several key attributes, we can make fairly reliable predictions on how
    expensive a given stream of IOs would be, at least compared to other
    IO patterns.

    The function which determines the cost of a given IO is the IO cost
    model for the device. This controller distributes IO capacity based
    on the costs estimated by such model. The more accurate the cost
    model the better but the controller adapts based on IO completion
    latency and as long as the relative costs across differents IO
    patterns are consistent and sensible, it'll adapt to the actual
    performance of the device.

    Currently, the only implemented cost model is a simple linear one with
    a few sets of default parameters for different classes of device.
    This covers most common devices reasonably well. All the
    infrastructure to tune and add different cost models is already in
    place and a later patch will also allow using bpf progs for cost
    models.

    Please see the top comment in blk-iocost.c and documentation for
    more details.

    v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Cc: Josef Bacik
    Cc: Rik van Riel
    Signed-off-by: Jens Axboe

    Tejun Heo
     

08 Nov, 2018

2 commits


27 Sep, 2018

1 commit

  • Move the code for runtime power management from blk-core.c into the
    new source file blk-pm.c. Move the corresponding declarations from
    into . For CONFIG_PM=n, leave out
    the declarations of the functions that are not used in that mode.
    This patch not only reduces the number of #ifdefs in the block layer
    core code but also reduces the size of header file
    and hence should help to reduce the build time of the Linux kernel
    if CONFIG_PM is not defined.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jul, 2018

3 commits

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • blkcg-qos is going to do essentially what wbt does, only on a cgroup
    basis. Break out the common code that will be shared between blkcg-qos
    and wbt into blk-rq-qos.* so they can both utilize the same
    infrastructure.

    Signed-off-by: Josef Bacik
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Exclude zoned block device members from struct request_queue for
    CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
    the code that uses these struct request_queue members if
    CONFIG_BLK_DEV_ZONED != n.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Cc: Matias Bjorling
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Aug, 2017

1 commit

  • Like pci and virtio, we add a rdma helper for affinity
    spreading. This achieves optimal mq affinity assignments
    according to the underlying rdma device affinity maps.

    Reviewed-by: Jens Axboe
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     

19 Apr, 2017

2 commits

  • The BFQ I/O scheduler features an optimal fair-queuing
    (proportional-share) scheduling algorithm, enriched with several
    mechanisms to boost throughput and reduce latency for interactive and
    real-time applications. This makes BFQ a large and complex piece of
    code. This commit addresses this issue by splitting BFQ into three
    main, independent components, and by moving each component into a
    separate source file:
    1. Main algorithm: handles the interaction with the kernel, and
    decides which requests to dispatch; it uses the following two further
    components to achieve its goals.
    2. Scheduling engine (Hierarchical B-WF2Q+ scheduling algorithm):
    computes the schedule, using weights and budgets provided by the above
    component.
    3. cgroups support: handles group operations (creation, destruction,
    move, ...).

    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Paolo Valente
     
  • We tag as v0 the version of BFQ containing only BFQ's engine plus
    hierarchical support. BFQ's engine is introduced by this commit, while
    hierarchical support is added by next commit. We use the v0 tag to
    distinguish this minimal version of BFQ from the versions containing
    also the features and the improvements added by next commits. BFQ-v0
    coincides with the version of BFQ submitted a few years ago [1], apart
    from the introduction of preemption, described below.

    BFQ is a proportional-share I/O scheduler, whose general structure,
    plus a lot of code, are borrowed from CFQ.

    - Each process doing I/O on a device is associated with a weight and a
    (bfq_)queue.

    - BFQ grants exclusive access to the device, for a while, to one queue
    (process) at a time, and implements this service model by
    associating every queue with a budget, measured in number of
    sectors.

    - After a queue is granted access to the device, the budget of the
    queue is decremented, on each request dispatch, by the size of the
    request.

    - The in-service queue is expired, i.e., its service is suspended,
    only if one of the following events occurs: 1) the queue finishes
    its budget, 2) the queue empties, 3) a "budget timeout" fires.

    - The budget timeout prevents processes doing random I/O from
    holding the device for too long and dramatically reducing
    throughput.

    - Actually, as in CFQ, a queue associated with a process issuing
    sync requests may not be expired immediately when it empties. In
    contrast, BFQ may idle the device for a short time interval,
    giving the process the chance to go on being served if it issues
    a new request in time. Device idling typically boosts the
    throughput on rotational devices, if processes do synchronous
    and sequential I/O. In addition, under BFQ, device idling is
    also instrumental in guaranteeing the desired throughput
    fraction to processes issuing sync requests (see [2] for
    details).

    - With respect to idling for service guarantees, if several
    processes are competing for the device at the same time, but
    all processes (and groups, after the following commit) have
    the same weight, then BFQ guarantees the expected throughput
    distribution without ever idling the device. Throughput is
    thus as high as possible in this common scenario.

    - Queues are scheduled according to a variant of WF2Q+, named
    B-WF2Q+, and implemented using an augmented rb-tree to preserve an
    O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
    also ready for hierarchical scheduling. However, for a cleaner
    logical breakdown, the code that enables and completes
    hierarchical support is provided in the next commit, which focuses
    exactly on this feature.

    - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
    perfectly fair, and smooth service. In particular, B-WF2Q+
    guarantees that each queue receives a fraction of the device
    throughput proportional to its weight, even if the throughput
    fluctuates, and regardless of: the device parameters, the current
    workload and the budgets assigned to the queue.

    - The last, budget-independence, property (although probably
    counterintuitive in the first place) is definitely beneficial, for
    the following reasons:

    - First, with any proportional-share scheduler, the maximum
    deviation with respect to an ideal service is proportional to
    the maximum budget (slice) assigned to queues. As a consequence,
    BFQ can keep this deviation tight not only because of the
    accurate service of B-WF2Q+, but also because BFQ *does not*
    need to assign a larger budget to a queue to let the queue
    receive a higher fraction of the device throughput.

    - Second, BFQ is free to choose, for every process (queue), the
    budget that best fits the needs of the process, or best
    leverages the I/O pattern of the process. In particular, BFQ
    updates queue budgets with a simple feedback-loop algorithm that
    allows a high throughput to be achieved, while still providing
    tight latency guarantees to time-sensitive applications. When
    the in-service queue expires, this algorithm computes the next
    budget of the queue so as to:

    - Let large budgets be eventually assigned to the queues
    associated with I/O-bound applications performing sequential
    I/O: in fact, the longer these applications are served once
    got access to the device, the higher the throughput is.

    - Let small budgets be eventually assigned to the queues
    associated with time-sensitive applications (which typically
    perform sporadic and short I/O), because, the smaller the
    budget assigned to a queue waiting for service is, the sooner
    B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).

    - Weights can be assigned to processes only indirectly, through I/O
    priorities, and according to the relation:
    weight = 10 * (IOPRIO_BE_NR - ioprio).
    The next patch provides, instead, a cgroups interface through which
    weights can be assigned explicitly.

    - If several processes are competing for the device at the same time,
    but all processes and groups have the same weight, then BFQ
    guarantees the expected throughput distribution without ever idling
    the device. It uses preemption instead. Throughput is then much
    higher in this common scenario.

    - ioprio classes are served in strict priority order, i.e.,
    lower-priority queues are not served as long as there are
    higher-priority queues. Among queues in the same class, the
    bandwidth is distributed in proportion to the weight of each
    queue. A very thin extra bandwidth is however guaranteed to the Idle
    class, to prevent it from starving.

    - If the strict_guarantees parameter is set (default: unset), then BFQ
    - always performs idling when the in-service queue becomes empty;
    - forces the device to serve one I/O request at a time, by
    dispatching a new request only if there is no outstanding
    request.
    In the presence of differentiated weights or I/O-request sizes,
    both the above conditions are needed to guarantee that every
    queue receives its allotted share of the bandwidth (see
    Documentation/block/bfq-iosched.txt for more details). Setting
    strict_guarantees may evidently affect throughput.

    [1] https://lkml.org/lkml/2008/4/1/234
    https://lkml.org/lkml/2008/11/11/148

    [2] P. Valente and M. Andreolini, "Improving Application
    Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
    the 5th Annual International Systems and Storage Conference
    (SYSTOR '12), June 2012.
    Slightly extended version:
    http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
    results.pdf

    Signed-off-by: Fabio Checconi
    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Paolo Valente
     

15 Apr, 2017

1 commit

  • The Kyber I/O scheduler is an I/O scheduler for fast devices designed to
    scale to multiple queues. Users configure only two knobs, the target
    read and synchronous write latencies, and the scheduler tunes itself to
    achieve that latency goal.

    The implementation is based on "tokens", built on top of the scalable
    bitmap library. Tokens serve as a mechanism for limiting requests. There
    are two tiers of tokens: queueing tokens and dispatch tokens.

    A queueing token is required to allocate a request. In fact, these
    tokens are actually the blk-mq internal scheduler tags, but the
    scheduler manages the allocation directly in order to implement its
    policy.

    Dispatch tokens are device-wide and split up into two scheduling
    domains: reads vs. writes. Each hardware queue dispatches batches
    round-robin between the scheduling domains as long as tokens are
    available for that domain.

    These tokens can be used as the mechanism to enable various policies.
    The policy Kyber uses is inspired by active queue management techniques
    for network routing, similar to blk-wbt. The scheduler monitors
    latencies and scales the number of dispatch tokens accordingly. Queueing
    tokens are used to prevent starvation of synchronous requests by
    asynchronous requests.

    Various extensions are possible, including better heuristics and ionice
    support. The new scheduler isn't set as the default yet.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

03 Mar, 2017

1 commit

  • Pull vhost updates from Michael Tsirkin:
    "virtio, vhost: optimizations, fixes

    Looks like a quiet cycle for vhost/virtio, just a couple of minor
    tweaks. Most notable is automatic interrupt affinity for blk and scsi.
    Hopefully other devices are not far behind"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-console: avoid DMA from stack
    vhost: introduce O(1) vq metadata cache
    virtio_scsi: use virtio IRQ affinity
    virtio_blk: use virtio IRQ affinity
    blk-mq: provide a default queue mapping for virtio device
    virtio: provide a method to get the IRQ affinity mask for a virtqueue
    virtio: allow drivers to request IRQ affinity when creating VQs
    virtio_pci: simplify MSI-X setup
    virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
    virtio_pci: use shared interrupts for virtqueues
    virtio_pci: remove struct virtio_pci_vq_info
    vhost: try avoiding avail index access when getting descriptor
    virtio_mmio: expose header to userspace

    Linus Torvalds
     

28 Feb, 2017

1 commit


18 Feb, 2017

1 commit


07 Feb, 2017

1 commit

  • This patch implements the necessary logic to bring an Opal
    enabled drive out of a factory-enabled into a working
    Opal state.

    This patch set also enables logic to save a password to
    be replayed during a resume from suspend.

    Signed-off-by: Scott Bauer
    Signed-off-by: Rafael Antognolli
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Scott Bauer
     

01 Feb, 2017

1 commit


28 Jan, 2017

1 commit

  • This fixes a couple of problems:

    1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
    2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
    all.

    Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

    Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
    Signed-off-by: Omar Sandoval

    Augment Kconfig description.

    Signed-off-by: Jens Axboe

    Omar Sandoval