29 Aug, 2019

2 commits

  • This patchset implements IO cost model based work-conserving
    proportional controller.

    While io.latency provides the capability to comprehensively prioritize
    and protect IOs depending on the cgroups, its protection is binary -
    the lowest latency target cgroup which is suffering is protected at
    the cost of all others. In many use cases including stacking multiple
    workload containers in a single system, it's necessary to distribute
    IO capacity with better granularity.

    One challenge of controlling IO resources is the lack of trivially
    observable cost metric. The most common metrics - bandwidth and iops
    - can be off by orders of magnitude depending on the device type and
    IO pattern. However, the cost isn't a complete mystery. Given
    several key attributes, we can make fairly reliable predictions on how
    expensive a given stream of IOs would be, at least compared to other
    IO patterns.

    The function which determines the cost of a given IO is the IO cost
    model for the device. This controller distributes IO capacity based
    on the costs estimated by such model. The more accurate the cost
    model the better but the controller adapts based on IO completion
    latency and as long as the relative costs across differents IO
    patterns are consistent and sensible, it'll adapt to the actual
    performance of the device.

    Currently, the only implemented cost model is a simple linear one with
    a few sets of default parameters for different classes of device.
    This covers most common devices reasonably well. All the
    infrastructure to tune and add different cost models is already in
    place and a later patch will also allow using bpf progs for cost
    models.

    Please see the top comment in blk-iocost.c and documentation for
    more details.

    v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.

    Signed-off-by: Tejun Heo
    Cc: Andy Newell
    Cc: Josef Bacik
    Cc: Rik van Riel
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There are currently two start time timestamps - start_time_ns and
    io_start_time_ns. The former marks the request allocation and and the
    second issue-to-device time. The planned io.weight controller needs
    to measure the total time bios take to execute after it leaves rq_qos
    including the time spent waiting for request to become available,
    which can easily dominate on saturated devices.

    This patch adds request->alloc_time_ns which records when the request
    allocation attempt started. As it isn't used for the usual stats,
    make it optional behind CONFIG_BLK_RQ_ALLOC_TIME and
    QUEUE_FLAG_RQ_ALLOC_TIME so that it can be compiled out when there are
    no users and it's active only on queues which need it even when
    compiled in.

    v2: s/pre_start_time/alloc_time/ and add CONFIG_BLK_RQ_ALLOC_TIME
    gating as suggested by Jens.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jul, 2019

2 commits


09 Jul, 2019

1 commit


15 Jun, 2019

1 commit

  • Convert the cgroup-v1 files to ReST format, in order to
    allow a later addition to the admin-guide.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Mauro Carvalho Chehab
     

13 Jun, 2019

1 commit

  • In most use cases of zoned block devices (aka SMR disks), the
    mq-deadline scheduler is mandatory as it implements sequential write
    command processing guarantees with zone write locking. So make sure that
    this scheduler is always enabled if CONFIG_BLK_DEV_ZONED is selected.

    Tested-by: Chaitanya Kulkarni
    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Damien Le Moal
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

07 Apr, 2019

1 commit

  • Currently support for 64-bit sector_t and blkcnt_t is optional on 32-bit
    architectures. These types are required to support block device and/or
    file sizes larger than 2 TiB, and have generally defaulted to on for
    a long time. Enabling the option only increases the i386 tinyconfig
    size by 145 bytes, and many data structures already always use
    64-bit values for their in-core and on-disk data structures anyway,
    so there should not be a large change in dynamic memory usage either.

    Dropping this option removes a somewhat weird non-default config that
    has cause various bugs or compiler warnings when actually used.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

30 Dec, 2018

1 commit

  • Pull Kconfig updates from Masahiro Yamada:

    - support -y option for merge_config.sh to avoid downgrading =y to =m

    - remove S_OTHER symbol type, and touch include/config/*.h files correctly

    - fix file name and line number in lexer warnings

    - fix memory leak when EOF is encountered in quotation

    - resolve all shift/reduce conflicts of the parser

    - warn no new line at end of file

    - make 'source' statement more strict to take only string literal

    - rewrite the lexer and remove the keyword lookup table

    - convert to SPDX License Identifier

    - compile C files independently instead of including them from zconf.y

    - fix various warnings of gconfig

    - misc cleanups

    * tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
    kconfig: surround dbg_sym_flags with #ifdef DEBUG to fix gconf warning
    kconfig: split images.c out of qconf.cc/gconf.c to fix gconf warnings
    kconfig: add static qualifiers to fix gconf warnings
    kconfig: split the lexer out of zconf.y
    kconfig: split some C files out of zconf.y
    kconfig: convert to SPDX License Identifier
    kconfig: remove keyword lookup table entirely
    kconfig: update current_pos in the second lexer
    kconfig: switch to ASSIGN_VAL state in the second lexer
    kconfig: stop associating kconf_id with yylval
    kconfig: refactor end token rules
    kconfig: stop supporting '.' and '/' in unquoted words
    treewide: surround Kconfig file paths with double quotes
    microblaze: surround string default in Kconfig with double quotes
    kconfig: use T_WORD instead of T_VARIABLE for variables
    kconfig: use specific tokens instead of T_ASSIGN for assignments
    kconfig: refactor scanning and parsing "option" properties
    kconfig: use distinct tokens for type and default properties
    kconfig: remove redundant token defines
    kconfig: rename depends_list to comment_option_list
    ...

    Linus Torvalds
     

21 Dec, 2018

1 commit

  • The Kconfig lexer supports special characters such as '.' and '/' in
    the parameter context. In my understanding, the reason is just to
    support bare file paths in the source statement.

    I do not see a good reason to complicate Kconfig for the room of
    ambiguity.

    The majority of code already surrounds file paths with double quotes,
    and it makes sense since file paths are constant string literals.

    Make it treewide consistent now.

    Signed-off-by: Masahiro Yamada
    Acked-by: Wolfram Sang
    Acked-by: Geert Uytterhoeven
    Acked-by: Ingo Molnar

    Masahiro Yamada
     

08 Nov, 2018

1 commit


11 Oct, 2018

1 commit

  • 'default n' is the default value for any bool or tristate Kconfig
    setting so there is no need to write it explicitly.

    Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
    is not set' for visible symbols") the Kconfig behavior is the same
    regardless of 'default n' being present or not:

    ...
    One side effect of (and the main motivation for) this change is making
    the following two definitions behave exactly the same:

    config FOO
    bool

    config FOO
    bool
    default n

    With this change, neither of these will generate a
    '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
    That might make it clearer to people that a bare 'default n' is
    redundant.
    ...

    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Jens Axboe

    Bartlomiej Zolnierkiewicz
     

27 Sep, 2018

1 commit

  • Move the code for runtime power management from blk-core.c into the
    new source file blk-pm.c. Move the corresponding declarations from
    into . For CONFIG_PM=n, leave out
    the declarations of the functions that are not used in that mode.
    This patch not only reduces the number of #ifdefs in the block layer
    core code but also reduces the size of header file
    and hence should help to reduce the build time of the Linux kernel
    if CONFIG_PM is not defined.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Cc: Alan Stern
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

09 Jul, 2018

2 commits

  • Current IO controllers for the block layer are less than ideal for our
    use case. The io.max controller is great at hard limiting, but it is
    not work conserving. This patch introduces io.latency. You provide a
    latency target for your group and we monitor the io in short windows to
    make sure we are not exceeding those latency targets. This makes use of
    the rq-qos infrastructure and works much like the wbt stuff. There are
    a few differences from wbt

    - It's bio based, so the latency covers the whole block layer in addition to
    the actual io.
    - We will throttle all IO types that comes in here if we need to.
    - We use the mean latency over the 100ms window. This is because writes can
    be particularly fast, which could give us a false sense of the impact of
    other workloads on our protected workload.
    - By default there's no throttling, we set the queue_depth to INT_MAX so that
    we can have as many outstanding bio's as we're allowed to. Only at
    throttle time do we pay attention to the actual queue depth.
    - We backcharge cgroups for root cg issued IO and induce artificial
    delays in order to deal with cases like metadata only or swap heavy
    workloads.

    In testing this has worked out relatively well. Protected workloads
    will throttle noisy workloads down to 1 io at time if they are doing
    normal IO on their own, or induce up to a 1 second delay per syscall if
    they are doing a lot of root issued IO (metadata/swap IO).

    Our testing has revolved mostly around our production web servers where
    we have hhvm (the web server application) in a protected group and
    everything else in another group. We see slightly higher requests per
    second (RPS) on the test tier vs the control tier, and much more stable
    RPS across all machines in the test tier vs the control tier.

    Another test we run is a slow memory allocator in the unprotected group.
    Before this would eventually push us into swap and cause the whole box
    to die and not recover at all. With these patches we see slight RPS
    drops (usually 10-15%) before the memory consumer is properly killed and
    things recover within seconds.

    Signed-off-by: Josef Bacik
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Josef Bacik
     
  • Exclude zoned block device members from struct request_queue for
    CONFIG_BLK_DEV_ZONED == n. Avoid breaking the build by only building
    the code that uses these struct request_queue members if
    CONFIG_BLK_DEV_ZONED != n.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Damien Le Moal
    Cc: Matias Bjorling
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

16 Jun, 2018

1 commit

  • As we move stuff around, some doc references are broken. Fix some of
    them via this script:
    ./scripts/documentation-file-ref-check --fix

    Manually checked if the produced result is valid, removing a few
    false-positives.

    Acked-by: Takashi Iwai
    Acked-by: Masami Hiramatsu
    Acked-by: Stephen Boyd
    Acked-by: Charles Keepax
    Acked-by: Mathieu Poirier
    Reviewed-by: Coly Li
    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

09 Aug, 2017

1 commit

  • Like pci and virtio, we add a rdma helper for affinity
    spreading. This achieves optimal mq affinity assignments
    according to the underlying rdma device affinity maps.

    Reviewed-by: Jens Axboe
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Max Gurtovoy
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Doug Ledford

    Sagi Grimberg
     

13 May, 2017

1 commit

  • Pull libnvdimm fixes from Dan Williams:
    "Incremental fixes and a small feature addition on top of the main
    libnvdimm 4.12 pull request:

    - Geert noticed that tinyconfig was bloated by BLOCK selecting DAX.
    The size regression is fixed by moving all dax helpers into the
    dax-core and only specifying "select DAX" for FS_DAX and
    dax-capable drivers. He also asked for clarification of the
    NR_DEV_DAX config option which, on closer look, does not need to be
    a config option at all. Mike also throws in a DEV_DAX_PMEM fixup
    for good measure.

    - Ben's attention to detail on -stable patch submissions caught a
    case where the recent fixes to arch_copy_from_iter_pmem() missed a
    condition where we strand dirty data in the cache. This is tagged
    for -stable and will also be included in the rework of the pmem api
    to a proposed {memcpy,copy_user}_flushcache() interface for 4.13.

    - Vishal adds a feature that missed the initial pull due to pending
    review feedback. It allows the kernel to clear media errors when
    initializing a BTT (atomic sector update driver) instance on a pmem
    namespace.

    - Ross noticed that the dax_device + dax_operations conversion broke
    __dax_zero_page_range(). The nvdimm unit tests fail to check this
    path, but xfstests immediately trips over it. No excuse for missing
    this before submitting the 4.12 pull request.

    These all pass the nvdimm unit tests and an xfstests spot check. The
    set has received a build success notification from the kbuild robot"

    * 'libnvdimm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
    filesystem-dax: fix broken __dax_zero_page_range() conversion
    libnvdimm, btt: ensure that initializing metadata clears poison
    libnvdimm: add an atomic vs process context flag to rw_bytes
    x86, pmem: Fix cache flushing for iovec write < 8 bytes
    device-dax: kill NR_DEV_DAX
    block, dax: move "select DAX" from BLOCK to FS_DAX
    device-dax: Tell kbuild DEV_DAX_PMEM depends on DEV_DAX

    Linus Torvalds
     

09 May, 2017

1 commit

  • For configurations that do not enable DAX filesystems or drivers, do not
    require the DAX core to be built.

    Given that the 'direct_access' method has been removed from
    'block_device_operations', we can also go ahead and remove the
    block-related dax helper functions from fs/block_dev.c to
    drivers/dax/super.c. This keeps dax details out of the block layer and
    lets the DAX core be built as a module in the FS_DAX=n case.

    Filesystems need to include dax.h to call bdev_dax_supported().

    Cc: linux-xfs@vger.kernel.org
    Cc: Jens Axboe
    Cc: "Theodore Ts'o"
    Cc: Matthew Wilcox
    Cc: Alexander Viro
    Cc: "Darrick J. Wong"
    Cc: Ross Zwisler
    Reviewed-by: Jan Kara
    Reported-by: Geert Uytterhoeven
    Signed-off-by: Dan Williams

    Dan Williams
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

21 Apr, 2017

1 commit

  • Replace bdev_direct_access() with dax_direct_access() that uses
    dax_device and dax_operations instead of a block_device and
    block_device_operations for dax. Once all consumers of the old api have
    been converted bdev_direct_access() will be deleted.

    Given that block device partitioning decisions can cause dax page
    alignment constraints to be violated this also introduces the
    bdev_dax_pgoff() helper. It handles calculating a logical pgoff relative
    to the dax_device and also checks for page alignment.

    Signed-off-by: Dan Williams

    Dan Williams
     

28 Mar, 2017

1 commit


03 Mar, 2017

1 commit

  • Pull vhost updates from Michael Tsirkin:
    "virtio, vhost: optimizations, fixes

    Looks like a quiet cycle for vhost/virtio, just a couple of minor
    tweaks. Most notable is automatic interrupt affinity for blk and scsi.
    Hopefully other devices are not far behind"

    * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
    virtio-console: avoid DMA from stack
    vhost: introduce O(1) vq metadata cache
    virtio_scsi: use virtio IRQ affinity
    virtio_blk: use virtio IRQ affinity
    blk-mq: provide a default queue mapping for virtio device
    virtio: provide a method to get the IRQ affinity mask for a virtqueue
    virtio: allow drivers to request IRQ affinity when creating VQs
    virtio_pci: simplify MSI-X setup
    virtio_pci: don't duplicate the msix_enable flag in struct pci_dev
    virtio_pci: use shared interrupts for virtqueues
    virtio_pci: remove struct virtio_pci_vq_info
    vhost: try avoiding avail index access when getting descriptor
    virtio_mmio: expose header to userspace

    Linus Torvalds
     

28 Feb, 2017

1 commit


18 Feb, 2017

1 commit


07 Feb, 2017

1 commit

  • This patch implements the necessary logic to bring an Opal
    enabled drive out of a factory-enabled into a working
    Opal state.

    This patch set also enables logic to save a password to
    be replayed during a resume from suspend.

    Signed-off-by: Scott Bauer
    Signed-off-by: Rafael Antognolli
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Scott Bauer
     

01 Feb, 2017

1 commit


28 Jan, 2017

1 commit

  • This fixes a couple of problems:

    1. In the !CONFIG_DEBUG_FS case, the stub definitions were bogus.
    2. In the !CONFIG_BLOCK case, blk-mq-debugfs.c shouldn't be compiled at
    all.

    Fix the stub definitions and add a CONFIG_BLK_DEBUG_FS Kconfig option.

    Fixes: 07e4fead45e6 ("blk-mq: create debugfs directory tree")
    Signed-off-by: Omar Sandoval

    Augment Kconfig description.

    Signed-off-by: Jens Axboe

    Omar Sandoval
     

11 Nov, 2016

1 commit

  • Enable throttling of buffered writeback to make it a lot
    more smooth, and has way less impact on other system activity.
    Background writeback should be, by definition, background
    activity. The fact that we flush huge bundles of it at the time
    means that it potentially has heavy impacts on foreground workloads,
    which isn't ideal. We can't easily limit the sizes of writes that
    we do, since that would impact file system layout in the presence
    of delayed allocation. So just throttle back buffered writeback,
    unless someone is waiting for it.

    The algorithm for when to throttle takes its inspiration in the
    CoDel networking scheduling algorithm. Like CoDel, blk-wb monitors
    the minimum latencies of requests over a window of time. In that
    window of time, if the minimum latency of any request exceeds a
    given target, then a scale count is incremented and the queue depth
    is shrunk. The next monitoring window is shrunk accordingly. Unlike
    CoDel, if we hit a window that exhibits good behavior, then we
    simply increment the scale count and re-calculate the limits for that
    scale value. This prevents us from oscillating between a
    close-to-ideal value and max all the time, instead remaining in the
    windows where we get good behavior.

    Unlike CoDel, blk-wb allows the scale count to to negative. This
    happens if we primarily have writes going on. Unlike positive
    scale counts, this doesn't change the size of the monitoring window.
    When the heavy writers finish, blk-bw quickly snaps back to it's
    stable state of a zero scale count.

    The patch registers a sysfs entry, 'wb_lat_usec'. This sets the latency
    target to me met. It defaults to 2 msec for non-rotational storage, and
    75 msec for rotational storage. Setting this value to '0' disables
    blk-wb. Generally, a user would not have to touch this setting.

    We don't enable WBT on devices that are managed with CFQ, and have
    a non-root block cgroup attached. If we have a proportional share setup
    on this particular disk, then the wbt throttling will interfere with
    that. We don't have a strong need for wbt for that case, since we will
    rely on CFQ doing that for us.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Nov, 2016

1 commit

  • blk_mq_quiesce_queue() waits until ongoing .queue_rq() invocations
    have finished. This function does *not* wait until all outstanding
    requests have finished (this means invocation of request.end_io()).
    The algorithm used by blk_mq_quiesce_queue() is as follows:
    * Hold either an RCU read lock or an SRCU read lock around
    .queue_rq() calls. The former is used if .queue_rq() does not
    block and the latter if .queue_rq() may block.
    * blk_mq_quiesce_queue() first calls blk_mq_stop_hw_queues()
    followed by synchronize_srcu() or synchronize_rcu(). The latter
    call waits for .queue_rq() invocations that started before
    blk_mq_quiesce_queue() was called.
    * The blk_mq_hctx_stopped() calls that control whether or not
    .queue_rq() will be called are called with the (S)RCU read lock
    held. This is necessary to avoid race conditions against
    blk_mq_quiesce_queue().

    Signed-off-by: Bart Van Assche
    Cc: Hannes Reinecke
    Cc: Johannes Thumshirn
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

19 Oct, 2016

1 commit

  • Implement zoned block device zone information reporting and reset.
    Zone information are reported as struct blk_zone. This implementation
    does not differentiate between host-aware and host-managed device
    models and is valid for both. Two functions are provided:
    blkdev_report_zones for discovering the zone configuration of a
    zoned block device, and blkdev_reset_zones for resetting the write
    pointer of sequential zones. The helper function blk_queue_zone_size
    and bdev_zone_size are also provided for, as the name suggest,
    obtaining the zone size (in 512B sectors) of the zones of the device.

    Signed-off-by: Hannes Reinecke

    [Damien: * Removed the zone cache
    * Implement report zones operation based on earlier proposal
    by Shaun Tancheff ]
    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Shaun Tancheff
    Tested-by: Shaun Tancheff
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

10 Oct, 2016

1 commit

  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

19 Sep, 2016

1 commit


17 Sep, 2016

1 commit

  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

04 Aug, 2016

1 commit

  • The functionality for block device DAX was already removed with commit
    acc93d30d7d4 ("Revert "block: enable dax for raw block devices"")

    However, we still had a config option hanging around that was always
    disabled because it depended on CONFIG_BROKEN. This config option was
    introduced in commit 03cdadb04077 ("block: disable block device DAX by
    default")

    This change reverts that commit, removing the dead config option.

    Link: http://lkml.kernel.org/r/20160729182314.6368-1-ross.zwisler@linux.intel.com
    Signed-off-by: Ross Zwisler
    Cc: Dave Hansen
    Acked-by: Dan Williams
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ross Zwisler
     

28 Feb, 2016

1 commit

  • The recent *sync enabling discovered that we are inserting into the
    block_device pagecache counter to the expectations of the dirty data
    tracking for dax mappings. This can lead to data corruption.

    We want to support DAX for block devices eventually, but it requires
    wider changes to properly manage the pagecache.

    dump_stack+0x85/0xc2
    dax_writeback_mapping_range+0x60/0xe0
    blkdev_writepages+0x3f/0x50
    do_writepages+0x21/0x30
    __filemap_fdatawrite_range+0xc6/0x100
    filemap_write_and_wait+0x4a/0xa0
    set_blocksize+0x70/0xd0
    sb_set_blocksize+0x1d/0x50
    ext4_fill_super+0x75b/0x3360
    mount_bdev+0x180/0x1b0
    ext4_mount+0x15/0x20
    mount_fs+0x38/0x170

    Mark the support broken so its disabled by default, but otherwise still
    available for testing.

    Signed-off-by: Dan Williams
    Signed-off-by: Ross Zwisler
    Reported-by: Ross Zwisler
    Suggested-by: Dave Chinner
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Cc: Matthew Wilcox
    Cc: Al Viro
    Cc: Theodore Ts'o
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

27 Sep, 2014

1 commit

  • The T10 Protection Information format is also used by some devices that
    do not go through the SCSI layer (virtual block devices, NVMe). Relocate
    the relevant functions to a block layer library that can be used without
    involving SCSI.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

01 Oct, 2013

1 commit

  • Recently commit bab55417b10c ("block: support embedded device command
    line partition") introduced CONFIG_CMDLINE_PARSER. However, that name
    is too generic and sounds like it enables/disables generic kernel boot
    arg processing, when it really is block specific.

    Before this option becomes a part of a full/final release, add the BLK_
    prefix to it so that it is clear in absence of any other context that it
    is block specific.

    In addition, fix up the following less critical items:
    - help text was not really at all helpful.
    - index file for Documentation was not updated
    - add the new arg to Documentation/kernel-parameters.txt
    - clarify wording in source comments

    Signed-off-by: Paul Gortmaker
    Cc: Jens Axboe
    Cc: Cai Zhiyong
    Cc: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

12 Sep, 2013

1 commit

  • Read block device partition table from command line. The partition used
    for fixed block device (eMMC) embedded device. It is no MBR, save
    storage space. Bootloader can be easily accessed by absolute address of
    data on the block device. Users can easily change the partition.

    This code reference MTD partition, source "drivers/mtd/cmdlinepart.c"
    About the partition verbose reference
    "Documentation/block/cmdline-partition.txt"

    [akpm@linux-foundation.org: fix printk text]
    [yongjun_wei@trendmicro.com.cn: fix error return code in parse_parts()]
    Signed-off-by: Cai Zhiyong
    Cc: Karel Zak
    Cc: "Wanglin (Albert)"
    Cc: Marius Groeger
    Cc: David Woodhouse
    Cc: Jens Axboe
    Cc: Brian Norris
    Cc: Artem Bityutskiy
    Signed-off-by: Wei Yongjun
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cai Zhiyong