05 Apr, 2019

1 commit

  • Storage devices which report supporting discard commands like
    WRITE_SAME_16 with unmap, but reject discard commands sent to the
    storage device. This is a clear storage firmware bug but it doesn't
    change the fact that should a program cause discards to be sent to a
    multipath device layered on this buggy storage, all paths can end up
    failed at the same time from the discards, causing possible I/O loss.

    The first discard to a path will fail with Illegal Request, Invalid
    field in cdb, e.g.:
    kernel: sd 8:0:8:19: [sdfn] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    kernel: sd 8:0:8:19: [sdfn] tag#0 Sense Key : Illegal Request [current]
    kernel: sd 8:0:8:19: [sdfn] tag#0 Add. Sense: Invalid field in cdb
    kernel: sd 8:0:8:19: [sdfn] tag#0 CDB: Write same(16) 93 08 00 00 00 00 00 a0 08 00 00 00 80 00 00 00
    kernel: blk_update_request: critical target error, dev sdfn, sector 10487808

    The SCSI layer converts this to the BLK_STS_TARGET error number, the sd
    device disables its support for discard on this path, and because of the
    BLK_STS_TARGET error multipath fails the discard without failing any
    path or retrying down a different path. But subsequent discards can
    cause path failures. Any discards sent to the path which already failed
    a discard ends up failing with EIO from blk_cloned_rq_check_limits with
    an "over max size limit" error since the discard limit was set to 0 by
    the sd driver for the path. As the error is EIO, this now fails the
    path and multipath tries to send the discard down the next path. This
    cycle continues as discards are sent until all paths fail.

    Fix this by training DM core to disable DISCARD if the underlying
    storage already did so.

    Also, fix branching in dm_done() and clone_endio() to reflect the
    mutually exclussive nature of the IO operations in question.

    Cc: stable@vger.kernel.org
    Reported-by: David Jeffery
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     

20 Dec, 2018

1 commit

  • DM currently has a statically allocated bio that it uses to issue empty
    flushes. It doesn't submit this bio, it just uses it for maintaining
    state while setting up clones. Multiple users can access this bio at the
    same time. This wasn't previously an issue, even if it was a bit iffy,
    but with the blkg associations it can become one.

    We setup the blkg association, then clone bio's and submit, then remove
    the blkg assocation again. But since we can have multiple tasks doing
    this at the same time, against multiple blkg's, then we can either lose
    references to a blkg, or put it twice. The latter causes complaints on
    the percpu ref being
    Tested-by: Ming Lei
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

10 Dec, 2018

1 commit


11 Oct, 2018

1 commit


08 Jun, 2018

1 commit


31 May, 2018

1 commit


30 Jan, 2018

1 commit


17 Dec, 2017

2 commits


11 Nov, 2017

1 commit

  • The structure srcu_struct can be very big, its size is proportional to the
    value CONFIG_NR_CPUS. The Fedora kernel has CONFIG_NR_CPUS 8192, the field
    io_barrier in the struct mapped_device has 84kB in the debugging kernel
    and 50kB in the non-debugging kernel. The large size may result in failure
    of the function kzalloc_node.

    In order to avoid the allocation failure, we use the function
    kvzalloc_node, this function falls back to vmalloc if a large contiguous
    chunk of memory is not available. This patch also moves the field
    io_barrier to the last position of struct mapped_device - the reason is
    that on many processor architectures, short memory offsets result in
    smaller code than long memory offsets - on x86-64 it reduces code size by
    320 bytes.

    Note to stable kernel maintainers - the kernels 4.11 and older don't have
    the function kvzalloc_node, you can use the function vzalloc_node instead.

    Signed-off-by: Mikulas Patocka
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

25 Sep, 2017

1 commit

  • The size of struct dm_name_list is different on 32-bit and 64-bit
    kernels (so "(nl + 1)" differs between 32-bit and 64-bit kernels).

    This mismatch caused some harmless difference in padding when using 32-bit
    or 64-bit kernel. Commit 23d70c5e52dd ("dm ioctl: report event number in
    DM_LIST_DEVICES") added reporting event number in the output of
    DM_LIST_DEVICES_CMD. This difference in padding makes it impossible for
    userspace to determine the location of the event number (the location
    would be different when running on 32-bit and 64-bit kernels).

    Fix the padding by using offsetof(struct dm_name_list, name) instead of
    sizeof(struct dm_name_list) to determine the location of entries.

    Also, the ioctl version number is incremented to 37 so that userspace
    can use the version number to determine that the event number is present
    and correctly located.

    In addition, a global event is now raised when a DM device is created,
    removed, renamed or when table is swapped, so that the user can monitor
    for device changes.

    Reported-by: Eugene Syromiatnikov
    Fixes: 23d70c5e52dd ("dm ioctl: report event number in DM_LIST_DEVICES")
    Cc: stable@vger.kernel.org # 4.13
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

19 Jun, 2017

1 commit

  • Add the ability to poll on the /dev/mapper/control device. The select
    or poll function waits until any event happens on any dm device since
    opening the /dev/mapper/control device. When select or poll returns the
    device as readable, we must close and reopen the device to wait for new
    dm events.

    Usage:
    1. open the /dev/mapper/control device
    2. scan the event numbers of all devices we are interested in and process
    them
    3. call select, poll or epoll on the handle (it waits until some new event
    happens since opening the device)
    4. close the /dev/mapper/control handle
    5. go to step 1

    The next commit allows to re-arm the polling without closing and
    reopening the device.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Andy Grover
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

06 May, 2017

1 commit

  • Pull libnvdimm updates from Dan Williams:
    "The bulk of this has been in multiple -next releases. There were a few
    late breaking fixes and small features that got added in the last
    couple days, but the whole set has received a build success
    notification from the kbuild robot.

    Change summary:

    - Region media error reporting: A libnvdimm region device is the
    parent to one or more namespaces. To date, media errors have been
    reported via the "badblocks" attribute attached to pmem block
    devices for namespaces in "raw" or "memory" mode. Given that
    namespaces can be in "device-dax" or "btt-sector" mode this new
    interface reports media errors generically, i.e. independent of
    namespace modes or state.

    This subsequently allows userspace tooling to craft "ACPI 6.1
    Section 9.20.7.6 Function Index 4 - Clear Uncorrectable Error"
    requests and submit them via the ioctl path for NVDIMM root bus
    devices.

    - Introduce 'struct dax_device' and 'struct dax_operations': Prompted
    by a request from Linus and feedback from Christoph this allows for
    dax capable drivers to publish their own custom dax operations.
    This fixes the broken assumption that all dax operations are
    related to a persistent memory device, and makes it easier for
    other architectures and platforms to add customized persistent
    memory support.

    - 'libnvdimm' core updates: A new "deep_flush" sysfs attribute is
    available for storage appliance applications to manually trigger
    memory controllers to drain write-pending buffers that would
    otherwise be flushed automatically by the platform ADR
    (asynchronous-DRAM-refresh) mechanism at a power loss event.
    Support for "locked" DIMMs is included to prevent namespaces from
    surfacing when the namespace label data area is locked. Finally,
    fixes for various reported deadlocks and crashes, also tagged for
    -stable.

    - ACPI / nfit driver updates: General updates of the nfit driver to
    add DSM command overrides, ACPI 6.1 health state flags support, DSM
    payload debug available by default, and various fixes.

    Acknowledgements that came after the branch was pushed:

    - commmit 565851c972b5 "device-dax: fix sysfs attribute deadlock":
    Tested-by: Yi Zhang

    - commit 23f498448362 "libnvdimm: rework region badblocks clearing"
    Tested-by: Toshi Kani "

    * tag 'libnvdimm-for-4.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (52 commits)
    libnvdimm, pfn: fix 'npfns' vs section alignment
    libnvdimm: handle locked label storage areas
    libnvdimm: convert NDD_ flags to use bitops, introduce NDD_LOCKED
    brd: fix uninitialized use of brd->dax_dev
    block, dax: use correct format string in bdev_dax_supported
    device-dax: fix sysfs attribute deadlock
    libnvdimm: restore "libnvdimm: band aid btt vs clear poison locking"
    libnvdimm: fix nvdimm_bus_lock() vs device_lock() ordering
    libnvdimm: rework region badblocks clearing
    acpi, nfit: kill ACPI_NFIT_DEBUG
    libnvdimm: fix clear length of nvdimm_forget_poison()
    libnvdimm, pmem: fix a NULL pointer BUG in nd_pmem_notify
    libnvdimm, region: sysfs trigger for nvdimm_flush()
    libnvdimm: fix phys_addr for nvdimm_clear_poison
    x86, dax, pmem: remove indirection around memcpy_from_pmem()
    block: remove block_device_operations ->direct_access()
    block, dax: convert bdev_dax_supported() to dax_direct_access()
    filesystem-dax: convert to dax_direct_access()
    Revert "block: use DAX for partition table reads"
    ext2, ext4, xfs: retrieve dax_device for iomap operations
    ...

    Linus Torvalds
     

04 May, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - A major update for DM cache that reduces the latency for deciding
    whether blocks should migrate to/from the cache. The bio-prison-v2
    interface supports this improvement by enabling direct dispatch of
    work to workqueues rather than having to delay the actual work
    dispatch to the DM cache core. So the dm-cache policies are much more
    nimble by being able to drive IO as they see fit. One immediate
    benefit from the improved latency is a cache that should be much more
    adaptive to changing workloads.

    - Add a new DM integrity target that emulates a block device that has
    additional per-sector tags that can be used for storing integrity
    information.

    - Add a new authenticated encryption feature to the DM crypt target
    that builds on the capabilities provided by the DM integrity target.

    - Add MD interface for switching the raid4/5/6 journal mode and update
    the DM raid target to use it to enable aid4/5/6 journal write-back
    support.

    - Switch the DM verity target over to using the asynchronous hash
    crypto API (this helps work better with architectures that have
    access to off-CPU algorithm providers, which should reduce CPU
    utilization).

    - Various request-based DM and DM multipath fixes and improvements from
    Bart and Christoph.

    - A DM thinp target fix for a bio structure leak that occurs for each
    discard IFF discard passdown is enabled.

    - A fix for a possible deadlock in DM bufio and a fix to re-check the
    new buffer allocation watermark in the face of competing admin
    changes to the 'max_cache_size_bytes' tunable.

    - A couple DM core cleanups.

    * tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits)
    dm bufio: check new buffer allocation watermark every 30 seconds
    dm bufio: avoid a possible ABBA deadlock
    dm mpath: make it easier to detect unintended I/O request flushes
    dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
    dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
    dm: introduce enum dm_queue_mode to cleanup related code
    dm mpath: verify __pg_init_all_paths locking assumptions at runtime
    dm: verify suspend_locking assumptions at runtime
    dm block manager: remove an unused argument from dm_block_manager_create()
    dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
    dm mpath: delay requeuing while path initialization is in progress
    dm mpath: avoid that path removal can trigger an infinite loop
    dm mpath: split and rename activate_path() to prepare for its expanded use
    dm ioctl: prevent stack leak in dm ioctl call
    dm integrity: use previously calculated log2 of sectors_per_block
    dm integrity: use hex2bin instead of open-coded variant
    dm crypt: replace custom implementation of hex2bin()
    dm crypt: remove obsolete references to per-CPU state
    dm verity: switch to using asynchronous hash crypto API
    dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
    ...

    Linus Torvalds
     

28 Apr, 2017

1 commit


21 Apr, 2017

1 commit

  • Allocate a dax_device to represent the capacity of a device-mapper
    instance. Provide a ->direct_access() method via the new dax_operations
    indirection that mirrors the functionality of the current direct_access
    support via block_device_operations. Once fs/dax.c has been converted
    to use dax_operations the old dm_blk_direct_access() will be removed.

    A new helper dm_dax_get_live_target() is introduced to separate some of
    the dm-specifics from the direct_access implementation.

    This enabling is only for the top-level dm representation to upper
    layers. Converting target direct_access implementations is deferred to a
    separate patch.

    Cc: Toshi Kani
    Reviewed-by: Mike Snitzer
    Signed-off-by: Dan Williams

    Dan Williams
     

09 Apr, 2017

1 commit


28 Jan, 2017

1 commit

  • DM already calls blk_mq_alloc_request on the request_queue of the
    underlying device if it is a blk-mq device. But now that we allow drivers
    to allocate additional data and initialize it ahead of time we need to do
    the same for all drivers. Doing so and using the new cmd_size
    infrastructure in the block layer greatly simplifies the dm-rq and mpath
    code, and should also make arbitrary combinations of SQ and MQ devices
    with SQ or MQ device mapper tables easily possible as a further step.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

11 Jun, 2016

1 commit

  • Add some seperation between bio-based and request-based DM core code.

    'struct mapped_device' and other DM core only structures and functions
    have been moved to dm-core.h and all relevant DM core .c files have been
    updated to include dm-core.h rather than dm.h

    DM targets should _never_ include dm-core.h!

    [block core merge conflict resolution from Stephen Rothwell]
    Signed-off-by: Mike Snitzer
    Signed-off-by: Stephen Rothwell

    Mike Snitzer