31 Jan, 2019

1 commit

  • commit d445bd9cec1a850c2100fcf53684c13b3fd934f2 upstream.

    Commit 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next
    stage processing") changed process_prepared_discard_passdown_pt1() to
    increment all the blocks being discarded until after the passdown had
    completed to avoid them being prematurely reused.

    IO issued to a thin device that breaks sharing with a snapshot, followed
    by a discard issued to snapshot(s) that previously shared the block(s),
    results in passdown_double_checking_shared_status() being called to
    iterate through the blocks double checking their reference count is zero
    and issuing the passdown if so. So a side effect of commit 00a0ea33b495
    is passdown_double_checking_shared_status() was broken.

    Fix this by checking if the block reference count is greater than 1.
    Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

    Fixes: 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next stage processing")
    Cc: stable@vger.kernel.org # 4.9+
    Reported-by: ryan.p.norwood@gmail.com
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

21 Dec, 2018

1 commit

  • commit f6c367585d0d851349d3a9e607c43e5bea993fa1 upstream.

    Sending a DM event before a thin-pool state change is about to happen is
    a bug. It wasn't realized until it became clear that userspace response
    to the event raced with the actual state change that the event was
    meant to notify about.

    Fix this by first updating internal thin-pool state to reflect what the
    DM event is being issued about. This fixes a long-standing racey/buggy
    userspace device-mapper-test-suite 'resize_io' test that would get an
    event but not find the state it was looking for -- so it would just go
    on to hang because no other events caused the test to reevaluate the
    thin-pool's state.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

10 Oct, 2018

1 commit

  • [ Upstream commit 3ab91828166895600efd9cdc3a0eb32001f7204a ]

    Committing a transaction can consume some metadata of it's own, we now
    reserve a small amount of metadata to cover this. Free metadata
    reported by the kernel will not include this reserve.

    If any of the reserve has been used after a commit we enter a new
    internal state PM_OUT_OF_METADATA_SPACE. This is reported as
    PM_READ_ONLY, so no userland changes are needed. If the metadata
    device is resized the pool will move back to PM_WRITE.

    These changes mean we never need to abort and rollback a transaction due
    to running out of metadata space. This is particularly important
    because there have been a handful of reports of data corruption against
    DM thin-provisioning that can all be attributed to the thin-pool having
    ran out of metadata space.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

10 Sep, 2018

1 commit

  • commit 75294442d896f2767be34f75aca7cc2b0d01301f upstream.

    Now both check_for_space() and do_no_space_timeout() will read & write
    pool->pf.error_if_no_space. If these functions run concurrently, as
    shown in the following case, the default setting of "queue_if_no_space"
    can get lost.

    precondition:
    * error_if_no_space = false (aka "queue_if_no_space")
    * pool is in Out-of-Data-Space (OODS) mode
    * no_space_timeout worker has been queued

    CPU 0: CPU 1:
    // delete a thin device
    process_delete_mesg()
    // check_for_space() invoked by commit()
    set_pool_mode(pool, PM_WRITE)
    pool->pf.error_if_no_space = \
    pt->requested_pf.error_if_no_space

    // timeout, pool is still in OODS mode
    do_no_space_timeout
    // "queue_if_no_space" config is lost
    pool->pf.error_if_no_space = true
    pool->pf.mode = new_mode

    Fix it by stopping no_space_timeout worker when switching to write mode.

    Fixes: bcc696fac11f ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     

03 Jul, 2018

1 commit

  • commit a685557fbbc3122ed11e8ad3fa63a11ebc5de8c3 upstream.

    Discards issued to a DM thin device can complete to userspace (via
    fstrim) _before_ the metadata changes associated with the discards is
    reflected in the thinp superblock (e.g. free blocks). As such, if a
    user constructs a test that loops repeatedly over these steps, block
    allocation can fail due to discards not having completed yet:
    1) fill thin device via filesystem file
    2) remove file
    3) fstrim

    From initial report, here:
    https://www.redhat.com/archives/dm-devel/2018-April/msg00022.html

    "The root cause of this issue is that dm-thin will first remove
    mapping and increase corresponding blocks' reference count to prevent
    them from being reused before DISCARD bios get processed by the
    underlying layers. However. increasing blocks' reference count could
    also increase the nr_allocated_this_transaction in struct sm_disk
    which makes smd->old_ll.nr_allocated +
    smd->nr_allocated_this_transaction bigger than smd->old_ll.nr_blocks.
    In this case, alloc_data_block() will never commit metadata to reset
    the begin pointer of struct sm_disk, because sm_disk_get_nr_free()
    always return an underflow value."

    While there is room for improvement to the space-map accounting that
    thinp is making use of: the reality is this test is inherently racey and
    will result in the previous iteration's fstrim's discard(s) completing
    vs concurrent block allocation, via dd, in the next iteration of the
    loop.

    No amount of space map accounting improvements will be able to allow
    user's to use a block before a discard of that block has completed.

    So the best we can really do is allow DM thinp to gracefully handle such
    aggressive use of all the pool's data by degrading the pool into
    out-of-data-space (OODS) mode. We _should_ get that behaviour already
    (if space map accounting didn't falsely cause alloc_data_block() to
    believe free space was available).. but short of that we handle the
    current reality that dm_pool_alloc_data_block() can return -ENOSPC.

    Reported-by: Dennis Yang
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

20 Dec, 2017

1 commit

  • commit 7e6358d244e4706fe612a77b9c36519a33600ac0 upstream.

    A NULL pointer is seen if two concurrent "vgchange -ay -K "
    processes race to load the dm-thin-pool module:

    PID: 25992 TASK: ffff883cd7d23500 CPU: 4 COMMAND: "vgchange"
    #0 [ffff883cd743d600] machine_kexec at ffffffff81038fa9
    0000001 [ffff883cd743d660] crash_kexec at ffffffff810c5992
    0000002 [ffff883cd743d730] oops_end at ffffffff81515c90
    0000003 [ffff883cd743d760] no_context at ffffffff81049f1b
    0000004 [ffff883cd743d7b0] __bad_area_nosemaphore at ffffffff8104a1a5
    0000005 [ffff883cd743d800] bad_area at ffffffff8104a2ce
    0000006 [ffff883cd743d830] __do_page_fault at ffffffff8104aa6f
    0000007 [ffff883cd743d950] do_page_fault at ffffffff81517bae
    0000008 [ffff883cd743d980] page_fault at ffffffff81514f95
    [exception RIP: kmem_cache_alloc+108]
    RIP: ffffffff8116ef3c RSP: ffff883cd743da38 RFLAGS: 00010046
    RAX: 0000000000000004 RBX: ffffffff81121b90 RCX: ffff881bf1e78cc0
    RDX: 0000000000000000 RSI: 00000000000000d0 RDI: 0000000000000000
    RBP: ffff883cd743da68 R8: ffff881bf1a4eb00 R9: 0000000080042000
    R10: 0000000000002000 R11: 0000000000000000 R12: 00000000000000d0
    R13: 0000000000000000 R14: 00000000000000d0 R15: 0000000000000246
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    0000009 [ffff883cd743da70] mempool_alloc_slab at ffffffff81121ba5
    0000010 [ffff883cd743da80] mempool_create_node at ffffffff81122083
    0000011 [ffff883cd743dad0] mempool_create at ffffffff811220f4
    0000012 [ffff883cd743dae0] pool_ctr at ffffffffa08de049 [dm_thin_pool]
    0000013 [ffff883cd743dbd0] dm_table_add_target at ffffffffa0005f2f [dm_mod]
    0000014 [ffff883cd743dc30] table_load at ffffffffa0008ba9 [dm_mod]
    0000015 [ffff883cd743dc90] ctl_ioctl at ffffffffa0009dc4 [dm_mod]

    The race results in a NULL pointer because:

    Process A (vgchange -ay -K):
    a. send DM_LIST_VERSIONS_CMD ioctl;
    b. pool_target not registered;
    c. modprobe dm_thin_pool and wait until end.

    Process B (vgchange -ay -K):
    a. send DM_LIST_VERSIONS_CMD ioctl;
    b. pool_target registered;
    c. table_load->dm_table_add_target->pool_ctr;
    d. _new_mapping_cache is NULL and panic.
    Note:
    1. process A and process B are two concurrent processes.
    2. pool_target can be detected by process B but
    _new_mapping_cache initialization has not ended.

    To fix dm-thin-pool, and other targets (cache, multipath, and snapshot)
    with the same problem, simply dm_register_target() after all resources
    created during module init (as labelled with __init) are finished.

    Signed-off-by: monty
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    monty_pavel@sina.com
     

15 Sep, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Some request-based DM core and DM multipath fixes and cleanups

    - Constify a few variables in DM core and DM integrity

    - Add bufio optimization and checksum failure accounting to DM
    integrity

    - Fix DM integrity to avoid checking integrity of failed reads

    - Fix DM integrity to use init_completion

    - A couple DM log-writes target fixes

    - Simplify DAX flushing by eliminating the unnecessary flush
    abstraction that was stood up for DM's use.

    * tag 'for-4.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dax: remove the pmem_dax_ops->flush abstraction
    dm integrity: use init_completion instead of COMPLETION_INITIALIZER_ONSTACK
    dm integrity: make blk_integrity_profile structure const
    dm integrity: do not check integrity for failed read operations
    dm log writes: fix >512b sectorsize support
    dm log writes: don't use all the cpu while waiting to log blocks
    dm ioctl: constify ioctl lookup table
    dm: constify argument arrays
    dm integrity: count and display checksum failures
    dm integrity: optimize writing dm-bufio buffers that are partially changed
    dm rq: do not update rq partially in each ending bio
    dm rq: make dm-sq requeuing behavior consistent with dm-mq behavior
    dm mpath: complain about unsupported __multipath_map_bio() return values
    dm mpath: avoid that building with W=1 causes gcc 7 to complain about fall-through

    Linus Torvalds
     

28 Aug, 2017

1 commit

  • The arrays of 'struct dm_arg' are never modified by the device-mapper
    core, so constify them so that they are placed in .rodata.

    (Exception: the args array in dm-raid cannot be constified because it is
    allocated on the stack and modified.)

    Signed-off-by: Eric Biggers
    Signed-off-by: Mike Snitzer

    Eric Biggers
     

24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jul, 2017

1 commit

  • Pull core block/IO updates from Jens Axboe:
    "This is the main pull request for the block layer for 4.13. Not a huge
    round in terms of features, but there's a lot of churn related to some
    core cleanups.

    Note this depends on the UUID tree pull request, that Christoph
    already sent out.

    This pull request contains:

    - A series from Christoph, unifying the error/stats codes in the
    block layer. We now use blk_status_t everywhere, instead of using
    different schemes for different places.

    - Also from Christoph, some cleanups around request allocation and IO
    scheduler interactions in blk-mq.

    - And yet another series from Christoph, cleaning up how we handle
    and do bounce buffering in the block layer.

    - A blk-mq debugfs series from Bart, further improving on the support
    we have for exporting internal information to aid debugging IO
    hangs or stalls.

    - Also from Bart, a series that cleans up the request initialization
    differences across types of devices.

    - A series from Goldwyn Rodrigues, allowing the block layer to return
    failure if we will block and the user asked for non-blocking.

    - Patch from Hannes for supporting setting loop devices block size to
    that of the underlying device.

    - Two series of patches from Javier, fixing various issues with
    lightnvm, particular around pblk.

    - A series from me, adding support for write hints. This comes with
    NVMe support as well, so applications can help guide data placement
    on flash to improve performance, latencies, and write
    amplification.

    - A series from Ming, improving and hardening blk-mq support for
    stopping/starting and quiescing hardware queues.

    - Two pull requests for NVMe updates. Nothing major on the feature
    side, but lots of cleanups and bug fixes. From the usual crew.

    - A series from Neil Brown, greatly improving the bio rescue set
    support. Most notably, this kills the bio rescue work queues, if we
    don't really need them.

    - Lots of other little bug fixes that are all over the place"

    * 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
    lightnvm: pblk: set line bitmap check under debug
    lightnvm: pblk: verify that cache read is still valid
    lightnvm: pblk: add initialization check
    lightnvm: pblk: remove target using async. I/Os
    lightnvm: pblk: use vmalloc for GC data buffer
    lightnvm: pblk: use right metadata buffer for recovery
    lightnvm: pblk: schedule if data is not ready
    lightnvm: pblk: remove unused return variable
    lightnvm: pblk: fix double-free on pblk init
    lightnvm: pblk: fix bad le64 assignations
    nvme: Makefile: remove dead build rule
    blk-mq: map all HWQ also in hyperthreaded system
    nvmet-rdma: register ib_client to not deadlock in device removal
    nvme_fc: fix error recovery on link down.
    nvmet_fc: fix crashes on bad opcodes
    nvme_fc: Fix crash when nvme controller connection fails.
    nvme_fc: replace ioabort msleep loop with completion
    nvme_fc: fix double calls to nvme_cleanup_cmd()
    nvme-fabrics: verify that a controller returns the correct NQN
    nvme: simplify nvme_dev_attrs_are_visible
    ...

    Linus Torvalds
     

28 Jun, 2017

1 commit

  • process_prepared_discard_passdown_pt1() should cleanup
    dm_thin_new_mapping in cases of error.

    dm_pool_inc_data_range() can fail trying to get a block reference:

    metadata operation 'dm_pool_inc_data_range' failed: error = -61

    When dm_pool_inc_data_range() fails, dm thin aborts current metadata
    transaction and marks pool as PM_READ_ONLY. Memory for thin mapping
    is released as well. However, current thin mapping will be queued
    onto next stage as part of queue_passdown_pt2() or passdown_endio().
    This dangling thin mapping memory when processed and accessed in
    next stage will lead to device mapper crashing.

    Code flow without fix:
    -> process_prepared_discard_passdown_pt1(m)
    -> dm_thin_remove_range()
    -> discard passdown
    --> passdown_endio(m) queues m onto next stage
    -> dm_pool_inc_data_range() fails, frees memory m
    but does not remove it from next stage queue

    -> process_prepared_discard_passdown_pt2(m)
    -> processes freed memory m and crashes

    One such stack:

    Call Trace:
    [] dm_cell_release_no_holder+0x2f/0x70 [dm_bio_prison]
    [] cell_defer_no_holder+0x3c/0x80 [dm_thin_pool]
    [] process_prepared_discard_passdown_pt2+0x4b/0x90 [dm_thin_pool]
    [] process_prepared+0x81/0xa0 [dm_thin_pool]
    [] do_worker+0xc5/0x820 [dm_thin_pool]
    [] ? __schedule+0x244/0x680
    [] ? pwq_activate_delayed_work+0x42/0xb0
    [] process_one_work+0x153/0x3f0
    [] worker_thread+0x12b/0x4b0
    [] ? rescuer_thread+0x350/0x350
    [] kthread+0xca/0xe0
    [] ? kthread_park+0x60/0x60
    [] ret_from_fork+0x25/0x30

    The fix is to first take the block ref count for discarded block and
    then do a passdown discard of this block. If block ref count fails,
    then bail out aborting current metadata transaction, mark pool as
    PM_READ_ONLY and also free current thin mapping memory (existing error
    handling code) without queueing this thin mapping onto next stage of
    processing. If block ref count succeeds, then passdown discard of this
    block. Discard callback of passdown_endio() will queue this thin mapping
    onto next stage of processing.

    Code flow with fix:
    -> process_prepared_discard_passdown_pt1(m)
    -> dm_thin_remove_range()
    -> dm_pool_inc_data_range()
    --> if fails, free memory m and bail out
    -> discard passdown
    --> passdown_endio(m) queues m onto next stage

    Cc: stable # v4.9+
    Reviewed-by: Eduardo Valentin
    Reviewed-by: Cristian Gafton
    Reviewed-by: Anchal Agarwal
    Signed-off-by: Vallish Vaidyeshwara
    Reviewed-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Vallish Vaidyeshwara
     

09 Jun, 2017

2 commits

  • Replace bi_error with a new bi_status to allow for a clear conversion.
    Note that device mapper overloaded bi_error with a private value, which
    we'll have to keep arround at least for now and thus propagate to a
    proper blk_status_t value.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Turn the error paramter into a pointer so that target drivers can change
    the value, and make sure only DM_ENDIO_* values are returned from the
    methods.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 May, 2017

1 commit

  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - A major update for DM cache that reduces the latency for deciding
    whether blocks should migrate to/from the cache. The bio-prison-v2
    interface supports this improvement by enabling direct dispatch of
    work to workqueues rather than having to delay the actual work
    dispatch to the DM cache core. So the dm-cache policies are much more
    nimble by being able to drive IO as they see fit. One immediate
    benefit from the improved latency is a cache that should be much more
    adaptive to changing workloads.

    - Add a new DM integrity target that emulates a block device that has
    additional per-sector tags that can be used for storing integrity
    information.

    - Add a new authenticated encryption feature to the DM crypt target
    that builds on the capabilities provided by the DM integrity target.

    - Add MD interface for switching the raid4/5/6 journal mode and update
    the DM raid target to use it to enable aid4/5/6 journal write-back
    support.

    - Switch the DM verity target over to using the asynchronous hash
    crypto API (this helps work better with architectures that have
    access to off-CPU algorithm providers, which should reduce CPU
    utilization).

    - Various request-based DM and DM multipath fixes and improvements from
    Bart and Christoph.

    - A DM thinp target fix for a bio structure leak that occurs for each
    discard IFF discard passdown is enabled.

    - A fix for a possible deadlock in DM bufio and a fix to re-check the
    new buffer allocation watermark in the face of competing admin
    changes to the 'max_cache_size_bytes' tunable.

    - A couple DM core cleanups.

    * tag 'for-4.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (50 commits)
    dm bufio: check new buffer allocation watermark every 30 seconds
    dm bufio: avoid a possible ABBA deadlock
    dm mpath: make it easier to detect unintended I/O request flushes
    dm mpath: cleanup QUEUE_IF_NO_PATH bit manipulation by introducing assign_bit()
    dm mpath: micro-optimize the hot path relative to MPATHF_QUEUE_IF_NO_PATH
    dm: introduce enum dm_queue_mode to cleanup related code
    dm mpath: verify __pg_init_all_paths locking assumptions at runtime
    dm: verify suspend_locking assumptions at runtime
    dm block manager: remove an unused argument from dm_block_manager_create()
    dm rq: check blk_mq_register_dev() return value in dm_mq_init_request_queue()
    dm mpath: delay requeuing while path initialization is in progress
    dm mpath: avoid that path removal can trigger an infinite loop
    dm mpath: split and rename activate_path() to prepare for its expanded use
    dm ioctl: prevent stack leak in dm ioctl call
    dm integrity: use previously calculated log2 of sectors_per_block
    dm integrity: use hex2bin instead of open-coded variant
    dm crypt: replace custom implementation of hex2bin()
    dm crypt: remove obsolete references to per-CPU state
    dm verity: switch to using asynchronous hash crypto API
    dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
    ...

    Linus Torvalds
     

25 Apr, 2017

1 commit

  • dm-thin does not free the discard_parent bio after all chained sub
    bios finished. The following kmemleak report could be observed after
    pool with discard_passdown option processes discard bios in
    linux v4.11-rc7. To fix this, we drop the discard_parent bio reference
    when its endio (passdown_endio) called.

    unreferenced object 0xffff8803d6b29700 (size 256):
    comm "kworker/u8:0", pid 30349, jiffies 4379504020 (age 143002.776s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    01 00 00 00 00 00 00 f0 00 00 00 00 00 00 00 00 ................
    backtrace:
    [] kmemleak_alloc+0x49/0xa0
    [] kmem_cache_alloc+0xb4/0x100
    [] mempool_alloc_slab+0x10/0x20
    [] mempool_alloc+0x55/0x150
    [] bio_alloc_bioset+0xb9/0x260
    [] process_prepared_discard_passdown_pt1+0x40/0x1c0 [dm_thin_pool]
    [] break_up_discard_bio+0x1a9/0x200 [dm_thin_pool]
    [] process_discard_cell_passdown+0x24/0x40 [dm_thin_pool]
    [] process_discard_bio+0xdd/0xf0 [dm_thin_pool]
    [] do_worker+0xa76/0xd50 [dm_thin_pool]
    [] process_one_work+0x139/0x370
    [] worker_thread+0x61/0x450
    [] kthread+0xd6/0xf0
    [] ret_from_fork+0x3f/0x70
    [] 0xffffffffffffffff

    Cc: stable@vger.kernel.org
    Signed-off-by: Dennis Yang
    Signed-off-by: Mike Snitzer

    Dennis Yang
     

09 Apr, 2017

1 commit


08 Mar, 2017

1 commit


02 Feb, 2017

1 commit

  • We will want to have struct backing_dev_info allocated separately from
    struct request_queue. As the first step add pointer to backing_dev_info
    to request_queue and convert all users touching it. No functional
    changes in this patch.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Jan, 2017

1 commit


08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jul, 2016

1 commit

  • The discard passdown was being issued after the block was unmapped,
    which meant the block could be reprovisioned whilst the passdown discard
    was still in flight.

    We can only identify unshared blocks (safe to do a passdown a discard
    to) once they're unmapped and their ref count hits zero. Block ref
    counts are now used to guard against concurrent allocation of these
    blocks that are being discarded. So now we unmap the block, issue
    passdown discards, and the immediately increment ref counts for regions
    that have been discarded via passed down (this is safe because
    allocation occurs within the same thread). We then decrement ref counts
    once the passdown discard IO is complete -- signaling these blocks may
    now be allocated.

    This fixes the potential for corruption that was reported here:
    https://www.redhat.com/archives/dm-devel/2016-June/msg00311.html

    Reported-by: Dennis Yang
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     

08 Jun, 2016

4 commits


13 May, 2016

3 commits

  • There is little benefit to doing this but it does structure DM thinp's
    code to more cleanly use the __blkdev_issue_discard() interface --
    particularly in passdown_double_checking_shared_status().

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer

    Joe Thornber
     
  • With commit 38f25255330 ("block: add __blkdev_issue_discard") DM thinp
    no longer needs to carry its own async discard method.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Reviewed-by: Christoph Hellwig

    Mike Snitzer
     
  • DM thinp's use of bio_inc_remaining() is critical to ensure the original
    parent discard bio isn't completed before sub-discards have. DM thinp
    needs this due to the extra quiescing that occurs, via multiple DM thinp
    mappings, while processing large discards. As such DM thinp must build
    the async discard bio chain after some delay -- so bio_inc_remaining()
    is used to enable DM thinp to take a reference on the original parent
    discard bio for each mapping. This allows the immediate use of
    bio_endio() on that discard bio; but with the understanding that the
    actual completion won't occur until each of the sub-discards'
    per-mapping references are dropped.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer
     

06 May, 2016

1 commit


12 Mar, 2016

1 commit

  • Commit 0a927c2f02 ("dm thin: return -ENOSPC when erroring retry list due
    to out of data space") was a step in the right direction but didn't go
    far enough.

    Add a new 'out_of_data_space' flag to 'struct pool' and set it if/when
    the pool runs of of data space. This fixes cell_error() and
    error_retry_list() to not blindly return -EIO.

    We cannot rely on the 'error_if_no_space' feature flag since it is
    transient (in that it can be reset once space is added, plus it only
    controls whether errors are issued, it doesn't reflect whether the
    pool is actually out of space).

    Signed-off-by: Mike Snitzer

    Mike Snitzer
     

23 Feb, 2016

1 commit


07 Jan, 2016

1 commit


18 Dec, 2015

1 commit

  • When a thin pool is being destroyed delayed work items are
    cancelled using cancel_delayed_work(), which doesn't guarantee that on
    return the delayed item isn't running. This can cause the work item to
    requeue itself on an already destroyed workqueue. Fix this by using
    cancel_delayed_work_sync() which guarantees that on return the work item
    is not running anymore.

    Fixes: 905e51b39a555 ("dm thin: commit outstanding data every second")
    Fixes: 85ad643b7e7e5 ("dm thin: add timeout to stop out-of-data-space mode holding IO forever")
    Signed-off-by: Nikolay Borisov
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org

    Nikolay Borisov
     

24 Nov, 2015

1 commit

  • When establishing a thin device's discard limits we cannot rely on the
    underlying thin-pool device's discard capabilities (which are inherited
    from the thin-pool's underlying data device) given that DM thin devices
    must provide discard support even when the thin-pool's underlying data
    device doesn't support discards.

    Users were exposed to this thin device discard limits regression if
    their thin-pool's underlying data device does _not_ support discards.
    This regression caused all upper-layers that called the
    blkdev_issue_discard() interface to not be able to issue discards to
    thin devices (because discard_granularity was 0). This regression
    wasn't caught earlier because the device-mapper-test-suite's extensive
    'thin-provisioning' discard tests are only ever performed against
    thin-pool's with data devices that support discards.

    Fix is to have thin_io_hints() test the pool's 'discard_enabled' feature
    rather than inferring whether or not a thin device's discard support
    should be enabled by looking at the thin-pool's discard_granularity.

    Fixes: 216076705 ("dm thin: disable discard support for thin devices if pool's is disabled")
    Reported-by: Mike Gerber
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org # 4.1+

    Mike Snitzer
     

16 Nov, 2015

1 commit

  • A thin-pool that is in out-of-data-space (OODS) mode may transition back
    to write mode -- without the admin adding more space to the thin-pool --
    if/when blocks are released (either by deleting thin devices or
    discarding provisioned blocks).

    But as part of the thin-pool's earlier transition to out-of-data-space
    mode the thin-pool may have set the 'error_if_no_space' flag to true if
    the no_space_timeout expires without more space having been made
    available. That implementation detail, of changing the pool's
    error_if_no_space setting, needs to be reset back to the default that
    the user specified when the thin-pool's table was loaded.

    Otherwise we'll drop the user requested behaviour on the floor when this
    out-of-data-space to write mode transition occurs.

    Reported-by: Vivek Goyal
    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber
    Fixes: 2c43fd26e4 ("dm thin: fix missing out-of-data-space to write mode transition if blocks are released")
    Cc: stable@vger.kernel.org

    Mike Snitzer
     

14 Oct, 2015

1 commit


14 Sep, 2015

1 commit


03 Sep, 2015

2 commits

  • Pull device mapper update from Mike Snitzer:

    - a couple small cleanups in dm-cache, dm-verity, persistent-data's
    dm-btree, and DM core.

    - a 4.1-stable fix for dm-cache that fixes the leaking of deferred bio
    prison cells

    - a 4.2-stable fix that adds feature reporting for the dm-stats
    features added in 4.2

    - improve DM-snapshot to not invalidate the on-disk snapshot if
    snapshot device write overflow occurs; but a write overflow triggered
    through the origin device will still invalidate the snapshot.

    - optimize DM-thinp's async discard submission a bit now that late bio
    splitting has been included in block core.

    - switch DM-cache's SMQ policy lock from using a mutex to a spinlock;
    improves performance on very low latency devices (eg. NVMe SSD).

    - document DM RAID 4/5/6's discard support

    [ I did not pull the slab changes, which weren't appropriate for this
    tree, and weren't obviously the right thing to do anyway. At the very
    least they need some discussion and explanation before getting merged.

    Because not pulling the actual tagged commit but doing a partial pull
    instead, this merge commit thus also obviously is missing the git
    signature from the original tag ]

    * tag 'dm-4.3-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm cache: fix use after freeing migrations
    dm cache: small cleanups related to deferred prison cell cleanup
    dm cache: fix leaking of deferred bio prison cells
    dm raid: document RAID 4/5/6 discard support
    dm stats: report precise_timestamps and histogram in @stats_list output
    dm thin: optimize async discard submission
    dm snapshot: don't invalidate on-disk image on snapshot write overflow
    dm: remove unlikely() before IS_ERR()
    dm: do not override error code returned from dm_get_device()
    dm: test return value for DM_MAPIO_SUBMITTED
    dm verity: remove unused mempool
    dm cache: move wake_waker() from free_migrations() to where it is needed
    dm btree remove: remove unused function get_nr_entries()
    dm btree: remove unused "dm_block_t root" parameter in btree_split_sibling()
    dm cache policy smq: change the mutex to a spinlock

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:
    "This first core part of the block IO changes contains:

    - Cleanup of the bio IO error signaling from Christoph. We used to
    rely on the uptodate bit and passing around of an error, now we
    store the error in the bio itself.

    - Improvement of the above from myself, by shrinking the bio size
    down again to fit in two cachelines on x86-64.

    - Revert of the max_hw_sectors cap removal from a revision again,
    from Jeff Moyer. This caused performance regressions in various
    tests. Reinstate the limit, bump it to a more reasonable size
    instead.

    - Make /sys/block//queue/discard_max_bytes writeable, by me.
    Most devices have huge trim limits, which can cause nasty latencies
    when deleting files. Enable the admin to configure the size down.
    We will look into having a more sane default instead of UINT_MAX
    sectors.

    - Improvement of the SGP gaps logic from Keith Busch.

    - Enable the block core to handle arbitrarily sized bios, which
    enables a nice simplification of bio_add_page() (which is an IO hot
    path). From Kent.

    - Improvements to the partition io stats accounting, making it
    faster. From Ming Lei.

    - Also from Ming Lei, a basic fixup for overflow of the sysfs pending
    file in blk-mq, as well as a fix for a blk-mq timeout race
    condition.

    - Ming Lin has been carrying Kents above mentioned patches forward
    for a while, and testing them. Ming also did a few fixes around
    that.

    - Sasha Levin found and fixed a use-after-free problem introduced by
    the bio->bi_error changes from Christoph.

    - Small blk cgroup cleanup from Viresh Kumar"

    * 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
    blk: Fix bio_io_vec index when checking bvec gaps
    block: Replace SG_GAPS with new queue limits mask
    block: bump BLK_DEF_MAX_SECTORS to 2560
    Revert "block: remove artifical max_hw_sectors cap"
    blk-mq: fix race between timeout and freeing request
    blk-mq: fix buffer overflow when reading sysfs file of 'pending'
    Documentation: update notes in biovecs about arbitrarily sized bios
    block: remove bio_get_nr_vecs()
    fs: use helper bio_add_page() instead of open coding on bi_io_vec
    block: kill merge_bvec_fn() completely
    md/raid5: get rid of bio_fits_rdev()
    md/raid5: split bio for chunk_aligned_read
    block: remove split code in blkdev_issue_{discard,write_same}
    btrfs: remove bio splitting and merge_bvec_fn() calls
    bcache: remove driver private bio splitting code
    block: simplify bio_add_page()
    block: make generic_make_request handle arbitrarily sized bios
    blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
    block: don't access bio->bi_error after bio_put()
    block: shrink struct bio down to 2 cache lines again
    ...

    Linus Torvalds
     

18 Aug, 2015

1 commit

  • __blkdev_issue_discard_async() doesn't need to worry about further
    splitting because the upper layer blkdev_issue_discard() will have
    already handled splitting bios such that the bi_size isn't
    overflowed.

    Signed-off-by: Mike Snitzer
    Acked-by: Joe Thornber

    Mike Snitzer