07 Feb, 2019

1 commit

  • commit 483cbbeddd5fe2c80fd4141ff0748fa06c4ff146 upstream.

    This fixes the case when md array assembly fails because of raid cache recovery
    unable to allocate a stripe, despite attempts to replay stripes and increase
    cache size. This happens because stripes released by r5c_recovery_replay_stripes
    and raid5_set_cache_size don't become available for allocation immediately.
    Released stripes first are placed on conf->released_stripes list and require
    md thread to merge them on conf->inactive_list before they can be allocated.

    Patch allows final allocation attempt during cache recovery to wait for
    new stripes to become availabe for allocation.

    Cc: linux-raid@vger.kernel.org
    Cc: Shaohua Li
    Cc: linux-stable # 4.10+
    Fixes: b4c625c67362 ("md/r5cache: r5cache recovery: part 1")
    Signed-off-by: Alexei Naberezhnov
    Signed-off-by: Song Liu
    Signed-off-by: Greg Kroah-Hartman

    Alexei Naberezhnov
     

31 Jan, 2019

2 commits

  • commit 1856b9f7bcc8e9bdcccc360aabb56fbd4dd6c565 upstream.

    The dm-crypt cipher specification in a mapping table is defined as:
    cipher[:keycount]-chainmode-ivmode[:ivopts]
    or (new crypt API format):
    capi:cipher_api_spec-ivmode[:ivopts]

    For ESSIV, the parameter includes hash specification, for example:
    aes-cbc-essiv:sha256

    The implementation expected that additional IV option to never include
    another dash '-' character.

    But, with SHA3, there are names like sha3-256; so the mapping table
    parser fails:

    dmsetup create test --table "0 8 crypt aes-cbc-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"
    or (new crypt API format)
    dmsetup create test --table "0 8 crypt capi:cbc(aes)-essiv:sha3-256 9c1185a5c5e9fc54612808977ee8f5b9e 0 /dev/sdb 0"

    device-mapper: crypt: Ignoring unexpected additional cipher options
    device-mapper: table: 253:0: crypt: Error creating IV
    device-mapper: ioctl: error adding target to table

    Fix the dm-crypt constructor to ignore additional dash in IV options and
    also remove a bogus warning (that is ignored anyway).

    Cc: stable@vger.kernel.org # 4.12+
    Signed-off-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Milan Broz
     
  • commit d445bd9cec1a850c2100fcf53684c13b3fd934f2 upstream.

    Commit 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next
    stage processing") changed process_prepared_discard_passdown_pt1() to
    increment all the blocks being discarded until after the passdown had
    completed to avoid them being prematurely reused.

    IO issued to a thin device that breaks sharing with a snapshot, followed
    by a discard issued to snapshot(s) that previously shared the block(s),
    results in passdown_double_checking_shared_status() being called to
    iterate through the blocks double checking their reference count is zero
    and issuing the passdown if so. So a side effect of commit 00a0ea33b495
    is passdown_double_checking_shared_status() was broken.

    Fix this by checking if the block reference count is greater than 1.
    Also, rename dm_pool_block_is_used() to dm_pool_block_is_shared().

    Fixes: 00a0ea33b495 ("dm thin: do not queue freed thin mapping for next stage processing")
    Cc: stable@vger.kernel.org # 4.9+
    Reported-by: ryan.p.norwood@gmail.com
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

26 Jan, 2019

3 commits

  • [ Upstream commit 721b1d98fb517ae99ab3b757021cf81db41e67be ]

    kcopyd has no upper limit to the number of jobs one can allocate and
    issue. Under certain workloads this can lead to excessive memory usage
    and workqueue stalls. For example, when creating multiple dm-snapshot
    targets with a 4K chunk size and then writing to the origin through the
    page cache. Syncing the page cache causes a large number of BIOs to be
    issued to the dm-snapshot origin target, which itself issues an even
    larger (because of the BIO splitting taking place) number of kcopyd
    jobs.

    Running the following test, from the device mapper test suite [1],

    dmtest run --suite snapshot -n many_snapshots_of_same_volume_N

    , with 8 active snapshots, results in the kcopyd job slab cache growing
    to 10G. Depending on the available system RAM this can lead to the OOM
    killer killing user processes:

    [463.492878] kthreadd invoked oom-killer: gfp_mask=0x6040c0(GFP_KERNEL|__GFP_COMP),
    nodemask=(null), order=1, oom_score_adj=0
    [463.492894] kthreadd cpuset=/ mems_allowed=0
    [463.492948] CPU: 7 PID: 2 Comm: kthreadd Not tainted 4.19.0-rc7 #3
    [463.492950] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
    [463.492952] Call Trace:
    [463.492964] dump_stack+0x7d/0xbb
    [463.492973] dump_header+0x6b/0x2fc
    [463.492987] ? lockdep_hardirqs_on+0xee/0x190
    [463.493012] oom_kill_process+0x302/0x370
    [463.493021] out_of_memory+0x113/0x560
    [463.493030] __alloc_pages_slowpath+0xf40/0x1020
    [463.493055] __alloc_pages_nodemask+0x348/0x3c0
    [463.493067] cache_grow_begin+0x81/0x8b0
    [463.493072] ? cache_grow_begin+0x874/0x8b0
    [463.493078] fallback_alloc+0x1e4/0x280
    [463.493092] kmem_cache_alloc_node+0xd6/0x370
    [463.493098] ? copy_process.part.31+0x1c5/0x20d0
    [463.493105] copy_process.part.31+0x1c5/0x20d0
    [463.493115] ? __lock_acquire+0x3cc/0x1550
    [463.493121] ? __switch_to_asm+0x34/0x70
    [463.493129] ? kthread_create_worker_on_cpu+0x70/0x70
    [463.493135] ? finish_task_switch+0x90/0x280
    [463.493165] _do_fork+0xe0/0x6d0
    [463.493191] ? kthreadd+0x19f/0x220
    [463.493233] kernel_thread+0x25/0x30
    [463.493235] kthreadd+0x1bf/0x220
    [463.493242] ? kthread_create_on_cpu+0x90/0x90
    [463.493248] ret_from_fork+0x3a/0x50
    [463.493279] Mem-Info:
    [463.493285] active_anon:20631 inactive_anon:4831 isolated_anon:0
    [463.493285] active_file:80216 inactive_file:80107 isolated_file:435
    [463.493285] unevictable:0 dirty:51266 writeback:109372 unstable:0
    [463.493285] slab_reclaimable:31191 slab_unreclaimable:3483521
    [463.493285] mapped:526 shmem:4903 pagetables:1759 bounce:0
    [463.493285] free:33623 free_pcp:2392 free_cma:0
    ...
    [463.493489] Unreclaimable slab info:
    [463.493513] Name Used Total
    [463.493522] bio-6 1028KB 1028KB
    [463.493525] bio-5 1028KB 1028KB
    [463.493528] dm_snap_pending_exception 236783KB 243789KB
    [463.493531] dm_exception 41KB 42KB
    [463.493534] bio-4 1216KB 1216KB
    [463.493537] bio-3 439396KB 439396KB
    [463.493539] kcopyd_job 6973427KB 6973427KB
    ...
    [463.494340] Out of memory: Kill process 1298 (ruby2.3) score 1 or sacrifice child
    [463.494673] Killed process 1298 (ruby2.3) total-vm:435740kB, anon-rss:20180kB, file-rss:4kB, shmem-rss:0kB
    [463.506437] oom_reaper: reaped process 1298 (ruby2.3), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    Moreover, issuing a large number of kcopyd jobs results in kcopyd
    hogging the CPU, while processing them. As a result, processing of work
    items, queued for execution on the same CPU as the currently running
    kcopyd thread, is stalled for long periods of time, hurting performance.
    Running the aforementioned test we get, in dmesg, messages like the
    following:

    [67501.194592] BUG: workqueue lockup - pool cpus=4 node=0 flags=0x0 nice=0 stuck for 27s!
    [67501.195586] Showing busy workqueues and worker pools:
    [67501.195591] workqueue events: flags=0x0
    [67501.195597] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195611] pending: cache_reap
    [67501.195641] workqueue mm_percpu_wq: flags=0x8
    [67501.195645] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195656] pending: vmstat_update
    [67501.195682] workqueue kblockd: flags=0x18
    [67501.195687] pwq 5: cpus=2 node=0 flags=0x0 nice=-20 active=1/256
    [67501.195698] pending: blk_timeout_work
    [67501.195753] workqueue kcopyd: flags=0x8
    [67501.195757] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195768] pending: do_work [dm_mod]
    [67501.195802] workqueue kcopyd: flags=0x8
    [67501.195806] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195817] pending: do_work [dm_mod]
    [67501.195834] workqueue kcopyd: flags=0x8
    [67501.195838] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195848] pending: do_work [dm_mod]
    [67501.195881] workqueue kcopyd: flags=0x8
    [67501.195885] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=1/256
    [67501.195896] pending: do_work [dm_mod]
    [67501.195920] workqueue kcopyd: flags=0x8
    [67501.195924] pwq 8: cpus=4 node=0 flags=0x0 nice=0 active=2/256
    [67501.195935] in-flight: 67:do_work [dm_mod]
    [67501.195945] pending: do_work [dm_mod]
    [67501.195961] pool 8: cpus=4 node=0 flags=0x0 nice=0 hung=27s workers=3 idle: 129 23765

    The root cause for these issues is the way dm-snapshot uses kcopyd. In
    particular, the lack of an explicit or implicit limit to the maximum
    number of in-flight COW jobs. The merging path is not affected because
    it implicitly limits the in-flight kcopyd jobs to one.

    Fix these issues by using a semaphore to limit the maximum number of
    in-flight kcopyd jobs. We grab the semaphore before allocating a new
    kcopyd job in start_copy() and start_full_bio() and release it after the
    job finishes in copy_callback().

    The initial semaphore value is configurable through a module parameter,
    to allow fine tuning the maximum number of in-flight COW jobs. Setting
    this parameter to zero initializes the semaphore to INT_MAX.

    A default value of 2048 maximum in-flight kcopyd jobs was chosen. This
    value was decided experimentally as a trade-off between memory
    consumption, stalling the kernel's workqueues and maintaining a high
    enough throughput.

    Re-running the aforementioned test:

    * Workqueue stalls are eliminated
    * kcopyd's job slab cache uses a maximum of 130MB
    * The time taken by the test to write to the snapshot-origin target is
    reduced from 05m20.48s to 03m26.38s

    [1] https://github.com/jthornber/device-mapper-test-suite

    Signed-off-by: Nikos Tsironis
    Signed-off-by: Ilias Tsitsimpis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit d7e6b8dfc7bcb3f4f3a18313581f67486a725b52 ]

    When using kcopyd to run callbacks through dm_kcopyd_do_callback() or
    submitting copy jobs with a source size of 0, the jobs are pushed
    directly to the complete_jobs list, which could be under processing by
    the kcopyd thread. As a result, the kcopyd thread can continue running
    completed jobs indefinitely, without releasing the CPU, as long as
    someone keeps submitting new completed jobs through the aforementioned
    paths. Processing of work items, queued for execution on the same CPU as
    the currently running kcopyd thread, is thus stalled for excessive
    amounts of time, hurting performance.

    Running the following test, from the device mapper test suite [1],

    dmtest run --suite snapshot -n parallel_io_to_many_snaps_N

    , with 8 active snapshots, we get, in dmesg, messages like the
    following:

    [68899.948523] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 95s!
    [68899.949282] Showing busy workqueues and worker pools:
    [68899.949288] workqueue events: flags=0x0
    [68899.949295] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949306] pending: vmstat_shepherd, cache_reap
    [68899.949331] workqueue mm_percpu_wq: flags=0x8
    [68899.949337] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949345] pending: vmstat_update
    [68899.949387] workqueue dm_bufio_cache: flags=0x8
    [68899.949392] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
    [68899.949400] pending: work_fn [dm_bufio]
    [68899.949423] workqueue kcopyd: flags=0x8
    [68899.949429] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949437] pending: do_work [dm_mod]
    [68899.949452] workqueue kcopyd: flags=0x8
    [68899.949458] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
    [68899.949466] in-flight: 13:do_work [dm_mod]
    [68899.949474] pending: do_work [dm_mod]
    [68899.949487] workqueue kcopyd: flags=0x8
    [68899.949493] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949501] pending: do_work [dm_mod]
    [68899.949515] workqueue kcopyd: flags=0x8
    [68899.949521] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949529] pending: do_work [dm_mod]
    [68899.949541] workqueue kcopyd: flags=0x8
    [68899.949547] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
    [68899.949555] pending: do_work [dm_mod]
    [68899.949568] pool 0: cpus=0 node=0 flags=0x0 nice=0 hung=95s workers=4 idle: 27130 27223 1084

    Fix this by splitting the complete_jobs list into two parts: A user
    facing part, named callback_jobs, and one used internally by kcopyd,
    retaining the name complete_jobs. dm_kcopyd_do_callback() and
    dispatch_job() now push their jobs to the callback_jobs list, which is
    spliced to the complete_jobs list once, every time the kcopyd thread
    wakes up. This prevents kcopyd from hogging the CPU indefinitely and
    causing workqueue stalls.

    Re-running the aforementioned test:

    * Workqueue stalls are eliminated
    * The maximum writing time among all targets is reduced from 09m37.10s
    to 06m04.85s and the total run time of the test is reduced from
    10m43.591s to 7m19.199s

    [1] https://github.com/jthornber/device-mapper-test-suite

    Signed-off-by: Nikos Tsironis
    Signed-off-by: Ilias Tsitsimpis
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Nikos Tsironis
     
  • [ Upstream commit 8d683dcd65c037efc9fb38c696ec9b65b306e573 ]

    The iv_offset in the mapping table of crypt target is a 64bit number when
    IV algorithm is plain64, plain64be, essiv or benbi. It will be assigned to
    iv_offset of struct crypt_config, cc_sector of struct convert_context and
    iv_sector of struct dm_crypt_request. These structures members are defined
    as a sector_t. But sector_t is 32bit when CONFIG_LBDAF is not set in 32bit
    kernel. In this situation sector_t is not big enough to store the 64bit
    iv_offset.

    Here is a reproducer.
    Prepare test image and device (loop is automatically allocated by cryptsetup):

    # dd if=/dev/zero of=tst.img bs=1M count=1
    # echo "tst"|cryptsetup open --type plain -c aes-xts-plain64 \
    --skip 500000000000000000 tst.img test

    On 32bit system (use IV offset value that overflows to 64bit; CONFIG_LBDAF if off)
    and device checksum is wrong:

    # dmsetup table test --showkeys
    0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 3551657984 7:0 0

    # sha256sum /dev/mapper/test
    533e25c09176632b3794f35303488c4a8f3f965dffffa6ec2df347c168cb6c19 /dev/mapper/test

    On 64bit system (and on 32bit system with the patch), table and checksum is now correct:

    # dmsetup table test --showkeys
    0 2048 crypt aes-xts-plain64 dfa7cfe3c481f2239155739c42e539ae8f2d38f304dcc89d20b26f69daaf0933 500000000000000000 7:0 0

    # sha256sum /dev/mapper/test
    5d16160f9d5f8c33d8051e65fdb4f003cc31cd652b5abb08f03aa6fce0df75fc /dev/mapper/test

    Signed-off-by: AliOS system security
    Tested-and-Reviewed-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    AliOS system security
     

13 Jan, 2019

3 commits

  • commit d57f9da890696af1484f4a47f7f123560197865a upstream.

    struct bioctx includes the ref refcount_t to track the number of I/O
    fragments used to process a target BIO as well as ensure that the zone
    of the BIO is kept in the active state throughout the lifetime of the
    BIO. However, since decrementing of this reference count is done in the
    target .end_io method, the function bio_endio() must be called multiple
    times for read and write target BIOs, which causes problems with the
    value of the __bi_remaining struct bio field for chained BIOs (e.g. the
    clone BIO passed by dm core is large and splits into fragments by the
    block layer), resulting in incorrect values and inconsistencies with the
    BIO_CHAIN flag setting. This is turn triggers the BUG_ON() call:

    BUG_ON(atomic_read(&bio->__bi_remaining)
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit e4b069e0945fa14c71cf8b5b89f8b1b2aa68dbc2 upstream.

    Since commit d1ac3ff008fb ("dm verity: switch to using asynchronous hash
    crypto API") dm-verity uses asynchronous crypto calls for verification,
    so that it can use hardware with asynchronous processing of crypto
    operations.

    These asynchronous calls don't support vmalloc memory, but the buffer data
    can be allocated with vmalloc if dm-bufio is short of memory and uses a
    reserved buffer that was preallocated in dm_bufio_client_create().

    Fix verity_hash_update() so that it deals with vmalloc'd memory
    correctly.

    Reported-by: "Xiao, Jin"
    Signed-off-by: Mikulas Patocka
    Fixes: d1ac3ff008fb ("dm verity: switch to using asynchronous hash crypto API")
    Cc: stable@vger.kernel.org # 4.12+
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sudip Mukherjee
    Acked-by: Mikulas Patocka
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 584ed9fa9532f8b9d5955628ff87ee3b2ab9f5a9 upstream.

    The raid10 driver can't be built with clang since it uses a variable
    length array in a structure (VLAIS):

    drivers/md/raid10.c:4583:17: error: fields must have a constant size:
    'variable length array in structure' extension will never be supported

    Allocate the r10bio struct with kmalloc instead of using the VLAIS
    construct.

    Shaohua: set the MD_RECOVERY_INTR bit
    Neil Brown: use GFP_NOIO

    Signed-off-by: Matthias Kaehlcke
    Reviewed-by: Guenter Roeck
    Signed-off-by: Shaohua Li
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Greg Kroah-Hartman

    Matthias Kaehlcke
     

21 Dec, 2018

2 commits

  • commit 687cf4412a343a63928a5c9d91bdc0f522939d43 upstream.

    Otherwise dm_bitset_cursor_begin() return -ENODATA. Other calls to
    dm_bitset_cursor_begin() have similar negative checks.

    Fixes inability to create a cache in passthrough mode (even though doing
    so makes no sense).

    Fixes: 0d963b6e65 ("dm cache metadata: fix metadata2 format's blocks_are_clean_separate_dirty")
    Cc: stable@vger.kernel.org
    Reported-by: David Teigland
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit f6c367585d0d851349d3a9e607c43e5bea993fa1 upstream.

    Sending a DM event before a thin-pool state change is about to happen is
    a bug. It wasn't realized until it became clear that userspace response
    to the event raced with the actual state change that the event was
    meant to notify about.

    Fix this by first updating internal thin-pool state to reflect what the
    DM event is being issued about. This fixes a long-standing racey/buggy
    userspace device-mapper-test-suite 'resize_io' test that would get an
    event but not find the state it was looking for -- so it would just go
    on to hang because no other events caused the test to reevaluate the
    thin-pool's state.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

14 Nov, 2018

7 commits

  • commit 9e753ba9b9b405e3902d9f08aec5f2ea58a0c317 upstream.

    Commit d595567dc4f0 (MD: fix invalid stored role for a disk) broke linear
    hotadd. Let's only fix the role for disks in raid1/10.
    Based on Guoqing's original patch.

    Reported-by: kernel test robot
    Cc: Gioh Kim
    Cc: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • commit 3d4e738311327bc4ba1d55fbe2f1da3de4a475f9 upstream.

    dmz_fetch_mblock() called from dmz_get_mblock() has a race since the
    allocation of the new metadata block descriptor and its insertion in
    the cache rbtree with the READING state is not atomic. Two different
    contexts requesting the same block may end up each adding two different
    descriptors of the same block to the cache.

    Another problem for this function is that the BIO for processing the
    block read is allocated after the metadata block descriptor is inserted
    in the cache rbtree. If the BIO allocation fails, the metadata block
    descriptor is freed without first being removed from the rbtree.

    Fix the first problem by checking again if the requested block is not in
    the cache right before inserting the newly allocated descriptor,
    atomically under the mblk_lock spinlock. The second problem is fixed by
    simply allocating the BIO before inserting the new block in the cache.

    Finally, since dmz_fetch_mblock() also increments a block reference
    counter, rename the function to dmz_get_mblock_slow(). To be symmetric
    and clear, also rename dmz_lookup_mblock() to dmz_get_mblock_fast() and
    increment the block reference counter directly in that function rather
    than in dmz_get_mblock().

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 33c2865f8d011a2ca9f67124ddab9dc89382e9f1 upstream.

    Since the ref field of struct dmz_mblock is always used with the
    spinlock of struct dmz_metadata locked, there is no need to use an
    atomic_t type. Change the type of the ref field to an unsigne
    integer.

    Fixes: 3b1a94c88b79 ("dm zoned: drive-managed zoned block device target")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit 800a7340ab7dd667edf95e74d8e4f23a17e87076 upstream.

    In copy_params(), the struct 'dm_ioctl' is first copied from the user
    space buffer 'user' to 'param_kernel' and the field 'data_size' is
    checked against 'minimum_data_size' (size of 'struct dm_ioctl' payload
    up to its 'data' member). If the check fails, an error code EINVAL will be
    returned. Otherwise, param_kernel->data_size is used to do a second copy,
    which copies from the same user-space buffer to 'dmi'. After the second
    copy, only 'dmi->data_size' is checked against 'param_kernel->data_size'.
    Given that the buffer 'user' resides in the user space, a malicious
    user-space process can race to change the content in the buffer between
    the two copies. This way, the attacker can inject inconsistent data
    into 'dmi' (versus previously validated 'param_kernel').

    Fix redundant copying of 'minimum_data_size' from user-space buffer by
    using the first copy stored in 'param_kernel'. Also remove the
    'data_size' check after the second copy because it is now unnecessary.

    Cc: stable@vger.kernel.org
    Signed-off-by: Wenwen Wang
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Wenwen Wang
     
  • [ Upstream commit d595567dc4f0c1d90685ec1e2e296e2cad2643ac ]

    If we change the number of array's device after device is removed from array,
    then add the device back to array, we can see that device is added as active
    role instead of spare which we expected.

    Please see the below link for details:
    https://marc.info/?l=linux-raid&m=153736982015076&w=2

    This is caused by that we prefer to use device's previous role which is
    recorded by saved_raid_disk, but we should respect the new number of
    conf->raid_disks since it could be changed after device is removed.

    Reported-by: Gioh Kim
    Tested-by: Gioh Kim
    Acked-by: Guoqing Jiang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     
  • commit 2d6cb6edd2c7fb4f40998895bda45006281b1ac5 upstream.

    refill->end record the last key of writeback, for example, at the first
    time, keys (1,128K) to (1,1024K) are flush to the backend device, but
    the end key (1,1024K) is not included, since the bellow code:
    if (bkey_cmp(k, refill->end) >= 0) {
    ret = MAP_DONE;
    goto out;
    }
    And in the next time when we refill writeback keybuf again, we searched
    key start from (1,1024K), and got a key bigger than it, so the key
    (1,1024K) missed.
    This patch modify the above code, and let the end key to be included to
    the writeback key buffer.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tang Junhui
     
  • commit 502b291568fc7faf1ebdb2c2590f12851db0ff76 upstream.

    Missed reading IOs are identified by s->cache_missed, not the
    s->cache_miss, so in trace_bcache_read() using trace_bcache_read
    to identify whether the IO is missed or not.

    Signed-off-by: Tang Junhui
    Cc: stable@vger.kernel.org
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tang Junhui
     

04 Nov, 2018

1 commit

  • [ Upstream commit e16b4f99f0f79682a7efe191a8ce694d87ca9fc8 ]

    Since crypto API commit 9fa68f62004 ("crypto: hash - prevent using keyed
    hashes without setting key") dm-integrity cannot use keyed algorithms
    without the key being set.

    The dm-integrity recognizes this too late (during use of HMAC), so it
    allows creation and formatting of superblock, but the device is in fact
    unusable.

    Fix it by detecting the key requirement in integrity table constructor.

    Signed-off-by: Milan Broz
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin

    Milan Broz
     

18 Oct, 2018

4 commits

  • commit 118aa47c7072bce05fc39bd40a1c0a90caed72ab upstream.

    The dm-linear target is independent of the dm-zoned target. For code
    requiring support for zoned block devices, use CONFIG_BLK_DEV_ZONED
    instead of CONFIG_DM_ZONED.

    While at it, similarly to dm linear, also enable the DM_TARGET_ZONED_HM
    feature in dm-flakey only if CONFIG_BLK_DEV_ZONED is defined.

    Fixes: beb9caac211c1 ("dm linear: eliminate linear_end_io call if CONFIG_DM_ZONED disabled")
    Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit beb9caac211c1be1bc118bb62d5cf09c4107e6a5 upstream.

    It is best to avoid any extra overhead associated with bio completion.
    DM core will indirectly call a DM target's .end_io if it is defined.
    In the case of DM linear, there is no need to do so (for every bio that
    completes) if CONFIG_DM_ZONED is not enabled.

    Avoiding an extra indirect call for every bio completion is very
    important for ensuring DM linear doesn't incur more overhead that
    further widens the performance gap between dm-linear and raw block
    devices.

    Fixes: 0be12c1c7fce7 ("dm linear: add support for zoned block devices")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 9864cd5dc54cade89fd4b0954c2e522841aa247c upstream.

    If dm-linear or dm-flakey are layered on top of a partition of a zoned
    block device, remapping of the start sector and write pointer position
    of the zones reported by a report zones BIO must be modified to account
    for the target table entry mapping (start offset within the device and
    entry mapping with the dm device). If the target's backing device is a
    partition of a whole disk, the start sector on the physical device of
    the partition must also be accounted for when modifying the zone
    information. However, dm_remap_zone_report() was not considering this
    last case, resulting in incorrect zone information remapping with
    targets using disk partitions.

    Fix this by calculating the target backing device start sector using
    the position of the completed report zones BIO and the unchanged
    position and size of the original report zone BIO. With this value
    calculated, the start sector and write pointer position of the target
    zones can be correctly remapped.

    Fixes: 10999307c14e ("dm: introduce dm_remap_zone_report()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Damien Le Moal
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Damien Le Moal
     
  • commit c7cd55504a5b0fc826a2cd9540845979d24ae542 upstream.

    Commit 7e6358d244e47 ("dm: fix various targets to dm_register_target
    after module __init resources created") inadvertently introduced this
    bug when it moved dm_register_target() after the call to KMEM_CACHE().

    Fixes: 7e6358d244e47 ("dm: fix various targets to dm_register_target after module __init resources created")
    Cc: stable@vger.kernel.org
    Signed-off-by: Shenghui Wang
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Shenghui Wang
     

13 Oct, 2018

2 commits

  • commit 5d07384a666d4b2f781dc056bfeec2c27fbdf383 upstream.

    A reload of the cache's DM table is needed during resize because
    otherwise a crash will occur when attempting to access smq policy
    entries associated with the portion of the cache that was recently
    extended.

    The reason is cache-size based data structures in the policy will not be
    resized, the only way to safely extend the cache is to allow for a
    proper cache policy initialization that occurs when the cache table is
    loaded. For example the smq policy's space_init(), init_allocator(),
    calc_hotspot_params() must be sized based on the extended cache size.

    The fix for this is to disallow cache resizes of this pattern:
    1) suspend "cache" target's device
    2) resize the fast device used for the cache
    3) resume "cache" target's device

    Instead, the last step must be a full reload of the cache's DM table.

    Fixes: 66a636356 ("dm cache: add stochastic-multi-queue (smq) policy")
    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 4561ffca88c546f96367f94b8f1e4715a9c62314 upstream.

    Commit fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to
    on-disk superblock") enabled previously written policy hints to be
    used after a cache is reactivated. But in doing so the cache
    metadata's hint array was left exposed to out of bounds access because
    on resize the metadata's on-disk hint array wasn't ever extended.

    Fix this by ignoring that there are no on-disk hints associated with the
    newly added cache blocks. An expanded on-disk hint array is later
    rewritten upon the next clean shutdown of the cache.

    Fixes: fd2fa9541 ("dm cache metadata: save in-core policy_hint_size to on-disk superblock")
    Cc: stable@vger.kernel.org
    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     

10 Oct, 2018

5 commits

  • commit 013ad043906b2befd4a9bfb06219ed9fedd92716 upstream.

    sector_div() is only viable for use with sector_t.
    dm_block_t is typedef'd to uint64_t -- so use div_u64() instead.

    Fixes: 3ab918281 ("dm thin metadata: try to avoid ever aborting transactions")
    Signed-off-by: Mike Snitzer
    Cc: Sudip Mukherjee
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • [ Upstream commit 3ab91828166895600efd9cdc3a0eb32001f7204a ]

    Committing a transaction can consume some metadata of it's own, we now
    reserve a small amount of metadata to cover this. Free metadata
    reported by the kernel will not include this reserve.

    If any of the reserve has been used after a commit we enter a new
    internal state PM_OUT_OF_METADATA_SPACE. This is reported as
    PM_READ_ONLY, so no userland changes are needed. If the metadata
    device is resized the pool will move back to PM_WRITE.

    These changes mean we never need to abort and rollback a transaction due
    to running out of metadata space. This is particularly important
    because there have been a handful of reports of data corruption against
    DM thin-provisioning that can all be attributed to the thin-pool having
    ran out of metadata space.

    Signed-off-by: Joe Thornber
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Joe Thornber
     
  • [ Upstream commit c44a5ee803d2b7ed8c2e6ce24a5c4dd60778886e ]

    Update superblock when particular devices are requested via rebuild
    (e.g. lvconvert --replace ...) to avoid spurious failure with the "New
    device injected into existing raid set without 'delta_disks' or
    'rebuild' parameter specified" error message.

    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Heinz Mauelshagen
     
  • [ Upstream commit 1d0ffd264204eba1861865560f1f7f7a92919384 ]

    In raid10 reshape_request it gets max_sectors in read_balance. If the underlayer disks
    have bad blocks, the max_sectors is less than last. It will call goto read_more many
    times. It calls raise_barrier(conf, sectors_done != 0) every time. In this condition
    sectors_done is not 0. So the value passed to the argument force of raise_barrier is
    true.

    In raise_barrier it checks conf->barrier when force is true. If force is true and
    conf->barrier is 0, it panic. In this case reshape_request submits bio to under layer
    disks. And in the callback function of the bio it calls lower_barrier. If the bio
    finishes before calling raise_barrier again, it can trigger the BUG_ON.

    Add one pair of raise_barrier/lower_barrier to fix this bug.

    Signed-off-by: Xiao Ni
    Suggested-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Xiao Ni
     
  • [ Upstream commit e254de6bcf3f5b6e78a92ac95fb91acef8adfe1a ]

    We don't support reshape yet if an array supports log device. Previously we
    determine the fact by checking ->log. However, ->log could be NULL after a log
    device is removed, but the array is still marked to support log device. Don't
    allow reshape in this case too. User can disable log device support by setting
    'consistency_policy' to 'resync' then do reshape.

    Reported-by: Xiao Ni
    Tested-by: Xiao Ni
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Shaohua Li
     

04 Oct, 2018

1 commit

  • [ Upstream commit 010228e4a932ca1e8365e3b58c8e1e44c16ff793 ]

    When one node leaves cluster or stops the resyncing
    (resync or recovery) array, then other nodes need to
    call recover_bitmaps to continue the unfinished task.

    But we need to clear suspend_area later after other
    nodes copy the resync information to their bitmap
    (by call bitmap_copy_from_slot). Otherwise, all nodes
    could write to the suspend_area even the suspend_area
    is not handled by any node, because area_resyncing
    returns 0 at the beginning of raid1_write_request.
    Which means one node could write suspend_area while
    another node is resyncing the same area, then data
    could be inconsistent.

    So let's clear suspend_area later to avoid above issue
    with the protection of bm lock. Also it is straightforward
    to clear suspend_area after nodes have copied the resync
    info to bitmap.

    Signed-off-by: Guoqing Jiang
    Reviewed-by: NeilBrown
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Guoqing Jiang
     

20 Sep, 2018

2 commits

  • [ Upstream commit af9313c32c0fa2a0ac3b113669273833d60cc9de ]

    More than one io_mode feature can be requested when creating a dm cache
    device (as is: last one wins). The io_mode selections are incompatible
    with one another, we should force them to be selected exclusively. Add
    a counter to check for more than one io_mode selection.

    Fixes: 629d0a8a1a10 ("dm cache metadata: add "metadata2" feature")
    Signed-off-by: John Pittman
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Pittman
     
  • [ Upstream commit d63e2fc804c46e50eee825c5d3a7228e07048b47 ]

    During raid5 replacement, the stripes can be marked with R5_NeedReplace
    flag. Data can be read from being-replaced devices and written to
    replacing spares without reading all other devices. (It's 'replace'
    mode. s.replacing = 1) If a being-replaced device is dropped, the
    replacement progress will be interrupted and resumed with pure recovery
    mode. However, existing stripes before being interrupted cannot read
    from the dropped device anymore. It prints lots of WARN_ON messages.
    And it results in data corruption because existing stripes write
    problematic data into its replacement device and update the progress.

    \# Erase disks (1MB + 2GB)
    dd if=/dev/zero of=/dev/sda bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdb bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdc bs=1MB count=2049
    dd if=/dev/zero of=/dev/sdd bs=1MB count=2049
    mdadm -C /dev/md0 -amd -R -l5 -n3 -x0 /dev/sd[abc] -z 2097152
    \# Ensure array stores non-zero data
    dd if=/root/data_4GB.iso of=/dev/md0 bs=1MB
    \# Start replacement
    mdadm /dev/md0 -a /dev/sdd
    mdadm /dev/md0 --replace /dev/sda

    Then, Hot-plug out /dev/sda during recovery, and wait for recovery done.
    echo check > /sys/block/md0/md/sync_action
    cat /sys/block/md0/md/mismatch_cnt # it will be greater than 0.

    Soon after you hot-plug out /dev/sda, you will see many WARN_ON
    messages. The replacement recovery will be interrupted shortly. After
    the recovery finishes, it will result in data corruption.

    Actually, it's just an unhandled case of replacement. In commit
    (md/raid5: fix interaction of 'replace' and 'recovery'.),
    if a NeedReplace device is not UPTODATE then that is an error, the
    commit just simply print WARN_ON but also mark these corrupted stripes
    with R5_WantReplace. (it means it's ready for writes.)

    To fix this case, we can leverage 'sync and replace' mode mentioned in
    commit (md/raid5: detect and handle replacements during
    recovery.). We can add logics to detect and use 'sync and replace' mode
    for these stripes.

    Reported-by: Alex Chen
    Reviewed-by: Alex Wu
    Reviewed-by: Chung-Chiang Cheng
    Signed-off-by: BingJing Chang
    Signed-off-by: Shaohua Li
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    BingJing Chang
     

15 Sep, 2018

1 commit

  • [ Upstream commit 784c9a29e99eb40b842c29ecf1cc3a79e00fb629 ]

    It was reported that softlockups occur when using dm-snapshot ontop of
    slow (rbd) storage. E.g.:

    [ 4047.990647] watchdog: BUG: soft lockup - CPU#10 stuck for 22s! [kworker/10:23:26177]
    ...
    [ 4048.034151] Workqueue: kcopyd do_work [dm_mod]
    [ 4048.034156] RIP: 0010:copy_callback+0x41/0x160 [dm_snapshot]
    ...
    [ 4048.034190] Call Trace:
    [ 4048.034196] ? __chunk_is_tracked+0x70/0x70 [dm_snapshot]
    [ 4048.034200] run_complete_job+0x5f/0xb0 [dm_mod]
    [ 4048.034205] process_jobs+0x91/0x220 [dm_mod]
    [ 4048.034210] ? kcopyd_put_pages+0x40/0x40 [dm_mod]
    [ 4048.034214] do_work+0x46/0xa0 [dm_mod]
    [ 4048.034219] process_one_work+0x171/0x370
    [ 4048.034221] worker_thread+0x1fc/0x3f0
    [ 4048.034224] kthread+0xf8/0x130
    [ 4048.034226] ? max_active_store+0x80/0x80
    [ 4048.034227] ? kthread_bind+0x10/0x10
    [ 4048.034231] ret_from_fork+0x35/0x40
    [ 4048.034233] Kernel panic - not syncing: softlockup: hung tasks

    Fix this by calling cond_resched() after run_complete_job()'s callout to
    the dm_kcopyd_notify_fn (which is dm-snap.c:copy_callback in the above
    trace).

    Signed-off-by: John Pittman
    Signed-off-by: Mike Snitzer
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    John Pittman
     

10 Sep, 2018

6 commits

  • commit 3943b040f11ed0cc6d4585fd286a623ca8634547 upstream.

    The writeback thread would exit with a lock held when the cache device
    is detached via sysfs interface, fix it by releasing the held lock
    before exiting the while-loop.

    Fixes: fadd94e05c02 (bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set)
    Signed-off-by: Shan Hai
    Signed-off-by: Coly Li
    Tested-by: Shenghui Wang
    Cc: stable@vger.kernel.org #4.17+
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Shan Hai
     
  • commit bc9e9cf0401f18e33b78d4c8a518661b8346baf7 upstream.

    dm-crypt should only increase device limits, it should not decrease them.

    This fixes a bug where the user could creates a crypt device with 1024
    sector size on the top of scsi device that had 4096 logical block size.
    The limit 4096 would be lost and the user could incorrectly send
    1024-I/Os to the crypt device.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     
  • commit 5b1fe7bec8a8d0cc547a22e7ddc2bd59acd67de4 upstream.

    Quoting Documentation/device-mapper/cache.txt:

    The 'dirty' state for a cache block changes far too frequently for us
    to keep updating it on the fly. So we treat it as a hint. In normal
    operation it will be written when the dm device is suspended. If the
    system crashes all cache blocks will be assumed dirty when restarted.

    This got broken in commit f177940a8091 ("dm cache metadata: switch to
    using the new cursor api for loading metadata") in 4.9, which removed
    the code that consulted cmd->clean_when_opened (CLEAN_SHUTDOWN on-disk
    flag) when loading cache blocks. This results in data corruption on an
    unclean shutdown with dirty cache blocks on the fast device. After the
    crash those blocks are considered clean and may get evicted from the
    cache at any time. This can be demonstrated by doing a lot of reads
    to trigger individual evictions, but uncache is more predictable:

    ### Disable auto-activation in lvm.conf to be able to do uncache in
    ### time (i.e. see uncache doing flushing) when the fix is applied.

    # xfs_io -d -c 'pwrite -b 4M -S 0xaa 0 1G' /dev/vdb
    # vgcreate vg_cache /dev/vdb /dev/vdc
    # lvcreate -L 1G -n lv_slowdev vg_cache /dev/vdb
    # lvcreate -L 512M -n lv_cachedev vg_cache /dev/vdc
    # lvcreate -L 256M -n lv_metadev vg_cache /dev/vdc
    # lvconvert --type cache-pool --cachemode writeback vg_cache/lv_cachedev --poolmetadata vg_cache/lv_metadev
    # lvconvert --type cache vg_cache/lv_slowdev --cachepool vg_cache/lv_cachedev
    # xfs_io -d -c 'pwrite -b 4M -S 0xbb 0 512M' /dev/mapper/vg_cache-lv_slowdev
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # dmsetup status vg_cache-lv_slowdev
    0 2097152 cache 8 27/65536 128 8192/8192 1 100 0 0 0 8192 7065 2 metadata2 writeback 2 migration_threshold 2048 smq 0 rw -
    ^^^^
    7065 * 64k = 441M yet to be written to the slow device
    # echo b >/proc/sysrq-trigger

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 0 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................
    0fe00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ................

    This is the case with both v1 and v2 cache pool metatata formats.

    After applying this patch:

    # vgchange -ay vg_cache
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    # lvconvert --uncache vg_cache/lv_slowdev
    Flushing 3724 blocks for cache vg_cache/lv_slowdev.
    ...
    Flushing 71 blocks for cache vg_cache/lv_slowdev.
    Logical volume "lv_cachedev" successfully removed
    Logical volume vg_cache/lv_slowdev is not cached.
    # xfs_io -d -c 'pread -v 254M 512' /dev/mapper/vg_cache-lv_slowdev | head -n 2
    0fe00000: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................
    0fe00010: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................

    Cc: stable@vger.kernel.org
    Fixes: f177940a8091 ("dm cache metadata: switch to using the new cursor api for loading metadata")
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     
  • commit fd2fa95416188a767a63979296fa3e169a9ef5ec upstream.

    policy_hint_size starts as 0 during __write_initial_superblock(). It
    isn't until the policy is loaded that policy_hint_size is set in-core
    (cmd->policy_hint_size). But it never got recorded in the on-disk
    superblock because __commit_transaction() didn't deal with transfering
    the in-core cmd->policy_hint_size to the on-disk superblock.

    The in-core cmd->policy_hint_size gets initialized by metadata_open()'s
    __begin_transaction_flags() which re-reads all superblock fields.
    Because the superblock's policy_hint_size was never properly stored, when
    the cache was created, hints_array_available() would always return false
    when re-activating a previously created cache. This means
    __load_mappings() always considered the hints invalid and never made use
    of the hints (these hints served to optimize).

    Another detremental side-effect of this oversight is the cache_check
    utility would fail with: "invalid hint width: 0"

    Cc: stable@vger.kernel.org
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     
  • commit 75294442d896f2767be34f75aca7cc2b0d01301f upstream.

    Now both check_for_space() and do_no_space_timeout() will read & write
    pool->pf.error_if_no_space. If these functions run concurrently, as
    shown in the following case, the default setting of "queue_if_no_space"
    can get lost.

    precondition:
    * error_if_no_space = false (aka "queue_if_no_space")
    * pool is in Out-of-Data-Space (OODS) mode
    * no_space_timeout worker has been queued

    CPU 0: CPU 1:
    // delete a thin device
    process_delete_mesg()
    // check_for_space() invoked by commit()
    set_pool_mode(pool, PM_WRITE)
    pool->pf.error_if_no_space = \
    pt->requested_pf.error_if_no_space

    // timeout, pool is still in OODS mode
    do_no_space_timeout
    // "queue_if_no_space" config is lost
    pool->pf.error_if_no_space = true
    pool->pf.mode = new_mode

    Fix it by stopping no_space_timeout worker when switching to write mode.

    Fixes: bcc696fac11f ("dm thin: stay in out-of-data-space mode once no_space_timeout expires")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Hou Tao
     
  • commit c21b16392701543d61e366dca84e15fe7f0cf0cf upstream.

    Early alpha processors can't write a byte or short atomically - they
    read 8 bytes, modify the byte or two bytes in registers and write back
    8 bytes.

    The modification of the variable "suspending" may race with
    modification of the variable "failed". Fix this by changing
    "suspending" to an int.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka