31 Mar, 2020

1 commit

  • Pull block driver updates from Jens Axboe:

    - floppy driver cleanup series from Willy

    - NVMe updates and fixes (Various)

    - null_blk trace improvements (Chaitanya)

    - bcache fixes (Coly)

    - md fixes (via Song)

    - loop block size change optimizations (Martijn)

    - scnprintf() use (Takashi)

    * tag 'for-5.7/drivers-2020-03-29' of git://git.kernel.dk/linux-block: (81 commits)
    null_blk: add trace in null_blk_zoned.c
    null_blk: add tracepoint helpers for zoned mode
    block: add a zone condition debug helper
    nvme: cleanup namespace identifier reporting in nvme_init_ns_head
    nvme: rename __nvme_find_ns_head to nvme_find_ns_head
    nvme: refactor nvme_identify_ns_descs error handling
    nvme-tcp: Add warning on state change failure at nvme_tcp_setup_ctrl
    nvme-rdma: Add warning on state change failure at nvme_rdma_setup_ctrl
    nvme: Fix controller creation races with teardown flow
    nvme: Make nvme_uninit_ctrl symmetric to nvme_init_ctrl
    nvme: Fix ctrl use-after-free during sysfs deletion
    nvme-pci: Re-order nvme_pci_free_ctrl
    nvme: Remove unused return code from nvme_delete_ctrl_sync
    nvme: Use nvme_state_terminal helper
    nvme: release ida resources
    nvme: Add compat_ioctl handler for NVME_IOCTL_SUBMIT_IO
    nvmet-tcp: optimize tcp stack TX when data digest is used
    nvme-fabrics: Use scnprintf() for avoiding potential buffer overflow
    nvme-multipath: do not reset on unknown status
    nvmet-rdma: allocate RW ctxs according to mdts
    ...

    Linus Torvalds
     

28 Mar, 2020

2 commits

  • Current make_request based drivers use either blk_alloc_queue_node or
    blk_alloc_queue to allocate a queue, and then set up the make_request_fn
    function pointer and a few parameters using the blk_queue_make_request
    helper. Simplify this by passing the make_request pointer to
    blk_alloc_queue, and while at it merge the _node variant into the main
    helper by always passing a node_id, and remove the superfluous gfp_mask
    parameter. A lower-level __blk_alloc_queue is kept for the blk-mq case.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • bcache is the only driver not actually passing its make_request
    methods to blk_queue_make_request, but instead just sets them up
    manually a little later. Make bcache follow the common way of
    setting up make_request based queues.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Mar, 2020

2 commits

  • These macros are just used by a few files. Move them out of genhd.h,
    which is included everywhere into a new standalone header.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit 253a99d95d5b ("bcache: move macro btree() and btree_root()
    into btree.h") makes two duplicated declaration into btree.h,
    typedef int (btree_map_keys_fn)();
    int bch_btree_map_keys();

    The kbuild test robot detects and reports this
    problem and this patch fixes it by removing the duplicated ones.

    Fixes: 253a99d95d5b ("bcache: move macro btree() and btree_root() into btree.h")
    Reported-by: kbuild test robot
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     

24 Mar, 2020

2 commits

  • Add a new include/linux/raid/detect.h header to declare the
    md_autodetect_dev prototype which can be shared between md and
    the partition code. Then use IS_BUILTIN to call it instead of the
    ifdef magic.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There is no good reason for __bdevname to exist. Just open code
    printing the string in the callers. For three of them the format
    string can be trivially merged into existing printk statements,
    and in init/do_mounts.c we can at least do the scnprintf once at
    the start of the function, and unconditional of CONFIG_BLOCK to
    make the output for tiny configfs a little more helpful.

    Acked-by: Theodore Ts'o # for ext4
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

23 Mar, 2020

7 commits

  • The idea of this patch is from Davidlohr Bueso, he posts a patch
    for bcache to optimize barrier usage for read-modify-write atomic
    bitops. Indeed such optimization can also apply on other locations
    where smp_mb() is used before or after an atomic operation.

    This patch replaces smp_mb() with smp_mb__before_atomic() or
    smp_mb__after_atomic() in btree.c and writeback.c, where it is used
    to synchronize memory cache just earlier on other cores. Although
    the locations are not on hot code path, it is always not bad to mkae
    things a little better.

    Signed-off-by: Coly Li
    Cc: Davidlohr Bueso
    Signed-off-by: Jens Axboe

    Coly Li
     
  • We can avoid the unnecessary barrier on non LL/SC architectures,
    such as x86. Instead, use the smp_mb__after_atomic().

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Davidlohr Bueso
     
  • Since snprintf() returns the would-be-output size instead of the
    actual output size, the succeeding calls may go beyond the given
    buffer limit. Fix it by replacing with scnprintf().

    Signed-off-by: Takashi Iwai
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Takashi Iwai
     
  • When attaching a cached device (a.k.a backing device) to a cache
    device, bch_sectors_dirty_init() is called to count dirty sectors
    and stripes (see what bcache_dev_sectors_dirty_add() does) on the
    cache device.

    The counting is done by a single thread recursive function
    bch_btree_map_keys() to iterate all the bcache btree nodes.
    If the btree has huge number of nodes, bch_sectors_dirty_init() will
    take quite long time. In my testing, if the registering cache set has
    a existed UUID which matches a already registered cached device, the
    automatical attachment during the registration may take more than
    55 minutes. This is too long for waiting the bcache to work in real
    deployment.

    Fortunately when bch_sectors_dirty_init() is called, no other thread
    will access the btree yet, it is safe to do a read-only parallelized
    dirty sectors counting by multiple threads.

    This patch tries to create multiple threads, and each thread tries to
    one-by-one count dirty sectors from the sub-tree indexed by a root
    node key which the thread fetched. After the sub-tree is counted, the
    counting thread will continue to fetch another root node key, until
    the fetched key is NULL. How many threads in parallel depends on
    the number of keys from the btree root node, and the number of online
    CPU core. The thread number will be the less number but no more than
    BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it
    can only be 2x times faster by this patch. But if there are 10 keys
    in the root node, with this patch it can be 10x times faster.

    Signed-off-by: Coly Li
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Coly Li
     
  • When registering a cache device, bch_btree_check() is called to check
    all btree nodes, to make sure the btree is consistent and not
    corrupted.

    bch_btree_check() is recursively executed in a single thread, when there
    are a lot of data cached and the btree is huge, it may take very long
    time to check all the btree nodes. In my testing, I observed it took
    around 50 minutes to finish bch_btree_check().

    When checking the bcache btree nodes, the cache set is not running yet,
    and indeed the whole tree is in read-only state, it is safe to create
    multiple threads to check the btree in parallel.

    This patch tries to create multiple threads, and each thread tries to
    one-by-one check the sub-tree indexed by a key from the btree root node.
    The parallel thread number depends on how many keys in the btree root
    node. At most BCH_BTR_CHKTHREAD_MAX (64) threads can be created, but in
    practice is should be min(cpu-number/2, root-node-keys-number).

    Signed-off-by: Coly Li
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Coly Li
     
  • This patch changes macro btree_root() and btree() to bcache_btree_root()
    and bcache_btree(), to avoid potential generic name clash in future.

    NOTE: for product kernel maintainers, this patch can be skipped if
    you feel the rename stuffs introduce inconvenince to patch backport.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • In order to accelerate bcache registration speed, the macro btree()
    and btree_root() will be referenced out of btree.c. This patch moves
    them from btree.c into btree.h with other relative function declaration
    in btree.h, for the following changes.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     

18 Mar, 2020

1 commit

  • Don't call quiesce(1) and quiesce(0) if array is already suspended,
    otherwise in level_store, the array is writable after mddev_detach
    in below part though the intention is to make array writable after
    resume.

    mddev_suspend(mddev);
    mddev_detach(mddev);
    ...
    mddev_resume(mddev);

    And it also causes calltrace as follows in [1].

    [48005.653834] WARNING: CPU: 1 PID: 45380 at kernel/kthread.c:510 kthread_park+0x77/0x90
    [...]
    [48005.653976] CPU: 1 PID: 45380 Comm: mdadm Tainted: G OE 5.4.10-arch1-1 #1
    [48005.653979] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018
    [48005.653984] RIP: 0010:kthread_park+0x77/0x90
    [48005.654015] Call Trace:
    [48005.654039] r5l_quiesce+0x3c/0x70 [raid456]
    [48005.654052] raid5_quiesce+0x228/0x2e0 [raid456]
    [48005.654073] mddev_detach+0x30/0x70 [md_mod]
    [48005.654090] level_store+0x202/0x670 [md_mod]
    [48005.654099] ? security_capable+0x40/0x60
    [48005.654114] md_attr_store+0x7b/0xc0 [md_mod]
    [48005.654123] kernfs_fop_write+0xce/0x1b0
    [48005.654132] vfs_write+0xb6/0x1a0
    [48005.654138] ksys_write+0x67/0xe0
    [48005.654146] do_syscall_64+0x4e/0x140
    [48005.654155] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [48005.654161] RIP: 0033:0x7fa0c8737497

    [1]: https://bugzilla.kernel.org/show_bug.cgi?id=206161

    Signed-off-by: Guoqing Jiang
    Signed-off-by: Song Liu

    Guoqing Jiang
     

08 Mar, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "Here are a few fixes that should go into this release. This contains:

    - Revert of a bad bcache patch from this merge window

    - Removed unused function (Daniel)

    - Fixup for the blktrace fix from Jan from this release (Cengiz)

    - Fix of deeper level bfqq overwrite in BFQ (Carlo)"

    * tag 'block-5.6-2020-03-07' of git://git.kernel.dk/linux-block:
    block, bfq: fix overwrite of bfq_group pointer in bfq_find_set_group()
    blktrace: fix dereference after null check
    Revert "bcache: ignore pending signals when creating gc and allocator thread"
    block: Remove used kblockd_schedule_work_on()

    Linus Torvalds
     

05 Mar, 2020

1 commit

  • Pull device mapper fixes from Mike Snitzer:

    - Fix request-based DM's congestion_fn and actually wire it up to the
    bdi.

    - Extend dm-bio-record to track additional struct bio members needed by
    DM integrity target.

    - Fix DM core to properly advertise that a device is suspended during
    unload (between the presuspend and postsuspend hooks). This change is
    a prereq for related DM integrity and DM writecache fixes. It
    elevates DM integrity's 'suspending' state tracking to DM core.

    - Four stable fixes for DM integrity target.

    - Fix crash in DM cache target due to incorrect work item cancelling.

    - Fix DM thin metadata lockdep warning that was introduced during 5.6
    merge window.

    - Fix DM zoned target's chunk work refcounting that regressed during
    recent conversion to refcount_t.

    - Bump the minor version for DM core and all target versions that have
    seen interface changes or important fixes during the 5.6 cycle.

    * tag 'for-5.6/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm: bump version of core and various targets
    dm: fix congested_fn for request-based device
    dm integrity: use dm_bio_record and dm_bio_restore
    dm bio record: save/restore bi_end_io and bi_integrity
    dm zoned: Fix reference counter initial value of chunk works
    dm writecache: verify watermark during resume
    dm: report suspended device during destroy
    dm thin metadata: fix lockdep complaint
    dm cache: fix a crash due to incorrect work item cancelling
    dm integrity: fix invalid table returned due to argument count mismatch
    dm integrity: fix a deadlock due to offloading to an incorrect workqueue
    dm integrity: fix recalculation when moving from journal mode to bitmap mode

    Linus Torvalds
     

04 Mar, 2020

2 commits

  • Changes made during the 5.6 cycle warrant bumping the version number
    for DM core and the targets modified by this commit.

    It should be noted that dm-thin, dm-crypt and dm-raid already had
    their target version bumped during the 5.6 merge window.

    Signed-off-by; Mike Snitzer

    Mike Snitzer
     
  • We neither assign congested_fn for requested-based blk-mq device nor
    implement it correctly. So fix both.

    Also, remove incorrect comment from dm_init_normal_md_queue and rename
    it to dm_init_congested_fn.

    Fixes: 4aa9c692e052 ("bdi: separate out congested state into a separate struct")
    Cc: stable@vger.kernel.org
    Signed-off-by: Hou Tao
    Signed-off-by: Mike Snitzer

    Hou Tao
     

03 Mar, 2020

3 commits

  • In cases where dec_in_flight() has to requeue the integrity_bio_wait
    work to transfer the rest of the data, the bio's __bi_remaining might
    already have been decremented to 0, e.g.: if bio passed to underlying
    data device was split via blk_queue_split().

    Use dm_bio_{record,restore} rather than effectively open-coding them in
    dm-integrity -- these methods now manage __bi_remaining too.

    Depends-on: f7f0b057a9c1 ("dm bio record: save/restore bi_end_io and bi_integrity")
    Reported-by: Daniel Glöckner
    Suggested-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • Also, save/restore __bi_remaining in case the bio was used in a
    BIO_CHAIN (e.g. due to blk_queue_split).

    Suggested-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mike Snitzer
     
  • This reverts commit 0b96da639a4874311e9b5156405f69ef9fc3bef8.

    We can't just go flushing random signals, under the assumption that the
    OOM killer will just do something else. It's not safe from the OOM
    perspective, and it could also cause other signals to get randomly lost.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Feb, 2020

5 commits

  • Dm-zoned initializes reference counters of new chunk works with zero
    value and refcount_inc() is called to increment the counter. However, the
    refcount_inc() function handles the addition to zero value as an error
    and triggers the warning as follows:

    refcount_t: addition on 0; use-after-free.
    WARNING: CPU: 7 PID: 1506 at lib/refcount.c:25 refcount_warn_saturate+0x68/0xf0
    ...
    CPU: 7 PID: 1506 Comm: systemd-udevd Not tainted 5.4.0+ #134
    ...
    Call Trace:
    dmz_map+0x2d2/0x350 [dm_zoned]
    __map_bio+0x42/0x1a0
    __split_and_process_non_flush+0x14a/0x1b0
    __split_and_process_bio+0x83/0x240
    ? kmem_cache_alloc+0x165/0x220
    dm_process_bio+0x90/0x230
    ? generic_make_request_checks+0x2e7/0x680
    dm_make_request+0x3e/0xb0
    generic_make_request+0xcf/0x320
    ? memcg_drain_all_list_lrus+0x1c0/0x1c0
    submit_bio+0x3c/0x160
    ? guard_bio_eod+0x2c/0x130
    mpage_readpages+0x182/0x1d0
    ? bdev_evict_inode+0xf0/0xf0
    read_pages+0x6b/0x1b0
    __do_page_cache_readahead+0x1ba/0x1d0
    force_page_cache_readahead+0x93/0x100
    generic_file_read_iter+0x83a/0xe40
    ? __seccomp_filter+0x7b/0x670
    new_sync_read+0x12a/0x1c0
    vfs_read+0x9d/0x150
    ksys_read+0x5f/0xe0
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ...

    After this warning, following refcount API calls for the counter all fail
    to change the counter value.

    Fix this by setting the initial reference counter value not zero but one
    for the new chunk works. Instead, do not call refcount_inc() via
    dmz_get_chunk_work() for the new chunks works.

    The failure was observed with linux version 5.4 with CONFIG_REFCOUNT_FULL
    enabled. Refcount rework was merged to linux version 5.5 by the
    commit 168829ad09ca ("Merge branch 'locking-core-for-linus' of
    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip"). After this
    commit, CONFIG_REFCOUNT_FULL was removed and the failure was observed
    regardless of kernel configuration.

    Linux version 4.20 merged the commit 092b5648760a ("dm zoned: target: use
    refcount_t for dm zoned reference counters"). Before this commit, dm
    zoned used atomic_t APIs which does not check addition to zero, then this
    fix is not necessary.

    Fixes: 092b5648760a ("dm zoned: target: use refcount_t for dm zoned reference counters")
    Cc: stable@vger.kernel.org # 5.4+
    Signed-off-by: Shin'ichiro Kawasaki
    Reviewed-by: Damien Le Moal
    Signed-off-by: Mike Snitzer

    Shin'ichiro Kawasaki
     
  • Verify the watermark upon resume - so that if the target is reloaded
    with lower watermark, it will start the cleanup process immediately.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # 4.18+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • The function dm_suspended returns true if the target is suspended.
    However, when the target is being suspended during unload, it returns
    false.

    An example where this is a problem: the test "!dm_suspended(wc->ti)" in
    writecache_writeback is not sufficient, because dm_suspended returns
    zero while writecache_suspend is in progress. As is, without an
    enhanced dm_suspended, simply switching from flush_workqueue to
    drain_workqueue still emits warnings:
    workqueue writecache-writeback: drain_workqueue() isn't complete after 10 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 100 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 200 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 300 tries
    workqueue writecache-writeback: drain_workqueue() isn't complete after 400 tries

    writecache_suspend calls flush_workqueue(wc->writeback_wq) - this function
    flushes the current work. However, the workqueue may re-queue itself and
    flush_workqueue doesn't wait for re-queued works to finish. Because of
    this - the function writecache_writeback continues execution after the
    device was suspended and then concurrently with writecache_dtr, causing
    a crash in writecache_writeback.

    We must use drain_workqueue - that waits until the work and all re-queued
    works finish.

    As a prereq for switching to drain_workqueue, this commit fixes
    dm_suspended to return true after the presuspend hook and before the
    postsuspend hook - just like during a normal suspend. It allows
    simplifying the dm-integrity and dm-writecache targets so that they
    don't have to maintain suspended flags on their own.

    With this change use of drain_workqueue() can be used effectively. This
    change was tested with the lvm2 testsuite and cryptsetup testsuite and
    the are no regressions.

    Fixes: 48debafe4f2f ("dm: add writecache target")
    Cc: stable@vger.kernel.org # 4.18+
    Reported-by: Corey Marthaler
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • [ 3934.173244] ======================================================
    [ 3934.179572] WARNING: possible circular locking dependency detected
    [ 3934.185884] 5.4.21-xfstests #1 Not tainted
    [ 3934.190151] ------------------------------------------------------
    [ 3934.196673] dmsetup/8897 is trying to acquire lock:
    [ 3934.201688] ffffffffbce82b18 (shrinker_rwsem){++++}, at: unregister_shrinker+0x22/0x80
    [ 3934.210268]
    but task is already holding lock:
    [ 3934.216489] ffff92a10cc5e1d0 (&pmd->root_lock){++++}, at: dm_pool_metadata_close+0xba/0x120
    [ 3934.225083]
    which lock already depends on the new lock.

    [ 3934.564165] Chain exists of:
    shrinker_rwsem --> &journal->j_checkpoint_mutex --> &pmd->root_lock

    For a more detailed lockdep report, please see:

    https://lore.kernel.org/r/20200220234519.GA620489@mit.edu

    We shouldn't need to hold the lock while are just tearing down and
    freeing the whole metadata pool structure.

    Fixes: 44d8ebf436399a4 ("dm thin metadata: use pool locking at end of dm_pool_metadata_close")
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Mike Snitzer

    Theodore Ts'o
     
  • The crash can be reproduced by running the lvm2 testsuite test
    lvconvert-thin-external-cache.sh for several minutes, e.g.:
    while :; do make check T=shell/lvconvert-thin-external-cache.sh; done

    The crash happens in this call chain:
    do_waker -> policy_tick -> smq_tick -> end_hotspot_period -> clear_bitset
    -> memset -> __memset -- which accesses an invalid pointer in the vmalloc
    area.

    The work entry on the workqueue is executed even after the bitmap was
    freed. The problem is that cancel_delayed_work doesn't wait for the
    running work item to finish, so the work item can continue running and
    re-submitting itself even after cache_postsuspend. In order to make sure
    that the work item won't be running, we must use cancel_delayed_work_sync.

    Also, change flush_workqueue to drain_workqueue, so that if some work item
    submits itself or another work item, we are properly waiting for both of
    them.

    Fixes: c6b4fcbad044 ("dm: add cache target")
    Cc: stable@vger.kernel.org # v3.9
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

26 Feb, 2020

3 commits

  • If the flag SB_FLAG_RECALCULATE is present in the superblock, but it was
    not specified on the command line (i.e. ic->recalculate_flag is false),
    dm-integrity would return invalid table line - the reported number of
    arguments would not match the real number.

    Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
    Cc: stable@vger.kernel.org # v5.2+
    Reported-by: Ondrej Kozina
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • If we need to perform synchronous I/O in dm_integrity_map_continue(),
    we must make sure that we are not in the map function - in order to
    avoid the deadlock due to bio queuing in generic_make_request. To
    avoid the deadlock, we offload the request to metadata_wq.

    However, metadata_wq also processes metadata updates for write requests.
    If there are too many requests that get offloaded to metadata_wq at the
    beginning of dm_integrity_map_continue, the workqueue metadata_wq
    becomes clogged and the system is incapable of processing any metadata
    updates.

    This causes a deadlock because all the requests that need to do metadata
    updates wait for metadata_wq to proceed and metadata_wq waits inside
    wait_and_add_new_range until some existing request releases its range
    lock (which doesn't happen because the range lock is released after
    metadata update).

    In order to fix the deadlock, we create a new workqueue offload_wq and
    offload requests to it - so that processing of offload_wq is independent
    from processing of metadata_wq.

    Fixes: 7eada909bfd7 ("dm: add integrity target")
    Cc: stable@vger.kernel.org # v4.12+
    Reported-by: Heinz Mauelshagen
    Tested-by: Heinz Mauelshagen
    Signed-off-by: Heinz Mauelshagen
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     
  • If we resume a device in bitmap mode and the on-disk format is in journal
    mode, we must recalculate anything above ic->sb->recalc_sector. Otherwise,
    there would be non-recalculated blocks which would cause I/O errors.

    Fixes: 468dfca38b1a ("dm integrity: add a bitmap mode")
    Cc: stable@vger.kernel.org # v5.2+
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer

    Mikulas Patocka
     

17 Feb, 2020

1 commit

  • Pull block fixes from Jens Axboe:
    "Not a lot here, which is great, basically just three small bcache
    fixes from Coly, and four NVMe fixes via Keith"

    * tag 'block-5.6-2020-02-16' of git://git.kernel.dk/linux-block:
    nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info
    nvme/pci: move cqe check after device shutdown
    nvme: prevent warning triggered by nvme_stop_keep_alive
    nvme/tcp: fix bug on double requeue when send fails
    bcache: remove macro nr_to_fifo_front()
    bcache: Revert "bcache: shrink btree node cache after bch_btree_check()"
    bcache: ignore pending signals when creating gc and allocator thread

    Linus Torvalds
     

13 Feb, 2020

3 commits

  • Macro nr_to_fifo_front() is only used once in btree_flush_write(),
    it is unncessary indeed. This patch removes this macro and does
    calculation directly in place.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • This reverts commit 1df3877ff6a4810054237c3259d900ded4468969.

    In my testing, sometimes even all the cached btree nodes are freed,
    creating gc and allocator kernel threads may still fail. Finally it
    turns out that kthread_run() may fail if there is pending signal for
    current task. And the pending signal is sent from OOM killer which
    is triggered by memory consuption in bch_btree_check().

    Therefore explicitly shrinking bcache btree node here does not help,
    and after the shrinker callback is improved, as well as pending signals
    are ignored before creating kernel threads, now such operation is
    unncessary anymore.

    This patch reverts the commit 1df3877ff6a4 ("bcache: shrink btree node
    cache after bch_btree_check()") because we have better improvement now.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • When run a cache set, all the bcache btree node of this cache set will
    be checked by bch_btree_check(). If the bcache btree is very large,
    iterating all the btree nodes will occupy too much system memory and
    the bcache registering process might be selected and killed by system
    OOM killer. kthread_run() will fail if current process has pending
    signal, therefore the kthread creating in run_cache_set() for gc and
    allocator kernel threads are very probably failed for a very large
    bcache btree.

    Indeed such OOM is safe and the registering process will exit after
    the registration done. Therefore this patch flushes pending signals
    during the cache set start up, specificly in bch_cache_allocator_start()
    and bch_gc_thread_start(), to make sure run_cache_set() won't fail for
    large cahced data set.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     

09 Feb, 2020

1 commit

  • Pull misc vfs updates from Al Viro:

    - bmap series from cmaiolino

    - getting rid of convolutions in copy_mount_options() (use a couple of
    copy_from_user() instead of the __get_user() crap)

    * 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    saner copy_mount_options()
    fibmap: Reject negative block numbers
    fibmap: Use bmap instead of ->bmap method in ioctl_fibmap
    ecryptfs: drop direct calls to ->bmap
    cachefiles: drop direct usage of ->bmap method.
    fs: Enable bmap() function to properly return errors

    Linus Torvalds
     

06 Feb, 2020

1 commit

  • Pull more block updates from Jens Axboe:
    "Some later arrivals, but all fixes at this point:

    - bcache fix series (Coly)

    - Series of BFQ fixes (Paolo)

    - NVMe pull request from Keith with a few minor NVMe fixes

    - Various little tweaks"

    * tag 'block-5.6-2020-02-05' of git://git.kernel.dk/linux-block: (23 commits)
    nvmet: update AEN list and array at one place
    nvmet: Fix controller use after free
    nvmet: Fix error print message at nvmet_install_queue function
    brd: check and limit max_part par
    nvme-pci: remove nvmeq->tags
    nvmet: fix dsm failure when payload does not match sgl descriptor
    nvmet: Pass lockdep expression to RCU lists
    block, bfq: clarify the goal of bfq_split_bfqq()
    block, bfq: get a ref to a group when adding it to a service tree
    block, bfq: remove ifdefs from around gets/puts of bfq groups
    block, bfq: extend incomplete name of field on_st
    block, bfq: get extra ref to prevent a queue from being freed during a group move
    block, bfq: do not insert oom queue into position tree
    block, bfq: do not plug I/O for bfq_queues with no proc refs
    bcache: check return value of prio_read()
    bcache: fix incorrect data type usage in btree_flush_write()
    bcache: add readahead cache policy options via sysfs interface
    bcache: explicity type cast in bset_bkey_last()
    bcache: fix memory corruption in bch_cache_accounting_clear()
    xen/blkfront: limit allocated memory size to actual use case
    ...

    Linus Torvalds
     

04 Feb, 2020

1 commit

  • The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
    seq_file.h.

    Conversion rule is:

    llseek => proc_lseek
    unlocked_ioctl => proc_ioctl

    xxx => proc_xxx

    delete ".owner = THIS_MODULE" line

    [akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
    [sfr@canb.auug.org.au: fix kernel/sched/psi.c]
    Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

03 Feb, 2020

1 commit

  • By now, bmap() will either return the physical block number related to
    the requested file offset or 0 in case of error or the requested offset
    maps into a hole.
    This patch makes the needed changes to enable bmap() to proper return
    errors, using the return value as an error return, and now, a pointer
    must be passed to bmap() to be filled with the mapped physical block.

    It will change the behavior of bmap() on return:

    - negative value in case of error
    - zero on success or map fell into a hole

    In case of a hole, the *block will be zero too

    Since this is a prep patch, by now, the only error return is -EINVAL if
    ->bmap doesn't exist.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Carlos Maiolino
    Signed-off-by: Al Viro

    Carlos Maiolino
     

01 Feb, 2020

2 commits

  • Now if prio_read() failed during starting a cache set, we can print
    out error message in run_cache_set() and handle the failure properly.

    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li
     
  • Dan Carpenter points out that from commit 2aa8c529387c ("bcache: avoid
    unnecessary btree nodes flushing in btree_flush_write()"), there is a
    incorrect data type usage which leads to the following static checker
    warning:
    drivers/md/bcache/journal.c:444 btree_flush_write()
    warn: 'ref_nr' unsigned journal.btree_flushing)
    430 return;
    431
    432 spin_lock(&c->journal.flush_write_lock);
    433 if (c->journal.btree_flushing) {
    434 spin_unlock(&c->journal.flush_write_lock);
    435 return;
    436 }
    437 c->journal.btree_flushing = true;
    438 spin_unlock(&c->journal.flush_write_lock);
    439
    440 /* get the oldest journal entry and check its refcount */
    441 spin_lock(&c->journal.lock);
    442 fifo_front_p = &fifo_front(&c->journal.pin);
    443 ref_nr = atomic_read(fifo_front_p);
    444 if (ref_nr journal.lock);
    450 goto out;
    451 }
    452 spin_unlock(&c->journal.lock);

    As the warning information indicates, local varaible ref_nr in unsigned
    int type is wrong, which does not matche atomic_read() and the "
    Signed-off-by: Coly Li
    Signed-off-by: Jens Axboe

    Coly Li