21 Nov, 2020

1 commit

  • If there is only one keyslot, then blk_ksm_init() computes
    slot_hashtable_size=1 and log_slot_ht_size=0. This causes
    blk_ksm_find_keyslot() to crash later because it uses
    hash_ptr(key, log_slot_ht_size) to find the hash bucket containing the
    key, and hash_ptr() doesn't support the bits == 0 case.

    Fix this by making the hash table always have at least 2 buckets.

    Tested by running:

    kvm-xfstests -c ext4 -g encrypt -m inlinecrypt \
    -o blk-crypto-fallback.num_keyslots=1

    Fixes: 1b2628397058 ("block: Keyslot Manager for Inline Encryption")
    Signed-off-by: Eric Biggers
    Signed-off-by: Jens Axboe

    Eric Biggers
     

15 Nov, 2020

1 commit


14 Nov, 2020

1 commit

  • For avoiding use-after-free on flush request, we call its .end_io() from
    both timeout code path and __blk_mq_end_request().

    When flush request's ref doesn't drop to zero, it is still used, we
    can't mark it as IDLE, so fix it by marking IDLE when its refcount drops
    to zero really.

    Fixes: 65ff5cd04551 ("blk-mq: mark flush request as IDLE in flush_end_io()")
    Signed-off-by: Ming Lei
    Cc: Yi Zhang
    Signed-off-by: Jens Axboe

    Ming Lei
     

13 Nov, 2020

1 commit


30 Oct, 2020

1 commit

  • Mark flush request as IDLE in its .end_io(), aligning it with how normal
    requests behave. The flush request stays in in-flight tags if we're not
    using an IO scheduler, so we need to change its state into IDLE.
    Otherwise, we will hang in blk_mq_tagset_wait_completed_request() during
    error recovery because flush the request state is kept as COMPLETED.

    Reported-by: Yi Zhang
    Signed-off-by: Ming Lei
    Tested-by: Yi Zhang
    Cc: Chao Leng
    Cc: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Ming Lei
     

28 Oct, 2020

1 commit

  • When the bio's size reaches max_append_sectors, bio_add_hw_page returns
    0 then __bio_iov_append_get_pages returns -EINVAL. This is an expected
    result of building a small enough bio not to be split in the IO path.
    However, iov_iter is not advanced in this case, causing the same pages
    are filled for the bio again and again.

    Fix the case by properly advancing the iov_iter for already processed
    pages.

    Fixes: 0512a75b98f8 ("block: Introduce REQ_OP_ZONE_APPEND")
    Cc: stable@vger.kernel.org # 5.8+
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Naohiro Aota
    Signed-off-by: Jens Axboe

    Naohiro Aota
     

26 Oct, 2020

2 commits

  • Similarly to commit 457e490f2b741 ("blkcg: allocate struct blkcg_gq
    outside request queue spinlock"), blkg_create can also trigger
    occasional -ENOMEM failures at the radix insertion because any
    allocation inside blkg_create has to be non-blocking, making it more
    likely to fail. This causes trouble for userspace tools trying to
    configure io weights who need to deal with this condition.

    This patch reduces the occurrence of -ENOMEMs on this path by preloading
    the radix tree element on a GFP_KERNEL context, such that we guarantee
    the later non-blocking insertion won't fail.

    A similar solution exists in blkcg_init_queue for the same situation.

    Acked-by: Tejun Heo
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     
  • If new_blkg allocation raced with blk_policy change and
    blkg_lookup_check fails, new_blkg is leaked.

    Acked-by: Tejun Heo
    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi
     

25 Oct, 2020

1 commit

  • Pull block fixes from Jens Axboe:

    - NVMe pull request from Christoph
    - rdma error handling fixes (Chao Leng)
    - fc error handling and reconnect fixes (James Smart)
    - fix the qid displace when tracing ioctl command (Keith Busch)
    - don't use BLK_MQ_REQ_NOWAIT for passthru (Chaitanya Kulkarni)
    - fix MTDT for passthru (Logan Gunthorpe)
    - blacklist Write Same on more devices (Kai-Heng Feng)
    - fix an uninitialized work struct (zhenwei pi)"

    - lightnvm out-of-bounds fix (Colin)

    - SG allocation leak fix (Doug)

    - rnbd fixes (Gioh, Guoqing, Jack)

    - zone error translation fixes (Keith)

    - kerneldoc markup fix (Mauro)

    - zram lockdep fix (Peter)

    - Kill unused io_context members (Yufen)

    - NUMA memory allocation cleanup (Xianting)

    - NBD config wakeup fix (Xiubo)

    * tag 'block-5.10-2020-10-24' of git://git.kernel.dk/linux-block: (27 commits)
    block: blk-mq: fix a kernel-doc markup
    nvme-fc: shorten reconnect delay if possible for FC
    nvme-fc: wait for queues to freeze before calling update_hr_hw_queues
    nvme-fc: fix error loop in create_hw_io_queues
    nvme-fc: fix io timeout to abort I/O
    null_blk: use zone status for max active/open
    nvmet: don't use BLK_MQ_REQ_NOWAIT for passthru
    nvmet: cleanup nvmet_passthru_map_sg()
    nvmet: limit passthru MTDS by BIO_MAX_PAGES
    nvmet: fix uninitialized work for zero kato
    nvme-pci: disable Write Zeroes on Sandisk Skyhawk
    nvme: use queuedata for nvme_req_qid
    nvme-rdma: fix crash due to incorrect cqe
    nvme-rdma: fix crash when connect rejected
    block: remove unused members for io_context
    blk-mq: remove the calling of local_memory_node()
    zram: Fix __zram_bvec_{read,write}() locking order
    skd_main: remove unused including
    sgl_alloc_order: fix memory leak
    lightnvm: fix out-of-bounds write to array devices->info[]
    ...

    Linus Torvalds
     

24 Oct, 2020

1 commit


20 Oct, 2020

1 commit

  • We don't need to check whether the node is memoryless numa node before
    calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
    allocator to pick a node. Page allocator should deal with memoryless
    nodes just fine. It has zonelists constructed for each possible nodes.
    And it will automatically fall back into a node which is closest to the
    requested node. As long as __GFP_THISNODE is not enforced of course.

    The code comments of kmem_cache_alloc_node() of SLAB also showed this:
    * Fallback to other node is possible if __GFP_THISNODE is not set.

    blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
    of local_memory_node().

    Signed-off-by: Xianting Tian
    Signed-off-by: Jens Axboe

    Xianting Tian
     

15 Oct, 2020

3 commits

  • Fix this warning:

    ./block/bio.c:1098: WARNING: Inline emphasis start-string without end-string.

    The thing is that *iter is not a valid markup.

    That seems to be a typo:
    *iter -> @iter

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • Using "@bio's parent" causes the following waring:
    ./block/bio.c:10: WARNING: Inline emphasis start-string without end-string.

    The main problem here is that this would be converted into:

    **bio**'s parent

    By kernel-doc, which is not a valid notation. It would be
    possible to use, instead, this kernel-doc markup:

    ``bio's`` parent

    Yet, here, is probably simpler to just use an altenative language:

    the parent of @bio

    Signed-off-by: Mauro Carvalho Chehab

    Mauro Carvalho Chehab
     
  • …/device-mapper/linux-dm

    Pull device mapper updates from Mike Snitzer:

    - Improve DM core's bio splitting to use blk_max_size_offset(). Also
    fix bio splitting for bios that were deferred to the worker thread
    due to a DM device being suspended.

    - Remove DM core's special handling of NVMe devices now that block core
    has internalized efficiencies drivers previously needed to be
    concerned about (via now removed direct_make_request).

    - Fix request-based DM to not bounce through indirect dm_submit_bio;
    instead have block core make direct call to blk_mq_submit_bio().

    - Various DM core cleanups to simplify and improve code.

    - Update DM cryot to not use drivers that set
    CRYPTO_ALG_ALLOCATES_MEMORY.

    - Fix DM raid's raid1 and raid10 discard limits for the purposes of
    linux-stable. But then remove DM raid's discard limits settings now
    that MD raid can efficiently handle large discards.

    - A couple small cleanups across various targets.

    * tag 'for-5.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
    dm: fix request-based DM to not bounce through indirect dm_submit_bio
    dm: remove special-casing of bio-based immutable singleton target on NVMe
    dm: export dm_copy_name_and_uuid
    dm: fix comment in __dm_suspend()
    dm: fold dm_process_bio() into dm_submit_bio()
    dm: fix missing imposition of queue_limits from dm_wq_work() thread
    dm snap persistent: simplify area_io()
    dm thin metadata: Remove unused local variable when create thin and snap
    dm raid: remove unnecessary discard limits for raid10
    dm raid: fix discard limits for raid1 and raid10
    dm crypt: don't use drivers that have CRYPTO_ALG_ALLOCATES_MEMORY
    dm: use dm_table_get_device_name() where appropriate in targets
    dm table: make 'struct dm_table' definition accessible to all of DM core
    dm: eliminate need for start_io_acct() forward declaration
    dm: simplify __process_abnormal_io()
    dm: push use of on-stack flush_bio down to __send_empty_flush()
    dm: optimize max_io_len() by inlining max_io_len_target_boundary()
    dm: push md->immutable_target optimization down to __process_bio()
    dm: change max_io_len() to use blk_max_size_offset()
    dm table: stack 'chunk_sectors' limit to account for target-specific splitting

    Linus Torvalds
     

14 Oct, 2020

3 commits

  • A zoned device with limited resources to open or activate zones may
    return an error when the host exceeds those limits. The same command may
    be successful if retried later, but the host needs to wait for specific
    zone states before it should expect a retry to succeed. Have the block
    layer provide an appropriate status for these conditions so applications
    can distinuguish this error for special handling.

    Cc: linux-api@vger.kernel.org
    Cc: Niklas Cassel
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Damien Le Moal
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     
  • Pull block driver updates from Jens Axboe:
    "Here are the driver updates for 5.10.

    A few SCSI updates in here too, in coordination with Martin as they
    depend on core block changes for the shared tag bitmap.

    This contains:

    - NVMe pull requests via Christoph:
    - fix keep alive timer modification (Amit Engel)
    - order the PCI ID list more sensibly (Andy Shevchenko)
    - cleanup the open by controller helper (Chaitanya Kulkarni)
    - use an xarray for the CSE log lookup (Chaitanya Kulkarni)
    - support ZNS in nvmet passthrough mode (Chaitanya Kulkarni)
    - fix nvme_ns_report_zones (Christoph Hellwig)
    - add a sanity check to nvmet-fc (James Smart)
    - fix interrupt allocation when too many polled queues are
    specified (Jeffle Xu)
    - small nvmet-tcp optimization (Mark Wunderlich)
    - fix a controller refcount leak on init failure (Chaitanya
    Kulkarni)
    - misc cleanups (Chaitanya Kulkarni)
    - major refactoring of the scanning code (Christoph Hellwig)

    - MD updates via Song:
    - Bug fixes in bitmap code, from Zhao Heming
    - Fix a work queue check, from Guoqing Jiang
    - Fix raid5 oops with reshape, from Song Liu
    - Clean up unused code, from Jason Yan
    - Discard improvements, from Xiao Ni
    - raid5/6 page offset support, from Yufen Yu

    - Shared tag bitmap for SCSI/hisi_sas/null_blk (John, Kashyap,
    Hannes)

    - null_blk open/active zone limit support (Niklas)

    - Set of bcache updates (Coly, Dongsheng, Qinglang)"

    * tag 'drivers-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (78 commits)
    md/raid5: fix oops during stripe resizing
    md/bitmap: fix memory leak of temporary bitmap
    md: fix the checking of wrong work queue
    md/bitmap: md_bitmap_get_counter returns wrong blocks
    md/bitmap: md_bitmap_read_sb uses wrong bitmap blocks
    md/raid0: remove unused function is_io_in_chunk_boundary()
    nvme-core: remove extra condition for vwc
    nvme-core: remove extra variable
    nvme: remove nvme_identify_ns_list
    nvme: refactor nvme_validate_ns
    nvme: move nvme_validate_ns
    nvme: query namespace identifiers before adding the namespace
    nvme: revalidate zone bitmaps in nvme_update_ns_info
    nvme: remove nvme_update_formats
    nvme: update the known admin effects
    nvme: set the queue limits in nvme_update_ns_info
    nvme: remove the 0 lba_shift check in nvme_update_ns_info
    nvme: clean up the check for too large logic block sizes
    nvme: freeze the queue over ->lba_shift updates
    nvme: factor out a nvme_configure_metadata helper
    ...

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:

    - Series of merge handling cleanups (Baolin, Christoph)

    - Series of blk-throttle fixes and cleanups (Baolin)

    - Series cleaning up BDI, seperating the block device from the
    backing_dev_info (Christoph)

    - Removal of bdget() as a generic API (Christoph)

    - Removal of blkdev_get() as a generic API (Christoph)

    - Cleanup of is-partition checks (Christoph)

    - Series reworking disk revalidation (Christoph)

    - Series cleaning up bio flags (Christoph)

    - bio crypt fixes (Eric)

    - IO stats inflight tweak (Gabriel)

    - blk-mq tags fixes (Hannes)

    - Buffer invalidation fixes (Jan)

    - Allow soft limits for zone append (Johannes)

    - Shared tag set improvements (John, Kashyap)

    - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel)

    - DM no-wait support (Mike, Konstantin)

    - Request allocation improvements (Ming)

    - Allow md/dm/bcache to use IO stat helpers (Song)

    - Series improving blk-iocost (Tejun)

    - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang,
    Xianting, Yang, Yufen, yangerkun)

    * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits)
    block: fix uapi blkzoned.h comments
    blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue
    blk-mq: get rid of the dead flush handle code path
    block: get rid of unnecessary local variable
    block: fix comment and add lockdep assert
    blk-mq: use helper function to test hw stopped
    block: use helper function to test queue register
    block: remove redundant mq check
    block: invoke blk_mq_exit_sched no matter whether have .exit_sched
    percpu_ref: don't refer to ref->data if it isn't allocated
    block: ratelimit handle_bad_sector() message
    blk-throttle: Re-use the throtl_set_slice_end()
    blk-throttle: Open code __throtl_de/enqueue_tg()
    blk-throttle: Move service tree validation out of the throtl_rb_first()
    blk-throttle: Move the list operation after list validation
    blk-throttle: Fix IO hang for a corner case
    blk-throttle: Avoid tracking latency if low limit is invalid
    blk-throttle: Avoid getting the current time if tg->last_finish_time is 0
    blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()
    block: Remove redundant 'return' statement
    ...

    Linus Torvalds
     

13 Oct, 2020

1 commit

  • Pull compat iovec cleanups from Al Viro:
    "Christoph's series around import_iovec() and compat variant thereof"

    * 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    security/keys: remove compat_keyctl_instantiate_key_iov
    mm: remove compat_process_vm_{readv,writev}
    fs: remove compat_sys_vmsplice
    fs: remove the compat readv/writev syscalls
    fs: remove various compat readv/writev helpers
    iov_iter: transparently handle compat iovecs in import_iovec
    iov_iter: refactor rw_copy_check_uvector and import_iovec
    iov_iter: move rw_copy_check_uvector() into lib/iov_iter.c
    compat.h: fix a spelling error in

    Linus Torvalds
     

10 Oct, 2020

8 commits


09 Oct, 2020

2 commits

  • Pull block fixes from Jens Axboe:
    "A few fixes that should go into this release:

    - NVMe controller error path reference fix (Chaitanya)

    - Fix regression with IBM partitions on non-dasd devices (Christoph)

    - Fix a missing clear in the compat CDROM packet structure (Peilin)"

    * tag 'block5.9-2020-10-08' of git://git.kernel.dk/linux-block:
    partitions/ibm: fix non-DASD devices
    nvme-core: put ctrl ref when module ref get fail
    block/scsi-ioctl: Fix kernel-infoleak in scsi_put_cdrom_generic_arg()

    Linus Torvalds
     
  • syzbot is reporting unkillable task [1], for the caller is failing to
    handle a corrupted filesystem image which attempts to access beyond
    the end of the device. While we need to fix the caller, flooding the
    console with handle_bad_sector() message is unlikely useful.

    [1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

    Signed-off-by: Tetsuo Handa
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Tetsuo Handa
     

08 Oct, 2020

10 commits


07 Oct, 2020

2 commits

  • Don't error out if the dasd_biodasdinfo symbol is not available.

    Cc: stable@vger.kernel.org
    Fixes: 26d7e28e3820 ("s390/dasd: remove ioctl_by_bdev calls")
    Reported-by: Christian Borntraeger
    Signed-off-by: Christoph Hellwig
    Tested-by: Christian Borntraeger
    Reviewed-by: Stefan Haberland
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • According to Documentation/block/stat.rst, inflight should not include
    I/O requests that are in the queue but not yet dispatched to the device,
    but blk-mq identifies as inflight any request that has a tag allocated,
    which, for queues without elevator, happens at request allocation time
    and before it is queued in the ctx (default case in blk_mq_submit_bio).

    In addition, current behavior is different for queues with elevator from
    queues without it, since for the former the driver tag is allocated at
    dispatch time. A more precise approach would be to only consider
    requests with state MQ_RQ_IN_FLIGHT.

    This effectively reverts commit 6131837b1de6 ("blk-mq: count allocated
    but not started requests in iostats inflight") to consolidate blk-mq
    behavior with itself (elevator case) and with original documentation,
    but it differs from the behavior used by the legacy path.

    This version differs from v1 by using blk_mq_rq_state to access the
    state attribute. Avoid using blk_mq_request_started, which was
    suggested, since we don't want to include MQ_RQ_COMPLETE.

    Signed-off-by: Gabriel Krisman Bertazi
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Gabriel Krisman Bertazi