23 Sep, 2022

1 commit

  • [ Upstream commit 56f99b8d06ef1ed1c9730948f9f05ac2b930a20b ]

    Today blk_queue_enter() and __bio_queue_enter() return -EBUSY for the
    nowait code path. This is not correct: they should return -EAGAIN
    instead.

    This problem was detected by fio. The following command exposed the
    above problem:

    t/io_uring -p0 -d128 -b4096 -s32 -c32 -F1 -B0 -R0 -X1 -n24 -P1 -u1 -O0 /dev/ng0n1

    By applying the patch, the retry case is handled correctly in the slow
    path.

    Signed-off-by: Stefan Roesch
    Fixes: bfd343aa1718 ("blk-mq: don't wait in blk_mq_queue_enter() if __GFP_WAIT isn't set")
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Stefan Roesch
     

23 Mar, 2022

1 commit

  • commit daaca3522a8e67c46e39ef09c1d542e866f85f3b upstream.

    blkcg_init_queue() may add rq qos structures to request queue, previously
    blk_cleanup_queue() calls rq_qos_exit() to release them, but commit
    8e141f9eb803 ("block: drain file system I/O on del_gendisk")
    moves rq_qos_exit() into del_gendisk(), so memory leak is caused
    because queues may not have disk, such as un-present scsi luns, nvme
    admin queue, ...

    Fixes the issue by adding rq_qos_exit() to blk_cleanup_queue() back.

    BTW, v5.18 won't need this patch any more since we move
    blkcg_init_queue()/blkcg_exit_queue() into disk allocation/release
    handler, and patches have been in for-5.18/block.

    Cc: Christoph Hellwig
    Cc: stable@vger.kernel.org
    Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
    Reported-by: syzbot+b42749a851a47a0f581b@syzkaller.appspotmail.com
    Signed-off-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20220314043018.177141-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

23 Feb, 2022

1 commit

  • commit 7a5428dcb7902700b830e912feee4e845df7c019 upstream.

    Various block drivers call blk_set_queue_dying to mark a disk as dead due
    to surprise removal events, but since commit 8e141f9eb803 that doesn't
    work given that the GD_DEAD flag needs to be set to stop I/O.

    Replace the driver calls to blk_set_queue_dying with a new (and properly
    documented) blk_mark_disk_dead API, and fold blk_set_queue_dying into the
    only remaining caller.

    Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk")
    Reported-by: Markus Blöchl
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Link: https://lore.kernel.org/r/20220217075231.1140-1-hch@lst.de
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Christoph Hellwig
     

02 Feb, 2022

1 commit

  • commit e45c47d1f94e0cc7b6b079fdb4bcce2995e2adc4 upstream.

    bio_start_io_acct_time() interface is like bio_start_io_acct() that
    allows start_time to be passed in. This gives drivers the ability to
    defer starting accounting until after IO is issued (but possibily not
    entirely due to bio splitting).

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Mike Snitzer
    Link: https://lore.kernel.org/r/20220128155841.39644-2-snitzer@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mike Snitzer
     

01 Dec, 2021

1 commit

  • commit 2a19b28f7929866e1cec92a3619f4de9f2d20005 upstream.

    For avoiding to slow down queue destroy, we don't call
    blk_mq_quiesce_queue() in blk_cleanup_queue(), instead of delaying to
    cancel dispatch work in blk_release_queue().

    However, this way has caused kernel oops[1], reported by Changhui. The log
    shows that scsi_device can be freed before running blk_release_queue(),
    which is expected too since scsi_device is released after the scsi disk
    is closed and the scsi_device is removed.

    Fixes the issue by canceling blk-mq dispatch work in both blk_cleanup_queue()
    and disk_release():

    1) when disk_release() is run, the disk has been closed, and any sync
    dispatch activities have been done, so canceling dispatch work is enough to
    quiesce filesystem I/O dispatch activity.

    2) in blk_cleanup_queue(), we only focus on passthrough request, and
    passthrough request is always explicitly allocated & freed by
    its caller, so once queue is frozen, all sync dispatch activity
    for passthrough request has been done, then it is enough to just cancel
    dispatch work for avoiding any dispatch activity.

    [1] kernel panic log
    [12622.769416] BUG: kernel NULL pointer dereference, address: 0000000000000300
    [12622.777186] #PF: supervisor read access in kernel mode
    [12622.782918] #PF: error_code(0x0000) - not-present page
    [12622.788649] PGD 0 P4D 0
    [12622.791474] Oops: 0000 [#1] PREEMPT SMP PTI
    [12622.796138] CPU: 10 PID: 744 Comm: kworker/10:1H Kdump: loaded Not tainted 5.15.0+ #1
    [12622.804877] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
    [12622.813321] Workqueue: kblockd blk_mq_run_work_fn
    [12622.818572] RIP: 0010:sbitmap_get+0x75/0x190
    [12622.823336] Code: 85 80 00 00 00 41 8b 57 08 85 d2 0f 84 b1 00 00 00 45 31 e4 48 63 cd 48 8d 1c 49 48 c1 e3 06 49 03 5f 10 4c 8d 6b 40 83 f0 01 8b 33 44 89 f2 4c 89 ef 0f b6 c8 e8 fa f3 ff ff 83 f8 ff 75 58
    [12622.844290] RSP: 0018:ffffb00a446dbd40 EFLAGS: 00010202
    [12622.850120] RAX: 0000000000000001 RBX: 0000000000000300 RCX: 0000000000000004
    [12622.858082] RDX: 0000000000000006 RSI: 0000000000000082 RDI: ffffa0b7a2dfe030
    [12622.866042] RBP: 0000000000000004 R08: 0000000000000001 R09: ffffa0b742721334
    [12622.874003] R10: 0000000000000008 R11: 0000000000000008 R12: 0000000000000000
    [12622.881964] R13: 0000000000000340 R14: 0000000000000000 R15: ffffa0b7a2dfe030
    [12622.889926] FS: 0000000000000000(0000) GS:ffffa0baafb40000(0000) knlGS:0000000000000000
    [12622.898956] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [12622.905367] CR2: 0000000000000300 CR3: 0000000641210001 CR4: 00000000001706e0
    [12622.913328] Call Trace:
    [12622.916055]
    [12622.918394] scsi_mq_get_budget+0x1a/0x110
    [12622.922969] __blk_mq_do_dispatch_sched+0x1d4/0x320
    [12622.928404] ? pick_next_task_fair+0x39/0x390
    [12622.933268] __blk_mq_sched_dispatch_requests+0xf4/0x140
    [12622.939194] blk_mq_sched_dispatch_requests+0x30/0x60
    [12622.944829] __blk_mq_run_hw_queue+0x30/0xa0
    [12622.949593] process_one_work+0x1e8/0x3c0
    [12622.954059] worker_thread+0x50/0x3b0
    [12622.958144] ? rescuer_thread+0x370/0x370
    [12622.962616] kthread+0x158/0x180
    [12622.966218] ? set_kthread_struct+0x40/0x40
    [12622.970884] ret_from_fork+0x22/0x30
    [12622.974875]
    [12622.977309] Modules linked in: scsi_debug rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs sunrpc dm_multipath intel_rapl_msr intel_rapl_common dell_wmi_descriptor sb_edac rfkill video x86_pkg_temp_thermal intel_powerclamp dcdbas coretemp kvm_intel kvm mgag200 irqbypass i2c_algo_bit rapl drm_kms_helper ipmi_ssif intel_cstate intel_uncore syscopyarea sysfillrect sysimgblt fb_sys_fops pcspkr cec mei_me lpc_ich mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter drm fuse xfs libcrc32c sr_mod cdrom sd_mod t10_pi sg ixgbe ahci libahci crct10dif_pclmul crc32_pclmul crc32c_intel libata megaraid_sas ghash_clmulni_intel tg3 wdat_wdt mdio dca wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_debug]

    Reported-by: ChanghuiZhong
    Cc: Christoph Hellwig
    Cc: "Martin K. Petersen"
    Cc: Bart Van Assche
    Cc: linux-scsi@vger.kernel.org
    Signed-off-by: Ming Lei
    Link: https://lore.kernel.org/r/20211116014343.610501-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

25 Nov, 2021

1 commit

  • [ Upstream commit b781d8db580c058ecd54ed7d5dde7f8270b25f5b ]

    KASAN reports a use-after-free report when doing block test:

    ==================================================================
    [10050.967049] BUG: KASAN: use-after-free in
    submit_bio_checks+0x1539/0x1550

    [10050.977638] Call Trace:
    [10050.978190] dump_stack+0x9b/0xce
    [10050.979674] print_address_description.constprop.6+0x3e/0x60
    [10050.983510] kasan_report.cold.9+0x22/0x3a
    [10050.986089] submit_bio_checks+0x1539/0x1550
    [10050.989576] submit_bio_noacct+0x83/0xc80
    [10050.993714] submit_bio+0xa7/0x330
    [10050.994435] mpage_readahead+0x380/0x500
    [10050.998009] read_pages+0x1c1/0xbf0
    [10051.002057] page_cache_ra_unbounded+0x4c2/0x6f0
    [10051.007413] do_page_cache_ra+0xda/0x110
    [10051.008207] force_page_cache_ra+0x23d/0x3d0
    [10051.009087] page_cache_sync_ra+0xca/0x300
    [10051.009970] generic_file_buffered_read+0xbea/0x2130
    [10051.012685] generic_file_read_iter+0x315/0x490
    [10051.014472] blkdev_read_iter+0x113/0x1b0
    [10051.015300] aio_read+0x2ad/0x450
    [10051.023786] io_submit_one+0xc8e/0x1d60
    [10051.029855] __se_sys_io_submit+0x125/0x350
    [10051.033442] do_syscall_64+0x2d/0x40
    [10051.034156] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [10051.048733] Allocated by task 18598:
    [10051.049482] kasan_save_stack+0x19/0x40
    [10051.050263] __kasan_kmalloc.constprop.1+0xc1/0xd0
    [10051.051230] kmem_cache_alloc+0x146/0x440
    [10051.052060] mempool_alloc+0x125/0x2f0
    [10051.052818] bio_alloc_bioset+0x353/0x590
    [10051.053658] mpage_alloc+0x3b/0x240
    [10051.054382] do_mpage_readpage+0xddf/0x1ef0
    [10051.055250] mpage_readahead+0x264/0x500
    [10051.056060] read_pages+0x1c1/0xbf0
    [10051.056758] page_cache_ra_unbounded+0x4c2/0x6f0
    [10051.057702] do_page_cache_ra+0xda/0x110
    [10051.058511] force_page_cache_ra+0x23d/0x3d0
    [10051.059373] page_cache_sync_ra+0xca/0x300
    [10051.060198] generic_file_buffered_read+0xbea/0x2130
    [10051.061195] generic_file_read_iter+0x315/0x490
    [10051.062189] blkdev_read_iter+0x113/0x1b0
    [10051.063015] aio_read+0x2ad/0x450
    [10051.063686] io_submit_one+0xc8e/0x1d60
    [10051.064467] __se_sys_io_submit+0x125/0x350
    [10051.065318] do_syscall_64+0x2d/0x40
    [10051.066082] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    [10051.067455] Freed by task 13307:
    [10051.068136] kasan_save_stack+0x19/0x40
    [10051.068931] kasan_set_track+0x1c/0x30
    [10051.069726] kasan_set_free_info+0x1b/0x30
    [10051.070621] __kasan_slab_free+0x111/0x160
    [10051.071480] kmem_cache_free+0x94/0x460
    [10051.072256] mempool_free+0xd6/0x320
    [10051.072985] bio_free+0xe0/0x130
    [10051.073630] bio_put+0xab/0xe0
    [10051.074252] bio_endio+0x3a6/0x5d0
    [10051.074984] blk_update_request+0x590/0x1370
    [10051.075870] scsi_end_request+0x7d/0x400
    [10051.076667] scsi_io_completion+0x1aa/0xe50
    [10051.077503] scsi_softirq_done+0x11b/0x240
    [10051.078344] blk_mq_complete_request+0xd4/0x120
    [10051.079275] scsi_mq_done+0xf0/0x200
    [10051.080036] virtscsi_vq_done+0xbc/0x150
    [10051.080850] vring_interrupt+0x179/0x390
    [10051.081650] __handle_irq_event_percpu+0xf7/0x490
    [10051.082626] handle_irq_event_percpu+0x7b/0x160
    [10051.083527] handle_irq_event+0xcc/0x170
    [10051.084297] handle_edge_irq+0x215/0xb20
    [10051.085122] asm_call_irq_on_stack+0xf/0x20
    [10051.085986] common_interrupt+0xae/0x120
    [10051.086830] asm_common_interrupt+0x1e/0x40

    ==================================================================

    Bio will be checked at beginning of submit_bio_noacct(). If bio needs
    to be throttled, it will start the timer and stop submit bio directly.
    Bio will submit in blk_throtl_dispatch_work_fn() when the timer expires.
    But in the current process, if bio is throttled, it will still set bio
    issue->value by blkcg_bio_issue_init(). This is redundant and may cause
    the above use-after-free.

    CPU0 CPU1
    submit_bio
    submit_bio_noacct
    submit_bio_checks
    blk_throtl_bio()
    pending_timer
    blk_throtl_dispatch_work_fn
    submit_bio_noacct() value will be set
    here

    bio_endio()
    bio_put()
    bio_free()
    Link: https://lore.kernel.org/r/20211112093354.3581504-1-qiulaibin@huawei.com
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Laibin Qiu
     

16 Oct, 2021

4 commits

  • Instead of delaying draining of file system I/O related items like the
    blk-qos queues, the integrity read workqueue and timeouts only when the
    request_queue is removed, do that when del_gendisk is called. This is
    important for SCSI where the upper level drivers that control the gendisk
    are separate entities, and the disk can be freed much earlier than the
    request_queue, or can even be unbound without tearing down the queue.

    Fixes: edb0872f44ec ("block: move the bdi from the request_queue to the gendisk")
    Reported-by: Ming Lei
    Signed-off-by: Christoph Hellwig
    Tested-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20210929071241.934472-5-hch@lst.de
    Tested-by: Yi Zhang
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • To prepare for fixing a gendisk shutdown race, open code the
    blk_queue_enter logic in bio_queue_enter. This also removes the
    pointless flags translation.

    Signed-off-by: Christoph Hellwig
    Tested-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20210929071241.934472-4-hch@lst.de
    Tested-by: Yi Zhang
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Factor out the code to try to get q_usage_counter without blocking into
    a separate helper. Both to improve code readability and to prepare for
    splitting bio_queue_enter from blk_queue_enter.

    Signed-off-by: Christoph Hellwig
    Tested-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20210929071241.934472-3-hch@lst.de
    Tested-by: Yi Zhang
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Ensure all bios check the current values of the queue under freeze
    protection, i.e. to make sure the zero capacity set by del_gendisk
    is actually seen before dispatching to the driver.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210929071241.934472-2-hch@lst.de
    Tested-by: Yi Zhang
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

31 Aug, 2021

2 commits

  • Pull support for struct bio recycling from Jens Axboe:
    "This adds bio recycling support for polled IO, allowing quick reuse of
    a bio for high IOPS scenarios via a percpu bio_set list.

    It's good for almost a 10% improvement in performance, bumping our
    per-core IO limit from ~3.2M IOPS to ~3.5M IOPS"

    * tag 'io_uring-bio-cache.5-2021-08-30' of git://git.kernel.dk/linux-block:
    bio: improve kerneldoc documentation for bio_alloc_kiocb()
    block: provide bio_clear_hipri() helper
    block: use the percpu bio cache in __blkdev_direct_IO
    io_uring: enable use of bio alloc cache
    block: clear BIO_PERCPU_CACHE flag if polling isn't supported
    bio: add allocation cache abstraction
    fs: add kiocb alloc cache flag
    bio: optimize initialization of a bio

    Linus Torvalds
     
  • Pull block updates from Jens Axboe:
    "Nothing major in here - lots of good cleanups and tech debt handling,
    which is also evident in the diffstats. In particular:

    - Add disk sequence numbers (Matteo)

    - Discard merge fix (Ming)

    - Relax disk zoned reporting restrictions (Niklas)

    - Bio error handling zoned leak fix (Pavel)

    - Start of proper add_disk() error handling (Luis, Christoph)

    - blk crypto fix (Eric)

    - Non-standard GPT location support (Dmitry)

    - IO priority improvements and cleanups (Damien)o

    - blk-throtl improvements (Chunguang)

    - diskstats_show() stack reduction (Abd-Alrhman)

    - Loop scheduler selection (Bart)

    - Switch block layer to use kmap_local_page() (Christoph)

    - Remove obsolete disk_name helper (Christoph)

    - block_device refcounting improvements (Christoph)

    - Ensure gendisk always has a request queue reference (Christoph)

    - Misc fixes/cleanups (Shaokun, Oliver, Guoqing)"

    * tag 'for-5.15/block-2021-08-30' of git://git.kernel.dk/linux-block: (129 commits)
    sg: pass the device name to blk_trace_setup
    block, bfq: cleanup the repeated declaration
    blk-crypto: fix check for too-large dun_bytes
    blk-zoned: allow BLKREPORTZONE without CAP_SYS_ADMIN
    blk-zoned: allow zone management send operations without CAP_SYS_ADMIN
    block: mark blkdev_fsync static
    block: refine the disk_live check in del_gendisk
    mmc: sdhci-tegra: Enable MMC_CAP2_ALT_GPT_TEGRA
    mmc: block: Support alternative_gpt_sector() operation
    partitions/efi: Support non-standard GPT location
    block: Add alternative_gpt_sector() operation
    bio: fix page leak bio_add_hw_page failure
    block: remove CONFIG_DEBUG_BLOCK_EXT_DEVT
    block: remove a pointless call to MINOR() in device_add_disk
    null_blk: add error handling support for add_disk()
    virtio_blk: add error handling support for add_disk()
    block: add error handling for device_add_disk / add_disk
    block: return errors from disk_alloc_events
    block: return errors from blk_integrity_add
    block: call blk_register_queue earlier in device_add_disk
    ...

    Linus Torvalds
     

24 Aug, 2021

2 commits

  • Any case that turns off REQ_HIPRI must also clear BIO_PERCPU_CACHE,
    as non-polled IO may complete through hard/soft IRQ and hence isn't
    safe for our polled bio alloc cache.

    Provide a helper that does just that, and use it in the merging code as
    well if we split a bio and turn off polling.

    Fixes: be863b9e4348 ("block: clear BIO_PERCPU_CACHE flag if polling isn't supported")
    Reported-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The bio alloc cache relies on the fact that a polled bio will complete
    in process context, clear the cacheable flag if we disable polling
    for a given bio.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Aug, 2021

1 commit

  • For fixing use-after-free during iterating over requests, we grabbed
    request's refcount before calling ->fn in commit 2e315dc07df0 ("blk-mq:
    grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter").
    Turns out this way may cause kernel panic when iterating over one flush
    request:

    1) old flush request's tag is just released, and this tag is reused by
    one new request, but ->rqs[] isn't updated yet

    2) the flush request can be re-used for submitting one new flush command,
    so blk_rq_init() is called at the same time

    3) meantime blk_mq_queue_tag_busy_iter() is called, and old flush request
    is retrieved from ->rqs[tag]; when blk_mq_put_rq_ref() is called,
    flush_rq->end_io may not be updated yet, so NULL pointer dereference
    is triggered in blk_mq_put_rq_ref().

    Fix the issue by calling refcount_set(&flush_rq->ref, 1) after
    flush_rq->end_io is set. So far the only other caller of blk_rq_init() is
    scsi_ioctl_reset() in which the request doesn't enter block IO stack and
    the request reference count isn't used, so the change is safe.

    Fixes: 2e315dc07df0 ("blk-mq: grab rq->refcount before calling ->fn in blk_mq_tagset_busy_iter")
    Reported-by: "Blank-Burian, Markus, Dr."
    Tested-by: "Blank-Burian, Markus, Dr."
    Signed-off-by: Ming Lei
    Reviewed-by: Christoph Hellwig
    Reviewed-by: John Garry
    Link: https://lore.kernel.org/r/20210811142624.618598-1-ming.lei@redhat.com
    Signed-off-by: Jens Axboe

    Ming Lei
     

10 Aug, 2021

2 commits

  • The backing device information only makes sense for file system I/O,
    and thus belongs into the gendisk and not the lower level request_queue
    structure. Move it there.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210809141744.1203023-5-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Don't leak the detaіls of the timer into the block layer, instead
    initialize the timer in bdi_alloc and delete it in bdi_unregister.
    Note that this means the timer is initialized (but not armed) for
    non-block queues as well now.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Johannes Thumshirn
    Link: https://lore.kernel.org/r/20210809141744.1203023-2-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

07 Jul, 2021

1 commit

  • On the IO submission path, blk_account_io_start() may interrupt
    the system interruption. When the interruption returns, the value
    of part->stamp may have been updated by other cores, so the time
    value collected before the interruption may be less than part->
    stamp. So when this happens, we should do nothing to make io_ticks
    more accurate? For kernels less than 5.0, this may cause io_ticks
    to become smaller, which in turn may cause abnormal ioutil values.

    Signed-off-by: Chunguang Xu
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/1625521646-1069-1-git-send-email-brookxu.cn@gmail.com
    Signed-off-by: Jens Axboe

    Chunguang Xu
     

01 Jul, 2021

1 commit

  • With the legacy IDE driver gone drivers now use either REQ_OP_DRV_*
    or REQ_OP_SCSI_*, so unify the two concepts of passthrough requests
    into a single one.

    Reviewed-by: Chaitanya Kulkarni
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

04 Jun, 2021

1 commit

  • Although the original intent was to use blk_update_request() in stacking
    block drivers only, it is used much more widely today. Reflect this in the
    documentation block above this function. See also:
    * commit 32fab448e5e8 ("block: add request update interface").
    * commit 2e60e02297cf ("block: clean up request completion API").
    * commit ed6565e73424 ("block: handle partial completions for special
    payload requests").

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Link: https://lore.kernel.org/r/20210519175226.8853-1-bvanassche@acm.org
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

01 Jun, 2021

1 commit

  • blk_alloc_queue is just an internal helper now, unexport it and remove
    it from the public header.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ulf Hansson
    Link: https://lore.kernel.org/r/20210521055116.1053587-27-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

24 May, 2021

1 commit

  • We have already delete block_dump feature in mark_inode_dirty() because
    it can be replaced by tracepoints, now we also remove the part in
    submit_bio() for the same reason. The part of block dump feature in
    submit_bio() dump the write process, write region and sectors on the
    target disk into kernel message. it can be replaced by
    block_bio_queue tracepoint in submit_bio_checks(), so we do not need
    block_dump anymore, remove the whole block_dump feature.

    Signed-off-by: zhangyi (F)
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210313030146.2882027-3-yi.zhang@huawei.com
    Signed-off-by: Jens Axboe

    zhangyi (F)
     

06 Apr, 2021

1 commit

  • Get rid of all the PFN arithmetics and just use an enum for the two
    remaining options, and use PageHighMem for the actual bounce decision.

    Add a fast path to entirely avoid the call for the common case of a queue
    not using the legacy bouncing code.

    Signed-off-by: Christoph Hellwig
    Acked-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Link: https://lore.kernel.org/r/20210331073001.46776-8-hch@lst.de
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

22 Feb, 2021

1 commit


26 Jan, 2021

1 commit

  • When an already remapped bio is resubmitted (e.g. by blk_queue_split),
    bio_check_eod will compare the remapped bi_sector against the size
    of the partition, leading to spurious I/O failures.

    Skip the EOD check in this case.

    Fixes: 309dca309fc3 ("block: store a block_device pointer in struct bio")
    Reported-by: Jens Axboe
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

25 Jan, 2021

7 commits

  • q->bio_split is only used by bio_split() for fast cloning bio, and no
    need to allocate bvecs, so remove this flag.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Ming Lei
    Reviewed-by: Pavel Begunkov
    Tested-by: Pavel Begunkov
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Remove the reverse map from a sector to a partition for I/O accounting by
    simply using ->bi_bdev.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Rework the I/O accounting for bio based drivers to use ->bi_bdev. This
    means all drivers can now simply use bio_start_io_acct to start
    accounting, and it will take partitions into account automatically. To
    end I/O account either bio_end_io_acct can be used if the driver never
    remaps I/O to a different device, or bio_end_io_acct_remapped if the
    driver did remap the I/O.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • There is no good reason to reassign ->bi_bdev when remapping the
    partition-relative block number to the device wide one, as all the
    information required by the drivers comes from the gendisk anyway.

    Keeping the original ->bi_bdev alive will allow to greatly simplify
    the partition-away I/O accounting.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Merge a few checks for whole devices vs partitions to streamline the
    sanity checks.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Replace the gendisk pointer in struct bio with a pointer to the newly
    improved struct block device. From that the gendisk can be trivially
    accessed with an extra indirection, but it also allows to directly
    look up all information related to partition remapping.

    Signed-off-by: Christoph Hellwig
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Commit 20bd1d026aac ("scsi: sd: Keep disk read-only when re-reading
    partition") addressed a long-standing problem with user read-only
    policy being overridden as a result of a device-initiated revalidate.
    The commit has since been reverted due to a regression that left some
    USB devices read-only indefinitely.

    To fix the underlying problems with revalidate we need to keep track
    of hardware state and user policy separately.

    The gendisk has been updated to reflect the current hardware state set
    by the device driver. This is done to allow returning the device to
    the hardware state once the user clears the BLKROSET flag.

    The resulting semantics are as follows:

    - If BLKROSET sets a given partition read-only, that partition will
    remain read-only even if the underlying storage stack initiates a
    revalidate. However, the BLKRRPART ioctl will cause the partition
    table to be dropped and any user policy on partitions will be lost.

    - If BLKROSET has not been set, both the whole disk device and any
    partitions will reflect the current write-protect state of the
    underlying device.

    Based on a patch from Martin K. Petersen .

    Reported-by: Oleksii Kurochko
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=201221
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

02 Jan, 2021

1 commit

  • Pull SCSI fixes from James Bottomley:
    "This is a load of driver fixes (12 ufs, 1 mpt3sas, 1 cxgbi).

    The big core two fixes are for power management ("block: Do not accept
    any requests while suspended" and "block: Fix a race in the runtime
    power management code") which finally sorts out the resume problems
    we've occasionally been having.

    To make the resume fix, there are seven necessary precursors which
    effectively renames REQ_PREEMPT to REQ_PM, so every "special" request
    in block is automatically a power management exempt one.

    All of the non-PM preempt cases are removed except for the one in the
    SCSI Parallel Interface (spi) domain validation which is a genuine
    case where we have to run requests at high priority to validate the
    bus so this becomes an autopm get/put protected request"

    * tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (22 commits)
    scsi: cxgb4i: Fix TLS dependency
    scsi: ufs: Un-inline ufshcd_vops_device_reset function
    scsi: ufs: Re-enable WriteBooster after device reset
    scsi: ufs-mediatek: Use correct path to fix compile error
    scsi: mpt3sas: Signedness bug in _base_get_diag_triggers()
    scsi: block: Do not accept any requests while suspended
    scsi: block: Remove RQF_PREEMPT and BLK_MQ_REQ_PREEMPT
    scsi: core: Only process PM requests if rpm_status != RPM_ACTIVE
    scsi: scsi_transport_spi: Set RQF_PM for domain validation commands
    scsi: ide: Mark power management requests with RQF_PM instead of RQF_PREEMPT
    scsi: ide: Do not set the RQF_PREEMPT flag for sense requests
    scsi: block: Introduce BLK_MQ_REQ_PM
    scsi: block: Fix a race in the runtime power management code
    scsi: ufs-pci: Enable UFSHCD_CAP_RPM_AUTOSUSPEND for Intel controllers
    scsi: ufs-pci: Fix recovery from hibernate exit errors for Intel controllers
    scsi: ufs-pci: Ensure UFS device is in PowerDown mode for suspend-to-disk ->poweroff()
    scsi: ufs-pci: Fix restore from S4 for Intel controllers
    scsi: ufs-mediatek: Keep VCC always-on for specific devices
    scsi: ufs: Allow regulators being always-on
    scsi: ufs: Clear UAC for RPMB after ufshcd resets
    ...

    Linus Torvalds
     

10 Dec, 2020

3 commits

  • blk_queue_enter() accepts BLK_MQ_REQ_PM requests independent of the runtime
    power management state. Now that SCSI domain validation no longer depends
    on this behavior, modify the behavior of blk_queue_enter() as follows:

    - Do not accept any requests while suspended.

    - Only process power management requests while suspending or resuming.

    Submitting BLK_MQ_REQ_PM requests to a device that is runtime suspended
    causes runtime-suspended devices not to resume as they should. The request
    which should cause a runtime resume instead gets issued directly, without
    resuming the device first. Of course the device can't handle it properly,
    the I/O fails, and the device remains suspended.

    The problem is fixed by checking that the queue's runtime-PM status isn't
    RPM_SUSPENDED before allowing a request to be issued, and queuing a
    runtime-resume request if it is. In particular, the inline
    blk_pm_request_resume() routine is renamed blk_pm_resume_queue() and the
    code is unified by merging the surrounding checks into the routine. If the
    queue isn't set up for runtime PM, or there currently is no restriction on
    allowed requests, the request is allowed. Likewise if the BLK_MQ_REQ_PM
    flag is set and the status isn't RPM_SUSPENDED. Otherwise a runtime resume
    is queued and the request is blocked until conditions are more suitable.

    [ bvanassche: modified commit message and removed Cc: stable because
    without the previous patches from this series this patch would break
    parallel SCSI domain validation + introduced queue_rpm_status() ]

    Link: https://lore.kernel.org/r/20201209052951.16136-9-bvanassche@acm.org
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Reported-and-tested-by: Martin Kepplinger
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Can Guo
    Signed-off-by: Alan Stern
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Alan Stern
     
  • Remove flag RQF_PREEMPT and BLK_MQ_REQ_PREEMPT since these are no longer
    used by any kernel code.

    Link: https://lore.kernel.org/r/20201209052951.16136-8-bvanassche@acm.org
    Cc: Can Guo
    Cc: Stanley Chu
    Cc: Alan Stern
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Martin Kepplinger
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Bart Van Assche
     
  • Introduce the BLK_MQ_REQ_PM flag. This flag makes the request allocation
    functions set RQF_PM. This is the first step towards removing
    BLK_MQ_REQ_PREEMPT.

    Link: https://lore.kernel.org/r/20201209052951.16136-3-bvanassche@acm.org
    Cc: Alan Stern
    Cc: Stanley Chu
    Cc: Ming Lei
    Cc: Rafael J. Wysocki
    Cc: Can Guo
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jens Axboe
    Reviewed-by: Can Guo
    Signed-off-by: Bart Van Assche
    Signed-off-by: Martin K. Petersen

    Bart Van Assche
     

05 Dec, 2020

2 commits


02 Dec, 2020

2 commits

  • Use struct block_device to lookup partitions on a disk. This removes
    all usage of struct hd_struct from the I/O path.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Hannes Reinecke
    Acked-by: Coly Li [bcache]
    Acked-by: Chao Yu [f2fs]
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Allocate hd_struct together with struct block_device to pre-load
    the lifetime rule changes in preparation of merging the two structures.

    Note that part0 was previously embedded into struct gendisk, but is
    a separate allocation now, and already points to the block_device instead
    of the hd_struct. The lifetime of struct gendisk is still controlled by
    the struct device embedded in the part0 hd_struct.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Christoph Hellwig