22 Jul, 2020

3 commits

  • commit 4a2f704eb2d831a2d73d7f4cdd54f45c49c3c353 upstream.

    Commit 429120f3df2d starts to take account of segment's start dma address
    when computing max segment size, and data type of 'unsigned long'
    is used to do that. However, the segment mask may be 0xffffffff, so
    the figured out segment size may be overflowed in case of zero physical
    address on 32bit arch.

    Fix the issue by returning queue_max_segment_size() directly when that
    happens.

    Fixes: 429120f3df2d ("block: fix splitting segments on boundary masks")
    Reported-by: Guenter Roeck
    Tested-by: Guenter Roeck
    Cc: Christoph Hellwig
    Tested-by: Steven Rostedt (VMware)
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • commit 429120f3df2dba2bf3a4a19f4212a53ecefc7102 upstream.

    We ran into a problem with a mpt3sas based controller, where we would
    see random (and hard to reproduce) file corruption). The issue seemed
    specific to this controller, but wasn't specific to the file system.
    After a lot of debugging, we find out that it's caused by segments
    spanning a 4G memory boundary. This shouldn't happen, as the default
    setting for segment boundary masks is 4G.

    Turns out there are two issues in get_max_segment_size():

    1) The default segment boundary mask is bypassed

    2) The segment start address isn't taken into account when checking
    segment boundary limit

    Fix these two issues by removing the bypass of the segment boundary
    check even if the mask is set to the default value, and taking into
    account the actual start address of the request when checking if a
    segment needs splitting.

    Cc: stable@vger.kernel.org # v5.1+
    Reviewed-by: Chris Mason
    Tested-by: Chris Mason
    Fixes: dcebd755926b ("block: use bio_for_each_bvec() to compute multi-page bvec count")
    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Dropped const on the page pointer, ppc page_to_phys() doesn't mark the
    page as const...

    Signed-off-by: Jens Axboe

    Ming Lei
     
  • [ Upstream commit bfe373f608cf81b7626dfeb904001b0e867c5110 ]

    Else there may be magic numbers in /sys/kernel/debug/block/*/state.

    Signed-off-by: Hou Tao
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Hou Tao
     

16 Jul, 2020

2 commits

  • commit 05a4fed69ff00a8bd83538684cb602a4636b07a7 upstream.

    dm-multipath is the only user of blk_mq_queue_inflight(). When
    dm-multipath calls blk_mq_queue_inflight() to check if it has
    outstanding IO it can get a false negative. The reason for this is
    blk_mq_rq_inflight() doesn't consider requests that are no longer
    MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
    called or finished yet) as "inflight".

    This causes request-based dm-multipath's dm_wait_for_completion() to
    return before all outstanding dm-multipath requests have actually
    completed. This breaks DM multipath's suspend functionality because
    blk-mq requests complete after DM's suspend has finished -- which
    shouldn't happen.

    Fix this by considering any request not in the MQ_RQ_IDLE state
    (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
    blk_mq_rq_inflight().

    Fixes: 3c94d83cb3526 ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
    Signed-off-by: Ming Lei
    Signed-off-by: Mike Snitzer
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 0b8eb629a700c0ef15a437758db8255f8444e76c ]

    Release bip using kfree() in error path when that was allocated
    by kmalloc().

    Signed-off-by: Chengguang Xu
    Reviewed-by: Christoph Hellwig
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Chengguang Xu
     

01 Jul, 2020

2 commits

  • [ Upstream commit fe35ec58f0d339221643287bbb7cee15c93a5389 ]

    There is an issue when tune the number for read and write queues,
    if the total queue count was not changed. The hctx->type cannot
    be updated, since __blk_mq_update_nr_hw_queues will return directly
    if the total queue count has not been changed.

    Reproduce:

    dmesg | grep "default/read/poll"
    [ 2.607459] nvme nvme0: 48/0/0 default/read/poll queues
    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    tune the write queues to 24:
    echo 24 > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller

    dmesg | grep "default/read/poll"
    [ 433.547235] nvme nvme0: 24/24/0 default/read/poll queues

    cat /sys/kernel/debug/block/nvme0n1/hctx*/type | sort | uniq -c
    48 default

    The driver's hardware queue mapping is not same as block layer.

    Signed-off-by: Weiping Zhang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     
  • commit a75ca9303175d36af93c0937dd9b1a6422908b8d upstream.

    commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug") added
    a kfree() for 'buf' if bio_integrity_add_page() returns '0'. However,
    the object will be freed in bio_integrity_free() since 'bio->bi_opf' and
    'bio->bi_integrity' were set previousy in bio_integrity_alloc().

    Fixes: commit e7bf90e5afe3 ("block/bio-integrity: fix a memory leak bug")
    Signed-off-by: yu kuai
    Reviewed-by: Ming Lei
    Reviewed-by: Bob Liu
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    yu kuai
     

22 Jun, 2020

3 commits

  • [ Upstream commit 81ca627a933063fa63a6d4c66425de822a2ab7f5 ]

    When the QoS targets are met and nothing is being throttled, there's
    no way to tell how saturated the underlying device is - it could be
    almost entirely idle, at the cusp of saturation or anywhere inbetween.
    Given that there's no information, it's best to keep vrate as-is in
    this state. Before 7cd806a9a953 ("iocost: improve nr_lagging
    handling"), this was the case - if the device isn't missing QoS
    targets and nothing is being throttled, busy_level was reset to zero.

    While fixing nr_lagging handling, 7cd806a9a953 ("iocost: improve
    nr_lagging handling") broke this. Now, while the device is hitting
    QoS targets and nothing is being throttled, vrate keeps getting
    adjusted according to the existing busy_level.

    This led to vrate keeping climing till it hits max when there's an IO
    issuer with limited request concurrency if the vrate started low.
    vrate starts getting adjusted upwards until the issuer can issue IOs
    w/o being throttled. From then on, QoS targets keeps getting met and
    nothing on the system needs throttling and vrate keeps getting
    increased due to the existing busy_level.

    This patch makes the following changes to the busy_level logic.

    * Reset busy_level if nr_shortages is zero to avoid the above
    scenario.

    * Make non-zero nr_lagging block lowering nr_level but still clear
    positive busy_level if there's clear non-saturation signal - QoS
    targets are met and nr_shortages is non-zero. nr_lagging's role is
    preventing adjusting vrate upwards while there are long-running
    commands and it shouldn't keep busy_level positive while there's
    clear non-saturation signal.

    * Restructure code for clarity and add comments.

    Signed-off-by: Tejun Heo
    Reported-by: Andy Newell
    Fixes: 7cd806a9a953 ("iocost: improve nr_lagging handling")
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Tejun Heo
     
  • [ Upstream commit aa880ad690ab6d4c53934af85fb5a43e69ecb0f5 ]

    When we increase hardware queue count, blk_mq_update_queue_map will
    reset the mapping between cpu and hardware queue base on the hardware
    queue count(set->nr_hw_queues). The mapping cannot be reset if it
    encounters error in blk_mq_realloc_hw_ctxs, but the fallback flow will
    continue using it, then blk_mq_map_swqueue will touch a invalid memory,
    because the mapping points to a wrong hctx.

    blktest block/030:

    null_blk: module loaded
    Increasing nr_hw_queues to 8 fails, fallback to 1
    ==================================================================
    BUG: KASAN: null-ptr-deref in blk_mq_map_swqueue+0x2f2/0x830
    Read of size 8 at addr 0000000000000128 by task nproc/8541

    CPU: 5 PID: 8541 Comm: nproc Not tainted 5.7.0-rc4-dbg+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
    rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
    Call Trace:
    dump_stack+0xa5/0xe6
    __kasan_report.cold+0x65/0xbb
    kasan_report+0x45/0x60
    check_memory_region+0x15e/0x1c0
    __kasan_check_read+0x15/0x20
    blk_mq_map_swqueue+0x2f2/0x830
    __blk_mq_update_nr_hw_queues+0x3df/0x690
    blk_mq_update_nr_hw_queues+0x32/0x50
    nullb_device_submit_queues_store+0xde/0x160 [null_blk]
    configfs_write_file+0x1c4/0x250 [configfs]
    __vfs_write+0x4c/0x90
    vfs_write+0x14b/0x2d0
    ksys_write+0xdd/0x180
    __x64_sys_write+0x47/0x50
    do_syscall_64+0x6f/0x310
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Signed-off-by: Weiping Zhang
    Tested-by: Bart van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Weiping Zhang
     
  • [ Upstream commit fd689871bbfbb41cd77379d3e9e5f4def0f7d6c6 ]

    Alloc new map and request for new hardware queue when increse
    hardware queue count. Before this patch, it will show a
    warning for each new hardware queue, but it's not enough, these
    hctx have no maps and reqeust, when a bio was mapped to these
    hardware queue, it will trigger kernel panic when get request
    from these hctx.

    Test environment:
    * A NVMe disk supports 128 io queues
    * 96 cpus in system

    A corner case can always trigger this panic, there are 96
    io queues allocated for HCTX_TYPE_DEFAULT type, the corresponding kernel
    log: nvme nvme0: 96/0/0 default/read/poll queues. Now we set nvme write
    queues to 96, then nvme will alloc others(32) queues for read, but
    blk_mq_update_nr_hw_queues does not alloc map and request for these new
    added io queues. So when process read nvme disk, it will trigger kernel
    panic when get request from these hardware context.

    Reproduce script:

    nr=$(expr `cat /sys/block/nvme0n1/device/queue_count` - 1)
    echo $nr > /sys/module/nvme/parameters/write_queues
    echo 1 > /sys/block/nvme0n1/device/reset_controller
    dd if=/dev/nvme0n1 of=/dev/null bs=4K count=1

    [ 8040.805626] ------------[ cut here ]------------
    [ 8040.805627] WARNING: CPU: 82 PID: 12921 at block/blk-mq.c:2578 blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805627] Modules linked in: nvme nvme_core nf_conntrack_netlink xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nft_counter nf_nat_tftp nf_conntrack_tftp nft_masq nf_tables_set nft_fib_inet nft_f
    ib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack tun bridge nf_defrag_ipv6 nf_defrag_ipv4 stp llc ip6_tables ip_tables nft_compat rfkill ip_set nf_tables nfne
    tlink sunrpc intel_rapl_msr intel_rapl_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_ssif crct10dif_pclmul crc32_pclmul iTCO_wdt iTCO_vendor_support ghash_clmulni_intel intel_
    cstate intel_uncore raid0 joydev intel_rapl_perf ipmi_si pcspkr mei_me ioatdma sg ipmi_devintf mei i2c_i801 dca lpc_ich ipmi_msghandler acpi_power_meter acpi_pad xfs libcrc32c sd_mod ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm d
    rm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
    [ 8040.805637] ahci drm i40e libahci crc32c_intel libata t10_pi wmi dm_mirror dm_region_hash dm_log dm_mod [last unloaded: nvme_core]
    [ 8040.805640] CPU: 82 PID: 12921 Comm: kworker/u194:2 Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8040.805640] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8040.805641] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
    [ 8040.805642] RIP: 0010:blk_mq_map_swqueue+0x2b6/0x2c0
    [ 8040.805643] Code: 00 00 00 00 00 41 83 c5 01 44 39 6d 50 77 b8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b bb 98 00 00 00 89 d6 e8 8c 81 03 00 eb 83 0b e9 52 ff ff ff 0f 1f 00 0f 1f 44 00 00 41 57 48 89 f1 41 56
    [ 8040.805643] RSP: 0018:ffffba590d2e7d48 EFLAGS: 00010246
    [ 8040.805643] RAX: 0000000000000000 RBX: ffff9f013e1ba800 RCX: 000000000000003d
    [ 8040.805644] RDX: ffff9f00ffff6000 RSI: 0000000000000003 RDI: ffff9ed200246d90
    [ 8040.805644] RBP: ffff9f00f6a79860 R08: 0000000000000000 R09: 000000000000003d
    [ 8040.805645] R10: 0000000000000001 R11: ffff9f0138c3d000 R12: ffff9f00fb3a9008
    [ 8040.805645] R13: 000000000000007f R14: ffffffff96822660 R15: 000000000000005f
    [ 8040.805645] FS: 0000000000000000(0000) GS:ffff9f013fa80000(0000) knlGS:0000000000000000
    [ 8040.805646] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8040.805646] CR2: 00007f7f397fa6f8 CR3: 0000003d8240a002 CR4: 00000000007606e0
    [ 8040.805647] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8040.805647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8040.805647] PKRU: 55555554
    [ 8040.805647] Call Trace:
    [ 8040.805649] blk_mq_update_nr_hw_queues+0x31b/0x390
    [ 8040.805650] nvme_reset_work+0xb4b/0xeab [nvme]
    [ 8040.805651] process_one_work+0x1a7/0x370
    [ 8040.805652] worker_thread+0x1c9/0x380
    [ 8040.805653] ? max_active_store+0x80/0x80
    [ 8040.805655] kthread+0x112/0x130
    [ 8040.805656] ? __kthread_parkme+0x70/0x70
    [ 8040.805657] ret_from_fork+0x35/0x40
    [ 8040.805658] ---[ end trace b5f13b1e73ccb5d3 ]---
    [ 8229.365135] BUG: kernel NULL pointer dereference, address: 0000000000000004
    [ 8229.365165] #PF: supervisor read access in kernel mode
    [ 8229.365178] #PF: error_code(0x0000) - not-present page
    [ 8229.365191] PGD 0 P4D 0
    [ 8229.365201] Oops: 0000 [#1] SMP PTI
    [ 8229.365212] CPU: 77 PID: 13024 Comm: dd Kdump: loaded Tainted: G W 5.6.0-rc5.78317c+ #2
    [ 8229.365232] Hardware name: Inspur SA5212M5/YZMB-00882-104, BIOS 4.0.9 08/27/2019
    [ 8229.365253] RIP: 0010:blk_mq_get_tag+0x227/0x250
    [ 8229.365265] Code: 44 24 04 44 01 e0 48 8b 74 24 38 65 48 33 34 25 28 00 00 00 75 33 48 83 c4 40 5b 5d 41 5c 41 5d 41 5e c3 48 8d 68 10 4c 89 ef 8b 60 04 48 89 ee e8 dd f9 ff ff 83 f8 ff 75 c8 e9 67 fe ff ff
    [ 8229.365304] RSP: 0018:ffffba590e977970 EFLAGS: 00010246
    [ 8229.365317] RAX: 0000000000000000 RBX: ffff9f00f6a79860 RCX: ffffba590e977998
    [ 8229.365333] RDX: 0000000000000000 RSI: ffff9f012039b140 RDI: ffffba590e977a38
    [ 8229.365349] RBP: 0000000000000010 R08: ffffda58ff94e190 R09: ffffda58ff94e198
    [ 8229.365365] R10: 0000000000000011 R11: ffff9f00f6a79860 R12: 0000000000000000
    [ 8229.365381] R13: ffffba590e977a38 R14: ffff9f012039b140 R15: 0000000000000001
    [ 8229.365397] FS: 00007f481c230580(0000) GS:ffff9f013f940000(0000) knlGS:0000000000000000
    [ 8229.365415] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8229.365428] CR2: 0000000000000004 CR3: 0000005f35e26004 CR4: 00000000007606e0
    [ 8229.365444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 8229.365460] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 8229.365476] PKRU: 55555554
    [ 8229.365484] Call Trace:
    [ 8229.365498] ? finish_wait+0x80/0x80
    [ 8229.365512] blk_mq_get_request+0xcb/0x3f0
    [ 8229.365525] blk_mq_make_request+0x143/0x5d0
    [ 8229.365538] generic_make_request+0xcf/0x310
    [ 8229.365553] ? scan_shadow_nodes+0x30/0x30
    [ 8229.365564] submit_bio+0x3c/0x150
    [ 8229.365576] mpage_readpages+0x163/0x1a0
    [ 8229.365588] ? blkdev_direct_IO+0x490/0x490
    [ 8229.365601] read_pages+0x6b/0x190
    [ 8229.365612] __do_page_cache_readahead+0x1c1/0x1e0
    [ 8229.365626] ondemand_readahead+0x182/0x2f0
    [ 8229.365639] generic_file_buffered_read+0x590/0xab0
    [ 8229.365655] new_sync_read+0x12a/0x1c0
    [ 8229.365666] vfs_read+0x8a/0x140
    [ 8229.365676] ksys_read+0x59/0xd0
    [ 8229.365688] do_syscall_64+0x55/0x1d0
    [ 8229.365700] entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Signed-off-by: Ming Lei
    Signed-off-by: Weiping Zhang
    Tested-by: Weiping Zhang
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

03 Jun, 2020

1 commit

  • [ Upstream commit b0beb28097fa04177b3769f4bb7a0d0d9c4ae76e ]

    This reverts commit c58c1f83436b501d45d4050fd1296d71a9760bcb.

    io_uring does do the right thing for this case, and we're still returning
    -EAGAIN to userspace for the cases we don't support. Revert this change
    to avoid doing endless spins of resubmits.

    Cc: stable@vger.kernel.org # v5.6
    Reported-by: Bijan Mottahedeh
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

14 May, 2020

1 commit

  • commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream.

    abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup
    is and controls the activation of use_delay mechanism. Once a cgroup goes
    over budget from forced IOs, it has to pay it back with its future budget.
    The progress guarantee on debt paying comes from the iocg being active -
    active iocgs are processed by the periodic timer, which ensures that as time
    passes the debts dissipate and the iocg returns to normal operation.

    However, both iocg activation and vdebt handling are asynchronous and a
    sequence like the following may happen.

    1. The iocg is in the process of being deactivated by the periodic timer.

    2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns
    without anything because it still sees that the iocg is already active.

    3. The iocg is deactivated.

    4. The bio from #2 is over budget but needs to be forced. It increases
    abs_vdebt and goes over the threshold and enables use_delay.

    5. IO control is enabled for the iocg's subtree and now IOs are attributed
    to the descendant cgroups and the iocg itself no longer issues IOs.

    This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no
    further IOs which can activate it. This can end up unduly punishing all the
    descendants cgroups.

    The usual throttling path has the same issue - the iocg must be active while
    throttled to ensure that future event will wake it up - and solves the
    problem by synchronizing the throttling path with a spinlock. abs_vdebt
    handling is another form of overage handling and shares a lot of
    characteristics including the fact that it isn't in the hottest path.

    This patch fixes the above and other possible races by strictly
    synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock.

    Signed-off-by: Tejun Heo
    Reported-by: Vlad Dmitriev
    Cc: stable@vger.kernel.org # v5.4+
    Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future")
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Tejun Heo
     

02 May, 2020

2 commits

  • [ Upstream commit 5fe56de799ad03e92d794c7936bf363922b571df ]

    If in blk_mq_dispatch_rq_list() we find no budget, then we break of the
    dispatch loop, but the request may keep the driver tag, evaulated
    in 'nxt' in the previous loop iteration.

    Fix by putting the driver tag for that request.

    Reviewed-by: Ming Lei
    Signed-off-by: John Garry
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    John Garry
     
  • commit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

    Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
    argument of the iocost_ioc_vrate_adj trace entry defined in
    include/trace/events/iocost.h leading to the following error:

    /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
    error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
    , u32[]* __tracepoint_arg_missed_ppm

    That argument type is indeed rather complex and hard to read. Looking
    at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
    the argument to a simple "u32 *missed_ppm" and adjusting the trace
    entry accordingly, the compilation error was gone.

    Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Tejun Heo
    Signed-off-by: Waiman Long
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

23 Apr, 2020

3 commits

  • commit 4d38a87fbb77fb9ff2ff4e914162a8ae6453eff5 upstream.

    In bfq_pd_offline(), the function bfq_flush_idle_tree() is invoked to
    flush the rb tree that contains all idle entities belonging to the pd
    (cgroup) being destroyed. In particular, bfq_flush_idle_tree() is
    invoked before bfq_reparent_active_queues(). Yet the latter may happen
    to add some entities to the idle tree. It happens if, in some of the
    calls to bfq_bfqq_move() performed by bfq_reparent_active_queues(),
    the queue to move is empty and gets expired.

    This commit simply reverses the invocation order between
    bfq_flush_idle_tree() and bfq_reparent_active_queues().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • commit 576682fa52cbd95deb3773449566274f206acc58 upstream.

    bfq_reparent_leaf_entity() reparents the input leaf entity (a leaf
    entity represents just a bfq_queue in an entity tree). Yet, the input
    entity is guaranteed to always be a leaf entity only in two-level
    entity trees. In this respect, because of the error fixed by
    commit 14afc5936197 ("block, bfq: fix overwrite of bfq_group pointer
    in bfq_find_set_group()"), all (wrongly collapsed) entity trees happened
    to actually have only two levels. After the latter commit, this does not
    hold any longer.

    This commit fixes this problem by modifying
    bfq_reparent_leaf_entity(), so that it searches an active leaf entity
    down the path that stems from the input entity. Such a leaf entity is
    guaranteed to exist when bfq_reparent_leaf_entity() is invoked.

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • commit c8997736650060594845e42c5d01d3118aec8d25 upstream.

    A bfq_put_queue() may be invoked in __bfq_bic_change_cgroup(). The
    goal of this put is to release a process reference to a bfq_queue. But
    process-reference releases may trigger also some extra operation, and,
    to this goal, are handled through bfq_release_process_ref(). So, turn
    the invocation of bfq_put_queue() into an invocation of
    bfq_release_process_ref().

    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     

17 Apr, 2020

4 commits

  • [ Upstream commit 2f95fa5c955d0a9987ffdc3a095e2f4e62c5f2a9 ]

    In bfq_idle_slice_timer func, bfqq = bfqd->in_service_queue is
    not in bfqd-lock critical section. The bfqq, which is not
    equal to NULL in bfq_idle_slice_timer, may be freed after passing
    to bfq_idle_slice_timer_body. So we will access the freed memory.

    In addition, considering the bfqq may be in race, we should
    firstly check whether bfqq is in service before doing something
    on it in bfq_idle_slice_timer_body func. If the bfqq in race is
    not in service, it means the bfqq has been expired through
    __bfq_bfqq_expire func, and wait_request flags has been cleared in
    __bfq_bfqd_reset_in_service func. So we do not need to re-clear the
    wait_request of bfqq which is not in service.

    KASAN log is given as follows:
    [13058.354613] ==================================================================
    [13058.354640] BUG: KASAN: use-after-free in bfq_idle_slice_timer+0xac/0x290
    [13058.354644] Read of size 8 at addr ffffa02cf3e63f78 by task fork13/19767
    [13058.354646]
    [13058.354655] CPU: 96 PID: 19767 Comm: fork13
    [13058.354661] Call trace:
    [13058.354667] dump_backtrace+0x0/0x310
    [13058.354672] show_stack+0x28/0x38
    [13058.354681] dump_stack+0xd8/0x108
    [13058.354687] print_address_description+0x68/0x2d0
    [13058.354690] kasan_report+0x124/0x2e0
    [13058.354697] __asan_load8+0x88/0xb0
    [13058.354702] bfq_idle_slice_timer+0xac/0x290
    [13058.354707] __hrtimer_run_queues+0x298/0x8b8
    [13058.354710] hrtimer_interrupt+0x1b8/0x678
    [13058.354716] arch_timer_handler_phys+0x4c/0x78
    [13058.354722] handle_percpu_devid_irq+0xf0/0x558
    [13058.354731] generic_handle_irq+0x50/0x70
    [13058.354735] __handle_domain_irq+0x94/0x110
    [13058.354739] gic_handle_irq+0x8c/0x1b0
    [13058.354742] el1_irq+0xb8/0x140
    [13058.354748] do_wp_page+0x260/0xe28
    [13058.354752] __handle_mm_fault+0x8ec/0x9b0
    [13058.354756] handle_mm_fault+0x280/0x460
    [13058.354762] do_page_fault+0x3ec/0x890
    [13058.354765] do_mem_abort+0xc0/0x1b0
    [13058.354768] el0_da+0x24/0x28
    [13058.354770]
    [13058.354773] Allocated by task 19731:
    [13058.354780] kasan_kmalloc+0xe0/0x190
    [13058.354784] kasan_slab_alloc+0x14/0x20
    [13058.354788] kmem_cache_alloc_node+0x130/0x440
    [13058.354793] bfq_get_queue+0x138/0x858
    [13058.354797] bfq_get_bfqq_handle_split+0xd4/0x328
    [13058.354801] bfq_init_rq+0x1f4/0x1180
    [13058.354806] bfq_insert_requests+0x264/0x1c98
    [13058.354811] blk_mq_sched_insert_requests+0x1c4/0x488
    [13058.354818] blk_mq_flush_plug_list+0x2d4/0x6e0
    [13058.354826] blk_flush_plug_list+0x230/0x548
    [13058.354830] blk_finish_plug+0x60/0x80
    [13058.354838] read_pages+0xec/0x2c0
    [13058.354842] __do_page_cache_readahead+0x374/0x438
    [13058.354846] ondemand_readahead+0x24c/0x6b0
    [13058.354851] page_cache_sync_readahead+0x17c/0x2f8
    [13058.354858] generic_file_buffered_read+0x588/0xc58
    [13058.354862] generic_file_read_iter+0x1b4/0x278
    [13058.354965] ext4_file_read_iter+0xa8/0x1d8 [ext4]
    [13058.354972] __vfs_read+0x238/0x320
    [13058.354976] vfs_read+0xbc/0x1c0
    [13058.354980] ksys_read+0xdc/0x1b8
    [13058.354984] __arm64_sys_read+0x50/0x60
    [13058.354990] el0_svc_common+0xb4/0x1d8
    [13058.354994] el0_svc_handler+0x50/0xa8
    [13058.354998] el0_svc+0x8/0xc
    [13058.354999]
    [13058.355001] Freed by task 19731:
    [13058.355007] __kasan_slab_free+0x120/0x228
    [13058.355010] kasan_slab_free+0x10/0x18
    [13058.355014] kmem_cache_free+0x288/0x3f0
    [13058.355018] bfq_put_queue+0x134/0x208
    [13058.355022] bfq_exit_icq_bfqq+0x164/0x348
    [13058.355026] bfq_exit_icq+0x28/0x40
    [13058.355030] ioc_exit_icq+0xa0/0x150
    [13058.355035] put_io_context_active+0x250/0x438
    [13058.355038] exit_io_context+0xd0/0x138
    [13058.355045] do_exit+0x734/0xc58
    [13058.355050] do_group_exit+0x78/0x220
    [13058.355054] __wake_up_parent+0x0/0x50
    [13058.355058] el0_svc_common+0xb4/0x1d8
    [13058.355062] el0_svc_handler+0x50/0xa8
    [13058.355066] el0_svc+0x8/0xc
    [13058.355067]
    [13058.355071] The buggy address belongs to the object at ffffa02cf3e63e70#012 which belongs to the cache bfq_queue of size 464
    [13058.355075] The buggy address is located 264 bytes inside of#012 464-byte region [ffffa02cf3e63e70, ffffa02cf3e64040)
    [13058.355077] The buggy address belongs to the page:
    [13058.355083] page:ffff7e80b3cf9800 count:1 mapcount:0 mapping:ffff802db5c90780 index:0xffffa02cf3e606f0 compound_mapcount: 0
    [13058.366175] flags: 0x2ffffe0000008100(slab|head)
    [13058.370781] raw: 2ffffe0000008100 ffff7e80b53b1408 ffffa02d730c1c90 ffff802db5c90780
    [13058.370787] raw: ffffa02cf3e606f0 0000000000370023 00000001ffffffff 0000000000000000
    [13058.370789] page dumped because: kasan: bad access detected
    [13058.370791]
    [13058.370792] Memory state around the buggy address:
    [13058.370797] ffffa02cf3e63e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fb fb
    [13058.370801] ffffa02cf3e63e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370805] >ffffa02cf3e63f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370808] ^
    [13058.370811] ffffa02cf3e63f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    [13058.370815] ffffa02cf3e64000: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
    [13058.370817] ==================================================================
    [13058.370820] Disabling lock debugging due to kernel taint

    Here, we directly pass the bfqd to bfq_idle_slice_timer_body func.
    --
    V2->V3: rewrite the comment as suggested by Paolo Valente
    V1->V2: add one comment, and add Fixes and Reported-by tag.

    Fixes: aee69d78d ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
    Acked-by: Paolo Valente
    Reported-by: Wang Wang
    Signed-off-by: Zhiqiang Liu
    Signed-off-by: Feilong Lin
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Zhiqiang Liu
     
  • [ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

    There is a potential race between ioc_release_fn() and
    ioc_clear_queue() as shown below, due to which below kernel
    crash is observed. It also can result into use-after-free
    issue.

    context#1: context#2:
    ioc_release_fn() __ioc_clear_queue() gets the same icq
    ->spin_lock(&ioc->lock); ->spin_lock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->list_del_init(&icq->q_node);
    ->call_rcu(&icq->__rcu_head,
    icq_free_icq_rcu);
    ->spin_unlock(&ioc->lock);
    ->ioc_destroy_icq(icq);
    ->hlist_del_init(&icq->ioc_node);
    This results into below crash as this memory
    is now used by icq->__rcu_head in context#1.
    There is a chance that icq could be free'd
    as well.

    22150.386550: Unable to handle kernel write to read-only memory
    at virtual address ffffffaa8d31ca50
    ...
    Call trace:
    22150.607350: ioc_destroy_icq+0x44/0x110
    22150.611202: ioc_clear_queue+0xac/0x148
    22150.615056: blk_cleanup_queue+0x11c/0x1a0
    22150.619174: __scsi_remove_device+0xdc/0x128
    22150.623465: scsi_forget_host+0x2c/0x78
    22150.627315: scsi_remove_host+0x7c/0x2a0
    22150.631257: usb_stor_disconnect+0x74/0xc8
    22150.635371: usb_unbind_interface+0xc8/0x278
    22150.639665: device_release_driver_internal+0x198/0x250
    22150.644897: device_release_driver+0x24/0x30
    22150.649176: bus_remove_device+0xec/0x140
    22150.653204: device_del+0x270/0x460
    22150.656712: usb_disable_device+0x120/0x390
    22150.660918: usb_disconnect+0xf4/0x2e0
    22150.664684: hub_event+0xd70/0x17e8
    22150.668197: process_one_work+0x210/0x480
    22150.672222: worker_thread+0x32c/0x4c8

    Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
    indicate this icq is once marked as destroyed. Also, ensure
    __ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
    that icq doesn't get free'd up while it is still using it.

    Signed-off-by: Sahitya Tummala
    Co-developed-by: Pradeep P V K
    Signed-off-by: Pradeep P V K
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Sahitya Tummala
     
  • [ Upstream commit fd1bb3ae54a9a2e0c42709de861c69aa146b8955 ]

    Commit ecedd3d7e199 ("block, bfq: get extra ref to prevent a queue
    from being freed during a group move") gets an extra reference to a
    bfq_queue before possibly deactivating it (temporarily), in
    bfq_bfqq_move(). This prevents the bfq_queue from disappearing before
    being reactivated in its new group.

    Yet, the bfq_queue may also be expired (i.e., its service may be
    stopped) before the bfq_queue is deactivated. And also an expiration
    may lead to a premature freeing. This commit fixes this issue by
    simply moving forward the getting of the extra reference already
    introduced by commit ecedd3d7e199 ("block, bfq: get extra ref to
    prevent a queue from being freed during a group move").

    Reported-by: cki-project@redhat.com
    Tested-by: cki-project@redhat.com
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     
  • [ Upstream commit e74d93e96d721c4297f2a900ad0191890d2fc2b0 ]

    Field bdi->io_pages added in commit 9491ae4aade6 ("mm: don't cap request
    size based on read-ahead setting") removes unneeded split of read requests.

    Stacked drivers do not call blk_queue_max_hw_sectors(). Instead they set
    limits of their devices by blk_set_stacking_limits() + disk_stack_limits().
    Field bio->io_pages stays zero until user set max_sectors_kb via sysfs.

    This patch updates io_pages after merging limits in disk_stack_limits().

    Commit c6d6e9b0f6b4 ("dm: do not allow readahead to limit IO size") fixed
    the same problem for device-mapper devices, this one fixes MD RAIDs.

    Fixes: 9491ae4aade6 ("mm: don't cap request size based on read-ahead setting")
    Reviewed-by: Paul Menzel
    Reviewed-by: Bob Liu
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Song Liu
    Signed-off-by: Sasha Levin

    Konstantin Khlebnikov
     

13 Apr, 2020

1 commit

  • commit 6e66b49392419f3fe134e1be583323ef75da1e4b upstream.

    blk_mq_map_queues() and multiple .map_queues() implementations expect that
    set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware
    queues. Hence set .nr_queues before calling these functions. This patch
    fixes the following kernel warning:

    WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137
    Call Trace:
    blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508
    blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525
    blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775
    process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
    worker_thread+0x98/0xe40 kernel/workqueue.c:2415
    kthread+0x361/0x430 kernel/kthread.c:255

    Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0
    Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com
    Signed-off-by: Bart Van Assche
    Reviewed-by: Ming Lei
    Reviewed-by: Chaitanya Kulkarni
    Cc: Johannes Thumshirn
    Cc: Hannes Reinecke
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     

25 Mar, 2020

1 commit

  • [ Upstream commit 14afc59361976c0ba39e3a9589c3eaa43ebc7e1d ]

    The bfq_find_set_group() function takes as input a blkcg (which represents
    a cgroup) and retrieves the corresponding bfq_group, then it updates the
    bfq internal group hierarchy (see comments inside the function for why
    this is needed) and finally it returns the bfq_group.
    In the hierarchy update cycle, the pointer holding the correct bfq_group
    that has to be returned is mistakenly used to traverse the hierarchy
    bottom to top, meaning that in each iteration it gets overwritten with the
    parent of the current group. Since the update cycle stops at root's
    children (depth = 2), the overwrite becomes a problem only if the blkcg
    describes a cgroup at a hierarchy level deeper than that (depth > 2). In
    this case the root's child that happens to be also an ancestor of the
    correct bfq_group is returned. The main consequence is that processes
    contained in a cgroup at depth greater than 2 are wrongly placed in the
    group described above by BFQ.

    This commits fixes this problem by using a different bfq_group pointer in
    the update cycle in order to avoid the overwrite of the variable holding
    the original group reference.

    Reported-by: Kwon Je Oh
    Signed-off-by: Carlo Nonato
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Carlo Nonato
     

21 Mar, 2020

2 commits

  • [ Upstream commit cc3200eac4c5eb11c3f34848a014d1f286316310 ]

    commit 01e99aeca397 ("blk-mq: insert passthrough request into
    hctx->dispatch directly") may change to add flush request to the tail
    of dispatch by applying the 'add_head' parameter of
    blk_mq_sched_insert_request.

    Turns out this way causes performance regression on NCQ controller because
    flush is non-NCQ command, which can't be queued when there is any in-flight
    NCQ command. When adding flush rq to the front of hctx->dispatch, it is
    easier to introduce extra time to flush rq's latency compared with adding
    to the tail of dispatch queue because of S_SCHED_RESTART, then chance of
    flush merge is increased, and less flush requests may be issued to
    controller.

    So always insert flush request to the front of dispatch queue just like
    before applying commit 01e99aeca397 ("blk-mq: insert passthrough request
    into hctx->dispatch directly").

    Cc: Damien Le Moal
    Cc: Shinichiro Kawasaki
    Reported-by: Shinichiro Kawasaki
    Fixes: 01e99aeca397 ("blk-mq: insert passthrough request into hctx->dispatch directly")
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     
  • [ Upstream commit 01e99aeca3979600302913cef3f89076786f32c8 ]

    For some reason, device may be in one situation which can't handle
    FS request, so STS_RESOURCE is always returned and the FS request
    will be added to hctx->dispatch. However passthrough request may
    be required at that time for fixing the problem. If passthrough
    request is added to scheduler queue, there isn't any chance for
    blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
    Then the FS IO request may never be completed, and IO hang is caused.

    So passthrough request has to be added to hctx->dispatch directly
    for fixing the IO hang.

    Fix this issue by inserting passthrough request into hctx->dispatch
    directly together withing adding FS request to the tail of
    hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
    to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().

    Then it becomes consistent with original legacy IO request
    path, in which passthrough request is always added to q->queue_head.

    Cc: Dongli Zhang
    Cc: Christoph Hellwig
    Cc: Ewan D. Milne
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Ming Lei
     

18 Mar, 2020

1 commit


12 Mar, 2020

4 commits

  • commit 4d8340d0d4d90e7ca367d18ec16c2fefa89a339c upstream.

    ifdefs around gets and puts of bfq groups reduce readability, remove them.

    Tested-by: Oleksandr Natalenko
    Reported-by: Jens Axboe
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • commit db37a34c563bf4692b36990ae89005c031385e52 upstream.

    BFQ schedules generic entities, which may represent either bfq_queues
    or groups of bfq_queues. When an entity is inserted into a service
    tree, a reference must be taken, to make sure that the entity does not
    disappear while still referred in the tree. Unfortunately, such a
    reference is mistakenly taken only if the entity represents a
    bfq_queue. This commit takes a reference also in case the entity
    represents a group.

    Tested-by: Oleksandr Natalenko
    Tested-by: Chris Evich
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Paolo Valente
     
  • [ Upstream commit 32c59e3a9a5a0b180dd015755d6d18ca31e55935 ]

    BFQ maintains an ordered list, implemented with an RB tree, of
    head-request positions of non-empty bfq_queues. This position tree,
    inherited from CFQ, is used to find bfq_queues that contain I/O close
    to each other. BFQ merges these bfq_queues into a single shared queue,
    if this boosts throughput on the device at hand.

    There is however a special-purpose bfq_queue that does not participate
    in queue merging, the oom bfq_queue. Yet, also this bfq_queue could be
    wrongly added to the position tree. So bfqq_find_close() could return
    the oom bfq_queue, which is a source of further troubles in an
    out-of-memory situation. This commit prevents the oom bfq_queue from
    being inserted into the position tree.

    Tested-by: Patrick Dung
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     
  • [ Upstream commit ecedd3d7e19911ab8fe42f17b77c0a30fe7f4db3 ]

    In bfq_bfqq_move(), the bfq_queue, say Q, to be moved to a new group
    may happen to be deactivated in the scheduling data structures of the
    source group (and then activated in the destination group). If Q is
    referred only by the data structures in the source group when the
    deactivation happens, then Q is freed upon the deactivation.

    This commit addresses this issue by getting an extra reference before
    the possible deactivation, and releasing this extra reference after Q
    has been moved.

    Tested-by: Chris Evich
    Tested-by: Oleksandr Natalenko
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

24 Feb, 2020

1 commit

  • [ Upstream commit f718b093277df582fbf8775548a4f163e664d282 ]

    Commit 478de3380c1c ("block, bfq: deschedule empty bfq_queues not
    referred by any process") fixed commit 3726112ec731 ("block, bfq:
    re-schedule empty queues if they deserve I/O plugging") by
    descheduling an empty bfq_queue when it remains with not process
    reference. Yet, this still left a case uncovered: an empty bfq_queue
    with not process reference that remains in service. This happens for
    an in-service sync bfq_queue that is deemed to deserve I/O-dispatch
    plugging when it remains empty. Yet no new requests will arrive for
    such a bfq_queue if no process sends requests to it any longer. Even
    worse, the bfq_queue may happen to be prematurely freed while still in
    service (because there may remain no reference to it any longer).

    This commit solves this problem by preventing I/O dispatch from being
    plugged for the in-service bfq_queue, if the latter has no process
    reference (the bfq_queue is then prevented from remaining in service).

    Fixes: 3726112ec731 ("block, bfq: re-schedule empty queues if they deserve I/O plugging")
    Tested-by: Oleksandr Natalenko
    Reported-by: Patrick Dung
    Tested-by: Patrick Dung
    Signed-off-by: Paolo Valente
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Paolo Valente
     

26 Jan, 2020

1 commit

  • [ Upstream commit ece841abbed2da71fa10710c687c9ce9efb6bf69 ]

    7c20f11680a4 ("bio-integrity: stop abusing bi_end_io") moves
    bio_integrity_free from bio_uninit() to bio_integrity_verify_fn()
    and bio_endio(). This way looks wrong because bio may be freed
    without calling bio_endio(), for example, blk_rq_unprep_clone() is
    called from dm_mq_queue_rq() when the underlying queue of dm-mpath
    is busy.

    So memory leak of bio integrity data is caused by commit 7c20f11680a4.

    Fixes this issue by re-adding bio_integrity_free() to bio_uninit().

    Fixes: 7c20f11680a4 ("bio-integrity: stop abusing bi_end_io")
    Reviewed-by: Christoph Hellwig
    Signed-off-by Justin Tee

    Add commit log, and simplify/fix the original patch wroten by Justin.

    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Justin Tee
     

23 Jan, 2020

2 commits

  • commit c44a4edb20938c85b64a256661443039f5bffdea upstream.

    This patch fixes the following sparse warnings:

    block/bsg-lib.c:269:19: warning: incorrect type in initializer (different base types)
    block/bsg-lib.c:269:19: expected int sts
    block/bsg-lib.c:269:19: got restricted blk_status_t [usertype]
    block/bsg-lib.c:286:16: warning: incorrect type in return expression (different base types)
    block/bsg-lib.c:286:16: expected restricted blk_status_t
    block/bsg-lib.c:286:16: got int [assigned] sts

    Cc: Martin Wilck
    Fixes: d46fe2cb2dce ("block: drop device references in bsg_queue_rq()")
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Bart Van Assche
     
  • commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.

    Logical block size has type unsigned short. That means that it can be at
    most 32768. However, there are architectures that can run with 64k pages
    (for example arm64) and on these architectures, it may be possible to
    create block devices with 64k block size.

    For exmaple (run this on an architecture with 64k pages):

    Mount will fail with this error because it tries to read the superblock using 2-sector
    access:
    device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
    EXT4-fs (dm-0): unable to read superblock

    This patch changes the logical block size from unsigned short to unsigned
    int to avoid the overflow.

    Cc: stable@vger.kernel.org
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Ming Lei
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Mikulas Patocka
     

18 Jan, 2020

1 commit

  • commit 83c9c547168e8b914ea6398430473a4de68c52cc upstream.

    Commit 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    adds bio_truncate() for handling bio EOD. However, bio_truncate()
    doesn't use the passed 'op' parameter from guard_bio_eod's callers.

    So bio_trunacate() may retrieve wrong 'op', and zering pages may
    not be done for READ bio.

    Fixes this issue by moving guard_bio_eod() after bio_set_op_attrs()
    in submit_bh_wbc() so that bio_truncate() can always retrieve correct
    op info.

    Meantime remove the 'op' parameter from guard_bio_eod() because it isn't
    used any more.

    Cc: Carlos Maiolino
    Cc: linux-fsdevel@vger.kernel.org
    Fixes: 85a8ce62c2ea ("block: add bio_truncate to fix guard_bio_eod")
    Signed-off-by: Ming Lei
    Signed-off-by: Greg Kroah-Hartman

    Fold in kerneldoc and bio_op() change.

    Signed-off-by: Jens Axboe

    Ming Lei
     

12 Jan, 2020

3 commits

  • [ Upstream commit 3b7995a98ad76da5597b488fa84aa5a56d43b608 ]

    When I doing fuzzy test, get the memleak report:

    BUG: memory leak
    unreferenced object 0xffff88837af80000 (size 4096):
    comm "memleak", pid 3557, jiffies 4294817681 (age 112.499s)
    hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    20 00 00 00 10 01 00 00 00 00 00 00 01 00 00 00 ...............
    backtrace:
    [] bio_alloc_bioset+0x393/0x590
    [] bio_copy_user_iov+0x300/0xcd0
    [] blk_rq_map_user_iov+0x2f1/0x5f0
    [] blk_rq_map_user+0xf2/0x160
    [] sg_common_write.isra.21+0x1094/0x1870
    [] sg_write.part.25+0x5d9/0x950
    [] sg_write+0x5f/0x8c
    [] __vfs_write+0x7c/0x100
    [] vfs_write+0x1c3/0x500
    [] ksys_write+0xf9/0x200
    [] do_syscall_64+0x9f/0x4f0
    [] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    If __blk_rq_map_user_iov() is failed in blk_rq_map_user_iov(),
    the bio(s) which is allocated before this failing will leak. The
    refcount of the bio(s) is init to 1 and increased to 2 by calling
    bio_get(), but __blk_rq_unmap_user() only decrease it to 1, so
    the bio cannot be freed. Fix it by calling blk_rq_unmap_user().

    Reviewed-by: Bob Liu
    Reported-by: Hulk Robot
    Signed-off-by: Yang Yingliang
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Yang Yingliang
     
  • [ Upstream commit b3c6a59975415bde29cfd76ff1ab008edbf614a9 ]

    Avoid that running test nvme/012 from the blktests suite triggers the
    following false positive lockdep complaint:

    ============================================
    WARNING: possible recursive locking detected
    5.0.0-rc3-xfstests-00015-g1236f7d60242 #841 Not tainted
    --------------------------------------------
    ksoftirqd/1/16 is trying to acquire lock:
    000000000282032e (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    but task is already holding lock:
    00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&(&fq->mq_flush_lock)->rlock);
    lock(&(&fq->mq_flush_lock)->rlock);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    1 lock held by ksoftirqd/1/16:
    #0: 00000000cbadcbc2 (&(&fq->mq_flush_lock)->rlock){..-.}, at: flush_end_io+0x4e/0x1d0

    stack backtrace:
    CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.0.0-rc3-xfstests-00015-g1236f7d60242 #841
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    dump_stack+0x67/0x90
    __lock_acquire.cold.45+0x2b4/0x313
    lock_acquire+0x98/0x160
    _raw_spin_lock_irqsave+0x3b/0x80
    flush_end_io+0x4e/0x1d0
    blk_mq_complete_request+0x76/0x110
    nvmet_req_complete+0x15/0x110 [nvmet]
    nvmet_bio_done+0x27/0x50 [nvmet]
    blk_update_request+0xd7/0x2d0
    blk_mq_end_request+0x1a/0x100
    blk_flush_complete_seq+0xe5/0x350
    flush_end_io+0x12f/0x1d0
    blk_done_softirq+0x9f/0xd0
    __do_softirq+0xca/0x440
    run_ksoftirqd+0x24/0x50
    smpboot_thread_fn+0x113/0x1e0
    kthread+0x121/0x140
    ret_from_fork+0x3a/0x50

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Signed-off-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Bart Van Assche
     
  • [ Upstream commit c58c1f83436b501d45d4050fd1296d71a9760bcb ]

    Non-mq devs do not honor REQ_NOWAIT so give a chance to the caller to repeat
    request gracefully on -EAGAIN error.

    The problem is well reproduced using io_uring:

    mkfs.ext4 /dev/ram0
    mount /dev/ram0 /mnt

    # Preallocate a file
    dd if=/dev/zero of=/mnt/file bs=1M count=1

    # Start fio with io_uring and get -EIO
    fio --rw=write --ioengine=io_uring --size=1M --direct=1 --name=job --filename=/mnt/file

    Signed-off-by: Roman Penyaev
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Roman Penyaev
     

09 Jan, 2020

2 commits

  • commit 21d37340912d74b1222d43c11aa9dd0687162573 upstream.

    These were added to blkdev_ioctl() in v4.20 but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.20+
    Fixes: 72cd87576d1d ("block: Introduce BLKGETZONESZ ioctl")
    Fixes: 65e4e3eee83d ("block: Introduce BLKGETNRZONES ioctl")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit 673bdf8ce0a387ef585c13b69a2676096c6edfe9 upstream.

    These were added to blkdev_ioctl() but not blkdev_compat_ioctl,
    so add them now.

    Cc: # v4.10+
    Fixes: 3ed05a987e0f ("blk-zoned: implement ioctls")
    Reviewed-by: Damien Le Moal
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann