12 Dec, 2018

23 commits

  • rwb_enabled() can't be changed when there is any inflight IO.

    wbt_disable_default() may set rwb->wb_normal as zero, however the
    blk_stat timer may still be pending, and the timer function will update
    wrb->wb_normal again.

    This patch introduces blk_stat_deactivate() and applies it in
    wbt_disable_default(), then the following IO hang triggered when running
    parted & switching io scheduler can be fixed:

    [ 369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
    [ 369.938941] Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
    [ 369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [ 369.940768] parted D 0 3645 3239 0x00000000
    [ 369.941500] Call Trace:
    [ 369.941874] ? __schedule+0x6d9/0x74c
    [ 369.942392] ? wbt_done+0x5e/0x5e
    [ 369.942864] ? wbt_cleanup_cb+0x16/0x16
    [ 369.943404] ? wbt_done+0x5e/0x5e
    [ 369.943874] schedule+0x67/0x78
    [ 369.944298] io_schedule+0x12/0x33
    [ 369.944771] rq_qos_wait+0xb5/0x119
    [ 369.945193] ? karma_partition+0x1c2/0x1c2
    [ 369.945691] ? wbt_cleanup_cb+0x16/0x16
    [ 369.946151] wbt_wait+0x85/0xb6
    [ 369.946540] __rq_qos_throttle+0x23/0x2f
    [ 369.947014] blk_mq_make_request+0xe6/0x40a
    [ 369.947518] generic_make_request+0x192/0x2fe
    [ 369.948042] ? submit_bio+0x103/0x11f
    [ 369.948486] ? __radix_tree_lookup+0x35/0xb5
    [ 369.949011] submit_bio+0x103/0x11f
    [ 369.949436] ? blkg_lookup_slowpath+0x25/0x44
    [ 369.949962] submit_bio_wait+0x53/0x7f
    [ 369.950469] blkdev_issue_flush+0x8a/0xae
    [ 369.951032] blkdev_fsync+0x2f/0x3a
    [ 369.951502] do_fsync+0x2e/0x47
    [ 369.951887] __x64_sys_fsync+0x10/0x13
    [ 369.952374] do_syscall_64+0x89/0x149
    [ 369.952819] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 369.953492] RIP: 0033:0x7f95a1e729d4
    [ 369.953996] Code: Bad RIP value.
    [ 369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
    [ 369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
    [ 369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
    [ 369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
    [ 369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
    [ 369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

    Cc: stable@vger.kernel.org
    Cc: Paolo Valente
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • We're missing a deferred clear off the shallow get, which can cause
    a hang. Additionally, when we resize the sbitmap, we should also
    flush deferred clears for good measure.

    Ensure we have full coverage on batch clears, even for paths where
    we would not be doing deferred clear. This makes it less error
    prone for future additions.

    Reported-by: Bart Van Assche
    Tested-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Ehen using pblk with 0 sized metadata both ppa list and meta list
    points to the same memory since pblk_dma_meta_size() returns 0 in
    that case.

    This patch fix that issue by ensuring that pblk_dma_meta_size()
    always returns space equal to sizeof(struct pblk_sec_meta) and thus
    ppa list and meta list points to different memory address.

    Even that in that case drive does not really care about meta_list
    pointer, this is the easiest way to fix that issue without introducing
    changes in many places in the code just for 0 sized metadata case.

    The same approach needs to be also done for pblk_get_sec_meta()
    since we also cannot point to the same memory address in meta buffer
    when we are using it for pblk recovery process

    Reported-by: Hans Holmberg
    Tested-by: Hans Holmberg
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • pblk performs recovery of open lines by storing the LBA in the per LBA
    metadata field. Recovery therefore only works for drives that has this
    field.

    This patch adds support for packed metadata, which store l2p mapping
    for open lines in last sector of every write unit and enables drives
    without per IO metadata to recover open lines.

    After this patch, drives with OOB size
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • Currently pblk only check the size of I/O metadata and does not take
    into account if this metadata is in a separate buffer or interleaved
    in a single metadata buffer.

    In reality only the first scenario is supported, where second mode will
    break pblk functionality during any IO operation.

    This patch prevents pblk to be instantiated in case device only
    supports interleaved metadata.

    Reviewed-by: Javier González
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • Currently lightnvm and pblk uses single DMA pool, for which the entry
    size always is equal to PAGE_SIZE. The contents of each entry allocated
    from the DMA pool consists of a PPA list (8bytes * 64), leaving
    56bytes * 64 space for metadata. Since the metadata field can be bigger,
    such as 128 bytes, the static size does not cover this use-case.

    This patch adds support for I/O metadata above 56 bytes by changing DMA
    pool size based on device meta size and allows pblk to use OOB metadata
    >=16B.

    Reviewed-by: Javier González
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • pblk currently assumes that size of OOB metadata on drive is always
    equal to size of pblk_sec_meta struct. This commit add helpers which will
    allow to handle different sizes of OOB metadata on drive in the future.

    After this patch only OOB metadata equal to 16 bytes is supported.

    Reviewed-by: Javier González
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • Currently DMA allocated memory is reused on partial read
    for lba_list_mem and lba_list_media arrays. In preparation
    for dynamic DMA pool sizes we need to move this arrays
    into pblk_pr_ctx structures.

    Reviewed-by: Javier González
    Signed-off-by: Igor Konopko
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Igor Konopko
     
  • The current kref implementation around pblk global caches triggers a
    false positive on refcount_inc_checked() (when called) as the kref is
    initialized to 0. Instead of usint kref_inc() on a 0 reference, which is
    in principle correct, use kref_init() to avoid the check. This is also
    more explicit about what actually happens on cache creation.

    In the process, do a small refactoring to use kref helpers.

    Fixes: 1864de94ec9d6 "lightnvm: pblk: stop recreating global caches"
    Signed-off-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Currently the geometry of an OCSSD is enumerated using a two step
    approach:

    First, nvm_register is called, the OCSSD identify command is issued,
    and second the geometry sos and csecs values are read either from the
    OCSSD identify if it is a 1.2 drive, or from the NVMe namespace data
    structure if it is a 2.0 device.

    This patch recombines it into a single step, such that nvm_register can
    use the csecs and sos fields independent of which version is used. This
    enables one to dynamically size the lightnvm subsystem dma pool.

    Reviewed-by: Igor Konopko
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Matias Bjørling
     
  • pblk's recovery path is single threaded and therefore a number of
    assumptions regarding concurrency can be made. To avoid confusion, make
    this explicit with a couple of comments in the code.

    Signed-off-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Javier González
     
  • Protect the list_add on the pblk_line_init_bb() error
    path in case this code is used for some other purpose
    in the future.

    Signed-off-by: Hua Su
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hua Su
     
  • Signed-off-by: Hua Su
    Updated description.
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hua Su
     
  • Remove the call to pblk_line_replace_data as it returns
    directly because we have not set l_mg->data_next yet.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The chunk metadata is allocated with vmalloc, so we need to use
    vfree to free it.

    Fixes: 090ee26fd512 ("lightnvm: use internal allocation for chunk log page")
    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • ADDR_POOL_SIZE is not used anymore, so remove the macro.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • In a worst-case scenario (random writes), OP% of sectors
    in each line will be invalid, and we will then need
    to move data out of 100/OP% lines to free a single line.

    So, to prevent the possibility of running out of lines,
    temporarily block user writes when there is less than
    100/OP% free lines.

    Also ensure that pblk creation does not produce instances
    with insufficient over provisioning.

    Insufficient over-provising is not a problem on real hardware,
    but often an issue when running QEMU simulations (with few lines).
    100 lines is enough to create a sane instance with the standard
    (11%) over provisioning.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • If mapping fails (i.e. when running out of lines), handle the error
    and stop writing.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Lines inflicted with write errors lines might be recovered
    if they have not been recycled after write error garbage collection.

    Ensure that the emeta accounting of valid lbas is correct
    for such lines to avoid recovery inconsistencies.

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • Make sure we only look up valid lba addresses on the resubmission path.

    If an lba is invalidated in the write buffer, that sector will be
    submitted to disk (as it is already mapped to a ppa), and that write
    might fail, resulting in a crash when trying to look up the lba in the
    mapping table (as the lba is marked as invalid).

    Signed-off-by: Hans Holmberg
    Reviewed-by: Javier González
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • The check for chunk closes suffers from an off-by-one issue, leading
    to chunk close events not being traced.

    Fixes: 4c44abf43d00 ("lightnvm: pblk: add trace events for chunk states")
    Signed-off-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Hans Holmberg
     
  • With gcc 4.1:

    drivers/lightnvm/core.c: In function ‘nvm_get_bb_meta’:
    drivers/lightnvm/core.c:977: warning: ‘ret’ may be used uninitialized in this function

    and

    drivers/nvme/host/lightnvm.c: In function ‘nvme_nvm_get_chk_meta’:
    drivers/nvme/host/lightnvm.c:580: warning: ‘ret’ may be used uninitialized in this function

    Indeed, if (for the former) the number of channels or LUNs is zero, or
    (for both) the passed number of chunks is zero, ret will be returned
    uninitialized.

    Fix this by preinitializing ret to zero.

    Fixes: aff3fb18f957de93 ("lightnvm: move bad block and chunk state logic to core")
    Fixes: a294c199455187d1 ("lightnvm: implement get log report chunk helpers")
    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     
  • The smeta area l2p mapping is empty, and actually the
    recovery procedure only need to restore data sector's l2p
    mapping. So ignore the smeta oob scan.

    Signed-off-by: Zhoujie Wu
    Reviewed-by: Javier González
    Reviewed-by: Hans Holmberg
    Signed-off-by: Matias Bjørling
    Signed-off-by: Jens Axboe

    Zhoujie Wu
     

11 Dec, 2018

5 commits

  • The md->wait waitqueue is used by both bio-based and request-based DM.
    Commit dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for
    outstanding IO") lost sight of the requirement that
    dm_wait_for_completion() must work with all types of DM devices.

    Fix md_in_flight() to call the blk-mq or bio-based method accordingly.

    Fixes: dbd3bbd291 ("dm rq: leverage blk_mq_queue_busy() to check for outstanding IO")
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Guenter reported an boot hang issue on HPPA after we default to 0 poll
    queues. We have two issues in the queue count calculations:

    1) We don't separate the poll queues from the read/write queues. This is
    important, since the former doesn't need interrupts.
    2) The adjust logic is broken.

    Adjust the poll queue count before doing nvme_calc_io_queues(). The poll
    queue count is only limited by the IO queue count we were able to get
    from the controller, not failures in the IRQ allocation loop. This
    leaves nvme_calc_io_queues() just adjusting the read/write queue map.

    Reported-by: Reported-by: Guenter Roeck
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • After switching to percpu inflight counters, the inflight check
    is totally buggy. It's perfectly valid for some counters to be
    non-zero while having a total inflight IO count of 0, that's how
    these kinds of counters work (inc on one CPU, dec on another).
    Fix the md_in_flight() check to sum all counters before returning
    a false positive, potentially.

    While at it, remove the inflight read for IO completion. We don't
    need it, just wake anyone that's waiting for the IO count to drop
    to zero. The caller needs to re-check that value anyway when woken,
    which it does.

    Fixes: 6f75723190d8 ("dm: remove the pending IO accounting")
    Acked-by: Mike Snitzer
    Reported-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For cases where we can only fail with IO in-flight, we should be using
    BLK_STS_DEV_RESOURCE instead of BLK_STS_RESOURCE. The latter refers to
    system wide resource constraints.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • The "cmd_slot_unal" semaphore is never used in a blocking way
    but only as an atomic counter. Change the code to using
    atomic_dec_if_positive() as a better API.

    Signed-off-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Arnd Bergmann
     

10 Dec, 2018

12 commits

  • Remove the "pending" atomic counters, that duplicate block-core's
    in_flight counters, and update md_in_flight() to look at percpu
    in_flight counters.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • The previous patches deleted all the code that needed the second value
    returned from part_in_flight - now the kernel only uses the first value.

    Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that
    it only returns one value.

    This patch just refactors the code, there's no functional change.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • Now when part_round_stats is gone, we can switch to per-cpu in-flight
    counters.

    We use the local-atomic type local_t, so that if part_inc_in_flight or
    part_dec_in_flight is reentrantly called from an interrupt, the value will
    be correct.

    The other counters could be corrupted due to reentrant interrupt, but the
    corruption only results in slight counter skew - the in_flight counter
    must be exact, so it needs local_t.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • We want to convert to per-cpu in_flight counters.

    The function part_round_stats needs the in_flight counter every jiffy, it
    would be too costly to sum all the percpu variables every jiffy, so it
    must be deleted. part_round_stats is used to calculate two counters -
    time_in_queue and io_ticks.

    time_in_queue can be calculated without part_round_stats, by adding the
    duration of the I/O when the I/O ends (the value is almost as exact as the
    previously calculated value, except that time for in-progress I/Os is not
    counted).

    io_ticks can be approximated by increasing the value when I/O is started
    or ended and the jiffies value has changed. If the I/Os take less than a
    jiffy, the value is as exact as the previously calculated value. If the
    I/Os take more than a jiffy, io_ticks can drift behind the previously
    calculated value.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • All of part_stat_* and related methods are used with preempt disabled,
    so there is no need to pass cpu around to allow of them. Just call
    smp_processor_id() as needed.

    Suggested-by: Jens Axboe
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Now that request-based dm-multipath only supports blk-mq, make use of
    the newly introduced blk_mq_queue_busy() to check for outstanding IO --
    rather than (ab)using the block core's in_flight counters.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • generic_start_io_acct and generic_end_io_acct already update the variable
    in_flight using atomic operations, so we don't have to overwrite them
    again.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     
  • Pull in v4.20-rc6 to resolve the conflict in NVMe, but also to get the
    two corruption fixes. We're going to be overhauling the direct dispatch
    path, and we need to do that on top of the changes we made for that
    in mainline.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Ming reports that lockdep spews the following trace. What this
    essentially says is that the sbitmap swap_lock was used inconsistently
    in IRQ enabled and disabled context, and that is usually indicative of a
    bug that will cause a deadlock.

    For this case, it's a false positive. The swap_lock is used from process
    context only, when we swap the bits in the word and cleared mask. We
    also end up doing that when we are getting a driver tag, from the
    blk_mq_mark_tag_wait(), and from there we hold the waitqueue lock with
    IRQs disabled. However, this isn't from an actual IRQ, it's still
    process context.

    In lieu of a better way to fix this, simply always disable interrupts
    when grabbing the swap_lock if lockdep is enabled.

    [ 100.967642] ================start test sanity/001================
    [ 101.238280] null: module loaded
    [ 106.093735]
    [ 106.094012] =====================================================
    [ 106.094854] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
    [ 106.095759] 4.20.0-rc3_5d2ee7122c73_for-next+ #1 Not tainted
    [ 106.096551] -----------------------------------------------------
    [ 106.097386] fio/1043 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
    [ 106.098231] 000000004c43fa71
    (&(&sb->map[i].swap_lock)->rlock){+.+.}, at: sbitmap_get+0xd5/0x22c
    [ 106.099431]
    [ 106.099431] and this task is already holding:
    [ 106.100229] 000000007eec8b2f
    (&(&hctx->dispatch_wait_lock)->rlock){....}, at:
    blk_mq_dispatch_rq_list+0x4c1/0xd7c
    [ 106.101630] which would create a new lock dependency:
    [ 106.102326] (&(&hctx->dispatch_wait_lock)->rlock){....} ->
    (&(&sb->map[i].swap_lock)->rlock){+.+.}
    [ 106.103553]
    [ 106.103553] but this new dependency connects a SOFTIRQ-irq-safe lock:
    [ 106.104580] (&sbq->ws[i].wait){..-.}
    [ 106.104582]
    [ 106.104582] ... which became SOFTIRQ-irq-safe at:
    [ 106.105751] _raw_spin_lock_irqsave+0x4b/0x82
    [ 106.106284] __wake_up_common_lock+0x119/0x1b9
    [ 106.106825] sbitmap_queue_wake_up+0x33f/0x383
    [ 106.107456] sbitmap_queue_clear+0x4c/0x9a
    [ 106.108046] __blk_mq_free_request+0x188/0x1d3
    [ 106.108581] blk_mq_free_request+0x23b/0x26b
    [ 106.109102] scsi_end_request+0x345/0x5d7
    [ 106.109587] scsi_io_completion+0x4b5/0x8f0
    [ 106.110099] scsi_finish_command+0x412/0x456
    [ 106.110615] scsi_softirq_done+0x23f/0x29b
    [ 106.111115] blk_done_softirq+0x2a7/0x2e6
    [ 106.111608] __do_softirq+0x360/0x6ad
    [ 106.112062] run_ksoftirqd+0x2f/0x5b
    [ 106.112499] smpboot_thread_fn+0x3a5/0x3db
    [ 106.113000] kthread+0x1d4/0x1e4
    [ 106.113457] ret_from_fork+0x3a/0x50
    [ 106.113969]
    [ 106.113969] to a SOFTIRQ-irq-unsafe lock:
    [ 106.114672] (&(&sb->map[i].swap_lock)->rlock){+.+.}
    [ 106.114674]
    [ 106.114674] ... which became SOFTIRQ-irq-unsafe at:
    [ 106.116000] ...
    [ 106.116003] _raw_spin_lock+0x33/0x64
    [ 106.116676] sbitmap_get+0xd5/0x22c
    [ 106.117134] __sbitmap_queue_get+0xe8/0x177
    [ 106.117731] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.118286] blk_mq_get_tag+0x1db/0x6e4
    [ 106.118756] blk_mq_get_driver_tag+0x161/0x258
    [ 106.119383] blk_mq_dispatch_rq_list+0x28e/0xd7c
    [ 106.120043] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.120607] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.121234] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.121781] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.122366] blk_mq_run_hw_queue+0x151/0x187
    [ 106.122887] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.123492] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.124042] blk_flush_plug_list+0x392/0x3d7
    [ 106.124557] blk_finish_plug+0x37/0x4f
    [ 106.125019] read_pages+0x3ef/0x430
    [ 106.125446] __do_page_cache_readahead+0x18e/0x2fc
    [ 106.126027] force_page_cache_readahead+0x121/0x133
    [ 106.126621] page_cache_sync_readahead+0x35f/0x3bb
    [ 106.127229] generic_file_buffered_read+0x410/0x1860
    [ 106.127932] __vfs_read+0x319/0x38f
    [ 106.128415] vfs_read+0xd2/0x19a
    [ 106.128817] ksys_read+0xb9/0x135
    [ 106.129225] do_syscall_64+0x140/0x385
    [ 106.129684] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.130292]
    [ 106.130292] other info that might help us debug this:
    [ 106.130292]
    [ 106.131226] Chain exists of:
    [ 106.131226] &sbq->ws[i].wait -->
    &(&hctx->dispatch_wait_lock)->rlock -->
    &(&sb->map[i].swap_lock)->rlock
    [ 106.131226]
    [ 106.132865] Possible interrupt unsafe locking scenario:
    [ 106.132865]
    [ 106.133659] CPU0 CPU1
    [ 106.134194] ---- ----
    [ 106.134733] lock(&(&sb->map[i].swap_lock)->rlock);
    [ 106.135318] local_irq_disable();
    [ 106.136014] lock(&sbq->ws[i].wait);
    [ 106.136747]
    lock(&(&hctx->dispatch_wait_lock)->rlock);
    [ 106.137742]
    [ 106.138110] lock(&sbq->ws[i].wait);
    [ 106.138625]
    [ 106.138625] *** DEADLOCK ***
    [ 106.138625]
    [ 106.139430] 3 locks held by fio/1043:
    [ 106.139947] #0: 0000000076ff0fd9 (rcu_read_lock){....}, at:
    hctx_lock+0x29/0xe8
    [ 106.140813] #1: 000000002feb1016 (&sbq->ws[i].wait){..-.}, at:
    blk_mq_dispatch_rq_list+0x4ad/0xd7c
    [ 106.141877] #2: 000000007eec8b2f
    (&(&hctx->dispatch_wait_lock)->rlock){....}, at:
    blk_mq_dispatch_rq_list+0x4c1/0xd7c
    [ 106.143267]
    [ 106.143267] the dependencies between SOFTIRQ-irq-safe lock and the
    holding lock:
    [ 106.144351] -> (&sbq->ws[i].wait){..-.} ops: 82 {
    [ 106.144926] IN-SOFTIRQ-W at:
    [ 106.145314] _raw_spin_lock_irqsave+0x4b/0x82
    [ 106.146042] __wake_up_common_lock+0x119/0x1b9
    [ 106.146785] sbitmap_queue_wake_up+0x33f/0x383
    [ 106.147567] sbitmap_queue_clear+0x4c/0x9a
    [ 106.148379] __blk_mq_free_request+0x188/0x1d3
    [ 106.149148] blk_mq_free_request+0x23b/0x26b
    [ 106.149864] scsi_end_request+0x345/0x5d7
    [ 106.150546] scsi_io_completion+0x4b5/0x8f0
    [ 106.151367] scsi_finish_command+0x412/0x456
    [ 106.152157] scsi_softirq_done+0x23f/0x29b
    [ 106.152855] blk_done_softirq+0x2a7/0x2e6
    [ 106.153537] __do_softirq+0x360/0x6ad
    [ 106.154280] run_ksoftirqd+0x2f/0x5b
    [ 106.155020] smpboot_thread_fn+0x3a5/0x3db
    [ 106.155828] kthread+0x1d4/0x1e4
    [ 106.156526] ret_from_fork+0x3a/0x50
    [ 106.157267] INITIAL USE at:
    [ 106.157713] _raw_spin_lock_irqsave+0x4b/0x82
    [ 106.158542] prepare_to_wait_exclusive+0xa8/0x215
    [ 106.159421] blk_mq_get_tag+0x34f/0x6e4
    [ 106.160186] blk_mq_get_request+0x48e/0xaef
    [ 106.160997] blk_mq_make_request+0x27e/0xbd2
    [ 106.161828] generic_make_request+0x4d1/0x873
    [ 106.162661] submit_bio+0x20c/0x253
    [ 106.163379] mpage_bio_submit+0x44/0x4b
    [ 106.164142] mpage_readpages+0x3c2/0x407
    [ 106.164919] read_pages+0x13a/0x430
    [ 106.165633] __do_page_cache_readahead+0x18e/0x2fc
    [ 106.166530] force_page_cache_readahead+0x121/0x133
    [ 106.167439] page_cache_sync_readahead+0x35f/0x3bb
    [ 106.168337] generic_file_buffered_read+0x410/0x1860
    [ 106.169255] __vfs_read+0x319/0x38f
    [ 106.169977] vfs_read+0xd2/0x19a
    [ 106.170662] ksys_read+0xb9/0x135
    [ 106.171356] do_syscall_64+0x140/0x385
    [ 106.172120] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.173051] }
    [ 106.173308] ... key at: [] __key.26481+0x0/0x40
    [ 106.174219] ... acquired at:
    [ 106.174646] _raw_spin_lock+0x33/0x64
    [ 106.175183] blk_mq_dispatch_rq_list+0x4c1/0xd7c
    [ 106.175843] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.176518] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.177262] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.177900] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.178591] blk_mq_run_hw_queue+0x151/0x187
    [ 106.179207] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.179926] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.180571] blk_flush_plug_list+0x392/0x3d7
    [ 106.181187] blk_finish_plug+0x37/0x4f
    [ 106.181737] __se_sys_io_submit+0x171/0x304
    [ 106.182346] do_syscall_64+0x140/0x385
    [ 106.182895] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.183607]
    [ 106.183830] -> (&(&hctx->dispatch_wait_lock)->rlock){....} ops: 1 {
    [ 106.184691] INITIAL USE at:
    [ 106.185119] _raw_spin_lock+0x33/0x64
    [ 106.185838] blk_mq_dispatch_rq_list+0x4c1/0xd7c
    [ 106.186697] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.187551] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.188481] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.189307] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.190189] blk_mq_run_hw_queue+0x151/0x187
    [ 106.190989] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.191902] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.192739] blk_flush_plug_list+0x392/0x3d7
    [ 106.193535] blk_finish_plug+0x37/0x4f
    [ 106.194269] __se_sys_io_submit+0x171/0x304
    [ 106.195059] do_syscall_64+0x140/0x385
    [ 106.195794] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.196705] }
    [ 106.196950] ... key at: [] __key.51231+0x0/0x40
    [ 106.197853] ... acquired at:
    [ 106.198270] lock_acquire+0x280/0x2f3
    [ 106.198806] _raw_spin_lock+0x33/0x64
    [ 106.199337] sbitmap_get+0xd5/0x22c
    [ 106.199850] __sbitmap_queue_get+0xe8/0x177
    [ 106.200450] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.201035] blk_mq_get_tag+0x1db/0x6e4
    [ 106.201589] blk_mq_get_driver_tag+0x161/0x258
    [ 106.202237] blk_mq_dispatch_rq_list+0x5b9/0xd7c
    [ 106.202902] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.203572] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.204316] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.204956] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.205649] blk_mq_run_hw_queue+0x151/0x187
    [ 106.206269] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.206997] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.207644] blk_flush_plug_list+0x392/0x3d7
    [ 106.208264] blk_finish_plug+0x37/0x4f
    [ 106.208814] __se_sys_io_submit+0x171/0x304
    [ 106.209415] do_syscall_64+0x140/0x385
    [ 106.209965] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.210684]
    [ 106.210904]
    [ 106.210904] the dependencies between the lock to be acquired
    [ 106.210905] and SOFTIRQ-irq-unsafe lock:
    [ 106.212541] -> (&(&sb->map[i].swap_lock)->rlock){+.+.} ops: 1969 {
    [ 106.213393] HARDIRQ-ON-W at:
    [ 106.213840] _raw_spin_lock+0x33/0x64
    [ 106.214570] sbitmap_get+0xd5/0x22c
    [ 106.215282] __sbitmap_queue_get+0xe8/0x177
    [ 106.216086] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.216876] blk_mq_get_tag+0x1db/0x6e4
    [ 106.217627] blk_mq_get_driver_tag+0x161/0x258
    [ 106.218465] blk_mq_dispatch_rq_list+0x28e/0xd7c
    [ 106.219326] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.220198] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.221138] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.221975] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.222874] blk_mq_run_hw_queue+0x151/0x187
    [ 106.223686] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.224597] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.225444] blk_flush_plug_list+0x392/0x3d7
    [ 106.226255] blk_finish_plug+0x37/0x4f
    [ 106.227006] read_pages+0x3ef/0x430
    [ 106.227717] __do_page_cache_readahead+0x18e/0x2fc
    [ 106.228595] force_page_cache_readahead+0x121/0x133
    [ 106.229491] page_cache_sync_readahead+0x35f/0x3bb
    [ 106.230373] generic_file_buffered_read+0x410/0x1860
    [ 106.231277] __vfs_read+0x319/0x38f
    [ 106.231986] vfs_read+0xd2/0x19a
    [ 106.232666] ksys_read+0xb9/0x135
    [ 106.233350] do_syscall_64+0x140/0x385
    [ 106.234097] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.235012] SOFTIRQ-ON-W at:
    [ 106.235460] _raw_spin_lock+0x33/0x64
    [ 106.236195] sbitmap_get+0xd5/0x22c
    [ 106.236913] __sbitmap_queue_get+0xe8/0x177
    [ 106.237715] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.238488] blk_mq_get_tag+0x1db/0x6e4
    [ 106.239244] blk_mq_get_driver_tag+0x161/0x258
    [ 106.240079] blk_mq_dispatch_rq_list+0x28e/0xd7c
    [ 106.240937] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.241806] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.242751] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.243579] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.244469] blk_mq_run_hw_queue+0x151/0x187
    [ 106.245277] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.246191] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.247044] blk_flush_plug_list+0x392/0x3d7
    [ 106.247859] blk_finish_plug+0x37/0x4f
    [ 106.248749] read_pages+0x3ef/0x430
    [ 106.249463] __do_page_cache_readahead+0x18e/0x2fc
    [ 106.250357] force_page_cache_readahead+0x121/0x133
    [ 106.251263] page_cache_sync_readahead+0x35f/0x3bb
    [ 106.252157] generic_file_buffered_read+0x410/0x1860
    [ 106.253084] __vfs_read+0x319/0x38f
    [ 106.253808] vfs_read+0xd2/0x19a
    [ 106.254488] ksys_read+0xb9/0x135
    [ 106.255186] do_syscall_64+0x140/0x385
    [ 106.255943] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.256867] INITIAL USE at:
    [ 106.257300] _raw_spin_lock+0x33/0x64
    [ 106.258033] sbitmap_get+0xd5/0x22c
    [ 106.258747] __sbitmap_queue_get+0xe8/0x177
    [ 106.259542] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.260320] blk_mq_get_tag+0x1db/0x6e4
    [ 106.261072] blk_mq_get_driver_tag+0x161/0x258
    [ 106.261902] blk_mq_dispatch_rq_list+0x28e/0xd7c
    [ 106.262762] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.263626] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.264571] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.265409] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.266302] blk_mq_run_hw_queue+0x151/0x187
    [ 106.267111] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.268028] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.268878] blk_flush_plug_list+0x392/0x3d7
    [ 106.269694] blk_finish_plug+0x37/0x4f
    [ 106.270432] read_pages+0x3ef/0x430
    [ 106.271139] __do_page_cache_readahead+0x18e/0x2fc
    [ 106.272040] force_page_cache_readahead+0x121/0x133
    [ 106.272932] page_cache_sync_readahead+0x35f/0x3bb
    [ 106.273811] generic_file_buffered_read+0x410/0x1860
    [ 106.274709] __vfs_read+0x319/0x38f
    [ 106.275407] vfs_read+0xd2/0x19a
    [ 106.276074] ksys_read+0xb9/0x135
    [ 106.276764] do_syscall_64+0x140/0x385
    [ 106.277500] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [ 106.278417] }
    [ 106.278676] ... key at: [] __key.26212+0x0/0x40
    [ 106.279586] ... acquired at:
    [ 106.280026] lock_acquire+0x280/0x2f3
    [ 106.280559] _raw_spin_lock+0x33/0x64
    [ 106.281101] sbitmap_get+0xd5/0x22c
    [ 106.281610] __sbitmap_queue_get+0xe8/0x177
    [ 106.282221] __blk_mq_get_tag+0x1e6/0x22d
    [ 106.282809] blk_mq_get_tag+0x1db/0x6e4
    [ 106.283368] blk_mq_get_driver_tag+0x161/0x258
    [ 106.284018] blk_mq_dispatch_rq_list+0x5b9/0xd7c
    [ 106.284685] blk_mq_do_dispatch_sched+0x23a/0x287
    [ 106.285371] blk_mq_sched_dispatch_requests+0x379/0x3fc
    [ 106.286135] __blk_mq_run_hw_queue+0x137/0x17e
    [ 106.286806] __blk_mq_delay_run_hw_queue+0x80/0x25f
    [ 106.287515] blk_mq_run_hw_queue+0x151/0x187
    [ 106.288149] blk_mq_sched_insert_requests+0x13f/0x175
    [ 106.289041] blk_mq_flush_plug_list+0x7d6/0x81b
    [ 106.289912] blk_flush_plug_list+0x392/0x3d7
    [ 106.290590] blk_finish_plug+0x37/0x4f
    [ 106.291238] __se_sys_io_submit+0x171/0x304
    [ 106.291864] do_syscall_64+0x140/0x385
    [ 106.292534] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Reported-by: Ming Lei
    Tested-by: Guenter Roeck
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Linus Torvalds
     
  • Pull networking fixes from David Miller:
    "A decent batch of fixes here. I'd say about half are for problems that
    have existed for a while, and half are for new regressions added in
    the 4.20 merge window.

    1) Fix 10G SFP phy module detection in mvpp2, from Baruch Siach.

    2) Revert bogus emac driver change, from Benjamin Herrenschmidt.

    3) Handle BPF exported data structure with pointers when building
    32-bit userland, from Daniel Borkmann.

    4) Memory leak fix in act_police, from Davide Caratti.

    5) Check RX checksum offload in RX descriptors properly in aquantia
    driver, from Dmitry Bogdanov.

    6) SKB unlink fix in various spots, from Edward Cree.

    7) ndo_dflt_fdb_dump() only works with ethernet, enforce this, from
    Eric Dumazet.

    8) Fix FID leak in mlxsw driver, from Ido Schimmel.

    9) IOTLB locking fix in vhost, from Jean-Philippe Brucker.

    10) Fix SKB truesize accounting in ipv4/ipv6/netfilter frag memory
    limits otherwise namespace exit can hang. From Jiri Wiesner.

    11) Address block parsing length fixes in x25 from Martin Schiller.

    12) IRQ and ring accounting fixes in bnxt_en, from Michael Chan.

    13) For tun interfaces, only iface delete works with rtnl ops, enforce
    this by disallowing add. From Nicolas Dichtel.

    14) Use after free in liquidio, from Pan Bian.

    15) Fix SKB use after passing to netif_receive_skb(), from Prashant
    Bhole.

    16) Static key accounting and other fixes in XPS from Sabrina Dubroca.

    17) Partially initialized flow key passed to ip6_route_output(), from
    Shmulik Ladkani.

    18) Fix RTNL deadlock during reset in ibmvnic driver, from Thomas
    Falcon.

    19) Several small TCP fixes (off-by-one on window probe abort, NULL
    deref in tail loss probe, SNMP mis-estimations) from Yuchung
    Cheng"

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (93 commits)
    net/sched: cls_flower: Reject duplicated rules also under skip_sw
    bnxt_en: Fix _bnxt_get_max_rings() for 57500 chips.
    bnxt_en: Fix NQ/CP rings accounting on the new 57500 chips.
    bnxt_en: Keep track of reserved IRQs.
    bnxt_en: Fix CNP CoS queue regression.
    net/mlx4_core: Correctly set PFC param if global pause is turned off.
    Revert "net/ibm/emac: wrong bit is used for STA control"
    neighbour: Avoid writing before skb->head in neigh_hh_output()
    ipv6: Check available headroom in ip6_xmit() even without options
    tcp: lack of available data can also cause TSO defer
    ipv6: sr: properly initialize flowi6 prior passing to ip6_route_output
    mlxsw: spectrum_switchdev: Fix VLAN device deletion via ioctl
    mlxsw: spectrum_router: Relax GRE decap matching check
    mlxsw: spectrum_switchdev: Avoid leaking FID's reference count
    mlxsw: spectrum_nve: Remove easily triggerable warnings
    ipv4: ipv6: netfilter: Adjust the frag mem limit when truesize changes
    sctp: frag_point sanity check
    tcp: fix NULL ref in tail loss probe
    tcp: Do not underestimate rwnd_limited
    net: use skb_list_del_init() to remove from RX sublists
    ...

    Linus Torvalds
     
  • Pull x86 fixes from Ingo Molnar:
    "Three fixes: a boot parameter re-(re-)fix, a retpoline build artifact
    fix and an LLVM workaround"

    * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86/vdso: Drop implicit common-page-size linker flag
    x86/build: Fix compiler support check for CONFIG_RETPOLINE
    x86/boot: Clear RSDP address in boot_params for broken loaders

    Linus Torvalds