29 Oct, 2018

1 commit


13 Oct, 2018

1 commit

  • commit 587562d0c7cd6861f4f90a2eb811cccb1a376f5f upstream.

    trace_block_unplug() takes true for explicit unplugs and false for
    implicit unplugs. schedule() unplugs are implicit and should be
    reported as timer unplugs. While correct in the legacy code, this has
    been inverted in blk-mq since 4.11.

    Cc: stable@vger.kernel.org
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ilya Dryomov
     

26 Jun, 2018

1 commit

  • commit a347c7ad8edf4c5685154f3fdc3c12fc1db800ba upstream.

    It is not allowed to reinit q->tag_set_list list entry while RCU grace
    period has not completed yet, otherwise the following soft lockup in
    blk_mq_sched_restart() happens:

    [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270]
    [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000
    [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150
    [ 1064.256510] Call Trace:
    [ 1064.256664]
    [ 1064.256824] blk_mq_free_request+0xea/0x100
    [ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client]
    [ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client]
    [ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core]
    [ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client]
    [ 1064.257669] ib_create_qp+0x321/0x380 [ib_core]
    [ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core]
    [ 1064.258007] irq_poll_softirq+0xb7/0xe0
    [ 1064.258165] __do_softirq+0x106/0x2a2
    [ 1064.258328] irq_exit+0x92/0xa0
    [ 1064.258509] do_IRQ+0x4a/0xd0
    [ 1064.258660] common_interrupt+0x7a/0x7a
    [ 1064.258818]

    Meanwhile another context frees other queue but with the same set of
    shared tags:

    [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds.
    [ 1288.201833] bash D 0 5910 5820 0x00000000
    [ 1288.202016] Call Trace:
    [ 1288.202315] schedule+0x32/0x80
    [ 1288.202462] schedule_timeout+0x1e5/0x380
    [ 1288.203838] wait_for_completion+0xb0/0x120
    [ 1288.204137] __wait_rcu_gp+0x125/0x160
    [ 1288.204287] synchronize_sched+0x6e/0x80
    [ 1288.204770] blk_mq_free_queue+0x74/0xe0
    [ 1288.204922] blk_cleanup_queue+0xc7/0x110
    [ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client]
    [ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client]
    [ 1288.205548] kernfs_fop_write+0x109/0x180
    [ 1288.206328] vfs_write+0xb3/0x1a0
    [ 1288.206476] SyS_write+0x52/0xc0
    [ 1288.206624] do_syscall_64+0x68/0x1d0
    [ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    What happened is the following:

    1. There are several MQ queues with shared tags.
    2. One queue is about to be freed and now task is in
    blk_mq_del_queue_tag_set().
    3. Other CPU is in blk_mq_sched_restart() and loops over all queues in
    tag list in order to find hctx to restart.

    Because linked list entry was modified in blk_mq_del_queue_tag_set()
    without proper waiting for a grace period, blk_mq_sched_restart()
    never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup.

    Fix is simple: reinit list entry after an RCU grace period elapsed.

    Fixes: Fixes: 705cda97ee3a ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list")
    Cc: stable@vger.kernel.org
    Cc: Sagi Grimberg
    Cc: linux-block@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Reviewed-by: Bart Van Assche
    Signed-off-by: Roman Pen
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Roman Pen
     

21 Jun, 2018

1 commit

  • [ Upstream commit bf0ddaba65ddbb2715af97041da8e7a45b2d8628 ]

    When the blk-mq inflight implementation was added, /proc/diskstats was
    converted to use it, but /sys/block/$dev/inflight was not. Fix it by
    adding another helper to count in-flight requests by data direction.

    Fixes: f299b7c7a9de ("blk-mq: provide internal in-flight variant")
    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     

26 Apr, 2018

1 commit

  • [ Upstream commit 7df938fbc4ee641e70e05002ac67c24b19e86e74 ]

    We know this WARN_ON is harmless and in reality it may be trigged,
    so convert it to printk() and dump_stack() to avoid to confusing
    people.

    Also add comment about two releated races here.

    Cc: Christian Borntraeger
    Cc: Stefan Haberland
    Cc: Christoph Hellwig
    Cc: Thomas Gleixner
    Cc: "jianchao.wang"
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

12 Apr, 2018

3 commits

  • [ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ]

    HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
    then we need to check it before calling blk_mq_tag_idle(), otherwise
    the following kernel oops can be triggered, so fix it by checking if
    the hw queue is unmapped since it doesn't make sense to idle the tags
    any more after hw queues are unmapped.

    [ 440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
    [ 440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
    [ 440.788359] RIP: 0010:[] [] __blk_mq_tag_idle+0x24/0x40
    [ 440.798697] RSP: 0018:ffff893bf9bcbd10 EFLAGS: 00010286
    [ 440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
    [ 440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
    [ 440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
    [ 440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
    [ 440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
    [ 440.849988] FS: 0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
    [ 440.859955] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
    [ 440.876169] Call Trace:
    [ 440.879818] [] blk_mq_exit_hctx+0xd8/0xe0
    [ 440.887051] [] blk_mq_free_queue+0xf0/0x160
    [ 440.894465] [] blk_cleanup_queue+0xd9/0x150
    [ 440.901881] [] nvme_ns_remove+0x5b/0xb0 [nvme_core]
    [ 440.910068] [] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
    [ 440.919026] [] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
    [ 440.928079] [] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
    [ 440.937126] [] process_one_work+0x17a/0x440
    [ 440.944517] [] worker_thread+0x278/0x3c0
    [ 440.951607] [] ? manage_workers.isra.24+0x2a0/0x2a0
    [ 440.959760] [] kthread+0xcf/0xe0
    [ 440.966055] [] ? insert_kthread_work+0x40/0x40
    [ 440.973715] [] ret_from_fork+0x58/0x90
    [ 440.980586] [] ? insert_kthread_work+0x40/0x40
    [ 440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5 ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
    [ 441.011620] RIP [] __blk_mq_tag_idle+0x24/0x40
    [ 441.019301] RSP
    [ 441.024052] CR2: 0000000000000008

    Reported-by: Zhang Yi
    Tested-by: Zhang Yi
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit fb350e0ad99359768e1e80b4784692031ec340e4 ]

    In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
    can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
    example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
    freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

    This patch fixes the race be holding q->sysfs_lock.

    Reviewed-by: Christoph Hellwig
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     
  • [ Upstream commit 7d4901a90d02500c8011472a060f9b2e60e6e605 ]

    blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
    previous map isn't cleared yet, and may point to one stale hw queue
    index.

    This patch fixes the following issue by clearing the mapping table before
    setting it up in blk_mq_pci_map_queues().

    This patches fixes this following issue reported by Zhang Yi:

    [ 101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
    [ 101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
    [ 101.216346] PGD 0 P4D 0
    [ 101.219171] Oops: 0000 [#1] SMP
    [ 101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
    [ 101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
    [ 101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
    [ 101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
    [ 101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
    [ 101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
    [ 101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
    [ 101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
    [ 101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
    [ 101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
    [ 101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
    [ 101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
    [ 101.364314] FS: 0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
    [ 101.373342] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
    [ 101.387714] Call Trace:
    [ 101.390445] blk_mq_update_nr_hw_queues+0xbf/0x130
    [ 101.395791] nvme_reset_work+0x6f4/0xc06 [nvme]
    [ 101.400848] ? pick_next_task_fair+0x290/0x5f0
    [ 101.405807] ? __switch_to+0x1f5/0x430
    [ 101.409988] ? put_prev_entity+0x2f/0xd0
    [ 101.414365] process_one_work+0x141/0x340
    [ 101.418836] worker_thread+0x47/0x3e0
    [ 101.422921] kthread+0xf5/0x130
    [ 101.426424] ? rescuer_thread+0x380/0x380
    [ 101.430896] ? kthread_associate_blkcg+0x90/0x90
    [ 101.436048] ret_from_fork+0x1f/0x30
    [ 101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
    [ 101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
    [ 101.468205] CR2: 0000000000000098
    [ 101.471907] ---[ end trace 5fe710f98228a3ca ]---
    [ 101.482489] Kernel panic - not syncing: Fatal exception
    [ 101.488505] Kernel Offset: disabled
    [ 101.497752] ---[ end Kernel panic - not syncing: Fatal exception

    Reviewed-by: Christoph Hellwig
    Suggested-by: Christoph Hellwig
    Reported-by: Yi Zhang
    Tested-by: Yi Zhang
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

09 Mar, 2018

1 commit

  • commit 105976f517791aed3b11f8f53b308a2069d42055 upstream.

    __blk_mq_requeue_request() covers two cases:

    - one is that the requeued request is added to hctx->dispatch, such as
    blk_mq_dispatch_rq_list()

    - another case is that the request is requeued to io scheduler, such as
    blk_mq_requeue_request().

    We should call io sched's .requeue_request callback only for the 2nd
    case.

    Cc: Paolo Valente
    Cc: Omar Sandoval
    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Acked-by: Paolo Valente
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

03 Mar, 2018

1 commit

  • [ Upstream commit 454be724f6f99cc7e7bbf15067128be9868186c6 ]

    Now we track legacy requests with .q_usage_counter in commit 055f6e18e08f
    ("block: Make q_usage_counter also track legacy requests"), but that
    commit never runs and drains legacy queue before waiting for this counter
    becoming zero, then IO hang is caused in the test of pulling disk during IO.

    This patch fixes the issue by draining requests before waiting for
    q_usage_counter becoming zero, both Mauricio and chenxiang reported this
    issue, and observed that it can be fixed by this patch.

    Link: https://marc.info/?l=linux-block&m=151192424731797&w=2
    Fixes: 055f6e18e08f("block: Make q_usage_counter also track legacy requests")
    Cc: Wen Xiong
    Tested-by: "chenxiang (M)"
    Tested-by: Mauricio Faria de Oliveira
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

12 Sep, 2017

1 commit

  • A NULL pointer crash was reported for the case of having the BFQ IO
    scheduler attached to the underlying blk-mq paths of a DM multipath
    device. The crash occured in blk_mq_sched_insert_request()'s call to
    e->type->ops.mq.insert_requests().

    Paolo Valente correctly summarized why the crash occured with:
    "the call chain (dm_mq_queue_rq -> map_request -> setup_clone ->
    blk_rq_prep_clone) creates a cloned request without invoking
    e->type->ops.mq.prepare_request for the target elevator e. The cloned
    request is therefore not initialized for the scheduler, but it is
    however inserted into the scheduler by blk_mq_sched_insert_request."

    All said, a request-based DM multipath device's IO scheduler should be
    the only one used -- when the original requests are issued to the
    underlying paths as cloned requests they are inserted directly in the
    underlying dispatch queue(s) rather than through an additional elevator.

    But commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO
    schedulers") switched blk_insert_cloned_request() from using
    blk_mq_insert_request() to blk_mq_sched_insert_request(). Which
    incorrectly added elevator machinery into a call chain that isn't
    supposed to have any.

    To fix this introduce a blk-mq private blk_mq_request_bypass_insert()
    that blk_insert_cloned_request() calls to insert the request without
    involving any elevator that may be attached to the cloned request's
    request_queue.

    Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
    Cc: stable@vger.kernel.org
    Reported-by: Bart Van Assche
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     

08 Sep, 2017

1 commit

  • Pull block layer updates from Jens Axboe:
    "This is the first pull request for 4.14, containing most of the code
    changes. It's a quiet series this round, which I think we needed after
    the churn of the last few series. This contains:

    - Fix for a registration race in loop, from Anton Volkov.

    - Overflow complaint fix from Arnd for DAC960.

    - Series of drbd changes from the usual suspects.

    - Conversion of the stec/skd driver to blk-mq. From Bart.

    - A few BFQ improvements/fixes from Paolo.

    - CFQ improvement from Ritesh, allowing idling for group idle.

    - A few fixes found by Dan's smatch, courtesy of Dan.

    - A warning fixup for a race between changing the IO scheduler and
    device remova. From David Jeffery.

    - A few nbd fixes from Josef.

    - Support for cgroup info in blktrace, from Shaohua.

    - Also from Shaohua, new features in the null_blk driver to allow it
    to actually hold data, among other things.

    - Various corner cases and error handling fixes from Weiping Zhang.

    - Improvements to the IO stats tracking for blk-mq from me. Can
    drastically improve performance for fast devices and/or big
    machines.

    - Series from Christoph removing bi_bdev as being needed for IO
    submission, in preparation for nvme multipathing code.

    - Series from Bart, including various cleanups and fixes for switch
    fall through case complaints"

    * 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
    kernfs: checking for IS_ERR() instead of NULL
    drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
    drbd: Fix allyesconfig build, fix recent commit
    drbd: switch from kmalloc() to kmalloc_array()
    drbd: abort drbd_start_resync if there is no connection
    drbd: move global variables to drbd namespace and make some static
    drbd: rename "usermode_helper" to "drbd_usermode_helper"
    drbd: fix race between handshake and admin disconnect/down
    drbd: fix potential deadlock when trying to detach during handshake
    drbd: A single dot should be put into a sequence.
    drbd: fix rmmod cleanup, remove _all_ debugfs entries
    drbd: Use setup_timer() instead of init_timer() to simplify the code.
    drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
    drbd: new disk-option disable-write-same
    drbd: Fix resource role for newly created resources in events2
    drbd: mark symbols static where possible
    drbd: Send P_NEG_ACK upon write error in protocol != C
    drbd: add explicit plugging when submitting batches
    drbd: change list_for_each_safe to while(list_first_entry_or_null)
    drbd: introduce drbd_recv_header_maybe_unplug
    ...

    Linus Torvalds
     

18 Aug, 2017

1 commit

  • Since patch "blk-mq: switch .queue_rq return value to blk_status_t"
    .queue_rq() returns a BLK_STS_* value instead of a BLK_MQ_RQ_*
    value. Hence refer to the former in comments about .queue_rq()
    return values.

    Fixes: commit 39a70c76b89b ("blk-mq: clarify dispatch may not be drained/blocked by stopping queue")
    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Ming Lei
    Cc: Christoph Hellwig
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

15 Aug, 2017

1 commit

  • blk_mq_get_request() does not release the callers queue usage counter
    when allocation fails. The caller still needs to account for its own
    queue usage when it is unable to allocate a request.

    Fixes: 1ad43c0078b7 ("blk-mq: don't leak preempt counter/q_usage_counter when allocating rq failed")

    Reported-by: Max Gurtovoy
    Reviewed-by: Ming Lei
    Reviewed-by: Sagi Grimberg
    Tested-by: Max Gurtovoy
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

10 Aug, 2017

3 commits

  • The blk_mq_delay_kick_requeue_list() function is used by the device
    mapper and only by the device mapper to rerun the queue and requeue
    list after a delay. This function is called once per request that
    gets requeued. Modify this function such that the queue is run once
    per path change event instead of once per request that is requeued.

    Fixes: commit 2849450ad39d ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
    Signed-off-by: Bart Van Assche
    Cc: Mike Snitzer
    Cc: Laurence Oberman
    Cc:
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Modify blk_mq_in_flight() to count both a partition and root at
    the same time. Then we only have to call it once, instead of
    potentially looping the tags twice.

    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We don't have to inc/dec some counter, since we can just
    iterate the tags. That makes inc/dec a noop, but means we
    have to iterate busy tags to get an in-flight count.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Aug, 2017

1 commit


01 Aug, 2017

1 commit

  • We recently had a bug in the IPR SCSI driver, where it would end up
    making the SCSI mid layer run the mq hardware queue with interrupts
    disabled. This isn't legal, since the software queue locking relies
    on never being grabbed from interrupt context. Additionally, drivers
    that set BLK_MQ_F_BLOCKING may schedule from this context.

    Add a WARN_ON_ONCE() to catch bad users up front.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

29 Jul, 2017

1 commit


12 Jul, 2017

1 commit

  • Pull more block updates from Jens Axboe:
    "This is a followup for block changes, that didn't make the initial
    pull request. It's a bit of a mixed bag, this contains:

    - A followup pull request from Sagi for NVMe. Outside of fixups for
    NVMe, it also includes a series for ensuring that we properly
    quiesce hardware queues when browsing live tags.

    - Set of integrity fixes from Dmitry (mostly), fixing various issues
    for folks using DIF/DIX.

    - Fix for a bug introduced in cciss, with the req init changes. From
    Christoph.

    - Fix for a bug in BFQ, from Paolo.

    - Two followup fixes for lightnvm/pblk from Javier.

    - Depth fix from Ming for blk-mq-sched.

    - Also from Ming, performance fix for mtip32xx that was introduced
    with the dynamic initialization of commands"

    * 'for-linus' of git://git.kernel.dk/linux-block: (44 commits)
    block: call bio_uninit in bio_endio
    nvmet: avoid unneeded assignment of submit_bio return value
    nvme-pci: add module parameter for io queue depth
    nvme-pci: compile warnings in nvme_alloc_host_mem()
    nvmet_fc: Accept variable pad lengths on Create Association LS
    nvme_fc/nvmet_fc: revise Create Association descriptor length
    lightnvm: pblk: remove unnecessary checks
    lightnvm: pblk: control I/O flow also on tear down
    cciss: initialize struct scsi_req
    null_blk: fix error flow for shared tags during module_init
    block: Fix __blkdev_issue_zeroout loop
    nvme-rdma: unconditionally recycle the request mr
    nvme: split nvme_uninit_ctrl into stop and uninit
    virtio_blk: quiesce/unquiesce live IO when entering PM states
    mtip32xx: quiesce request queues to make sure no submissions are inflight
    nbd: quiesce request queues to make sure no submissions are inflight
    nvme: kick requeue list when requeueing a request instead of when starting the queues
    nvme-pci: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-loop: quiesce/unquiesce admin_q instead of start/stop its hw queues
    nvme-fc: quiesce/unquiesce admin_q instead of start/stop its hw queues
    ...

    Linus Torvalds
     

04 Jul, 2017

3 commits

  • Pull irq updates from Thomas Gleixner:
    "The irq department delivers:

    - Expand the generic infrastructure handling the irq migration on CPU
    hotplug and convert X86 over to it. (Thomas Gleixner)

    Aside of consolidating code this is a preparatory change for:

    - Finalizing the affinity management for multi-queue devices. The
    main change here is to shut down interrupts which are affine to a
    outgoing CPU and reenabling them when the CPU comes online again.
    That avoids moving interrupts pointlessly around and breaking and
    reestablishing affinities for no value. (Christoph Hellwig)

    Note: This contains also the BLOCK-MQ and NVME changes which depend
    on the rework of the irq core infrastructure. Jens acked them and
    agreed that they should go with the irq changes.

    - Consolidation of irq domain code (Marc Zyngier)

    - State tracking consolidation in the core code (Jeffy Chen)

    - Add debug infrastructure for hierarchical irq domains (Thomas
    Gleixner)

    - Infrastructure enhancement for managing generic interrupt chips via
    devmem (Bartosz Golaszewski)

    - Constification work all over the place (Tobias Klauser)

    - Two new interrupt controller drivers for MVEBU (Thomas Petazzoni)

    - The usual set of fixes, updates and enhancements all over the
    place"

    * 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (112 commits)
    irqchip/or1k-pic: Fix interrupt acknowledgement
    irqchip/irq-mvebu-gicp: Allocate enough memory for spi_bitmap
    irqchip/gic-v3: Fix out-of-bound access in gic_set_affinity
    nvme: Allocate queues for all possible CPUs
    blk-mq: Create hctx for each present CPU
    blk-mq: Include all present CPUs in the default queue mapping
    genirq: Avoid unnecessary low level irq function calls
    genirq: Set irq masked state when initializing irq_desc
    genirq/timings: Add infrastructure for estimating the next interrupt arrival time
    genirq/timings: Add infrastructure to track the interrupt timings
    genirq/debugfs: Remove pointless NULL pointer check
    irqchip/gic-v3-its: Don't assume GICv3 hardware supports 16bit INTID
    irqchip/gic-v3-its: Add ACPI NUMA node mapping
    irqchip/gic-v3-its-platform-msi: Make of_device_ids const
    irqchip/gic-v3-its: Make of_device_ids const
    irqchip/irq-mvebu-icu: Add new driver for Marvell ICU
    irqchip/irq-mvebu-gicp: Add new driver for Marvell GICP
    dt-bindings/interrupt-controller: Add DT binding for the Marvell ICU
    genirq/irqdomain: Remove auto-recursive hierarchy support
    irqchip/MSI: Use irq_domain_update_bus_token instead of an open coded access
    ...

    Linus Torvalds
     
  • Currently all integrity prep hooks are open-coded, and if prepare fails
    we ignore it's code and fail bio with EIO. Let's return real error to
    upper layer, so later caller may react accordingly.

    In fact no one want to use bio_integrity_prep() w/o bio_integrity_enabled,
    so it is reasonable to fold it in to one function.

    Signed-off-by: Dmitry Monakhov
    Reviewed-by: Martin K. Petersen
    [hch: merged with the latest block tree,
    return bool from bio_integrity_prep]
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Dmitry Monakhov
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Add the SYSTEM_SCHEDULING bootup state to move various scheduler
    debug checks earlier into the bootup. This turns silent and
    sporadically deadly bugs into nice, deterministic splats. Fix some
    of the splats that triggered. (Thomas Gleixner)

    - A round of restructuring and refactoring of the load-balancing and
    topology code (Peter Zijlstra)

    - Another round of consolidating ~20 of incremental scheduler code
    history: this time in terms of wait-queue nomenclature. (I didn't
    get much feedback on these renaming patches, and we can still
    easily change any names I might have misplaced, so if anyone hates
    a new name, please holler and I'll fix it.) (Ingo Molnar)

    - sched/numa improvements, fixes and updates (Rik van Riel)

    - Another round of x86/tsc scheduler clock code improvements, in hope
    of making it more robust (Peter Zijlstra)

    - Improve NOHZ behavior (Frederic Weisbecker)

    - Deadline scheduler improvements and fixes (Luca Abeni, Daniel
    Bristot de Oliveira)

    - Simplify and optimize the topology setup code (Lauro Ramos
    Venancio)

    - Debloat and decouple scheduler code some more (Nicolas Pitre)

    - Simplify code by making better use of llist primitives (Byungchul
    Park)

    - ... plus other fixes and improvements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
    sched/cputime: Refactor the cputime_adjust() code
    sched/debug: Expose the number of RT/DL tasks that can migrate
    sched/numa: Hide numa_wake_affine() from UP build
    sched/fair: Remove effective_load()
    sched/numa: Implement NUMA node level wake_affine()
    sched/fair: Simplify wake_affine() for the single socket case
    sched/numa: Override part of migrate_degrades_locality() when idle balancing
    sched/rt: Move RT related code from sched/core.c to sched/rt.c
    sched/deadline: Move DL related code from sched/core.c to sched/deadline.c
    sched/cpuset: Only offer CONFIG_CPUSETS if SMP is enabled
    sched/fair: Spare idle load balancing on nohz_full CPUs
    nohz: Move idle balancer registration to the idle path
    sched/loadavg: Generalize "_idle" naming to "_nohz"
    sched/core: Drop the unused try_get_task_struct() helper function
    sched/fair: WARN() and refuse to set buddy when !se->on_rq
    sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well
    sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
    sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
    sched/wait: Split out the wait_bit*() APIs from into
    sched/wait: Re-adjust macro line continuation backslashes in
    ...

    Linus Torvalds
     

29 Jun, 2017

1 commit

  • Currently we only create hctx for online CPUs, which can lead to a lot
    of churn due to frequent soft offline / online operations. Instead
    allocate one for each present CPU to avoid this and dramatically simplify
    the code.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Cc: Keith Busch
    Cc: linux-block@vger.kernel.org
    Cc: linux-nvme@lists.infradead.org
    Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
    Signed-off-by: Thomas Gleixner

    Christoph Hellwig
     

28 Jun, 2017

2 commits


24 Jun, 2017

1 commit


23 Jun, 2017

1 commit


22 Jun, 2017

3 commits

  • hwctx's queue_num has been set prior call blk_mq_init_hctx, so no need
    set it again.

    Signed-off-by: weiping
    Signed-off-by: Jens Axboe

    weiping
     
  • Since blk_mq_quiesce_queue_nowait() can be called from interrupt
    context, make this safe. Since this function is not in the hot
    path, uninline it.

    Fixes: commit f4560ffe8cec ("blk-mq: use QUEUE_FLAG_QUIESCED to quiesce queue")
    Signed-off-by: Bart Van Assche
    Cc: Ming Lei
    Cc: Hannes Reinecke
    Cc: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • If we have shared tags enabled, then every IO completion will trigger
    a full loop of every queue belonging to a tag set, and every hardware
    queue for each of those queues, even if nothing needs to be done.
    This causes a massive performance regression if you have a lot of
    shared devices.

    Instead of doing this huge full scan on every IO, add an atomic
    counter to the main queue that tracks how many hardware queues have
    been marked as needing a restart. With that, we can avoid looking for
    restartable queues, if we don't have to.

    Max reports that this restores performance. Before this patch, 4K
    IOPS was limited to 22-23K IOPS. With the patch, we are running at
    950-970K IOPS.

    Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
    Reported-by: Max Gurtovoy
    Tested-by: Max Gurtovoy
    Reviewed-by: Bart Van Assche
    Tested-by: Bart Van Assche
    Signed-off-by: Jens Axboe

    Jens Axboe
     

21 Jun, 2017

5 commits

  • A queue must be frozen while the mapped state of a hardware queue
    is changed. Additionally, any change of the mapped state is
    followed by a call to blk_mq_map_swqueue() (see also
    blk_mq_init_allocated_queue() and blk_mq_update_nr_hw_queues()).
    Since blk_mq_map_swqueue() does not map any unmapped hardware
    queue onto any software queue, no attempt will be made to run
    an unmapped hardware queue. Hence issue a warning upon attempts
    to run an unmapped hardware queue.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Document the locking assumptions in functions that modify
    blk_mq_ctx.rq_list to make it easier for humans to verify
    this code.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Initialization of blk-mq requests is a bit weird: blk_mq_rq_ctx_init()
    is called after a value has been assigned to .rq_flags and .rq_flags
    is initialized in __blk_mq_finish_request(). Initialize .rq_flags in
    blk_mq_rq_ctx_init() instead of relying on __blk_mq_finish_request().
    Moving the initialization of .rq_flags is fine because all changes
    and tests of .rq_flags occur between blk_get_request() and finishing
    a request.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Instead of declaring the second argument of blk_*_get_request()
    as int and passing it to functions that expect an unsigned int,
    declare that second argument as unsigned int. Also because of
    consistency, rename that second argument from 'rw' into 'op'.
    This patch does not change any functionality.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Cc: Ming Lei
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Since the srcu structure is rather large (184 bytes on an x86-64
    system with kernel debugging disabled), only allocate it if needed.

    Reported-by: Ming Lei
    Signed-off-by: Bart Van Assche
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ming Lei
    Cc: Hannes Reinecke
    Cc: Omar Sandoval
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

20 Jun, 2017

3 commits

  • A new bio operation flag REQ_NOWAIT is introduced to identify bio's
    orignating from iocb with IOCB_NOWAIT. This flag indicates
    to return immediately if a request cannot be made instead
    of retrying.

    Stacked devices such as md (the ones with make_request_fn hooks)
    currently are not supported because it may block for housekeeping.
    For example, an md can have a part of the device suspended.
    For this reason, only request based devices are supported.
    In the future, this feature will be expanded to stacked devices
    by teaching them how to handle the REQ_NOWAIT flags.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jens Axboe
    Signed-off-by: Goldwyn Rodrigues
    Signed-off-by: Jens Axboe

    Goldwyn Rodrigues
     
  • So I've noticed a number of instances where it was not obvious from the
    code whether ->task_list was for a wait-queue head or a wait-queue entry.

    Furthermore, there's a number of wait-queue users where the lists are
    not for 'tasks' but other entities (poll tables, etc.), in which case
    the 'task_list' name is actively confusing.

    To clear this all up, name the wait-queue head and entry list structure
    fields unambiguously:

    struct wait_queue_head::task_list => ::head
    struct wait_queue_entry::task_list => ::entry

    For example, this code:

    rqw->wait.task_list.next != &wait->task_list

    ... is was pretty unclear (to me) what it's doing, while now it's written this way:

    rqw->wait.head.next != &wait->entry

    ... which makes it pretty clear that we are iterating a list until we see the head.

    Other examples are:

    list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
    list_for_each_entry(wq, &fence->wait.task_list, task_list) {

    ... where it's unclear (to me) what we are iterating, and during review it's
    hard to tell whether it's trying to walk a wait-queue entry (which would be
    a bug), while now it's written as:

    list_for_each_entry_safe(pos, next, &x->head, entry) {
    list_for_each_entry(wq, &fence->wait.head, entry) {

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Rename:

    wait_queue_t => wait_queue_entry_t

    'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
    but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
    which had to carry the name.

    Start sorting this out by renaming it to 'wait_queue_entry_t'.

    This also allows the real structure name 'struct __wait_queue' to
    lose its double underscore and become 'struct wait_queue_entry',
    which is the more canonical nomenclature for such data types.

    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar