10 Oct, 2016

1 commit

  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

17 Sep, 2016

4 commits

  • In order to get good cache behavior from a sbitmap, we want each CPU to
    stick to its own cacheline(s) as much as possible. This might happen
    naturally as the bitmap gets filled up and the alloc_hint values spread
    out, but we really want this behavior from the start. blk-mq apparently
    intended to do this, but the code to do this was never wired up. Get rid
    of the dead code and make it part of the sbitmap library.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Again, there's no point in passing this in every time. Make it part of
    struct sbitmap_queue and clean up the API.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Sep, 2016

2 commits

  • Unused now that NVMe sets up irq affinity before calling into blk-mq.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jul, 2016

1 commit

  • The new nvme-rdma driver will need to reinitialize all the tags as part of
    the error recovery procedure (realloc the tag memory region). Add a helper
    in blk-mq for it that can iterate over all requests in a tagset to make
    this easier.

    Signed-off-by: Sagi Grimberg
    Tested-by: Ming Lin
    Reviewed-by: Stephen Bates
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

13 Apr, 2016

2 commits


02 Dec, 2015

1 commit


07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

05 Nov, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds
     

15 Oct, 2015

1 commit

  • tags is freed in blk_mq_free_rq_map() and should not be used after that.
    The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
    free_cpumask_var() is nop.

    tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
    free cpumask in its counter part, blk_mq_free_tags().

    Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
    Signed-off-by: Jun'ichi Nomura
    Cc: Keith Busch
    Reviewed-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Junichi Nomura
     

10 Oct, 2015

1 commit

  • blk_mq_tag_update_depth() seems to be missing a memory barrier which
    might cause the waker to not notice the waiter and fail to send a
    wake_up as in the following figure.

    blk_mq_tag_update_depth bt_get
    ------------------------------------------------------------------------
    if (waitqueue_active(&bs->wait))
    /* The CPU might reorder the test for
    the waitqueue up here, before
    prior writes complete */
    prepare_to_wait(&bs->wait, &wait,
    TASK_UNINTERRUPTIBLE);
    tag = __bt_get(hctx, bt, last_tag,
    tags);
    /* Value set in bt_update_count not
    visible yet */
    bt_update_count(&tags->bitmap_tags, tdepth);
    /* blk_mq_tag_wakeup_all(tags, false); */
    bt = &tags->bitmap_tags;
    wake_index = atomic_read(&bt->wake_index);
    ...
    io_schedule();
    ------------------------------------------------------------------------

    This patch adds the missing memory barrier.

    I found this issue when I was looking through the linux source code
    for places calling waitqueue_active() before wake_up*(), but without
    preceding memory barriers, after sending a patch to fix a similar
    issue in drivers/tty/n_tty.c (Details about the original issue can be
    found here: https://lkml.org/lkml/2015/9/28/849).

    Signed-off-by: Kosuke Tatsukawa
    Signed-off-by: Jens Axboe

    Kosuke Tatsukawa
     

01 Oct, 2015

1 commit


15 Aug, 2015

1 commit

  • Inside timeout handler, blk_mq_tag_to_rq() is called
    to retrieve the request from one tag. This way is obviously
    wrong because the request can be freed any time and some
    fiedds of the request can't be trusted, then kernel oops
    might be triggered[1].

    Currently wrt. blk_mq_tag_to_rq(), the only special case is
    that the flush request can share same tag with the request
    cloned from, and the two requests can't be active at the same
    time, so this patch fixes the above issue by updating tags->rqs[tag]
    with the active request(either flush rq or the request cloned
    from) of the tag.

    Also blk_mq_tag_to_rq() gets much simplified with this patch.

    Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
    make sure the request can't be freed, so in bt_for_each() this
    helper is replaced with tags->rqs[tag].

    [1] kernel oops log
    [ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
    [ 439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
    [ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
    [ 439.700653] Dumping ftrace buffer:^M
    [ 439.700653] (ftrace buffer empty)^M
    [ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
    [ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
    [ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
    [ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
    [ 439.730500] RIP: 0010:[] [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
    [ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
    [ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
    [ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
    [ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
    [ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
    [ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
    [ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
    [ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
    [ 439.730500] Stack:^M
    [ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
    [ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
    [ 439.755663] Call Trace:^M
    [ 439.755663] ^M
    [ 439.755663] [] bt_for_each+0x6e/0xc8^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
    [ 439.755663] [] blk_mq_tag_busy_iter+0x55/0x5e^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] blk_mq_rq_timer+0x5d/0xd4^M
    [ 439.755663] [] call_timer_fn+0xf7/0x284^M
    [ 439.755663] [] ? call_timer_fn+0x5/0x284^M
    [ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
    [ 439.755663] [] run_timer_softirq+0x1ce/0x1f8^M
    [ 439.755663] [] __do_softirq+0x181/0x3a4^M
    [ 439.755663] [] irq_exit+0x40/0x94^M
    [ 439.755663] [] smp_apic_timer_interrupt+0x33/0x3e^M
    [ 439.755663] [] apic_timer_interrupt+0x84/0x90^M
    [ 439.755663] ^M
    [ 439.755663] [] ? _raw_spin_unlock_irq+0x32/0x4a^M
    [ 439.755663] [] finish_task_switch+0xe0/0x163^M
    [ 439.755663] [] ? finish_task_switch+0xa2/0x163^M
    [ 439.755663] [] __schedule+0x469/0x6cd^M
    [ 439.755663] [] schedule+0x82/0x9a^M
    [ 439.789267] [] signalfd_read+0x186/0x49a^M
    [ 439.790911] [] ? wake_up_q+0x47/0x47^M
    [ 439.790911] [] __vfs_read+0x28/0x9f^M
    [ 439.790911] [] ? __fget_light+0x4d/0x74^M
    [ 439.790911] [] vfs_read+0x7a/0xc6^M
    [ 439.790911] [] SyS_read+0x49/0x7f^M
    [ 439.790911] [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
    [ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
    f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
    53 38 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
    ^M
    [ 439.790911] RIP [] blk_mq_tag_to_rq+0x21/0x6e^M
    [ 439.790911] RSP ^M
    [ 439.790911] CR2: 0000000000000158^M
    [ 439.790911] ---[ end trace d40af58949325661 ]---^M

    Cc:
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

02 Jun, 2015

1 commit

  • Storage controllers may expose multiple block devices that share hardware
    resources managed by blk-mq. This patch enhances the shared tags so a
    low-level driver can access the shared resources not tied to the unshared
    h/w contexts. This way the LLD can dynamically add and delete disks and
    request queues without having to track all the request_queue hctx's to
    iterate outstanding tags.

    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

19 Mar, 2015

1 commit

  • When allocating from the reserved tags pool, bt_get() is called with
    a NULL hctx. If all tags are in use, the hw queue is kicked to push
    out any pending IO, potentially freeing tags, and tag allocation is
    retried. The problem is that blk_mq_run_hw_queue() doesn't check for
    a NULL hctx. So we avoid it with a simple NULL hctx test.

    Tested by hammering mtip32xx with concurrent smartctl/hdparm.

    Signed-off-by: Sam Bradshaw
    Signed-off-by: Selvan Mani
    Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
    Cc: stable@kernel.org

    Added appropriate comment.

    Signed-off-by: Jens Axboe

    Sam Bradshaw
     

12 Feb, 2015

1 commit

  • If the allocation of bt->bs fails, then bt->map can be freed twice, once
    in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
    blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to
    NULL after the first free.

    Cc:
    Signed-off-by: Tony Battersby
    Signed-off-by: Jens Axboe

    Tony Battersby
     

24 Jan, 2015

1 commit

  • This is the blk-mq part to support tag allocation policy. The default
    allocation policy isn't changed (though it's not a strict FIFO). The new
    policy is round-robin for libata. But it's a try-best implementation. If
    multiple tasks are competing, the tags returned will be mixed (which is
    unavoidable even with !mq, as requests from different tasks can be
    mixed in queue)

    Cc: Jens Axboe
    Cc: Tejun Heo
    Cc: Christoph Hellwig
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

14 Jan, 2015

1 commit

  • The blk-mq tagging tries to maintain some locality between CPUs and
    the tags issued. The tags are split into groups of words, and the
    words may not be fully populated. When searching for a new free tag,
    blk-mq may look at partial words, hence it passes in an offset/size
    to find_next_zero_bit(). However, it does that wrong, the size must
    always be the full length of the number of tags in that word,
    otherwise we'll potentially miss some near the end.

    Another issue is when __bt_get() goes from one word set to the next.
    It bumps the index, but not the last_tag associated with the
    previous index. Bump that to be in the range of the new word.

    Finally, clean up __bt_get() and __bt_get_word() a bit and get
    rid of the goto in there, and the unnecessary 'wrap' variable.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

01 Jan, 2015

1 commit

  • If it's dying, we can't expect new request to complete and come
    in an wake up other tasks waiting for requests. So after we
    have marked it as dying, wake up everybody currently waiting
    for a request. Once they wake, they will retry their allocation
    and fail appropriately due to the state of the queue.

    Tested-by: Keith Busch
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Dec, 2014

1 commit

  • This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf.

    The optimization is only really safe for a single queue, otherwise
    'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
    for each loop, we'll potentially change the wait structure and
    corrupt task wait list.

    Reported-by: Jan Kara

    Jens Axboe
     

14 Dec, 2014

1 commit

  • Pull block driver core update from Jens Axboe:
    "This is the pull request for the core block IO changes for 3.19. Not
    a huge round this time, mostly lots of little good fixes:

    - Fix a bug in sysfs blktrace interface causing a NULL pointer
    dereference, when enabled/disabled through that API. From Arianna
    Avanzini.

    - Various updates/fixes/improvements for blk-mq:

    - A set of updates from Bart, mostly fixing buts in the tag
    handling.

    - Cleanup/code consolidation from Christoph.

    - Extend queue_rq API to be able to handle batching issues of IO
    requests. NVMe will utilize this shortly. From me.

    - A few tag and request handling updates from me.

    - Cleanup of the preempt handling for running queues from Paolo.

    - Prevent running of unmapped hardware queues from Ming Lei.

    - Move the kdump memory limiting check to be in the correct
    location, from Shaohua.

    - Initialize all software queues at init time from Takashi. This
    prevents a kobject warning when CPUs are brought online that
    weren't online when a queue was registered.

    - Single writeback fix for I_DIRTY clearing from Tejun. Queued with
    the core IO changes, since it's just a single fix.

    - Version X of the __bio_add_page() segment addition retry from
    Maurizio. Hope the Xth time is the charm.

    - Documentation fixup for IO scheduler merging from Jan.

    - Introduce (and use) generic IO stat accounting helpers for non-rq
    drivers, from Gu Zheng.

    - Kill off artificial limiting of max sectors in a request from
    Christoph"

    * 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
    bio: modify __bio_add_page() to accept pages that don't start a new segment
    blk-mq: Fix uninitialized kobject at CPU hotplugging
    blktrace: don't let the sysfs interface remove trace from running list
    blk-mq: Use all available hardware queues
    blk-mq: Micro-optimize bt_get()
    blk-mq: Fix a race between bt_clear_tag() and bt_get()
    blk-mq: Avoid that __bt_get_word() wraps multiple times
    blk-mq: Fix a use-after-free
    blk-mq: prevent unmapped hw queue from being scheduled
    blk-mq: re-check for available tags after running the hardware queue
    blk-mq: fix hang in bt_get()
    blk-mq: move the kdump check to blk_mq_alloc_tag_set
    blk-mq: cleanup tag free handling
    blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
    blk: introduce generic io stat accounting help function
    blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
    genhd: check for int overflow in disk_expand_part_tbl()
    blk-mq: add blk_mq_free_hctx_request()
    blk-mq: export blk_mq_free_request()
    blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
    ...

    Linus Torvalds
     

10 Dec, 2014

3 commits

  • Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
    calls into a single call.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • What we need is the following two guarantees:
    * Any thread that observes the effect of the test_and_set_bit() by
    __bt_get_word() also observes the preceding addition of 'current'
    to the appropriate wait list. This is guaranteed by the semantics
    of the spin_unlock() operation performed by prepare_and_wait().
    Hence the conversion of test_and_set_bit_lock() into
    test_and_set_bit().
    * The wait lists are examined by bt_clear() after the tag bit has
    been cleared. clear_bit_unlock() guarantees that any thread that
    observes that the bit has been cleared also observes the store
    operations preceding clear_bit_unlock(). However,
    clear_bit_unlock() does not prevent that the wait lists are examined
    before that the tag bit is cleared. Hence the addition of a memory
    barrier between clear_bit() and the wait list examination.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • If __bt_get_word() is called with last_tag != 0, if the first
    find_next_zero_bit() fails, if after wrap-around the
    test_and_set_bit() call fails and find_next_zero_bit() succeeds,
    if the next test_and_set_bit() call fails and subsequently
    find_next_zero_bit() does not find a zero bit, then another
    wrap-around will occur. Avoid this by introducing an additional
    local variable.

    Signed-off-by: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Robert Elliott
    Cc: Ming Lei
    Cc: Alexander Gordeev
    Cc: # v3.13+
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

08 Dec, 2014

2 commits

  • If we run out of tags and have to sleep, we run the hardware queue
    to kick pending IO into gear. During that run, we may have completed
    requests, so re-check if we have free tags before going to sleep.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Avoid that if there are fewer hardware queues than CPU threads that
    bt_get() can hang. The symptoms of the hang were as follows:

    * All tags allocated for a particular hardware queue.
    * (nr_tags) pending commands for that hardware queue.
    * No pending commands for the software queues associated with that
    hardware queue.

    Signed-off-by: Jens Axboe

    Bart Van Assche
     

25 Nov, 2014

1 commit


12 Nov, 2014

1 commit

  • The queuecommand() callback functions in SCSI low-level drivers
    need to know which hardware context has been selected by the
    block layer. Since this information is not available in the
    request structure, and since passing the hctx pointer directly to
    the queuecommand callback function would require modification of
    all SCSI LLDs, add a function to the block layer that allows to
    query the hardware context index.

    Signed-off-by: Bart Van Assche
    Acked-by: Jens Axboe
    Reviewed-by: Sagi Grimberg
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Christoph Hellwig

    Bart Van Assche
     

07 Oct, 2014

2 commits


23 Sep, 2014

1 commit

  • Don't do a kmalloc from timer to handle timeouts, chances are we could be
    under heavy load or similar and thus just miss out on the timeouts.
    Fortunately it is very easy to just iterate over all in use tags, and doing
    this properly actually cleans up the blk_mq_busy_iter API as well, and
    prepares us for the next patch by passing a reserved argument to the
    iterator.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

18 Jun, 2014

3 commits

  • This update fixes few issues in bt_get() function:

    - list_empty(&wait.task_list) check is not protected;

    - was_empty check is always true which results in *every* thread
    entering the loop resets bt_wait_state::wait_cnt counter rather
    than every bt->wake_cnt'th thread;

    - 'bt_wait_state::wait_cnt' counter update is redundant, since
    it also gets reset in bt_clear_tag() function;

    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • This piece of code in bt_clear_tag() function is racy:

    bs = bt_wake_ptr(bt);
    if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
    atomic_set(&bs->wait_cnt, bt->wake_cnt);
    wake_up(&bs->wait);
    }

    Since nothing prevents bt_wake_ptr() from returning the very
    same 'bs' address on multiple CPUs, the following scenario is
    possible:

    CPU1 CPU2
    ---- ----

    0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
    1. atomic_dec_and_test(&bs->wait_cnt)
    2. atomic_dec_and_test(&bs->wait_cnt)
    3. atomic_set(&bs->wait_cnt, bt->wake_cnt);

    If the decrement in [1] yields zero then for some amount of time
    the decrement in [2] results in a negative/overflow value, which
    is not expected. The follow-up assignment in [3] overwrites the
    invalid value with the batch value (and likely prevents the issue
    from being severe) which is still incorrect and should be a lesser.

    Cc: Ming Lei
    Cc: Jens Axboe
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     
  • Fix racy updates of shared blk_mq_bitmap_tags::wake_index
    and blk_mq_hw_ctx::wake_index fields.

    Cc: Ming Lei
    Signed-off-by: Alexander Gordeev
    Signed-off-by: Jens Axboe

    Alexander Gordeev
     

04 Jun, 2014

1 commit


29 May, 2014

1 commit