01 Oct, 2018

1 commit

  • Merge -rc6 in, for two reasons:

    1) Resolve a trivial conflict in the blk-mq-tag.c documentation
    2) A few important regression fixes went into upstream directly, so
    they aren't in the 4.20 branch.

    Signed-off-by: Jens Axboe

    * tag 'v4.19-rc6': (780 commits)
    Linux 4.19-rc6
    MAINTAINERS: fix reference to moved drivers/{misc => auxdisplay}/panel.c
    cpufreq: qcom-kryo: Fix section annotations
    perf/core: Add sanity check to deal with pinned event failure
    xen/blkfront: correct purging of persistent grants
    Revert "xen/blkfront: When purging persistent grants, keep them in the buffer"
    selftests/powerpc: Fix Makefiles for headers_install change
    blk-mq: I/O and timer unplugs are inverted in blktrace
    dax: Fix deadlock in dax_lock_mapping_entry()
    x86/boot: Fix kexec booting failure in the SEV bit detection code
    bcache: add separate workqueue for journal_write to avoid deadlock
    drm/amd/display: Fix Edid emulation for linux
    drm/amd/display: Fix Vega10 lightup on S3 resume
    drm/amdgpu: Fix vce work queue was not cancelled when suspend
    Revert "drm/panel: Add device_link from panel device to DRM device"
    xen/blkfront: When purging persistent grants, keep them in the buffer
    clocksource/drivers/timer-atmel-pit: Properly handle error cases
    block: fix deadline elevator drain for zoned block devices
    ACPI / hotplug / PCI: Don't scan for non-hotplug bridges if slot is not bridge
    drm/syncobj: Don't leak fences when WAIT_FOR_SUBMIT is set
    ...

    Signed-off-by: Jens Axboe

    Jens Axboe
     

26 Sep, 2018

1 commit

  • A recent commit runs tag iterator callbacks under the rcu read lock,
    but existing callbacks do not satisfy the non-blocking requirement.
    The commit intended to prevent an iterator from accessing a queue that's
    being modified. This patch fixes the original issue by taking a queue
    reference instead of reading it, which allows callbacks to make blocking
    calls.

    Fixes: f5bbbbe4d6357 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
    Acked-by: Jianchao Wang
    Signed-off-by: Keith Busch
    Signed-off-by: Jens Axboe

    Keith Busch
     

22 Sep, 2018

1 commit

  • Make it easier to understand the purpose of the functions that iterate
    over requests by documenting their purpose. Fix several minor spelling
    and grammer mistakes in comments in these functions.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Johannes Thumshirn
    Cc: Christoph Hellwig
    Cc: Ming Lei
    Cc: Jianchao Wang
    Cc: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

23 Aug, 2018

1 commit

  • Pull more block updates from Jens Axboe:

    - Set of bcache fixes and changes (Coly)

    - The flush warn fix (me)

    - Small series of BFQ fixes (Paolo)

    - wbt hang fix (Ming)

    - blktrace fix (Steven)

    - blk-mq hardware queue count update fix (Jianchao)

    - Various little fixes

    * tag 'for-4.19/post-20180822' of git://git.kernel.dk/linux-block: (31 commits)
    block/DAC960.c: make some arrays static const, shrinks object size
    blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter
    blk-mq: init hctx sched after update ctx and hctx mapping
    block: remove duplicate initialization
    tracing/blktrace: Fix to allow setting same value
    pktcdvd: fix setting of 'ret' error return for a few cases
    block: change return type to bool
    block, bfq: return nbytes and not zero from struct cftype .write() method
    block, bfq: improve code of bfq_bfqq_charge_time
    block, bfq: reduce write overcharge
    block, bfq: always update the budget of an entity when needed
    block, bfq: readd missing reset of parent-entity service
    blk-wbt: fix IO hang in wbt_wait()
    block: don't warn for flush on read-only device
    bcache: add the missing comments for smp_mb()/smp_wmb()
    bcache: remove unnecessary space before ioctl function pointer arguments
    bcache: add missing SPDX header
    bcache: move open brace at end of function definitions to next line
    bcache: add static const prefix to char * array declarations
    bcache: fix code comments style
    ...

    Linus Torvalds
     

21 Aug, 2018

1 commit

  • For blk-mq, part_in_flight/rw will invoke blk_mq_in_flight/rw to
    account the inflight requests. It will access the queue_hw_ctx and
    nr_hw_queues w/o any protection. When updating nr_hw_queues and
    blk_mq_in_flight/rw occur concurrently, panic comes up.

    Before update nr_hw_queues, the q will be frozen. So we could use
    q_usage_counter to avoid the race. percpu_ref_is_zero is used here
    so that we will not miss any in-flight request. The access to
    nr_hw_queues and queue_hw_ctx in blk_mq_queue_tag_busy_iter are
    under rcu critical section, __blk_mq_update_nr_hw_queues could use
    synchronize_rcu to ensure the zeroed q_usage_counter to be globally
    visible.

    Signed-off-by: Jianchao Wang
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

15 Aug, 2018

1 commit

  • Pull block updates from Jens Axboe:
    "First pull request for this merge window, there will also be a
    followup request with some stragglers.

    This pull request contains:

    - Fix for a thundering heard issue in the wbt block code (Anchal
    Agarwal)

    - A few NVMe pull requests:
    * Improved tracepoints (Keith)
    * Larger inline data support for RDMA (Steve Wise)
    * RDMA setup/teardown fixes (Sagi)
    * Effects log suppor for NVMe target (Chaitanya Kulkarni)
    * Buffered IO suppor for NVMe target (Chaitanya Kulkarni)
    * TP4004 (ANA) support (Christoph)
    * Various NVMe fixes

    - Block io-latency controller support. Much needed support for
    properly containing block devices. (Josef)

    - Series improving how we handle sense information on the stack
    (Kees)

    - Lightnvm fixes and updates/improvements (Mathias/Javier et al)

    - Zoned device support for null_blk (Matias)

    - AIX partition fixes (Mauricio Faria de Oliveira)

    - DIF checksum code made generic (Max Gurtovoy)

    - Add support for discard in iostats (Michael Callahan / Tejun)

    - Set of updates for BFQ (Paolo)

    - Removal of async write support for bsg (Christoph)

    - Bio page dirtying and clone fixups (Christoph)

    - Set of bcache fix/changes (via Coly)

    - Series improving blk-mq queue setup/teardown speed (Ming)

    - Series improving merging performance on blk-mq (Ming)

    - Lots of other fixes and cleanups from a slew of folks"

    * tag 'for-4.19/block-20180812' of git://git.kernel.dk/linux-block: (190 commits)
    blkcg: Make blkg_root_lookup() work for queues in bypass mode
    bcache: fix error setting writeback_rate through sysfs interface
    null_blk: add lock drop/acquire annotation
    Blk-throttle: reduce tail io latency when iops limit is enforced
    block: paride: pd: mark expected switch fall-throughs
    block: Ensure that a request queue is dissociated from the cgroup controller
    block: Introduce blk_exit_queue()
    blkcg: Introduce blkg_root_lookup()
    block: Remove two superfluous #include directives
    blk-mq: count the hctx as active before allocating tag
    block: bvec_nr_vecs() returns value for wrong slab
    bcache: trivial - remove tailing backslash in macro BTREE_FLAG
    bcache: make the pr_err statement used for ENOENT only in sysfs_attatch section
    bcache: set max writeback rate when I/O request is idle
    bcache: add code comments for bset.c
    bcache: fix mistaken comments in request.c
    bcache: fix mistaken code comments in bcache.h
    bcache: add a comment in super.c
    bcache: avoid unncessary cache prefetch bch_btree_node_get()
    bcache: display rate debug parameters to 0 when writeback is not running
    ...

    Linus Torvalds
     

09 Aug, 2018

1 commit

  • Currently, we count the hctx as active after allocate driver tag
    successfully. If a previously inactive hctx try to get tag first
    time, it may fails and need to wait. However, due to the stale tag
    ->active_queues, the other shared-tags users are still able to
    occupy all driver tags while there is someone waiting for tag.
    Consequently, even if the previously inactive hctx is waked up, it
    still may not be able to get a tag and could be starved.

    To fix it, we count the hctx as active before try to allocate driver
    tag, then when it is waiting the tag, the other shared-tag users
    will reserve budget for it.

    Reviewed-by: Ming Lei
    Signed-off-by: Jianchao Wang
    Signed-off-by: Jens Axboe

    Jianchao Wang
     

03 Aug, 2018

2 commits

  • Commit d250bf4e776ff09d5("blk-mq: only iterate over inflight requests
    in blk_mq_tagset_busy_iter") uses 'blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT'
    to replace 'blk_mq_request_started(req)', this way is wrong, and causes
    lots of test system hang during booting.

    Fix the issue by using blk_mq_request_started(req) inside bt_tags_iter().

    Fixes: d250bf4e776ff09d5 ("blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter")
    Cc: Josef Bacik
    Cc: Christoph Hellwig
    Cc: Guenter Roeck
    Cc: Mark Brown
    Cc: Matt Hart
    Cc: Johannes Thumshirn
    Cc: John Garry
    Cc: Hannes Reinecke ,
    Cc: "Martin K. Petersen" ,
    Cc: James Bottomley
    Cc: linux-scsi@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Reviewed-by: Bart Van Assche
    Tested-by: Guenter Roeck
    Reported-by: Mark Brown
    Reported-by: Guenter Roeck
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • The passed 'nr' from userspace represents the total depth, meantime
    inside 'struct blk_mq_tags', 'nr_tags' stores the total tag depth,
    and 'nr_reserved_tags' stores the reserved part.

    There are two issues in blk_mq_tag_update_depth() now:

    1) for growing tags, we should have used the passed 'nr', and keep the
    number of reserved tags not changed.

    2) the passed 'nr' should have been used for checking against
    'tags->nr_tags', instead of number of the normal part.

    This patch fixes the above two cases, and avoids kernel crash caused
    by wrong resizing sbitmap queue.

    Cc: "Ewan D. Milne"
    Cc: Christoph Hellwig
    Cc: Bart Van Assche
    Cc: Omar Sandoval
    Tested by: Marco Patalano
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

14 Jun, 2018

1 commit


31 May, 2018

1 commit


25 May, 2018

1 commit

  • When the allocation process is scheduled back and the mapped hw queue is
    changed, fake one extra wake up on previous queue for compensating wake
    up miss, so other allocations on the previous queue won't be starved.

    This patch fixes one request allocation hang issue, which can be
    triggered easily in case of very low nr_request.

    The race is as follows:

    1) 2 hw queues, nr_requests are 2, and wake_batch is one

    2) there are 3 waiters on hw queue 0

    3) two in-flight requests in hw queue 0 are completed, and only two
    waiters of 3 are waken up because of wake_batch, but both the two
    waiters can be scheduled to another CPU and cause to switch to hw
    queue 1

    4) then the 3rd waiter will wait for ever, since no in-flight request
    is in hw queue 0 any more.

    5) this patch fixes it by the fake wakeup when waiter is scheduled to
    another hw queue

    Cc:
    Reviewed-by: Omar Sandoval
    Signed-off-by: Ming Lei

    Modified commit message to make it clearer, and make it apply on
    top of the 4.18 branch.

    Signed-off-by: Jens Axboe

    Ming Lei
     

23 Dec, 2017

1 commit

  • Even with a number of waitqueues, we can get into a situation where we
    are heavily contended on the waitqueue lock. I got a report on spc1
    where we're spending seconds doing this. Arguably the use case is nasty,
    I reproduce it with one device and 1000 threads banging on the device.
    But that doesn't mean we shouldn't be handling it better.

    What ends up happening is that a thread will fail to get a tag, add
    itself to the waitqueue, and subsequently get woken up when a tag is
    freed - only to find itself going back to sleep on the waitqueue.

    Instead of waking all threads, use an exclusive wait and wake up our
    sbitmap batch count instead. This seems to work well for me (massive
    improvement for this use case), and it survives basic testing. But I
    haven't fully verified it yet.

    An additional improvement is running the queue and checking for a new
    tag BEFORE needing to add ourselves to the waitqueue.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 Oct, 2017

2 commits

  • No callers left.

    Reviewed-by: Jens Axboe
    Reviewed-by: Bart Van Assche
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     
  • Iterator helper to apply a function on all the
    tags in a given tagset. export it as it will be used
    outside the block layer later on.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Jens Axboe
    Reviewed-by: Max Gurtovoy
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Sagi Grimberg
    Signed-off-by: Christoph Hellwig

    Sagi Grimberg
     

18 Aug, 2017

1 commit

  • Since blk_mq_ops.reinit_request is only called from inside
    blk_mq_reinit_tagset(), make this function pointer an argument of
    blk_mq_reinit_tagset() instead of a member of struct blk_mq_ops.
    This patch does not change any functionality but makes
    blk_mq_reinit_tagset() calls easier to read and to analyze.

    Signed-off-by: Bart Van Assche
    Reviewed-by: Hannes Reinecke
    Cc: Christoph Hellwig
    Cc: Sagi Grimberg
    Cc: James Smart
    Cc: Johannes Thumshirn
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Aug, 2017

1 commit

  • Since we introduced blk-mq-sched, the tags->rqs[] array has been
    dynamically assigned. So we need to check for NULL when iterating,
    since there's a window of time where the bit is set, but we haven't
    dynamically assigned the tags->rqs[] array position yet.

    This is perfectly safe, since the memory backing of the request is
    never going away while the device is alive.

    Reviewed-by: Bart Van Assche
    Reviewed-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Jens Axboe
     

15 Apr, 2017

1 commit


13 Mar, 2017

1 commit


02 Mar, 2017

1 commit


28 Jan, 2017

1 commit


27 Jan, 2017

1 commit


25 Jan, 2017

1 commit


21 Jan, 2017

1 commit

  • Add support for growing the tags associated with a hardware queue, for
    the scheduler tags. Currently we only support resizing within the
    limits of the original depth, change that so we can grow it as well by
    allocating and replacing the existing scheduler tag set.

    This is similar to how we could increase the software queue depth with
    the legacy IO stack and schedulers.

    Signed-off-by: Jens Axboe
    Reviewed-by: Omar Sandoval

    Jens Axboe
     

19 Jan, 2017

1 commit


18 Jan, 2017

2 commits


10 Oct, 2016

1 commit

  • Pull blk-mq irq/cpu mapping updates from Jens Axboe:
    "This is the block-irq topic branch for 4.9-rc. It's mostly from
    Christoph, and it allows drivers to specify their own mappings, and
    more importantly, to share the blk-mq mappings with the IRQ affinity
    mappings. It's a good step towards making this work better out of the
    box"

    * 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
    blk_mq: linux/blk-mq.h does not include all the headers it depends on
    blk-mq: kill unused blk_mq_create_mq_map()
    blk-mq: get rid of the cpumask in struct blk_mq_tags
    nvme: remove the post_scan callout
    nvme: switch to use pci_alloc_irq_vectors
    blk-mq: provide a default queue mapping for PCI device
    blk-mq: allow the driver to pass in a queue mapping
    blk-mq: remove ->map_queue
    blk-mq: only allocate a single mq_map per tag_set
    blk-mq: don't redistribute hardware queues on a CPU hotplug event

    Linus Torvalds
     

17 Sep, 2016

4 commits

  • In order to get good cache behavior from a sbitmap, we want each CPU to
    stick to its own cacheline(s) as much as possible. This might happen
    naturally as the bitmap gets filled up and the alloc_hint values spread
    out, but we really want this behavior from the start. blk-mq apparently
    intended to do this, but the code to do this was never wired up. Get rid
    of the dead code and make it part of the sbitmap library.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Again, there's no point in passing this in every time. Make it part of
    struct sbitmap_queue and clean up the API.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • Allocating your own per-cpu allocation hint separately makes for an
    awkward API. Instead, allocate the per-cpu hint as part of the struct
    sbitmap_queue. There's no point for a struct sbitmap_queue without the
    cache, but you can still use a bare struct sbitmap.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     
  • This is a generally useful data structure, so make it available to
    anyone else who might want to use it. It's also a nice cleanup
    separating the allocation logic from the rest of the tag handling logic.

    The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
    selected by CONFIG_BLOCK for now.

    This should be a complete noop functionality-wise.

    Signed-off-by: Omar Sandoval
    Signed-off-by: Jens Axboe

    Omar Sandoval
     

15 Sep, 2016

2 commits

  • Unused now that NVMe sets up irq affinity before calling into blk-mq.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • All drivers use the default, so provide an inline version of it. If we
    ever need other queue mapping we can add an optional method back,
    although supporting will also require major changes to the queue setup
    code.

    This provides better code generation, and better debugability as well.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Keith Busch
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

08 Jul, 2016

1 commit

  • The new nvme-rdma driver will need to reinitialize all the tags as part of
    the error recovery procedure (realloc the tag memory region). Add a helper
    in blk-mq for it that can iterate over all requests in a tagset to make
    this easier.

    Signed-off-by: Sagi Grimberg
    Tested-by: Ming Lin
    Reviewed-by: Stephen Bates
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Steve Wise
    Tested-by: Steve Wise
    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

13 Apr, 2016

2 commits


02 Dec, 2015

1 commit


07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

05 Nov, 2015

1 commit

  • Pull core block updates from Jens Axboe:
    "This is the core block pull request for 4.4. I've got a few more
    topic branches this time around, some of them will layer on top of the
    core+drivers changes and will come in a separate round. So not a huge
    chunk of changes in this round.

    This pull request contains:

    - Enable blk-mq page allocation tracking with kmemleak, from Catalin.

    - Unused prototype removal in blk-mq from Christoph.

    - Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
    xchg()'s, from Davidlohr.

    - A plug flush fix from Jeff.

    - Also from Jeff, a fix that means we don't have to update shared tag
    sets at init time unless we do a state change. This cuts down boot
    times on thousands of devices a lot with scsi/blk-mq.

    - blk-mq waitqueue barrier fix from Kosuke.

    - Various fixes from Ming:

    - Fixes for segment merging and splitting, and checks, for
    the old core and blk-mq.

    - Potential blk-mq speedup by marking ctx pending at the end
    of a plug insertion batch in blk-mq.

    - direct-io no page dirty on kernel direct reads.

    - A WRITE_SYNC fix for mpage from Roman"

    * 'for-4.4/core' of git://git.kernel.dk/linux-block:
    blk-mq: avoid excessive boot delays with large lun counts
    blktrace: re-write setting q->blk_trace
    blk-mq: mark ctx as pending at batch in flush plug path
    blk-mq: fix for trace_block_plug()
    block: check bio_mergeable() early before merging
    blk-mq: check bio_mergeable() early before merging
    block: avoid to merge splitted bio
    block: setup bi_phys_segments after splitting
    block: fix plug list flushing for nomerge queues
    blk-mq: remove unused blk_mq_clone_flush_request prototype
    blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
    fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
    fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
    block: kmemleak: Track the page allocations for struct request

    Linus Torvalds