Eric Lee / smarc-fsl-linux-kernel

10 Oct, 2016

1 commit

12e3d3cdd Merge branch 'for-4.9/block-irq' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-mq irq/cpu mapping updates from Jens Axboe:
"This is the block-irq topic branch for 4.9-rc. It's mostly from
Christoph, and it allows drivers to specify their own mappings, and
more importantly, to share the blk-mq mappings with the IRQ affinity
mappings. It's a good step towards making this work better out of the
box"

* 'for-4.9/block-irq' of git://git.kernel.dk/linux-block:
blk_mq: linux/blk-mq.h does not include all the headers it depends on
blk-mq: kill unused blk_mq_create_mq_map()
blk-mq: get rid of the cpumask in struct blk_mq_tags
nvme: remove the post_scan callout
nvme: switch to use pci_alloc_irq_vectors
blk-mq: provide a default queue mapping for PCI device
blk-mq: allow the driver to pass in a queue mapping
blk-mq: remove ->map_queue
blk-mq: only allocate a single mq_map per tag_set
blk-mq: don't redistribute hardware queues on a CPU hotplug event

Linus Torvalds
2016-10-10 08:29:33 +0800

17 Sep, 2016

4 commits

98d95416d sbitmap: randomize initial alloc_hint values ... Browse Code »

In order to get good cache behavior from a sbitmap, we want each CPU to
stick to its own cacheline(s) as much as possible. This might happen
naturally as the bitmap gets filled up and the alloc_hint values spread
out, but we really want this behavior from the start. blk-mq apparently
intended to do this, but the code to do this was never wired up. Get rid
of the dead code and make it part of the sbitmap library.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-09-17 22:39:14 +0800
f4a644db8 sbitmap: push alloc policy into sbitmap_queue ... Browse Code »

Again, there's no point in passing this in every time. Make it part of
struct sbitmap_queue and clean up the API.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-09-17 22:39:12 +0800
40aabb674 sbitmap: push per-cpu last_tag into sbitmap_queue ... Browse Code »

Allocating your own per-cpu allocation hint separately makes for an
awkward API. Instead, allocate the per-cpu hint as part of the struct
sbitmap_queue. There's no point for a struct sbitmap_queue without the
cache, but you can still use a bare struct sbitmap.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-09-17 22:39:10 +0800
88459642c blk-mq: abstract tag allocation out into sbitmap library ... Browse Code »

This is a generally useful data structure, so make it available to
anyone else who might want to use it. It's also a nice cleanup
separating the allocation logic from the rest of the tag handling logic.

The code is behind a new Kconfig option, CONFIG_SBITMAP, which is only
selected by CONFIG_BLOCK for now.

This should be a complete noop functionality-wise.

Signed-off-by: Omar Sandoval
Signed-off-by: Jens Axboe

Omar Sandoval
2016-09-17 22:38:44 +0800

15 Sep, 2016

2 commits

1b157939f blk-mq: get rid of the cpumask in struct blk_mq_tags ... Browse Code »

Unused now that NVMe sets up irq affinity before calling into blk-mq.

Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-09-15 22:42:03 +0800
7d7e0f90b blk-mq: remove ->map_queue ... Browse Code »

All drivers use the default, so provide an inline version of it. If we
ever need other queue mapping we can add an optional method back,
although supporting will also require major changes to the queue setup
code.

This provides better code generation, and better debugability as well.

Signed-off-by: Christoph Hellwig
Reviewed-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2016-09-15 22:42:03 +0800

08 Jul, 2016

1 commit

486cf9899 blk-mq: Introduce blk_mq_reinit_tagset ... Browse Code »

The new nvme-rdma driver will need to reinitialize all the tags as part of
the error recovery procedure (realloc the tag memory region). Add a helper
in blk-mq for it that can iterate over all requests in a tagset to make
this easier.

Signed-off-by: Sagi Grimberg
Tested-by: Ming Lin
Reviewed-by: Stephen Bates
Signed-off-by: Christoph Hellwig
Reviewed-by: Steve Wise
Tested-by: Steve Wise
Signed-off-by: Jens Axboe

Sagi Grimberg
2016-07-08 22:38:49 +0800

13 Apr, 2016

2 commits

e8f1e1630 blk-mq: Make blk_mq_all_tag_busy_iter static ... Browse Code »

No caller outside the blk-mq code so we can settle
with it static.

Signed-off-by: Sagi Grimberg
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Sagi Grimberg
2016-04-13 05:07:36 +0800
e0489487e blk-mq: Export tagset iter function ... Browse Code »

Its useful to iterate on all the active tags in cases
where we will need to fail all the queues IO.

Signed-off-by: Sagi Grimberg
[hch: carefully check for valid tagsets]
Reviewed-by: Christoph Hellwig
Reviewed-by: Johannes Thumshirn
Signed-off-by: Jens Axboe

Sagi Grimberg
2016-04-13 03:43:53 +0800

02 Dec, 2015

1 commit

6f3b0e8bc blk-mq: add a flags parameter to blk_mq_alloc_request ... Browse Code »

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-12-02 01:53:59 +0800

07 Nov, 2015

1 commit

d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

05 Nov, 2015

1 commit

d9734e0d1 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.

This pull request contains:

- Enable blk-mq page allocation tracking with kmemleak, from Catalin.

- Unused prototype removal in blk-mq from Christoph.

- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.

- A plug flush fix from Jeff.

- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.

- blk-mq waitqueue barrier fix from Kosuke.

- Various fixes from Ming:

- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.

- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.

- direct-io no page dirty on kernel direct reads.

- A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request

Linus Torvalds
2015-11-05 12:28:10 +0800

15 Oct, 2015

1 commit

f42d79ab6 blk-mq: fix use-after-free in blk_mq_free_tag_set() ... Browse Code »

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura
Cc: Keith Busch
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Junichi Nomura
2015-10-15 22:45:58 +0800

10 Oct, 2015

1 commit

8ee1b7b9d blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c ... Browse Code »

blk_mq_tag_update_depth() seems to be missing a memory barrier which
might cause the waker to not notice the waiter and fail to send a
wake_up as in the following figure.

blk_mq_tag_update_depth bt_get
------------------------------------------------------------------------
if (waitqueue_active(&bs->wait))
/* The CPU might reorder the test for
the waitqueue up here, before
prior writes complete */
prepare_to_wait(&bs->wait, &wait,
TASK_UNINTERRUPTIBLE);
tag = __bt_get(hctx, bt, last_tag,
tags);
/* Value set in bt_update_count not
visible yet */
bt_update_count(&tags->bitmap_tags, tdepth);
/* blk_mq_tag_wakeup_all(tags, false); */
bt = &tags->bitmap_tags;
wake_index = atomic_read(&bt->wake_index);
...
io_schedule();
------------------------------------------------------------------------

This patch adds the missing memory barrier.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa
Signed-off-by: Jens Axboe

Kosuke Tatsukawa
2015-10-10 00:52:46 +0800

01 Oct, 2015

1 commit

0bf6cd5b9 blk-mq: factor out a helper to iterate all tags for a request_queue ... Browse Code »

And replace the blk_mq_tag_busy_iter with it - the driver use has been
replaced with a new helper a while ago, and internal to the block we
only need the new version.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-10-01 16:10:57 +0800

15 Aug, 2015

1 commit

0048b4837 blk-mq: fix race between timeout and freeing request ... Browse Code »

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[ 439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[ 439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[ 439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[ 439.700653] Dumping ftrace buffer:^M
[ 439.700653] (ftrace buffer empty)^M
[ 439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[ 439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[ 439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[ 439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[ 439.730500] RIP: 0010:[] [] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.730500] RSP: 0018:ffff880819203da0 EFLAGS: 00010283^M
[ 439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[ 439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[ 439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[ 439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[ 439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[ 439.730500] FS: 00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[ 439.730500] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[ 439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[ 439.730500] Stack:^M
[ 439.730500] 0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[ 439.755663] ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[ 439.755663] Call Trace:^M
[ 439.755663] ^M
[ 439.755663] [] bt_for_each+0x6e/0xc8^M
[ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[ 439.755663] [] blk_mq_tag_busy_iter+0x55/0x5e^M
[ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [] blk_mq_rq_timer+0x5d/0xd4^M
[ 439.755663] [] call_timer_fn+0xf7/0x284^M
[ 439.755663] [] ? call_timer_fn+0x5/0x284^M
[ 439.755663] [] ? blk_mq_bio_to_request+0x38/0x38^M
[ 439.755663] [] run_timer_softirq+0x1ce/0x1f8^M
[ 439.755663] [] __do_softirq+0x181/0x3a4^M
[ 439.755663] [] irq_exit+0x40/0x94^M
[ 439.755663] [] smp_apic_timer_interrupt+0x33/0x3e^M
[ 439.755663] [] apic_timer_interrupt+0x84/0x90^M
[ 439.755663] ^M
[ 439.755663] [] ? _raw_spin_unlock_irq+0x32/0x4a^M
[ 439.755663] [] finish_task_switch+0xe0/0x163^M
[ 439.755663] [] ? finish_task_switch+0xa2/0x163^M
[ 439.755663] [] __schedule+0x469/0x6cd^M
[ 439.755663] [] schedule+0x82/0x9a^M
[ 439.789267] [] signalfd_read+0x186/0x49a^M
[ 439.790911] [] ? wake_up_q+0x47/0x47^M
[ 439.790911] [] __vfs_read+0x28/0x9f^M
[ 439.790911] [] ? __fget_light+0x4d/0x74^M
[ 439.790911] [] vfs_read+0x7a/0xc6^M
[ 439.790911] [] SyS_read+0x49/0x7f^M
[ 439.790911] [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[ 439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[ 439.790911] RIP [] blk_mq_tag_to_rq+0x21/0x6e^M
[ 439.790911] RSP ^M
[ 439.790911] CR2: 0000000000000158^M
[ 439.790911] ---[ end trace d40af58949325661 ]---^M

Cc:
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-08-15 23:45:21 +0800

02 Jun, 2015

1 commit

f26cdc853 blk-mq: Shared tag enhancements ... Browse Code »

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-06-02 04:35:56 +0800

19 Mar, 2015

1 commit

bc188d818 blkmq: Fix NULL pointer deref when all reserved tags in ... Browse Code »

When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx. If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried. The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx. So we avoid it with a simple NULL hctx test.

Tested by hammering mtip32xx with concurrent smartctl/hdparm.

Signed-off-by: Sam Bradshaw
Signed-off-by: Selvan Mani
Fixes: b32232073e80 ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org

Added appropriate comment.

Signed-off-by: Jens Axboe

Sam Bradshaw
2015-03-19 07:06:18 +0800

12 Feb, 2015

1 commit

564e559f2 blk-mq: fix double-free in error path ... Browse Code »

If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to
NULL after the first free.

Cc:
Signed-off-by: Tony Battersby
Signed-off-by: Jens Axboe

Tony Battersby
2015-02-12 00:35:21 +0800

24 Jan, 2015

1 commit

24391c0dc blk-mq: add tag allocation policy ... Browse Code »

This is the blk-mq part to support tag allocation policy. The default
allocation policy isn't changed (though it's not a strict FIFO). The new
policy is round-robin for libata. But it's a try-best implementation. If
multiple tasks are competing, the tags returned will be mixed (which is
unavoidable even with !mq, as requests from different tasks can be
mixed in queue)

Cc: Jens Axboe
Cc: Tejun Heo
Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2015-01-24 05:18:00 +0800

14 Jan, 2015

1 commit

0bf364984 blk-mq: fix false negative out-of-tags condition ... Browse Code »

The blk-mq tagging tries to maintain some locality between CPUs and
the tags issued. The tags are split into groups of words, and the
words may not be fully populated. When searching for a new free tag,
blk-mq may look at partial words, hence it passes in an offset/size
to find_next_zero_bit(). However, it does that wrong, the size must
always be the full length of the number of tags in that word,
otherwise we'll potentially miss some near the end.

Another issue is when __bt_get() goes from one word set to the next.
It bumps the index, but not the last_tag associated with the
previous index. Bump that to be in the range of the new word.

Finally, clean up __bt_get() and __bt_get_word() a bit and get
rid of the goto in there, and the unnecessary 'wrap' variable.

Signed-off-by: Jens Axboe

Jens Axboe
2015-01-14 23:49:55 +0800

01 Jan, 2015

1 commit

aed3ea94b block: wake up waiters when a queue is marked dying ... Browse Code »

If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.

Tested-by: Keith Busch
Signed-off-by: Jens Axboe

Jens Axboe
2015-01-01 00:39:16 +0800

15 Dec, 2014

1 commit

35d37c663 Revert "blk-mq: Micro-optimize bt_get()" ... Browse Code »

This reverts commit 52f7eb945f2ba62b324bb9ae16d945326a961dcf.

The optimization is only really safe for a single queue, otherwise
'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
for each loop, we'll potentially change the wait structure and
corrupt task wait list.

Reported-by: Jan Kara

Jens Axboe
2014-12-15 23:30:26 +0800

14 Dec, 2014

1 commit

caf292ae5 Merge branch 'for-3.19/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver core update from Jens Axboe:
"This is the pull request for the core block IO changes for 3.19. Not
a huge round this time, mostly lots of little good fixes:

- Fix a bug in sysfs blktrace interface causing a NULL pointer
dereference, when enabled/disabled through that API. From Arianna
Avanzini.

- Various updates/fixes/improvements for blk-mq:

- A set of updates from Bart, mostly fixing buts in the tag
handling.

- Cleanup/code consolidation from Christoph.

- Extend queue_rq API to be able to handle batching issues of IO
requests. NVMe will utilize this shortly. From me.

- A few tag and request handling updates from me.

- Cleanup of the preempt handling for running queues from Paolo.

- Prevent running of unmapped hardware queues from Ming Lei.

- Move the kdump memory limiting check to be in the correct
location, from Shaohua.

- Initialize all software queues at init time from Takashi. This
prevents a kobject warning when CPUs are brought online that
weren't online when a queue was registered.

- Single writeback fix for I_DIRTY clearing from Tejun. Queued with
the core IO changes, since it's just a single fix.

- Version X of the __bio_add_page() segment addition retry from
Maurizio. Hope the Xth time is the charm.

- Documentation fixup for IO scheduler merging from Jan.

- Introduce (and use) generic IO stat accounting helpers for non-rq
drivers, from Gu Zheng.

- Kill off artificial limiting of max sectors in a request from
Christoph"

* 'for-3.19/core' of git://git.kernel.dk/linux-block: (26 commits)
bio: modify __bio_add_page() to accept pages that don't start a new segment
blk-mq: Fix uninitialized kobject at CPU hotplugging
blktrace: don't let the sysfs interface remove trace from running list
blk-mq: Use all available hardware queues
blk-mq: Micro-optimize bt_get()
blk-mq: Fix a race between bt_clear_tag() and bt_get()
blk-mq: Avoid that __bt_get_word() wraps multiple times
blk-mq: Fix a use-after-free
blk-mq: prevent unmapped hw queue from being scheduled
blk-mq: re-check for available tags after running the hardware queue
blk-mq: fix hang in bt_get()
blk-mq: move the kdump check to blk_mq_alloc_tag_set
blk-mq: cleanup tag free handling
blk-mq: use 'nr_cpu_ids' as highest CPU ID count for hwq cpu map
blk: introduce generic io stat accounting help function
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
genhd: check for int overflow in disk_expand_part_tbl()
blk-mq: add blk_mq_free_hctx_request()
blk-mq: export blk_mq_free_request()
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
...

Linus Torvalds
2014-12-14 06:14:23 +0800

10 Dec, 2014

3 commits

52f7eb945 blk-mq: Micro-optimize bt_get() ... Browse Code »

Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Signed-off-by: Jens Axboe

Bart Van Assche
2014-12-10 00:07:28 +0800
c38d185d4 blk-mq: Fix a race between bt_clear_tag() and bt_get() ... Browse Code »

What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Cc: # v3.13+
Signed-off-by: Jens Axboe

Bart Van Assche
2014-12-10 00:07:16 +0800
9e98e9d7c blk-mq: Avoid that __bt_get_word() wraps multiple times ... Browse Code »

If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.

Signed-off-by: Bart Van Assche
Cc: Christoph Hellwig
Cc: Robert Elliott
Cc: Ming Lei
Cc: Alexander Gordeev
Cc: # v3.13+
Signed-off-by: Jens Axboe

Bart Van Assche
2014-12-10 00:07:14 +0800

08 Dec, 2014

2 commits

080ff3511 blk-mq: re-check for available tags after running the hardware queue ... Browse Code »

If we run out of tags and have to sleep, we run the hardware queue
to kick pending IO into gear. During that run, we may have completed
requests, so re-check if we have free tags before going to sleep.

Signed-off-by: Jens Axboe

Jens Axboe
2014-12-08 23:49:06 +0800
b32232073 blk-mq: fix hang in bt_get() ... Browse Code »

Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:

* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
hardware queue.

Signed-off-by: Jens Axboe

Bart Van Assche
2014-12-08 23:46:34 +0800

25 Nov, 2014

1 commit

70114c393 blk-mq: cleanup tag free handling ... Browse Code »

We only call __blk_mq_put_tag() and __blk_mq_put_reserved_tag()
from blk_mq_put_tag(), so just inline the two calls instead of
having them as separate functions.

Signed-off-by: Jens Axboe

Jens Axboe
2014-11-25 06:52:30 +0800

12 Nov, 2014

1 commit

205fb5f5b blk-mq: add blk_mq_unique_tag() ... Browse Code »

The queuecommand() callback functions in SCSI low-level drivers
need to know which hardware context has been selected by the
block layer. Since this information is not available in the
request structure, and since passing the hctx pointer directly to
the queuecommand callback function would require modification of
all SCSI LLDs, add a function to the block layer that allows to
query the hardware context index.

Signed-off-by: Bart Van Assche
Acked-by: Jens Axboe
Reviewed-by: Sagi Grimberg
Reviewed-by: Martin K. Petersen
Signed-off-by: Christoph Hellwig

Bart Van Assche
2014-11-12 18:16:09 +0800

07 Oct, 2014

2 commits

9d8f0bcca blk-mq: Make bt_clear_tag() easier to read ... Browse Code »

Eliminate a backwards goto statement from bt_clear_tag().

Signed-off-by: Bart Van Assche
Signed-off-by: Jens Axboe

Bart Van Assche
2014-10-07 22:45:21 +0800
abab13b5c blk-mq: fix potential hang if rolling wakeup depth is too high ... Browse Code »

We currently divide the queue depth by 4 as our batch wakeup
count, but we split the wakeups over BT_WAIT_QUEUES number of
wait queues. This defaults to 8. If the product of the resulting
batch wake count and BT_WAIT_QUEUES is higher than the device
queue depth, we can get into a situation where a task goes to
sleep waiting for a request, but never gets woken up.

Reported-by: Bart Van Assche
Fixes: 4bb659b156996
Cc: stable@kernel.org
Signed-off-by: Jens Axboe

Jens Axboe
2014-10-07 22:39:20 +0800

23 Sep, 2014

1 commit

81481eb42 blk-mq: fix and simplify tag iteration for the timeout handler ... Browse Code »

Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2014-09-23 02:00:07 +0800

18 Jun, 2014

3 commits

86fb5c56c blk-mq: bitmap tag: fix races in bt_get() function ... Browse Code »

This update fixes few issues in bt_get() function:

- list_empty(&wait.task_list) check is not protected;

- was_empty check is always true which results in *every* thread
entering the loop resets bt_wait_state::wait_cnt counter rather
than every bt->wake_cnt'th thread;

- 'bt_wait_state::wait_cnt' counter update is redundant, since
it also gets reset in bt_clear_tag() function;

Cc: Christoph Hellwig
Cc: Ming Lei
Cc: Jens Axboe
Signed-off-by: Alexander Gordeev
Signed-off-by: Jens Axboe

Alexander Gordeev
2014-06-18 13:13:08 +0800
2971c35f3 blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt ... Browse Code »

This piece of code in bt_clear_tag() function is racy:

bs = bt_wake_ptr(bt);
if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
atomic_set(&bs->wait_cnt, bt->wake_cnt);
wake_up(&bs->wait);
}

Since nothing prevents bt_wake_ptr() from returning the very
same 'bs' address on multiple CPUs, the following scenario is
possible:

CPU1 CPU2
---- ----

0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
1. atomic_dec_and_test(&bs->wait_cnt)
2. atomic_dec_and_test(&bs->wait_cnt)
3. atomic_set(&bs->wait_cnt, bt->wake_cnt);

If the decrement in [1] yields zero then for some amount of time
the decrement in [2] results in a negative/overflow value, which
is not expected. The follow-up assignment in [3] overwrites the
invalid value with the batch value (and likely prevents the issue
from being severe) which is still incorrect and should be a lesser.

Cc: Ming Lei
Cc: Jens Axboe
Signed-off-by: Alexander Gordeev
Signed-off-by: Jens Axboe

Alexander Gordeev
2014-06-18 13:13:05 +0800
8537b1203 blk-mq: bitmap tag: fix races on shared ::wake_index fields ... Browse Code »

Fix racy updates of shared blk_mq_bitmap_tags::wake_index
and blk_mq_hw_ctx::wake_index fields.

Cc: Ming Lei
Signed-off-by: Alexander Gordeev
Signed-off-by: Jens Axboe

Alexander Gordeev
2014-06-18 13:12:35 +0800

04 Jun, 2014

1 commit

cb96a42cc blk-mq: fix schedule from atomic context ... Browse Code »

blk_mq_put_ctx() has to be called before io_schedule() in
bt_get().

This patch fixes the problem by taking similar approach from
percpu_ida allocation for the situation.

Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2014-06-04 11:04:39 +0800

29 May, 2014

1 commit

75bb4625b blk-mq: add file comments and update copyright notices ... Browse Code »

None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe

Jens Axboe
2014-05-29 00:15:41 +0800