Eric Lee / smarc-fsl-linux-kernel

15 Sep, 2018

1 commit

d67c7c9dd block: bvec_nr_vecs() returns value for wrong slab ... Browse Code »

[ Upstream commit d6c02a9beb67f13d5f14f23e72fa9981e8b84477 ]

In commit ed996a52c868 ("block: simplify and cleanup bvec pool
handling"), the value of the slab index is incremented by one in
bvec_alloc() after the allocation is done to indicate an index value of
0 does not need to be later freed.

bvec_nr_vecs() was not updated accordingly, and thus returns the wrong
value. Decrement idx before performing the lookup.

Fixes: ed996a52c868 ("block: simplify and cleanup bvec pool handling")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-09-15 15:45:30 +0800

03 Aug, 2018

2 commits

b8088c524 block: reset bi_iter.bi_done after splitting bio ... Browse Code »

commit 5151842b9d8732d4cbfa8400b40bff894f501b2f upstream.

After the bio has been updated to represent the remaining sectors, reset
bi_done so bio_rewind_iter() does not rewind further than it should.

This resolves a bio_integrity_process() failure on reads where the
original request was split.

Fixes: 63573e359d05 ("bio-integrity: Restore original iterator on verify stage")
Signed-off-by: Greg Edwards
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Greg Edwards
2018-08-03 13:50:42 +0800
2258351cf block: bio_iov_iter_get_pages: fix size of last iovec ... Browse Code »

commit b403ea2404889e1227812fa9657667a1deb9c694 upstream.

If the last page of the bio is not "full", the length of the last
vector slot needs to be corrected. This slot has the index
(bio->bi_vcnt - 1), but only in bio->bi_io_vec. In the "bv" helper
array, which is shifted by the value of bio->bi_vcnt at function
invocation, the correct index is (nr_pages - 1).

v2: improved readability following suggestions from Ming Lei.
v3: followed a formatting suggestion from Christoph Hellwig.

Fixes: 2cefe4dbaadf ("block: add bio_iov_iter_get_pages()")
Reviewed-by: Hannes Reinecke
Reviewed-by: Ming Lei
Reviewed-by: Jan Kara
Reviewed-by: Christoph Hellwig
Signed-off-by: Martin Wilck
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Martin Wilck
2018-08-03 13:50:42 +0800

26 Apr, 2018

1 commit

c6c6e38ae block: Set BIO_TRACE_COMPLETION on new bio during split ... Browse Code »

[ Upstream commit 20d59023c5ec4426284af492808bcea1f39787ef ]

We inadvertently set it again on the source bio, but we need
to set it on the new split bio instead.

Fixes: fbbaf700e7b1 ("block: trace completion of all bios.")
Signed-off-by: Goldwyn Rodrigues
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Goldwyn Rodrigues
2018-04-26 17:02:10 +0800

08 Apr, 2018

1 commit

92e3d3f67 Fix slab name "biovec-(1<<(21-12))" ... Browse Code »

commit bd5c4facf59648581d2f1692dad7b107bf429954 upstream.

I'm getting a slab named "biovec-(1<
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Mikulas Patocka
2018-04-08 20:26:33 +0800

30 Dec, 2017

1 commit

3ef1c33f9 block-throttle: avoid double charge ... Browse Code »

commit 111be883981748acc9a56e855c8336404a8e787c upstream.

If a bio is throttled and split after throttling, the bio could be
resubmited and enters the throttling again. This will cause part of the
bio to be charged multiple times. If the cgroup has an IO limit, the
double charge will significantly harm the performance. The bio split
becomes quite common after arbitrary bio size change.

To fix this, we always set the BIO_THROTTLED flag if a bio is throttled.
If the bio is cloned/split, we copy the flag to new bio too to avoid a
double charge. However, cloned bio could be directed to a new disk,
keeping the flag be a problem. The observation is we always set new disk
for the bio in this case, so we can clear the flag in bio_set_dev().

This issue exists for a long time, arbitrary bio size change just makes
it worse, so this should go into stable at least since v4.2.

V1-> V2: Not add extra field in bio based on discussion with Tejun

Cc: Vivek Goyal
Acked-by: Tejun Heo
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Shaohua Li
2017-12-30 00:53:47 +0800

24 Nov, 2017

1 commit

5d62da3a8 bio: ensure __bio_clone_fast copies bi_partno ... Browse Code »

commit 62530ed8b1d07a45dec94d46e521c0c6c2d476e6 upstream.

A new field was introduced in 74d46992e0d9, bi_partno, instead of using
bdev->bd_contains and encoding the partition information in the bi_bdev
field. __bio_clone_fast was changed to copy the disk information, but
not the partition information. At minimum, this regressed bcache and
caused data corruption.

Signed-off-by: Michael Lyle
Fixes: 74d46992e0d9 ("block: replace bi_bdev with a gendisk pointer and partitions index")
Reported-by: Pavel Goran
Reported-by: Campbell Steven
Reviewed-by: Coly Li
Reviewed-by: Ming Lei
Signed-off-by: Jens Axboe
Signed-off-by: Greg Kroah-Hartman

Michael Lyle
2017-11-24 15:37:03 +0800

11 Oct, 2017

3 commits

1cfd0ddd8 bio_copy_user_iov(): don't ignore ->iov_offset ... Browse Code »

Since "block: support large requests in blk_rq_map_user_iov" we
started to call it with partially drained iter; that works fine
on the write side, but reads create a copy of iter for completion
time. And that needs to take the possibility of ->iov_iter != 0
into account...

Cc: stable@vger.kernel.org #v4.5+
Signed-off-by: Al Viro

Al Viro
2017-10-11 11:55:14 +0800
2b04e8f6b more bio_map_user_iov() leak fixes ... Browse Code »

we need to take care of failure exit as well - pages already
in bio should be dropped by analogue of bio_unmap_pages(),
since their refcounts had been bumped only once per reference
in bio.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro

Al Viro
2017-10-11 11:54:57 +0800
95d78c28b fix unbalanced page refcounting in bio_map_user_iov ... Browse Code »

bio_map_user_iov and bio_unmap_user do unbalanced pages refcounting if
IO vector has small consecutive buffers belonging to the same page.
bio_add_pc_page merges them into one, but the page reference is never
dropped.

Cc: stable@vger.kernel.org
Signed-off-by: Vitaly Mayatskikh
Signed-off-by: Al Viro

Vitaly Mayatskikh
2017-10-11 11:54:51 +0800

08 Sep, 2017

1 commit

3645e6d0d Merge tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md ... Browse Code »

Pull MD updates from Shaohua Li:
"This update mainly fixes bugs:

- Make raid5 ppl support several ppl from Pawel

- Several raid5-cache bug fixes from Song

- Bitmap fixes from Neil and Me

- One raid1/10 regression fix since 4.12 from Me

- Other small fixes and cleanup"

* tag 'md/4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
md/bitmap: disable bitmap_resize for file-backed bitmaps.
raid5-ppl: Recovery support for multiple partial parity logs
md: Runtime support for multiple ppls
md/raid0: attach correct cgroup info in bio
lib/raid6: align AVX512 constants to 512 bits, not bytes
raid5: remove raid5_build_block
md/r5cache: call mddev_lock/unlock() in r5c_journal_mode_show
md: replace seq_release_private with seq_release
md: notify about new spare disk in the container
md/raid1/10: reset bio allocated from mempool
md/raid5: release/flush io in raid5_do_work()
md/bitmap: copy correct data for bitmap super

Linus Torvalds
2017-09-08 03:41:48 +0800

26 Aug, 2017

1 commit

8a8e6f84a md/raid0: attach correct cgroup info in bio ... Browse Code »

The discard bio doesn't attach the original bio cgroup info. Normal bio
is cloned, so is fine.

Signed-off-by: Shaohua Li

Shaohua Li
2017-08-26 01:21:48 +0800

24 Aug, 2017

1 commit

74d46992e block: replace bi_bdev with a gendisk pointer and partitions index ... Browse Code »

This way we don't need a block_device structure to submit I/O. The
block_device has different life time rules from the gendisk and
request_queue and is usually only available when the block device node
is open. Other callers need to explicitly create one (e.g. the lightnvm
passthrough code, or the new nvme multipathing code).

For the actual I/O path all that we need is the gendisk, which exists
once per block device. But given that the block layer also does
partition remapping we additionally need a partition index, which is
used for said remapping in generic_make_request.

Note that all the block drivers generally want request_queue or
sometimes the gendisk, so this removes a layer of indirection all
over the stack.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-08-24 02:49:55 +0800

10 Aug, 2017

1 commit

d62e26b3f block: pass in queue to inflight accounting ... Browse Code »

No functional change in this patch, just in preparation for
basing the inflight mechanism on the queue in question.

Reviewed-by: Bart Van Assche
Reviewed-by: Omar Sandoval
Signed-off-by: Jens Axboe

Jens Axboe
2017-08-10 03:09:16 +0800

02 Aug, 2017

1 commit

3d289d688 block: Add comment to submit_bio_wait() ... Browse Code »

submit_bio_wait() does not consume bio reference. Add comment about
that.

Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe

Jan Kara
2017-08-02 22:25:04 +0800

11 Jul, 2017

1 commit

b222dd2fd block: call bio_uninit in bio_endio ... Browse Code »

bio_free isn't a good place to free cgroup info. There are a
lot of cases bio is allocated in special way (for example, in stack) and
never gets called by bio_put hence bio_free, we are leaking memory. This
patch moves the free to bio endio, which should be called anyway. The
bio_uninit call in bio_free is kept, in case the bio never gets called
bio endio.

This assumes ->bi_end_io() doesn't access cgroup info, which seems true
in my audit.

This along with Christoph's integrity patch should fix the memory leak
issue.

Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2017-07-11 02:43:33 +0800

04 Jul, 2017

4 commits

7c20f1168 bio-integrity: stop abusing bi_end_io ... Browse Code »

And instead call directly into the integrity code from bio_end_io.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-07-04 07:00:59 +0800
fbd08e767 bio-integrity: fix interface for bio_integrity_trim ... Browse Code »

bio_integrity_trim inherent it's interface from bio_trim and accept
offset and size, but this API is error prone because data offset
must always be insync with bio's data offset. That is why we have
integrity update hook in bio_advance()

So only meaningful values are: offset == 0, sectors == bio_sectors(bio)
Let's just remove them completely.

Reviewed-by: Hannes Reinecke
Reviewed-by: Christoph Hellwig
Reviewed-by: Martin K. Petersen
Signed-off-by: Dmitry Monakhov
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dmitry Monakhov
2017-07-04 06:56:22 +0800
376a78abf bio-integrity: bio_trim should truncate integrity vector accordingly ... Browse Code »

Reviewed-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: Martin K. Petersen
Signed-off-by: Dmitry Monakhov
Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dmitry Monakhov
2017-07-04 06:56:19 +0800
c6b1e36c8 Merge branch 'for-4.13/block' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block/IO updates from Jens Axboe:
"This is the main pull request for the block layer for 4.13. Not a huge
round in terms of features, but there's a lot of churn related to some
core cleanups.

Note this depends on the UUID tree pull request, that Christoph
already sent out.

This pull request contains:

- A series from Christoph, unifying the error/stats codes in the
block layer. We now use blk_status_t everywhere, instead of using
different schemes for different places.

- Also from Christoph, some cleanups around request allocation and IO
scheduler interactions in blk-mq.

- And yet another series from Christoph, cleaning up how we handle
and do bounce buffering in the block layer.

- A blk-mq debugfs series from Bart, further improving on the support
we have for exporting internal information to aid debugging IO
hangs or stalls.

- Also from Bart, a series that cleans up the request initialization
differences across types of devices.

- A series from Goldwyn Rodrigues, allowing the block layer to return
failure if we will block and the user asked for non-blocking.

- Patch from Hannes for supporting setting loop devices block size to
that of the underlying device.

- Two series of patches from Javier, fixing various issues with
lightnvm, particular around pblk.

- A series from me, adding support for write hints. This comes with
NVMe support as well, so applications can help guide data placement
on flash to improve performance, latencies, and write
amplification.

- A series from Ming, improving and hardening blk-mq support for
stopping/starting and quiescing hardware queues.

- Two pull requests for NVMe updates. Nothing major on the feature
side, but lots of cleanups and bug fixes. From the usual crew.

- A series from Neil Brown, greatly improving the bio rescue set
support. Most notably, this kills the bio rescue work queues, if we
don't really need them.

- Lots of other little bug fixes that are all over the place"

* 'for-4.13/block' of git://git.kernel.dk/linux-block: (217 commits)
lightnvm: pblk: set line bitmap check under debug
lightnvm: pblk: verify that cache read is still valid
lightnvm: pblk: add initialization check
lightnvm: pblk: remove target using async. I/Os
lightnvm: pblk: use vmalloc for GC data buffer
lightnvm: pblk: use right metadata buffer for recovery
lightnvm: pblk: schedule if data is not ready
lightnvm: pblk: remove unused return variable
lightnvm: pblk: fix double-free on pblk init
lightnvm: pblk: fix bad le64 assignations
nvme: Makefile: remove dead build rule
blk-mq: map all HWQ also in hyperthreaded system
nvmet-rdma: register ib_client to not deadlock in device removal
nvme_fc: fix error recovery on link down.
nvmet_fc: fix crashes on bad opcodes
nvme_fc: Fix crash when nvme controller connection fails.
nvme_fc: replace ioabort msleep loop with completion
nvme_fc: fix double calls to nvme_cleanup_cmd()
nvme-fabrics: verify that a controller returns the correct NQN
nvme: simplify nvme_dev_attrs_are_visible
...

Linus Torvalds
2017-07-04 01:34:51 +0800

29 Jun, 2017

1 commit

9ae3b3f52 block: provide bio_uninit() free freeing integrity/task associations ... Browse Code »

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim: 6752128 kB
SUnreclaim: 6874880 kB
SUnreclaim: 7238080 kB
....
SUnreclaim: 22307264 kB
SUnreclaim: 22485888 kB
SUnreclaim: 22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Reported-by: Wen Xiong
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-29 05:30:13 +0800

28 Jun, 2017

1 commit

cb6934f8e block: add support for write hints in a bio ... Browse Code »

No functional changes in this patch, we just use up some holes
in the bio and request structures to define a write hint that
we psas down the stack.

Ensure that we don't merge requests that have different life time
hints assigned to them, and that we inherit the write hint when
cloning a bio.

Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Jens Axboe
2017-06-28 02:05:27 +0800

19 Jun, 2017

3 commits

9b10f6a9c block: remove bio_clone() and all references. ... Browse Code »

bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800
47e0fb461 blk: make the bioset rescue_workqueue optional. ... Browse Code »

This patch converts bioset_create() to not create a workqueue by
default, so alloctions will never trigger punt_bios_to_rescuer(). It
also introduces a new flag BIOSET_NEED_RESCUER which tells
bioset_create() to preserve the old behavior.

All callers of bioset_create() that are inside block device drivers,
are given the BIOSET_NEED_RESCUER flag.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios. This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, and one allocated by
target_core_iblock.c.

biosets used by md/raid do not need rescuing as
their usage was recently audited and revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Reviewed-by: Christoph Hellwig
Credit-to: Ming Lei (minor fixes)
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800
011067b05 blk: replace bioset_create_nobvec() with a flags arg to bioset_create() ... Browse Code »

"flags" arguments are often seen as good API design as they allow
easy extensibility.
bioset_create_nobvec() is implemented internally as a variation in
flags passed to __bioset_create().

To support future extension, make the internal structure part of the
API.
i.e. add a 'flags' argument to bioset_create() and discard
bioset_create_nobvec().

Note that the bio_split allocations in drivers/md/raid* do not need
the bvec mempool - they should have used bioset_create_nobvec().

Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Reviewed-by: Ming Lei
Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-06-19 02:40:59 +0800

16 Jun, 2017

1 commit

a462b9508 block: Dedicated error code fixups ... Browse Code »

This patch fixes two sparse warnings introduced by the "dedicated
error codes for the block layer V3" patch series. These changes
have not been tested.

Signed-off-by: Bart Van Assche
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Bart Van Assche
2017-06-16 23:47:15 +0800

09 Jun, 2017

1 commit

4e4cbee93 block: switch bios to blk_status_t ... Browse Code »

Replace bi_error with a new bi_status to allow for a clear conversion.
Note that device mapper overloaded bi_error with a private value, which
we'll have to keep arround at least for now and thus propagate to a
proper blk_status_t value.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-06-09 23:27:32 +0800

02 May, 2017

1 commit

e265eb3a3 Merge branch 'md-next' into md-linus Browse Code »

Shaohua Li
2017-05-02 05:09:21 +0800

12 Apr, 2017

1 commit

50512625d Revert "block: introduce bio_copy_data_partial" ... Browse Code »

This reverts commit 6f8802852f7e58a12177a86179803b9efaad98e2.
bio_copy_data_partial() is no longer needed.

Signed-off-by: NeilBrown
Signed-off-by: Shaohua Li

NeilBrown
2017-04-12 01:09:03 +0800

07 Apr, 2017

1 commit

fbbaf700e block: trace completion of all bios. ... Browse Code »

Currently only dm and md/raid5 bios trigger
trace_block_bio_complete(). Now that we have bio_chain() and
bio_inc_remaining(), it is not possible, in general, for a driver to
know when the bio is really complete. Only bio_endio() knows that.

So move the trace_block_bio_complete() call to bio_endio().

Now trace_block_bio_complete() pairs with trace_block_bio_queue().
Any bio for which a 'queue' event is traced, will subsequently
generate a 'complete' event.

There are a few cases where completion tracing is not wanted.
1/ If blk_update_request() has already generated a completion
trace event at the 'request' level, there is no point generating
one at the bio level too. In this case the bi_sector and bi_size
will have changed, so the bio level event would be wrong

2/ If the bio hasn't actually been queued yet, but is being aborted
early, then a trace event could be confusing. Some filesystems
call bio_endio() but do not want tracing.

3/ The bio_integrity code interposes itself by replacing bi_end_io,
then restoring it and calling bio_endio() again. This would produce
two identical trace events if left like that.

To handle these, we introduce a flag BIO_TRACE_COMPLETION and only
produce the trace event when this is set.
We address point 1 above by clearing the flag in blk_update_request().
We address point 2 above by only setting the flag when
generic_make_request() is called.
We address point 3 above by clearing the flag after generating a
completion event.

When bio_split() is used on a bio, particularly in blk_queue_split(),
there is an extra complication. A new bio is split off the front, and
may be handle directly without going through generic_make_request().
The old bio, which has been advanced, is passed to
generic_make_request(), so it will trigger a trace event a second
time.
Probably the best result when a split happens is to see a single
'queue' event for the whole bio, then multiple 'complete' events - one
for each component. To achieve this was can:
- copy the BIO_TRACE_COMPLETION flag to the new bio in bio_split()
- avoid generating a 'queue' event if BIO_TRACE_COMPLETION is already set.
This way, the split-off bio won't create a queue event, the original
won't either even if it re-submitted to generic_make_request(),
but both will produce completion events, each for their own range.

So if generic_make_request() is called (which generates a QUEUED
event), then bi_endio() will create a single COMPLETE event for each
range that the bio is split into, unless the driver has explicitly
requested it not to.

Signed-off-by: NeilBrown
Signed-off-by: Jens Axboe

NeilBrown
2017-04-07 23:40:52 +0800

28 Mar, 2017

1 commit

9e234eeaf blk-throttle: add a simple idle detection ... Browse Code »

A cgroup gets assigned a low limit, but the cgroup could never dispatch
enough IO to cross the low limit. In such case, the queue state machine
will remain in LIMIT_LOW state and all other cgroups will be throttled
according to low limit. This is unfair for other cgroups. We should
treat the cgroup idle and upgrade the state machine to lower state.

We also have a downgrade logic. If the state machine upgrades because of
cgroup idle (real idle), the state machine will downgrade soon as the
cgroup is below its low limit. This isn't what we want. A more
complicated case is cgroup isn't idle when queue is in LIMIT_LOW. But
when queue gets upgraded to lower state, other cgroups could dispatch
more IO and this cgroup can't dispatch enough IO, so the cgroup is below
its low limit and looks like idle (fake idle). In this case, the queue
should downgrade soon. The key to determine if we should do downgrade is
to detect if cgroup is truely idle.

Unfortunately it's very hard to determine if a cgroup is real idle. This
patch uses the 'think time check' idea from CFQ for the purpose. Please
note, the idea doesn't work for all workloads. For example, a workload
with io depth 8 has disk utilization 100%, hence think time is 0, eg,
not idle. But the workload can run higher bandwidth with io depth 16.
Compared to io depth 16, the io depth 8 workload is idle. We use the
idea to roughly determine if a cgroup is idle.

We treat a cgroup idle if its think time is above a threshold (by
default 1ms for SSD and 100ms for HD). The idea is think time above the
threshold will start to harm performance. HD is much slower so a longer
think time is ok.

The patch (and the latter patches) uses 'unsigned long' to track time.
We convert 'ns' to 'us' with 'ns >> 10'. This is fast but loses
precision, should not a big deal.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2017-03-28 22:02:20 +0800

26 Mar, 2017

1 commit

f45958756 block: remove bio_clone_bioset_partial() ... Browse Code »

commit c18a1e0(block: introduce bio_clone_bioset_partial()) introduced
bio_clone_bioset_partial() for raid1 write behind IO. Now the write behind is
rewritten by Ming. We don't need the API any more, so revert the commit.

Cc: Christoph Hellwig
Reviewed-by: Jens Axboe
Reviewed-by: Ming Lei
Signed-off-by: Shaohua Li

Shaohua Li
2017-03-26 00:18:37 +0800

25 Mar, 2017

1 commit

6f8802852 block: introduce bio_copy_data_partial ... Browse Code »

Turns out we can use bio_copy_data in raid1's write behind,
and we can make alloc_behind_pages() more clean/efficient,
but we need to partial version of bio_copy_data().

Signed-off-by: Ming Lei
Reviewed-by: Jens Axboe
Signed-off-by: Shaohua Li

Ming Lei
2017-03-25 01:41:37 +0800

23 Mar, 2017

1 commit

7a88fa191 block: make nr_iovecs unsigned in bio_alloc_bioset() ... Browse Code »

There isn't a bug here, but Smatch is not smart enough to know that
"nr_iovecs" can't be negative so it complains about underflows.
Really, it's slightly cleaner to make this parameter unsigned.

Signed-off-by: Dan Carpenter
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Dan Carpenter
2017-03-23 22:16:11 +0800

12 Mar, 2017

1 commit

f5fe1b519 blk: Ensure users for current->bio_list can see the full list. ... Browse Code »

Commit 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
changed current->bio_list so that it did not contain *all* of the
queued bios, but only those submitted by the currently running
make_request_fn.

There are two places which walk the list and requeue selected bios,
and others that check if the list is empty. These are no longer
correct.

So redefine current->bio_list to point to an array of two lists, which
contain all queued bios, and adjust various code to test or walk both
lists.

Signed-off-by: NeilBrown
Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()")
Signed-off-by: Jens Axboe

NeilBrown
2017-03-12 06:31:37 +0800

25 Feb, 2017

1 commit

a682e0035 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md ... Browse Code »

Pull md updates from Shaohua Li:
"Mainly fixes bugs and improves performance:

- Improve scalability for raid1 from Coly

- Improve raid5-cache read performance, disk efficiency and IO
pattern from Song and me

- Fix a race condition of disk hotplug for linear from Coly

- A few cleanup patches from Ming and Byungchul

- Fix a memory leak from Neil

- Fix WRITE SAME IO failure from me

- Add doc for raid5-cache from me"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md: (23 commits)
md/raid1: fix write behind issues introduced by bio_clone_bioset_partial
md/raid1: handle flush request correctly
md/linear: shutup lockdep warnning
md/raid1: fix a use-after-free bug
RAID1: avoid unnecessary spin locks in I/O barrier code
RAID1: a new I/O barrier implementation to remove resync window
md/raid5: Don't reinvent the wheel but use existing llist API
md: fast clone bio in bio_clone_mddev()
md: remove unnecessary check on mddev
md/raid1: use bio_clone_bioset_partial() in case of write behind
md: fail if mddev->bio_set can't be created
block: introduce bio_clone_bioset_partial()
md: disable WRITE SAME if it fails in underlayer disks
md/raid5-cache: exclude reclaiming stripes in reclaim check
md/raid5-cache: stripe reclaim only counts valid stripes
MD: add doc for raid5-cache
Documentation: move MD related doc into a separate dir
md: ensure md devices are freed before module is unloaded.
md/r5cache: improve journal device efficiency
md/r5cache: enable chunk_aligned_read with write back cache
...

Linus Torvalds
2017-02-25 06:42:19 +0800

18 Feb, 2017

1 commit

818551e2b Merge branch 'for-4.11/next' into for-4.11/linus-merge ... Browse Code »

Signed-off-by: Jens Axboe

Jens Axboe
2017-02-18 05:08:19 +0800

16 Feb, 2017

1 commit

c18a1e090 block: introduce bio_clone_bioset_partial() ... Browse Code »

md still need bio clone(not the fast version) for behind write,
and it is more efficient to use bio_clone_bioset_partial().

The idea is simple and just copy the bvecs range specified from
parameters.

Reviewed-by: Christoph Hellwig
Reviewed-by: Jens Axboe
Signed-off-by: Ming Lei
Signed-off-by: Shaohua Li

Ming Lei
2017-02-16 03:22:05 +0800

02 Feb, 2017

1 commit

5fad1b64a block: Update comments that refer to __bio_map_user() and bio_map_user() ... Browse Code »

Since __bio_map_user() and bio_map_user() have been removed, update
the comments that still refer to these functions.

Signed-off-by: Bart Van Assche
References: commit ddad8dd0a162 ("block: use blk_rq_map_user_iov to implement blk_rq_map_user")
Cc: Ming Lei
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Bart Van Assche
2017-02-02 03:32:29 +0800

01 Feb, 2017

1 commit

aebf526b5 block: fold cmd_type into the REQ_OP_ space ... Browse Code »

Instead of keeping two levels of indirection for requests types, fold it
all into the operations. The little caveat here is that previously
cmd_type only applied to struct request, while the request and bio op
fields were set to plain REQ_OP_READ/WRITE even for passthrough
operations.

Instead this patch adds new REQ_OP_* for SCSI passthrough and driver
private requests, althought it has to add two for each so that we
can communicate the data in/out nature of the request.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2017-02-01 05:00:44 +0800