Eric Lee / smarc-fsl-linux-kernel

23 Dec, 2015

1 commit

287922eb0 block: defer timeouts to a workqueue ... Browse Code »

Timer context is not very useful for drivers to perform any meaningful abort
action from. So instead of calling the driver from this useless context
defer it to a workqueue as soon as possible.

Note that while a delayed_work item would seem the right thing here I didn't
dare to use it due to the magic in blk_add_timer that pokes deep into timer
internals. But maybe this encourages Tejun to add a sensible API for that to
the workqueue API and we'll all be fine in the end :)

Contains a major update from Keith Bush:

"This patch removes synchronizing the timeout work so that the timer can
start a freeze on its own queue. The timer enters the queue, so timer
context can only start a freeze, but not wait for frozen."

Signed-off-by: Christoph Hellwig
Acked-by: Keith Busch
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-12-23 00:38:16 +0800

02 Dec, 2015

1 commit

6f3b0e8bc blk-mq: add a flags parameter to blk_mq_alloc_request ... Browse Code »

We already have the reserved flag, and a nowait flag awkwardly encoded as
a gfp_t. Add a real flags argument to make the scheme more extensible and
allow for a nicer calling convention.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-12-02 01:53:59 +0800

25 Nov, 2015

2 commits

d674d4145 block: do not initialise globals to 0 or NULL ... Browse Code »

This patch fixes the checkpatch.pl error to blk-exec.c:

ERROR: do not initialise globals to 0 or NULL

Signed-off-by: Wei Tang
Signed-off-by: Jens Axboe

Wei Tang
2015-11-25 06:24:25 +0800
c2789bd40 block: rename request_queue slab cache ... Browse Code »

Name the cache after the actual name of the struct.

Signed-off-by: Ilya Dryomov
Signed-off-by: Jens Axboe

Ilya Dryomov
2015-11-25 06:24:25 +0800

12 Nov, 2015

1 commit

ccc2600b8 block: fix blk-core.c kernel-doc warning ... Browse Code »

Fix kernel-doc warning in blk-core.c:

Warning(..//block/blk-core.c:1549): No description found for parameter 'same_queue_rq'

Signed-off-by: Randy Dunlap
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Randy Dunlap
2015-11-12 00:36:57 +0800

11 Nov, 2015

1 commit

3419b4503 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block IO poll support from Jens Axboe:
"Various groups have been doing experimentation around IO polling for
(really) fast devices. The code has been reviewed and has been
sitting on the side for a few releases, but this is now good enough
for coordinated benchmarking and further experimentation.

Currently O_DIRECT sync read/write are supported. A framework is in
the works that allows scalable stats tracking so we can auto-tune
this. And we'll add libaio support as well soon. Fow now, it's an
opt-in feature for test purposes"

* 'for-4.4/io-poll' of git://git.kernel.dk/linux-block:
direct-io: be sure to assign dio->bio_bdev for both paths
directio: add block polling support
NVMe: add blk polling support
block: add block polling support
blk-mq: return tag/queue combo in the make_request_fn handlers
block: change ->make_request_fn() and users to return a queue cookie

Linus Torvalds
2015-11-11 09:23:49 +0800

08 Nov, 2015

2 commits

05229beed block: add block polling support ... Browse Code »

Add basic support for polling for specific IO to complete. This uses
the cookie that blk-mq passes back, which enables the block layer
to pass this cookie to the driver to spin for a specific request.

This will be combined with request latency tracking, so we can make
qualified decisions about when to poll and when not to. For now, for
benchmark purposes, we add a sysfs file that controls whether polling
is enabled or not.

Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:47 +0800
dece16353 block: change ->make_request_fn() and users to return a queue cookie ... Browse Code »

No functional changes in this patch, but it prepares us for returning
a more useful cookie related to the IO that was queued up.

Signed-off-by: Jens Axboe
Acked-by: Christoph Hellwig
Acked-by: Keith Busch

Jens Axboe
2015-11-08 01:40:46 +0800

07 Nov, 2015

2 commits

71baba4b9 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM ... Browse Code »

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Acked-by: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800

05 Nov, 2015

2 commits

527d1529e Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block integrity updates from Jens Axboe:
""This is the joint work of Dan and Martin, cleaning up and improving
the support for block data integrity"

* 'for-4.4/integrity' of git://git.kernel.dk/linux-block:
block, libnvdimm, nvme: provide a built-in blk_integrity nop profile
block: blk_flush_integrity() for bio-based drivers
block: move blk_integrity to request_queue
block: generic request_queue reference counting
nvme: suspend i/o during runtime blk_integrity_unregister
md: suspend i/o during runtime blk_integrity_unregister
md, dm, scsi, nvme, libnvdimm: drop blk_integrity_unregister() at shutdown
block: Inline blk_integrity in struct gendisk
block: Export integrity data interval size in sysfs
block: Reduce the size of struct blk_integrity
block: Consolidate static integrity profile properties
block: Move integrity kobject to struct gendisk

Linus Torvalds
2015-11-05 12:51:48 +0800
d9734e0d1 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"This is the core block pull request for 4.4. I've got a few more
topic branches this time around, some of them will layer on top of the
core+drivers changes and will come in a separate round. So not a huge
chunk of changes in this round.

This pull request contains:

- Enable blk-mq page allocation tracking with kmemleak, from Catalin.

- Unused prototype removal in blk-mq from Christoph.

- Cleanup of the q->blk_trace exchange, using cmpxchg instead of two
xchg()'s, from Davidlohr.

- A plug flush fix from Jeff.

- Also from Jeff, a fix that means we don't have to update shared tag
sets at init time unless we do a state change. This cuts down boot
times on thousands of devices a lot with scsi/blk-mq.

- blk-mq waitqueue barrier fix from Kosuke.

- Various fixes from Ming:

- Fixes for segment merging and splitting, and checks, for
the old core and blk-mq.

- Potential blk-mq speedup by marking ctx pending at the end
of a plug insertion batch in blk-mq.

- direct-io no page dirty on kernel direct reads.

- A WRITE_SYNC fix for mpage from Roman"

* 'for-4.4/core' of git://git.kernel.dk/linux-block:
blk-mq: avoid excessive boot delays with large lun counts
blktrace: re-write setting q->blk_trace
blk-mq: mark ctx as pending at batch in flush plug path
blk-mq: fix for trace_block_plug()
block: check bio_mergeable() early before merging
blk-mq: check bio_mergeable() early before merging
block: avoid to merge splitted bio
block: setup bi_phys_segments after splitting
block: fix plug list flushing for nomerge queues
blk-mq: remove unused blk_mq_clone_flush_request prototype
blk-mq: fix waitqueue_active without memory barrier in block/blk-mq-tag.c
fs: direct-io: don't dirtying pages for ITER_BVEC/ITER_KVEC direct read
fs/mpage.c: forgotten WRITE_SYNC in case of data integrity write
block: kmemleak: Track the page allocations for struct request

Linus Torvalds
2015-11-05 12:28:10 +0800

22 Oct, 2015

3 commits

0809e3ac6 block: fix plug list flushing for nomerge queues ... Browse Code »

Request queues with merging disabled will not flush the plug list after
BLK_MAX_REQUEST_COUNT requests have been queued, since the code relies
on blk_attempt_plug_merge to compute the request_count. Fix this by
computing the number of queued requests even for nomerge queues.

Signed-off-by: Jeff Moyer
Signed-off-by: Jens Axboe

Jeff Moyer
2015-10-22 05:00:48 +0800
5a48fc147 block: blk_flush_integrity() for bio-based drivers ... Browse Code »

Since they lack requests to pin the request_queue active, synchronous
bio-based drivers may have in-flight integrity work from
bio_integrity_endio() that is not flushed by blk_freeze_queue(). Flush
that work to prevent races to free the queue and the final usage of the
blk_integrity profile.

This is temporary unless/until bio-based drivers start to generically
take a q_usage_counter reference while a bio is in-flight.

Cc: Martin K. Petersen
[martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:44 +0800
3ef28e83a block: generic request_queue reference counting ... Browse Code »

Allow pmem, and other synchronous/bio-based block drivers, to fallback
on a per-cpu reference count managed by the core for tracking queue
live/dead state.

The existing per-cpu reference count for the blk_mq case is promoted to
be used in all block i/o scenarios. This involves initializing it by
default, waiting for it to drop to zero at exit, and holding a live
reference over the invocation of q->make_request_fn() in
generic_make_request(). The blk_mq code continues to take its own
reference per blk_mq request and retains the ability to freeze the
queue, but the check that the queue is frozen is moved to
generic_make_request().

This fixes crash signatures like the following:

BUG: unable to handle kernel paging request at ffff880140000000
[..]
Call Trace:
[] ? copy_user_handle_tail+0x5f/0x70
[] pmem_do_bvec.isra.11+0x70/0xf0 [nd_pmem]
[] pmem_make_request+0xd1/0x200 [nd_pmem]
[] ? mempool_alloc+0x72/0x1a0
[] generic_make_request+0xd6/0x110
[] submit_bio+0x76/0x170
[] submit_bh_wbc+0x12f/0x160
[] submit_bh+0x12/0x20
[] jbd2_write_superblock+0x8d/0x170
[] jbd2_mark_journal_empty+0x5d/0x90
[] jbd2_journal_destroy+0x24b/0x270
[] ? put_pwq_unlocked+0x2a/0x30
[] ? destroy_workqueue+0x225/0x250
[] ext4_put_super+0x64/0x360
[] generic_shutdown_super+0x6a/0xf0

Cc: Jens Axboe
Cc: Keith Busch
Cc: Ross Zwisler
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Tested-by: Ross Zwisler
Signed-off-by: Dan Williams
Signed-off-by: Jens Axboe

Dan Williams
2015-10-22 04:43:41 +0800

15 Oct, 2015

1 commit

b02176f30 block: don't release bdi while request_queue has live references ... Browse Code »

bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue. As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine. The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation. Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy(). This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step. To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy(). bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue(). bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Cc: stable@vger.kernel.org # v4.2+
Reported-and-tested-by: Andrey Konovalov
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Tejun Heo
2015-10-15 23:53:28 +0800

11 Sep, 2015

1 commit

b0a1ea51b Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-cg updates from Jens Axboe:
"A bit later in the cycle, but this has been in the block tree for a a
while. This is basically four patchsets from Tejun, that improve our
buffered cgroup writeback. It was dependent on the other cgroup
changes, but they went in earlier in this cycle.

Series 1 is set of 5 patches that has cgroup writeback updates:

- bdi_writeback iteration fix which could lead to some wb's being
skipped or repeated during e.g. sync under memory pressure.

- Simplification of wb work wait mechanism.

- Writeback tracepoints updated to report cgroup.

Series 2 is is a set of updates for the CFQ cgroup writeback handling:

cfq has always charged all async IOs to the root cgroup. It didn't
have much choice as writeback didn't know about cgroups and there
was no way to tell who to blame for a given writeback IO.
writeback finally grew support for cgroups and now tags each
writeback IO with the appropriate cgroup to charge it against.

This patchset updates cfq so that it follows the blkcg each bio is
tagged with. Async cfq_queues are now shared across cfq_group,
which is per-cgroup, instead of per-request_queue cfq_data. This
makes all IOs follow the weight based IO resource distribution
implemented by cfq.

- Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

- Other misc review points addressed, acks added and rebased.

Series 3 is the blkcg policy cleanup patches:

This patchset contains assorted cleanups for blkcg_policy methods
and blk[c]g_policy_data handling.

- alloc/free added for blkg_policy_data. exit dropped.

- alloc/free added for blkcg_policy_data.

- blk-throttle's async percpu allocation is replaced with direct
allocation.

- all methods now take blk[c]g_policy_data instead of blkcg_gq or
blkcg.

And finally, series 4 is a set of patches cleaning up the blkcg stats
handling:

blkcg's stats have always been somwhat of a mess. This patchset
tries to improve the situation a bit.

- The following patches added to consolidate blkcg entry point and
blkg creation. This is in itself is an improvement and helps
colllecting common stats on bio issue.

- per-blkg stats now accounted on bio issue rather than request
completion so that bio based and request based drivers can behave
the same way. The issue was spotted by Vivek.

- cfq-iosched implements custom recursive stats and blk-throttle
implements custom per-cpu stats. This patchset make blkcg core
support both by default.

- cfq-iosched and blk-throttle keep track of the same stats
multiple times. Unify them"

* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
blkcg: implement interface for the unified hierarchy
blkcg: misc preparations for unified hierarchy interface
blkcg: separate out tg_conf_updated() from tg_set_conf()
blkcg: move body parsing from blkg_conf_prep() to its callers
blkcg: mark existing cftypes as legacy
blkcg: rename subsystem name from blkio to io
blkcg: refine error codes returned during blkcg configuration
blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
blkcg: remove cfqg_stats->sectors
blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
blkcg: make blkcg_[rw]stat per-cpu
blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
blkcg: consolidate blkg creation in blkcg_bio_issue_check()
blk-throttle: improve queue bypass handling
blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
blkcg: inline [__]blkg_lookup()
...

Linus Torvalds
2015-09-11 09:56:14 +0800

19 Aug, 2015

1 commit

ae1188963 blkcg: consolidate blkg creation in blkcg_bio_issue_check() ... Browse Code »

blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies. Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.

This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks(). blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's. The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.

* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
instead of blkg_lookup_create().

* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
read locked and blkg already looked up. Both throtl_lookup_tg() and
throtl_lookup_create_tg() are dropped.

* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
cfq_lookup_cfqg()which uses blkg_lookup().

This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure. In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.

v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
!CONFIG_BLK_DEV_THROTTLING.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800

14 Aug, 2015

1 commit

54efd50bf block: make generic_make_request handle arbitrarily sized bios ... Browse Code »

The way the block layer is currently written, it goes to great lengths
to avoid having to split bios; upper layer code (such as bio_add_page())
checks what the underlying device can handle and tries to always create
bios that don't need to be split.

But this approach becomes unwieldy and eventually breaks down with
stacked devices and devices with dynamic limits, and it adds a lot of
complexity. If the block layer could split bios as needed, we could
eliminate a lot of complexity elsewhere - particularly in stacked
drivers. Code that creates bios can then create whatever size bios are
convenient, and more importantly stacked drivers don't have to deal with
both their own bio size limitations and the limitations of the
(potentially multiple) devices underneath them. In the future this will
let us delete merge_bvec_fn and a bunch of other code.

We do this by adding calls to blk_queue_split() to the various
make_request functions that need it - a few can already handle arbitrary
size bios. Note that we add the call _after_ any call to
blk_queue_bounce(); this means that blk_queue_split() and
blk_recalc_rq_segments() don't need to be concerned with bouncing
affecting segment merging.

Some make_request_fn() callbacks were simple enough to audit and verify
they don't need blk_queue_split() calls. The skipped ones are:

* nfhd_make_request (arch/m68k/emu/nfblock.c)
* axon_ram_make_request (arch/powerpc/sysdev/axonram.c)
* simdisk_make_request (arch/xtensa/platforms/iss/simdisk.c)
* brd_make_request (ramdisk - drivers/block/brd.c)
* mtip_submit_request (drivers/block/mtip32xx/mtip32xx.c)
* loop_make_request
* null_queue_bio
* bcache's make_request fns

Some others are almost certainly safe to remove now, but will be left
for future patches.

Cc: Jens Axboe
Cc: Christoph Hellwig
Cc: Al Viro
Cc: Ming Lei
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Cc: dm-devel@redhat.com
Cc: Lars Ellenberg
Cc: drbd-user@lists.linbit.com
Cc: Jiri Kosina
Cc: Geoff Levand
Cc: Jim Paris
Cc: Philip Kelleher
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Oleg Drokin
Cc: Andreas Dilger
Acked-by: NeilBrown (for the 'md/md.c' bits)
Acked-by: Mike Snitzer
Reviewed-by: Martin K. Petersen
Signed-off-by: Kent Overstreet
[dpark: skip more mq-based drivers, resolve merge conflicts, etc.]
Signed-off-by: Dongsu Park
Signed-off-by: Ming Lin
Signed-off-by: Jens Axboe

Kent Overstreet
2015-08-14 02:31:33 +0800

29 Jul, 2015

2 commits

b7c44ed9d block: manipulate bio->bi_flags through helpers ... Browse Code »

Some places use helpers now, others don't. We only have the 'is set'
helper, add helpers for setting and clearing flags too.

It was a bit of a mess of atomic vs non-atomic access. With
BIO_UPTODATE gone, we don't have any risk of concurrent access to the
flags. So relax the restriction and don't make any of them atomic. The
flags that do have serialization issues (reffed and chained), we
already handle those separately.

Signed-off-by: Jens Axboe

Jens Axboe
2015-07-29 22:55:20 +0800
4246a0b63 block: add a bi_error field to struct bio ... Browse Code »

Currently we have two different ways to signal an I/O error on a BIO:

(1) by clearing the BIO_UPTODATE flag
(2) by returning a Linux errno value to the bi_end_io callback

The first one has the drawback of only communicating a single possible
error (-EIO), and the second one has the drawback of not beeing persistent
when bios are queued up, and are not passed along from child to parent
bio in the ever more popular chaining scenario. Having both mechanisms
available has the additional drawback of utterly confusing driver authors
and introducing bugs where various I/O submitters only deal with one of
them, and the others have to add boilerplate code to deal with both kinds
of error returns.

So add a new bi_error field to store an errno value directly in struct
bio and remove the existing mechanisms to clean all this up.

Signed-off-by: Christoph Hellwig
Reviewed-by: Hannes Reinecke
Reviewed-by: NeilBrown
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-07-29 22:55:15 +0800

07 Jul, 2015

1 commit

0762b23d2 block: use FIELD_SIZEOF to calculate size of a field ... Browse Code »

use FIELD_SIZEOF instead of open coding

Signed-off-by: Maninder Singh
Signed-off-by: Jens Axboe

Maninder Singh
2015-07-07 21:47:37 +0800

27 Jun, 2015

1 commit

22165fa79 Merge tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm ... Browse Code »

Pull device mapper fixes from Mike Snitzer:
"Apologies for not pressing this request-based DM partial completion
issue further, it was an oversight on my part. We'll have to get it
fixed up properly and revisit for a future release.

- Revert block and DM core changes the removed request-based DM's
ability to handle partial request completions -- otherwise with the
current SCSI LLDs these changes could lead to silent data
corruption.

- Fix two DM version bumps that were missing from the initial 4.2 DM
pull request (enabled userspace lvm2 to know certain changes have
been made)"

* tag 'dm-4.2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm cache policy smq: fix "default" version to be 1.4.0
dm: bump the ioctl version to 4.32.0
Revert "block, dm: don't copy bios for request clones"
Revert "dm: do not allocate any mempools for blk-mq request-based DM"

Linus Torvalds
2015-06-27 03:35:01 +0800

26 Jun, 2015

3 commits

78d8e58a0 Revert "block, dm: don't copy bios for request clones" ... Browse Code »

This reverts commit 5f1b670d0bef508a5554d92525f5f6d00d640b38.

Justification for revert as reported in this dm-devel post:
https://www.redhat.com/archives/dm-devel/2015-June/msg00160.html

this change should not be pushed to mainline yet.

Firstly, Christoph has a newer version of the patch that fixes silent
data corruption problem:
https://www.redhat.com/archives/dm-devel/2015-May/msg00229.html

And the new version still depends on LLDDs to always complete requests
to the end when error happens, while block API doesn't enforce such a
requirement. If the assumption is ever broken, the inconsistency between
request and bio (e.g. rq->__sector and rq->bio) will cause silent data
corruption:
https://www.redhat.com/archives/dm-devel/2015-June/msg00022.html

Reported-by: Junichi Nomura
Signed-off-by: Mike Snitzer

Mike Snitzer
2015-06-26 22:11:58 +0800
e4bc13adf Merge branch 'for-4.2/writeback' of git://git.kernel.dk/linux-block ... Browse Code »

Pull cgroup writeback support from Jens Axboe:
"This is the big pull request for adding cgroup writeback support.

This code has been in development for a long time, and it has been
simmering in for-next for a good chunk of this cycle too. This is one
of those problems that has been talked about for at least half a
decade, finally there's a solution and code to go with it.

Also see last weeks writeup on LWN:

http://lwn.net/Articles/648292/"

* 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
writeback, blkio: add documentation for cgroup writeback support
vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
writeback: do foreign inode detection iff cgroup writeback is enabled
v9fs: fix error handling in v9fs_session_init()
bdi: fix wrong error return value in cgwb_create()
buffer: remove unusued 'ret' variable
writeback: disassociate inodes from dying bdi_writebacks
writeback: implement foreign cgroup inode bdi_writeback switching
writeback: add lockdep annotation to inode_to_wb()
writeback: use unlocked_inode_to_wb transaction in inode_congested()
writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
writeback: implement [locked_]inode_to_wb_and_lock_list()
writeback: implement foreign cgroup inode detection
writeback: make writeback_control track the inode being written back
writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
writeback: implement memcg writeback domain based throttling
writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
writeback: implement memcg wb_domain
writeback: update wb_over_bg_thresh() to use wb_domain aware operations
...

Linus Torvalds
2015-06-26 07:00:17 +0800
bfffa1cc9 Merge branch 'for-4.2/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block IO update from Jens Axboe:
"Nothing really major in here, mostly a collection of smaller
optimizations and cleanups, mixed with various fixes. In more detail,
this contains:

- Addition of policy specific data to blkcg for block cgroups. From
Arianna Avanzini.

- Various cleanups around command types from Christoph.

- Cleanup of the suspend block I/O path from Christoph.

- Plugging updates from Shaohua and Jeff Moyer, for blk-mq.

- Eliminating atomic inc/dec of both remaining IO count and reference
count in a bio. From me.

- Fixes for SG gap and chunk size support for data-less (discards)
IO, so we can merge these better. From me.

- Small restructuring of blk-mq shared tag support, freeing drivers
from iterating hardware queues. From Keith Busch.

- A few cfq-iosched tweaks, from Tahsin Erdogan and me. Makes the
IOPS mode the default for non-rotational storage"

* 'for-4.2/core' of git://git.kernel.dk/linux-block: (35 commits)
cfq-iosched: fix other locations where blkcg_to_cfqgd() can return NULL
cfq-iosched: fix sysfs oops when attempting to read unconfigured weights
cfq-iosched: move group scheduling functions under ifdef
cfq-iosched: fix the setting of IOPS mode on SSDs
blktrace: Add blktrace.c to BLOCK LAYER in MAINTAINERS file
block, cgroup: implement policy-specific per-blkcg data
block: Make CFQ default to IOPS mode on SSDs
block: add blk_set_queue_dying() to blkdev.h
blk-mq: Shared tag enhancements
block: don't honor chunk sizes for data-less IO
block: only honor SG gap prevention for merges that contain data
block: fix returnvar.cocci warnings
block, dm: don't copy bios for request clones
block: remove management of bi_remaining when restoring original bi_end_io
block: replace trylock with mutex_lock in blkdev_reread_part()
block: export blkdev_reread_part() and __blkdev_reread_part()
suspend: simplify block I/O handling
block: collapse bio bit space
block: remove unused BIO_RW_BLOCK and BIO_EOF flags
block: remove BIO_EOPNOTSUPP
...

Linus Torvalds
2015-06-26 05:29:53 +0800

02 Jun, 2015

5 commits

482cf79cd writeback, blkcg: propagate non-root blkcg congestion state ... Browse Code »

Now that bdi layer can handle per-blkcg bdi_writeback_congested state,
blk_{set|clear}_congested() can propagate non-root blkcg congestion
state to them.

This can be easily achieved by disabling the root_rl tests in
blk_{set|clear}_congested(). Note that we still need those tests when
!CONFIG_CGROUP_WRITEBACK as otherwise we'll end up flipping root blkcg
wb's congestion state for events happening on other blkcgs.

v2: Updated for bdi_writeback_congested.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Jan Kara
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:35 +0800
d40f75a06 writeback, blkcg: restructure blk_{set|clear}_queue_congested() ... Browse Code »

blk_{set|clear}_queue_congested() take @q and set or clear,
respectively, the congestion state of its bdi's root wb. Because bdi
used to be able to handle congestion state only on the root wb, the
callers of those functions tested whether the congestion is on the
root blkcg and skipped if not.

This is cumbersome and makes implementation of per cgroup
bdi_writeback congestion state propagation difficult. This patch
renames blk_{set|clear}_queue_congested() to
blk_{set|clear}_congested(), and makes them take request_list instead
of request_queue and test whether the specified request_list is the
root one before updating bdi_writeback congestion state. This makes
the tests in the callers unnecessary and simplifies them.

As there are no external users of these functions, the definitions are
moved from include/linux/blkdev.h to block/blk-core.c.

This patch doesn't introduce any noticeable behavior difference.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Jan Kara
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:35 +0800
89e9b9e07 writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK ... Browse Code »

cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Jan Kara
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:35 +0800
4452226ea writeback: move backing_dev_info->state into bdi_writeback ... Browse Code »

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Wu Fengguang
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
eea8f41cc blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h ... Browse Code »

cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800

30 May, 2015

1 commit

183f7802e Merge remote-tracking branch 'jens/for-4.2/core' into dm-4.2 Browse Code »

Mike Snitzer
2015-05-30 02:17:16 +0800

22 May, 2015

1 commit

5f1b670d0 block, dm: don't copy bios for request clones ... Browse Code »

Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-22 22:58:57 +0800

19 May, 2015

1 commit

97ca223c3 block: remove unused BIO_RW_BLOCK and BIO_EOF flags ... Browse Code »

Signed-off-by: Christoph Hellwig
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-19 23:17:05 +0800

13 May, 2015

1 commit

336b7e1f2 block: remove export for blk_queue_bio ... Browse Code »

With commit ff36ab345 ("dm: remove request-based logic from
make_request_fn wrapper") DM no longer calls blk_queue_bio() directly,
so remove its export. Doing so required a forward declaration in
blk-core.c.

Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2015-05-13 05:21:22 +0800

09 May, 2015

2 commits

5b3f341f0 blk-mq: make plug work for mutiple disks and queues ... Browse Code »

Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.

Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:23 +0800
dd6cf3e18 blk: clean up plug ... Browse Code »

Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.

Signed-off-by: Shaohua Li
Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:14 +0800

06 May, 2015

1 commit

a7928c157 block: move PM request support to IDE ... Browse Code »

This removes the request types and hacks from the block code and into the
old IDE driver. There is a small amunt of code duplication due to this,
but it's not too bad.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:42 +0800

28 Apr, 2015

1 commit

6cd18e711 block: destroy bdi before blockdev is unregistered. ... Browse Code »

Because of the peculiar way that md devices are created (automatically
when the device node is opened), a new device can be created and
registered immediately after the
blk_unregister_region(disk_devt(disk), disk->minors);
call in del_gendisk().

Therefore it is important that all visible artifacts of the previous
device are removed before this call. In particular, the 'bdi'.

Since:
commit c4db59d31e39ea067c32163ac961e9c80198fd37
Author: Christoph Hellwig
fs: don't reassign dirty inodes to default_backing_dev_info

moved the
device_unregister(bdi->dev);
call from bdi_unregister() to bdi_destroy() it has been quite easy to
lose a race and have a new (e.g.) "md127" be created after the
blk_unregister_region() call and before bdi_destroy() is ultimately
called by the final 'put_disk', which must come after del_gendisk().

The new device finds that the bdi name is already registered in sysfs
and complains

> [ 9627.630029] WARNING: CPU: 18 PID: 3330 at fs/sysfs/dir.c:31 sysfs_warn_dup+0x5a/0x70()
> [ 9627.630032] sysfs: cannot create duplicate filename '/devices/virtual/bdi/9:127'

We can fix this by moving the bdi_destroy() call out of
blk_release_queue() (which can happen very late when a refcount
reaches zero) and into blk_cleanup_queue() - which happens exactly when the md
device driver calls it.

Then it is only necessary for md to call blk_cleanup_queue() before
del_gendisk(). As loop.c devices are also created on demand by
opening the device node, we make the same change there.

Fixes: c4db59d31e39ea067c32163ac961e9c80198fd37
Reported-by: Azat Khuzhin
Cc: Christoph Hellwig
Cc: stable@vger.kernel.org (v4.0)
Signed-off-by: NeilBrown
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

NeilBrown
2015-04-28 00:27:20 +0800

25 Mar, 2015

1 commit

271508dba block: allocate request memory local to request queue ... Browse Code »

blk_init_rl() allocates a mempool using mempool_create_node() with node
local memory. This only allocates the mempool and element list locally
to the requeue queue node.

What we really want to do is allocate the request itself local to the
queue. To do this, we need our own alloc and free functions that will
allocate from request_cachep and pass the request queue node in to prefer
node local memory.

Acked-by: Tejun Heo
Signed-off-by: David Rientjes
Signed-off-by: Jens Axboe

David Rientjes
2015-03-25 10:00:07 +0800