Eric Lee / smarc-fsl-linux-kernel

02 Jun, 2015

17 commits

f0054bb1e writeback: move backing_dev_info->wb_lock and ->worklist into bdi_writeback ... Browse Code »

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->wb_lock and ->worklist into wb.

* The lock protects bdi->worklist and bdi->wb.dwork scheduling. While
moving, rename it to wb->work_lock as wb->wb_lock is confusing.
Also, move wb->dwork downwards so that it's colocated with the new
->work_lock and ->work_list fields.

* bdi_writeback_workfn() -> wb_workfn()
bdi_wakeup_thread_delayed(bdi) -> wb_wakeup_delayed(wb)
bdi_wakeup_thread(bdi) -> wb_wakeup(wb)
bdi_queue_work(bdi, ...) -> wb_queue_work(wb, ...)
__bdi_start_writeback(bdi, ...) -> __wb_start_writeback(wb, ...)
get_next_work_item(bdi) -> get_next_work_item(wb)

* bdi_wb_shutdown() is renamed to wb_shutdown() and now takes @wb.
The function contained parts which belong to the containing bdi
rather than the wb itself - testing cap_writeback_dirty and
bdi_remove_from_list() invocation. Those are moved to
bdi_unregister().

* bdi_wb_{init|exit}() are renamed to wb_{init|exit}().
Initializations of the moved bdi->wb_lock and ->work_list are
relocated from bdi_init() to wb_init().

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Wu Fengguang
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
de1fff37b writeback: s/bdi/wb/ in mm/page-writeback.c ... Browse Code »

Writeback operations will now be per wb (bdi_writeback) instead of
bdi. Replace the relevant bdi references in symbol names and comments
with wb. This patch is purely cosmetic and doesn't make any
functional changes.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Wu Fengguang
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
a88a341a7 writeback: move bandwidth related fields from backing_dev_info into bdi_writeback ... Browse Code »

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bandwidth related fields from backing_dev_info into
bdi_writeback.

* The moved fields are: bw_time_stamp, dirtied_stamp, written_stamp,
write_bandwidth, avg_write_bandwidth, dirty_ratelimit,
balanced_dirty_ratelimit, completions and dirty_exceeded.

* writeback_chunk_size() and over_bground_thresh() now take @wb
instead of @bdi.

* bdi_writeout_fraction(bdi, ...) -> wb_writeout_fraction(wb, ...)
bdi_dirty_limit(bdi, ...) -> wb_dirty_limit(wb, ...)
bdi_position_ration(bdi, ...) -> wb_position_ratio(wb, ...)
bdi_update_writebandwidth(bdi, ...) -> wb_update_write_bandwidth(wb, ...)
[__]bdi_update_bandwidth(bdi, ...) -> [__]wb_update_bandwidth(wb, ...)
bdi_{max|min}_pause(bdi, ...) -> wb_{max|min}_pause(wb, ...)
bdi_dirty_limits(bdi, ...) -> wb_dirty_limits(wb, ...)

* Init/exits of the relocated fields are moved to bdi_wb_init/exit()
respectively. Note that explicit zeroing is dropped in the process
as wb's are cleared in entirety anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

v2: Typo in description fixed as suggested by Jan.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Wu Fengguang
Cc: Jaegeuk Kim
Cc: Steven Whitehouse
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
93f78d882 writeback: move backing_dev_info->bdi_stat[] into bdi_writeback ... Browse Code »

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->bdi_stat[] into wb.

* enum bdi_stat_item is renamed to wb_stat_item and the prefix of all
enums is changed from BDI_ to WB_.

* BDI_STAT_BATCH() -> WB_STAT_BATCH()

* [__]{add|inc|dec|sum}_wb_stat(bdi, ...) -> [__]{add|inc}_wb_stat(wb, ...)

* bdi_stat[_error]() -> wb_stat[_error]()

* bdi_writeout_inc() -> wb_writeout_inc()

* stat init is moved to bdi_wb_init() and bdi_wb_exit() is added and
frees stat.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->stat[] are mechanically replaced with bdi->wb.stat[]
introducing no behavior changes.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Wu Fengguang
Cc: Miklos Szeredi
Cc: Trond Myklebust
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
4452226ea writeback: move backing_dev_info->state into bdi_writeback ... Browse Code »

Currently, a bdi (backing_dev_info) embeds single wb (bdi_writeback)
and the role of the separation is unclear. For cgroup support for
writeback IOs, a bdi will be updated to host multiple wb's where each
wb serves writeback IOs of a different cgroup on the bdi. To achieve
that, a wb should carry all states necessary for servicing writeback
IOs for a cgroup independently.

This patch moves bdi->state into wb.

* enum bdi_state is renamed to wb_state and the prefix of all enums is
changed from BDI_ to WB_.

* Explicit zeroing of bdi->state is removed without adding zeoring of
wb->state as the whole data structure is zeroed on init anyway.

* As there's still only one bdi_writeback per backing_dev_info, all
uses of bdi->state are mechanically replaced with bdi->wb.state
introducing no behavior changes.

Signed-off-by: Tejun Heo
Reviewed-by: Jan Kara
Cc: Jens Axboe
Cc: Wu Fengguang
Cc: drbd-dev@lists.linbit.com
Cc: Neil Brown
Cc: Alasdair Kergon
Cc: Mike Snitzer
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
ad7fa852d memcg: implement mem_cgroup_css_from_page() ... Browse Code »

Implement mem_cgroup_css_from_page() which returns the
cgroup_subsys_state of the memcg associated with a given page on the
default hierarchy. This will be used by cgroup writeback support.

This function assumes that page->mem_cgroup association doesn't change
until the page is released, which is true on the default hierarchy as
long as replace_page_cache_page() is not used. As the only user of
replace_page_cache_page() is FUSE which won't support cgroup writeback
for the time being, this works for now, and replace_page_cache_page()
will soon be updated so that the invariant actually holds.

Note that the RCU protected page->mem_cgroup access is consistent with
other usages across memcg but ultimately incorrect. These unlocked
accesses are missing required barriers. page->mem_cgroup should be
made an RCU pointer and updated and accessed using RCU operations.

v4: Instead of triggering WARN, return the root css on the traditional
hierarchies. This makes the function a lot easier to deal with
especially as there's no light way to synchronize against
hierarchy rebinding.

v3: s/mem_cgroup_migrate()/mem_cgroup_css_from_page()/

v2: Trigger WARN if the function is used on the traditional
hierarchies and add comment about the assumed invariant.

Signed-off-by: Tejun Heo
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
1d933cf09 blkcg: implement bio_associate_blkcg() ... Browse Code »

Currently, a bio can only be associated with the io_context and blkcg
of %current using bio_associate_current(). This is too restrictive
for cgroup writeback support. Implement bio_associate_blkcg() which
associates a bio with the specified blkcg.

bio_associate_blkcg() leaves the io_context unassociated.
bio_associate_current() is updated so that it considers a bio as
already associated if it has a blkcg_css, instead of an io_context,
associated with it.

Signed-off-by: Tejun Heo
Cc: Jens Axboe
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
fd383c2d3 blkcg: implement task_get_blkcg_css() ... Browse Code »

Implement a wrapper around task_get_css() to acquire the blkcg css for
a given task. The wrapper is necessary for cgroup writeback support
as there will be places outside blkcg proper trying to acquire
blkcg_css and blkio_cgrp_id will be undefined when !CONFIG_BLK_CGROUP.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
ec438699a cgroup, block: implement task_get_css() and use it in bio_associate_current() ... Browse Code »

bio_associate_current() currently open codes task_css() and
css_tryget_online() to find and pin $current's blkcg css. Abstract it
into task_get_css() which is implemented from cgroup side. As a task
is always associated with an online css for every subsystem except
while the css_set update is propagating, task_get_css() retries till
css_tryget_online() succeeds.

This is a cleanup and shouldn't lead to noticeable behavior changes.

Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Jens Axboe
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:34 +0800
496d5e756 blkcg: add blkcg_root_css ... Browse Code »

Add global constant blkcg_root_css which points to &blkcg_root.css.
This will be used by cgroup writeback support. If blkcg is disabled,
it's defined as ERR_PTR(-EINVAL).

v2: The declarations moved to include/linux/blk-cgroup.h as suggested
by Vivek.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Jens Axboe
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
56161634e memcg: add mem_cgroup_root_css ... Browse Code »

Add global mem_cgroup_root_css which points to the root memcg css.
This will be used by cgroup writeback support. If memcg is disabled,
it's defined as ERR_PTR(-EINVAL).

Signed-off-by: Tejun Heo
Cc: Johannes Weiner
aCc: Michal Hocko
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
ec13b1d6f blkcg: always create the blkcg_gq for the root blkcg ... Browse Code »

Currently, blkcg does a minor optimization where the root blkcg is
created when the first blkcg policy is activated on a queue and
destroyed on the deactivation of the last. On systems where blkcg is
configured but not used, this saves one blkcg_gq struct per queue. On
systems where blkcg is actually used, there's no difference. The only
case where this can lead to any meaninful, albeit still minute, save
in memory consumption is when all blkcg policies are deactivated after
being widely used in the system, which is a hihgly unlikely scenario.

The conditional existence of root blkcg_gq has already created several
bugs in blkcg and became an issue once again for the new per-cgroup
wb_congested mechanism for cgroup writeback support leading to a NULL
dereference when no blkcg policy is active. This is really not worth
bothering with. This patch makes blkcg always allocate and link the
root blkcg_gq and release it only on queue destruction.

Signed-off-by: Tejun Heo
Reported-by: Fengguang Wu
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
efa7d1c73 update !CONFIG_BLK_CGROUP dummies in include/linux/blk-cgroup.h ... Browse Code »

The header file will be used more widely with the pending cgroup
writeback support and the current set of dummy declarations aren't
enough to handle different config combinations. Update as follows.

* Drop the struct cgroup declaration. None of the dummy defs need it.

* Define blkcg as an empty struct instead of just declaring it.

* Wrap dummy function defs in CONFIG_BLOCK. Some functions use block
data types and none of them are to be used w/o block enabled.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
eea8f41cc blkcg: move block/blk-cgroup.h to include/linux/blk-cgroup.h ... Browse Code »

cgroup aware writeback support will require exposing some of blkcg
details. In preprataion, move block/blk-cgroup.h to
include/linux/blk-cgroup.h. This patch is pure file move.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
c4843a759 memcg: add per cgroup dirty page accounting ... Browse Code »

When modifying PG_Dirty on cached file pages, update the new
MEM_CGROUP_STAT_DIRTY counter. This is done in the same places where
global NR_FILE_DIRTY is managed. The new memcg stat is visible in the
per memcg memory.stat cgroupfs file. The most recent past attempt at
this was http://thread.gmane.org/gmane.linux.kernel.cgroups/8632

The new accounting supports future efforts to add per cgroup dirty
page throttling and writeback. It also helps an administrator break
down a container's memory usage and provides evidence to understand
memcg oom kills (the new dirty count is included in memcg oom kill
messages).

The ability to move page accounting between memcg
(memory.move_charge_at_immigrate) makes this accounting more
complicated than the global counter. The existing
mem_cgroup_{begin,end}_page_stat() lock is used to serialize move
accounting with stat updates.
Typical update operation:
memcg = mem_cgroup_begin_page_stat(page)
if (TestSetPageDirty()) {
[...]
mem_cgroup_update_page_stat(memcg)
}
mem_cgroup_end_page_stat(memcg)

Summary of mem_cgroup_end_page_stat() overhead:
- Without CONFIG_MEMCG it's a no-op
- With CONFIG_MEMCG and no inter memcg task movement, it's just
rcu_read_lock()
- With CONFIG_MEMCG and inter memcg task movement, it's
rcu_read_lock() + spin_lock_irqsave()

A memcg parameter is added to several routines because their callers
now grab mem_cgroup_begin_page_stat() which returns the memcg later
needed by for mem_cgroup_update_page_stat().

Because mem_cgroup_begin_page_stat() may disable interrupts, some
adjustments are needed:
- move __mark_inode_dirty() from __set_page_dirty() to its caller.
__mark_inode_dirty() locking does not want interrupts disabled.
- use spin_lock_irqsave(tree_lock) rather than spin_lock_irq() in
__delete_from_page_cache(), replace_page_cache_page(),
invalidate_complete_page2(), and __remove_mapping().

text data bss dec hex filename
8925147 1774832 1785856 12485835 be84cb vmlinux-!CONFIG_MEMCG-before
8925339 1774832 1785856 12486027 be858b vmlinux-!CONFIG_MEMCG-after
+192 text bytes
8965977 1784992 1785856 12536825 bf4bf9 vmlinux-CONFIG_MEMCG-before
8966750 1784992 1785856 12537598 bf4efe vmlinux-CONFIG_MEMCG-after
+773 text bytes

Performance tests run on v4.0-rc1-36-g4f671fe2f952. Lower is better for
all metrics, they're all wall clock or cycle counts. The read and write
fault benchmarks just measure fault time, they do not include I/O time.

* CONFIG_MEMCG not set:
baseline patched
kbuild 1m25.030000(+-0.088% 3 samples) 1m25.426667(+-0.120% 3 samples)
dd write 100 MiB 0.859211561 +-15.10% 0.874162885 +-15.03%
dd write 200 MiB 1.670653105 +-17.87% 1.669384764 +-11.99%
dd write 1000 MiB 8.434691190 +-14.15% 8.474733215 +-14.77%
read fault cycles 254.0(+-0.000% 10 samples) 253.0(+-0.000% 10 samples)
write fault cycles 2021.2(+-3.070% 10 samples) 1984.5(+-1.036% 10 samples)

* CONFIG_MEMCG=y root_memcg:
baseline patched
kbuild 1m25.716667(+-0.105% 3 samples) 1m25.686667(+-0.153% 3 samples)
dd write 100 MiB 0.855650830 +-14.90% 0.887557919 +-14.90%
dd write 200 MiB 1.688322953 +-12.72% 1.667682724 +-13.33%
dd write 1000 MiB 8.418601605 +-14.30% 8.673532299 +-15.00%
read fault cycles 266.0(+-0.000% 10 samples) 266.0(+-0.000% 10 samples)
write fault cycles 2051.7(+-1.349% 10 samples) 2049.6(+-1.686% 10 samples)

* CONFIG_MEMCG=y non-root_memcg:
baseline patched
kbuild 1m26.120000(+-0.273% 3 samples) 1m25.763333(+-0.127% 3 samples)
dd write 100 MiB 0.861723964 +-15.25% 0.818129350 +-14.82%
dd write 200 MiB 1.669887569 +-13.30% 1.698645885 +-13.27%
dd write 1000 MiB 8.383191730 +-14.65% 8.351742280 +-14.52%
read fault cycles 265.7(+-0.172% 10 samples) 267.0(+-0.000% 10 samples)
write fault cycles 2070.6(+-1.512% 10 samples) 2084.4(+-2.148% 10 samples)

As expected anon page faults are not affected by this patch.

tj: Updated to apply on top of the recent cancel_dirty_page() changes.

Signed-off-by: Sha Zhengju
Signed-off-by: Greg Thelen
Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Greg Thelen
2015-06-02 22:33:33 +0800
11f81becc page_writeback: revive cancel_dirty_page() in a restricted form ... Browse Code »

cancel_dirty_page() had some issues and b9ea25152e56 ("page_writeback:
clean up mess around cancel_dirty_page()") replaced it with
account_page_cleaned() which makes the caller responsible for clearing
the dirty bit; unfortunately, the planned changes for cgroup writeback
support requires synchronization between dirty bit manipulation and
stat updates. While we can open-code such synchronization in each
account_page_cleaned() callsite, that's gonna be unnecessarily awkward
and verbose.

This patch revives cancel_dirty_page() but in a more restricted form.
All it does is TestClearPageDirty() followed by account_page_cleaned()
invocation if the page was dirty. This helper covers all
account_page_cleaned() usages except for __delete_from_page_cache()
which is a special case anyway and left alone. As this leaves no
module user for account_page_cleaned(), EXPORT_SYMBOL() is dropped
from it.

This patch just revives cancel_dirty_page() as a trivial wrapper to
replace equivalent usages and doesn't introduce any functional
changes.

Signed-off-by: Tejun Heo
Cc: Konstantin Khlebnikov
Signed-off-by: Jens Axboe

Tejun Heo
2015-06-02 22:33:33 +0800
f26cdc853 blk-mq: Shared tag enhancements ... Browse Code »

Storage controllers may expose multiple block devices that share hardware
resources managed by blk-mq. This patch enhances the shared tags so a
low-level driver can access the shared resources not tied to the unshared
h/w contexts. This way the LLD can dynamically add and delete disks and
request queues without having to track all the request_queue hctx's to
iterate outstanding tags.

Signed-off-by: Keith Busch
Signed-off-by: Jens Axboe

Keith Busch
2015-06-02 04:35:56 +0800

30 May, 2015

2 commits

e548ca4ee block: don't honor chunk sizes for data-less IO ... Browse Code »

We don't need to honor chunk sizes for IO that doesn't carry any
data.

Signed-off-by: Jens Axboe

Jens Axboe
2015-05-30 03:11:32 +0800
beefa6ba7 block: only honor SG gap prevention for merges that contain data ... Browse Code »

We can safely merge anything that wont generate an SG list entry,
so if the bio is data-less (discard), don't look at potential
SG gaps.

Signed-off-by: Jens Axboe

Jens Axboe
2015-05-30 03:10:23 +0800

27 May, 2015

1 commit

f6454b049 block: fix returnvar.cocci warnings ... Browse Code »

Remove unneeded variable used to store return value.

Generated by: scripts/coccinelle/misc/returnvar.cocci

Signed-off-by: Fengguang Wu
Signed-off-by: Julia Lawall
Signed-off-by: Jens Axboe

Julia Lawall
2015-05-27 04:02:45 +0800

22 May, 2015

2 commits

5f1b670d0 block, dm: don't copy bios for request clones ... Browse Code »

Currently dm-multipath has to clone the bios for every request sent
to the lower devices, which wastes cpu cycles and ties down memory.

This patch instead adds a new REQ_CLONE flag that instructs req_bio_endio
to not complete bios attached to a request, which we set on clone
requests similar to bios in a flush sequence. With this change I/O
errors on a path failure only get propagated to dm-multipath, which
can then either resubmit the I/O or complete the bios on the original
request.

I've done some basic testing of this on a Linux target with ALUA support,
and it survives path failures during I/O nicely.

Signed-off-by: Christoph Hellwig
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-22 22:58:57 +0800
326e1dbb5 block: remove management of bi_remaining when restoring original bi_end_io ... Browse Code »

Commit c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for
non-chains") regressed all existing callers that followed this pattern:
1) saving a bio's original bi_end_io
2) wiring up an intermediate bi_end_io
3) restoring the original bi_end_io from intermediate bi_end_io
4) calling bio_endio() to execute the restored original bi_end_io

The regression was due to BIO_CHAIN only ever getting set if
bio_inc_remaining() is called. For the above pattern it isn't set until
step 3 above (step 2 would've needed to establish BIO_CHAIN). As such
the first bio_endio(), in step 2 above, never decremented __bi_remaining
before calling the intermediate bi_end_io -- leaving __bi_remaining with
the value 1 instead of 0. When bio_inc_remaining() occurred during step
3 it brought it to a value of 2. When the second bio_endio() was
called, in step 4 above, it should've called the original bi_end_io but
it didn't because there was an extra reference that wasn't dropped (due
to atomic operations being optimized away since BIO_CHAIN wasn't set
upfront).

Fix this issue by removing the __bi_remaining management complexity for
all callers that use the above pattern -- bio_chain() is the only
interface that _needs_ to be concerned with __bi_remaining. For the
above pattern callers just expect the bi_end_io they set to get called!
Remove bio_endio_nodec() and also remove all bio_inc_remaining() calls
that aren't associated with the bio_chain() interface.

Also, the bio_inc_remaining() interface has been moved local to bio.c.

Fixes: c4cf5261 ("bio: skip atomic inc/dec of ->bi_remaining for non-chains")
Reviewed-by: Christoph Hellwig
Reviewed-by: Jan Kara
Signed-off-by: Mike Snitzer
Signed-off-by: Jens Axboe

Mike Snitzer
2015-05-22 22:58:55 +0800

20 May, 2015

2 commits

b04a5636a block: replace trylock with mutex_lock in blkdev_reread_part() ... Browse Code »

The only possible problem of using mutex_lock() instead of trylock
is about deadlock.

If there aren't any locks held before calling blkdev_reread_part(),
deadlock can't be caused by this conversion.

If there are locks held before calling blkdev_reread_part(),
and if these locks arn't required in open, close handler and I/O
path, deadlock shouldn't be caused too.

Both user space's ioctl(BLKRRPART) and md_setup_drive() from
init/do_mounts_md.c belongs to the 1st case, so the conversion is safe
for the two cases.

For loop, the previous patches in this pathset has fixed the ABBA lock
dependency, so the conversion is OK.

For nbd, tx_lock is held when calling the function:

- both open and release won't hold the lock
- when blkdev_reread_part() is run, I/O thread has been stopped
already, so tx_lock won't be acquired in I/O path at that time.
- so the conversion won't cause deadlock for nbd

For dasd, both dasd_open(), dasd_release() and request function don't
acquire any mutex/semphone, so the conversion should be safe.

Reviewed-by: Christoph Hellwig
Tested-by: Jarod Wilson
Acked-by: Jarod Wilson
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Ming Lei
2015-05-20 23:05:45 +0800
be3241779 block: export blkdev_reread_part() and __blkdev_reread_part() ... Browse Code »

This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().

For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.

CC: Christoph Hellwig
CC: Jens Axboe
CC: Tejun Heo
CC: Alexander Viro
CC: Markus Pargmann
CC: Stefan Weinhuber
CC: Stefan Haberland
CC: Sebastian Ott
CC: Fabian Frederick
CC: Ming Lei
CC: David Herrmann
CC: Andrew Morton
CC: Peter Zijlstra
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig
Signed-off-by: Jarod Wilson
Signed-off-by: Ming Lei
Signed-off-by: Jens Axboe

Jarod Wilson
2015-05-20 23:05:42 +0800

19 May, 2015

5 commits

343df3c79 suspend: simplify block I/O handling ... Browse Code »

Stop abusing struct page functionality and the swap end_io handler, and
instead add a modified version of the blk-lib.c bio_batch helpers.

Also move the block I/O code into swap.c as they are directly tied into
each other.

Signed-off-by: Christoph Hellwig
Tested-by: Pavel Machek
Tested-by: Ming Lin
Acked-by: Pavel Machek
Acked-by: Rafael J. Wysocki
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-19 23:19:59 +0800
b2dbe0a60 block: collapse bio bit space ... Browse Code »

Various previous patches removed bits and left holes, collapse them
all. Leave the reset start bit where it is, we don't need to change
that.

Signed-off-by: Jens Axboe

Jens Axboe
2015-05-19 23:18:28 +0800
97ca223c3 block: remove unused BIO_RW_BLOCK and BIO_EOF flags ... Browse Code »

Signed-off-by: Christoph Hellwig
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-19 23:17:05 +0800
b25de9d6d block: remove BIO_EOPNOTSUPP ... Browse Code »

Since the big barrier rewrite/removal in 2007 we never fail FLUSH or
FUA requests, which means we can remove the magic BIO_EOPNOTSUPP flag
to help propagating those to the buffer_head layer.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jeff Moyer
Acked-by: Tejun Heo
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-19 23:17:03 +0800
4ecd4fef3 block: use an atomic_t for mq_freeze_depth ... Browse Code »

lockdep gets unhappy about the not disabling irqs when using the queue_lock
around it. Instead of trying to fix that up just switch to an atomic_t
and get rid of the lock.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-19 23:12:59 +0800

09 May, 2015

6 commits

5b3f341f0 blk-mq: make plug work for mutiple disks and queues ... Browse Code »

Last patch makes plug work for multiple queue case. However it only
works for single disk case, because it assumes only one request in the
plug list. If a task is accessing multiple disks, eg MD/DM, the
assumption is wrong. Let blk_attempt_plug_merge() record request from
the same queue.

V2: use NULL parameter in !mq case. Fix a bug. Add comments in
blk_attempt_plug_merge to make it less (hopefully) confusion.

Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:23 +0800
f984df1f0 blk-mq: do limited block plug for multiple queue case ... Browse Code »

plug is still helpful for workload with IO merge, but it can be harmful
otherwise especially with multiple hardware queues, as there is
(supposed) no lock contention in this case and plug can introduce
latency. For multiple queues, we do limited plug, eg plug only if there
is request merge. If a request doesn't have merge with following
request, the requet will be dispatched immediately.

V2: check blk_queue_nomerges() as suggested by Jeff.

Cc: Jens Axboe
Cc: Christoph Hellwig
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:21 +0800
239ad215f blk-mq: avoid re-initialize request which is failed in direct dispatch ... Browse Code »

If we directly issue a request and it fails, we use
blk_mq_merge_queue_io(). But we already assigned bio to a request in
blk_mq_bio_to_request. blk_mq_merge_queue_io shouldn't run
blk_mq_bio_to_request again.

Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:19 +0800
e6c4438ba blk-mq: fix plugging in blk_sq_make_request ... Browse Code »

The following appears in blk_sq_make_request:

/*
* If we have multiple hardware queues, just go directly to
* one of those for sync IO.
*/

We clearly don't have multiple hardware queues, here! This comment was
introduced with this commit 07068d5b8e (blk-mq: split make request
handler for multi and single queue):

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.

The old code had this:

use_plug = !is_flush_fua && ((q->nr_hw_queues == 1) || !is_sync);

and that was converted to:

use_plug = !is_flush_fua && !is_sync;

which is not equivalent. For the single queue case, that second half of
the && expression is always true. So, what I think was actually inteded
follows (and this more closely matches what is done in blk_queue_bio).

V2: delete the 'likely', which should not be a big deal

Signed-off-by: Jeff Moyer
Signed-off-by: Shaohua Li
Signed-off-by: Jens Axboe

Jeff Moyer
2015-05-09 04:17:17 +0800
5596d0d59 sched: always use blk_schedule_flush_plug in io_schedule_out ... Browse Code »

block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[ 370.817949] ------------[ cut here ]------------
[ 370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[ 370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [] prepare_to_wait+0x2f/0x90
[ 370.817971] Modules linked in: raid1
[ 370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G W 4.0.0+ #361
[ 370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[ 370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[ 370.817985] ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[ 370.817988] ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[ 370.817990] ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[ 370.817993] Call Trace:
[ 370.817999] [] dump_stack+0x4f/0x7b
[ 370.818002] [] warn_slowpath_common+0x8c/0xd0
[ 370.818004] [] warn_slowpath_fmt+0x46/0x50
[ 370.818006] [] ? prepare_to_wait+0x2f/0x90
[ 370.818008] [] ? prepare_to_wait+0x2f/0x90
[ 370.818010] [] __might_sleep+0x7f/0x90
[ 370.818014] [] raid1_unplug+0xd3/0x170 [raid1]
[ 370.818024] [] blk_flush_plug_list+0x8a/0x1e0
[ 370.818028] [] ? bit_wait+0x50/0x50
[ 370.818031] [] io_schedule_timeout+0x130/0x140
[ 370.818033] [] bit_wait_io+0x36/0x50
[ 370.818034] [] __wait_on_bit+0x65/0x90
[ 370.818041] [] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[ 370.818043] [] ? bit_wait+0x50/0x50
[ 370.818045] [] out_of_line_wait_on_bit+0x72/0x80
[ 370.818047] [] ? autoremove_wake_function+0x40/0x40
[ 370.818050] [] __wait_on_buffer+0x44/0x50
[ 370.818053] [] ext4_wait_block_bitmap+0xe0/0xf0
[ 370.818058] [] ext4_mb_init_cache+0x206/0x790
[ 370.818062] [] ? lru_cache_add+0x1c/0x50
[ 370.818064] [] ext4_mb_init_group+0x11e/0x200
[ 370.818066] [] ext4_mb_load_buddy+0x341/0x360
[ 370.818068] [] ext4_mb_find_by_goal+0x93/0x2f0
[ 370.818070] [] ? ext4_mb_normalize_request+0x1e4/0x5b0
[ 370.818072] [] ext4_mb_regular_allocator+0x67/0x460
[ 370.818074] [] ? ext4_mb_normalize_request+0x1e4/0x5b0
[ 370.818076] [] ext4_mb_new_blocks+0x4cb/0x620
[ 370.818079] [] ext4_ext_map_blocks+0x4c6/0x14d0
[ 370.818081] [] ? ext4_es_lookup_extent+0x4e/0x290
[ 370.818085] [] ext4_map_blocks+0x14d/0x4f0
[ 370.818088] [] ext4_writepages+0x76d/0xe50
[ 370.818094] [] do_writepages+0x21/0x50
[ 370.818097] [] __writeback_single_inode+0x60/0x490
[ 370.818099] [] writeback_sb_inodes+0x2da/0x590
[ 370.818103] [] ? trylock_super+0x1b/0x50
[ 370.818105] [] ? trylock_super+0x1b/0x50
[ 370.818107] [] __writeback_inodes_wb+0x9f/0xd0
[ 370.818109] [] wb_writeback+0x34b/0x3c0
[ 370.818111] [] bdi_writeback_workfn+0x23f/0x550
[ 370.818116] [] process_one_work+0x1c8/0x570
[ 370.818117] [] ? process_one_work+0x14b/0x570
[ 370.818119] [] worker_thread+0x11b/0x470
[ 370.818121] [] ? process_one_work+0x570/0x570
[ 370.818124] [] kthread+0xf8/0x110
[ 370.818126] [] ? kthread_create_on_node+0x210/0x210
[ 370.818129] [] ret_from_fork+0x42/0x70
[ 370.818131] [] ? kthread_create_on_node+0x210/0x210
[ 370.818132] ---[ end trace 7b4deb71e68b6605 ]---

V2: don't change ->in_iowait

Cc: NeilBrown
Signed-off-by: Shaohua Li
Reviewed-by: Jeff Moyer
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:15 +0800
dd6cf3e18 blk: clean up plug ... Browse Code »

Current code looks like inner plug gets flushed with a
blk_finish_plug(). Actually it's a nop. All requests/callbacks are added
to current->plug, while only outmost plug is assigned to current->plug.
So inner plug always has empty request/callback list, which makes
blk_flush_plug_list() a nop. This tries to make the code more clear.

Signed-off-by: Shaohua Li
Reviewed-by: Jeff Moyer
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Shaohua Li
2015-05-09 04:17:14 +0800

06 May, 2015

5 commits

9dc6c806b nbd: stop using req->cmd ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:44 +0800
a7928c157 block: move PM request support to IDE ... Browse Code »

This removes the request types and hacks from the block code and into the
old IDE driver. There is a small amunt of code duplication due to this,
but it's not too bad.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:42 +0800
ac7cdff00 block: remove REQ_TYPE_PM_SHUTDOWN ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:10 +0800
b0b93b48a block: move REQ_TYPE_SENSE to the ide driver ... Browse Code »

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:07 +0800
b42171ef7 block: move REQ_TYPE_ATA_TASKFILE and REQ_TYPE_ATA_PC to ide.h ... Browse Code »

These values are only used by the IDE driver, so move them into it
by allowing drivers to take cmd_type values after the first private
one. Note that we have to turn cmd_type into a plain unsigned integer
so that gcc doesn't complain about mismatching enum types.

Signed-off-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Christoph Hellwig
2015-05-06 03:40:05 +0800