Eric Lee / smarc-fsl-linux-kernel

11 Sep, 2015

1 commit

b0a1ea51b Merge branch 'for-4.3/blkcg' of git://git.kernel.dk/linux-block ... Browse Code »

Pull blk-cg updates from Jens Axboe:
"A bit later in the cycle, but this has been in the block tree for a a
while. This is basically four patchsets from Tejun, that improve our
buffered cgroup writeback. It was dependent on the other cgroup
changes, but they went in earlier in this cycle.

Series 1 is set of 5 patches that has cgroup writeback updates:

- bdi_writeback iteration fix which could lead to some wb's being
skipped or repeated during e.g. sync under memory pressure.

- Simplification of wb work wait mechanism.

- Writeback tracepoints updated to report cgroup.

Series 2 is is a set of updates for the CFQ cgroup writeback handling:

cfq has always charged all async IOs to the root cgroup. It didn't
have much choice as writeback didn't know about cgroups and there
was no way to tell who to blame for a given writeback IO.
writeback finally grew support for cgroups and now tags each
writeback IO with the appropriate cgroup to charge it against.

This patchset updates cfq so that it follows the blkcg each bio is
tagged with. Async cfq_queues are now shared across cfq_group,
which is per-cgroup, instead of per-request_queue cfq_data. This
makes all IOs follow the weight based IO resource distribution
implemented by cfq.

- Switched from GFP_ATOMIC to GFP_NOWAIT as suggested by Jeff.

- Other misc review points addressed, acks added and rebased.

Series 3 is the blkcg policy cleanup patches:

This patchset contains assorted cleanups for blkcg_policy methods
and blk[c]g_policy_data handling.

- alloc/free added for blkg_policy_data. exit dropped.

- alloc/free added for blkcg_policy_data.

- blk-throttle's async percpu allocation is replaced with direct
allocation.

- all methods now take blk[c]g_policy_data instead of blkcg_gq or
blkcg.

And finally, series 4 is a set of patches cleaning up the blkcg stats
handling:

blkcg's stats have always been somwhat of a mess. This patchset
tries to improve the situation a bit.

- The following patches added to consolidate blkcg entry point and
blkg creation. This is in itself is an improvement and helps
colllecting common stats on bio issue.

- per-blkg stats now accounted on bio issue rather than request
completion so that bio based and request based drivers can behave
the same way. The issue was spotted by Vivek.

- cfq-iosched implements custom recursive stats and blk-throttle
implements custom per-cpu stats. This patchset make blkcg core
support both by default.

- cfq-iosched and blk-throttle keep track of the same stats
multiple times. Unify them"

* 'for-4.3/blkcg' of git://git.kernel.dk/linux-block: (45 commits)
blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy
blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/
blkcg: implement interface for the unified hierarchy
blkcg: misc preparations for unified hierarchy interface
blkcg: separate out tg_conf_updated() from tg_set_conf()
blkcg: move body parsing from blkg_conf_prep() to its callers
blkcg: mark existing cftypes as legacy
blkcg: rename subsystem name from blkio to io
blkcg: refine error codes returned during blkcg configuration
blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device()
blkcg: reduce stack usage of blkg_rwstat_recursive_sum()
blkcg: remove cfqg_stats->sectors
blkcg: move io_service_bytes and io_serviced stats into blkcg_gq
blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq
blkcg: make blkcg_[rw]stat per-cpu
blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it
blkcg: consolidate blkg creation in blkcg_bio_issue_check()
blk-throttle: improve queue bypass handling
blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup()
blkcg: inline [__]blkg_lookup()
...

Linus Torvalds
2015-09-11 09:56:14 +0800

04 Sep, 2015

1 commit

e31fb9e00 Merge branch 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs ... Browse Code »

Pull ext3 removal, quota & udf fixes from Jan Kara:
"The biggest change in the pull is the removal of ext3 filesystem
driver (~28k lines removed). Ext4 driver is a full featured
replacement these days and both RH and SUSE use it for several years
without issues. Also there are some workarounds in VM & block layer
mainly for ext3 which we could eventually get rid of.

Other larger change is addition of proper error handling for
dquot_initialize(). The rest is small fixes and cleanups"

[ I wasn't convinced about the ext3 removal and worried about things
falling through the cracks for legacy users, but ext4 maintainers
piped up and were all unanimously in favor of removal, and maintaining
all legacy ext3 support inside ext4. - Linus ]

* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Don't modify filesystem for read-only mounts
quota: remove an unneeded condition
ext4: memory leak on error in ext4_symlink()
mm/Kconfig: NEED_BOUNCE_POOL: clean-up condition
ext4: Improve ext4 Kconfig test
block: Remove forced page bouncing under IO
fs: Remove ext3 filesystem driver
doc: Update doc about journalling layer
jfs: Handle error from dquot_initialize()
reiserfs: Handle error from dquot_initialize()
ocfs2: Handle error from dquot_initialize()
ext4: Handle error from dquot_initialize()
ext2: Handle error from dquot_initalize()
quota: Propagate error from ->acquire_dquot()

Linus Torvalds
2015-09-04 03:28:30 +0800

03 Sep, 2015

2 commits

d975f309a Merge branch 'for-4.3/sg' of git://git.kernel.dk/linux-block ... Browse Code »

Pull SG updates from Jens Axboe:
"This contains a set of scatter-gather related changes/fixes for 4.3:

- Add support for limited chaining of sg tables even for
architectures that do not set ARCH_HAS_SG_CHAIN. From Christoph.

- Add sg chain support to target_rd. From Christoph.

- Fixup open coded sg->page_link in crypto/omap-sham. From
Christoph.

- Fixup open coded crypto ->page_link manipulation. From Dan.

- Also from Dan, automated fixup of manual sg_unmark_end()
manipulations.

- Also from Dan, automated fixup of open coded sg_phys()
implementations.

- From Robert Jarzmik, addition of an sg table splitting helper that
drivers can use"

* 'for-4.3/sg' of git://git.kernel.dk/linux-block:
lib: scatterlist: add sg splitting function
scatterlist: use sg_phys()
crypto/omap-sham: remove an open coded access to ->page_link
scatterlist: remove open coded sg_unmark_end instances
crypto: replace scatterwalk_sg_chain with sg_chain
target/rd: always chain S/G list
scatterlist: allow limited chaining without ARCH_HAS_SG_CHAIN

Linus Torvalds
2015-09-03 04:22:38 +0800
1081230b7 Merge branch 'for-4.3/core' of git://git.kernel.dk/linux-block ... Browse Code »

Pull core block updates from Jens Axboe:
"This first core part of the block IO changes contains:

- Cleanup of the bio IO error signaling from Christoph. We used to
rely on the uptodate bit and passing around of an error, now we
store the error in the bio itself.

- Improvement of the above from myself, by shrinking the bio size
down again to fit in two cachelines on x86-64.

- Revert of the max_hw_sectors cap removal from a revision again,
from Jeff Moyer. This caused performance regressions in various
tests. Reinstate the limit, bump it to a more reasonable size
instead.

- Make /sys/block//queue/discard_max_bytes writeable, by me.
Most devices have huge trim limits, which can cause nasty latencies
when deleting files. Enable the admin to configure the size down.
We will look into having a more sane default instead of UINT_MAX
sectors.

- Improvement of the SGP gaps logic from Keith Busch.

- Enable the block core to handle arbitrarily sized bios, which
enables a nice simplification of bio_add_page() (which is an IO hot
path). From Kent.

- Improvements to the partition io stats accounting, making it
faster. From Ming Lei.

- Also from Ming Lei, a basic fixup for overflow of the sysfs pending
file in blk-mq, as well as a fix for a blk-mq timeout race
condition.

- Ming Lin has been carrying Kents above mentioned patches forward
for a while, and testing them. Ming also did a few fixes around
that.

- Sasha Levin found and fixed a use-after-free problem introduced by
the bio->bi_error changes from Christoph.

- Small blk cgroup cleanup from Viresh Kumar"

* 'for-4.3/core' of git://git.kernel.dk/linux-block: (26 commits)
blk: Fix bio_io_vec index when checking bvec gaps
block: Replace SG_GAPS with new queue limits mask
block: bump BLK_DEF_MAX_SECTORS to 2560
Revert "block: remove artifical max_hw_sectors cap"
blk-mq: fix race between timeout and freeing request
blk-mq: fix buffer overflow when reading sysfs file of 'pending'
Documentation: update notes in biovecs about arbitrarily sized bios
block: remove bio_get_nr_vecs()
fs: use helper bio_add_page() instead of open coding on bi_io_vec
block: kill merge_bvec_fn() completely
md/raid5: get rid of bio_fits_rdev()
md/raid5: split bio for chunk_aligned_read
block: remove split code in blkdev_issue_{discard,write_same}
btrfs: remove bio splitting and merge_bvec_fn() calls
bcache: remove driver private bio splitting code
block: simplify bio_add_page()
block: make generic_make_request handle arbitrarily sized bios
blk-cgroup: Drop unlikely before IS_ERR(_OR_NULL)
block: don't access bio->bi_error after bio_put()
block: shrink struct bio down to 2 cache lines again
...

Linus Torvalds
2015-09-03 04:10:25 +0800

02 Sep, 2015

1 commit

2ca495ac2 blk: Fix bio_io_vec index when checking bvec gaps ... Browse Code »

Corrects a coding error from earlier patch.

Reported by: Sagi Grimberg
Signed-off-by: Keith Busch
Fixes: 03100aada96f ("block: Replace SG_GAPS with new queue limits mask")
Signed-off-by: Jens Axboe

Keith Busch
2015-09-02 01:04:58 +0800

20 Aug, 2015

1 commit

03100aada block: Replace SG_GAPS with new queue limits mask ... Browse Code »

The SG_GAPS queue flag caused checks for bio vector alignment against
PAGE_SIZE, but the device may have different constraints. This patch
adds a queue limits so a driver with such constraints can set to allow
requests that would have been unnecessarily split. The new gaps check
takes the request_queue as a parameter to simplify the logic around
invoking this function.

This new limit makes the queue flag redundant, so removing it and
all usage. Device-mappers will inherit the correct settings through
blk_stack_limits().

Signed-off-by: Keith Busch
Reviewed-by: Martin K. Petersen
Reviewed-by: Christoph Hellwig
Signed-off-by: Jens Axboe

Keith Busch
2015-08-20 05:26:02 +0800

19 Aug, 2015

34 commits

69d7fde59 blkcg: use CGROUP_WEIGHT_* scale for io.weight on the unified hierarchy ... Browse Code »

cgroup is trying to make interface consistent across different
controllers. For weight based resource control, the knob should have
the range [1, 10000] and default to 100. This patch updates
cfq-iosched so that the weight range conforms. The internal
calculations have enough range and the widening of the weight range
shouldn't cause any problem.

* blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
when blkcg is attached to a hierarchy.

* cfq_cpd_init() is updated to use the new default value on the
unified hierarchy.

* cfq_cpd_bind() callback is implemented to clear per-blkg configs and
apply the default config matching the hierarchy type.

* cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
now responsible for initializing the initial weights when blkcg is
enabled.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:36 +0800
3ecca6293 blkcg: s/CFQ_WEIGHT_*/CFQ_WEIGHT_LEGACY_*/ ... Browse Code »

blkcg is gonna switch to cgroup common weight range as defined by
CGROUP_WEIGHT_* on the unified hierarchy. In preparation, rename
CFQ_WEIGHT_* constants to CFQ_WEIGHT_LEGACY_*.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:36 +0800
2ee867dcf blkcg: implement interface for the unified hierarchy ... Browse Code »

blkcg interface grew to be the biggest of all controllers and
unfortunately most inconsistent too. The interface files are
inconsistent with a number of cloes duplicates. Some files have
recursive variants while others don't. There's distinction between
normal and leaf weights which isn't intuitive and there are a lot of
stat knobs which don't make much sense outside of debugging and expose
too much implementation details to userland.

In the unified hierarchy, everything is always hierarchical and
internal nodes can't have tasks rendering the two structural issues
twisting the current interface. The interface has to be updated in a
significant anyway and this is a good chance to revamp it as a whole.
This patch implements blkcg interface for the unified hierarchy.

* (from a previous patch) blkcg is identified by "io" instead of
"blkio" on the unified hierarchy. Given that the whole interface is
updated anyway, the rename shouldn't carry noticeable conversion
overhead.

* The original interface consisted of 27 files is replaced with the
following three files.

blkio.stat : per-blkcg stats
blkio.weight : per-cgroup and per-cgroup-queue weight settings
blkio.max : per-cgroup-queue bps and iops max limits

Documentation/cgroups/unified-hierarchy.txt updated accordingly.

v2: blkcg_policy->dfl_cftypes wasn't removed on
blkcg_policy_unregister() corrupting the cftypes list. Fixed.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:35 +0800
dd165eb3b blkcg: misc preparations for unified hierarchy interface ... Browse Code »

* Export blkg_dev_name()

* Drop unnecessary @cft from __cfq_set_weight().

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
69948b070 blkcg: separate out tg_conf_updated() from tg_set_conf() ... Browse Code »

tg_set_conf() is largely consisted of parsing and setting the new
config and the follow-up application and propagation. This patch
separates out the latter part into tg_conf_updated(). This will be
used to implement interface for the unified hierarchy.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
36aa9e5f5 blkcg: move body parsing from blkg_conf_prep() to its callers ... Browse Code »

Currently, blkg_conf_prep() expects input to be of the following form

MAJ:MIN NUM

and reads the NUM part into blkg_conf_ctx->v. This is quite
restrictive and gets in the way in implementing blkcg interface for
the unified hierarchy. This patch updates blkg_conf_prep() so that it
expects

MAJ:MIN BODY_STR

where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
with ->body which is a char pointer pointing to the start of BODY_STR.
Parsing of the body is moved to blkg_conf_prep()'s callers.

To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
non-const pointer and to accommodate that const is dropped from @input
too.

This doesn't cause any behavior changes.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
880f50e22 blkcg: mark existing cftypes as legacy ... Browse Code »

blkcg is about to grow interface for the unified hierarchy. Add
legacy to existing cftypes.

* blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
* blk-cgroup.c:blkcg_files -> blkcg_legacy_files
* cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
* blk-throttle.c:throtl_files -> throtl_legacy_files

Pure renames. No functional change.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
c165b3e3c blkcg: rename subsystem name from blkio to io ... Browse Code »

blkio interface has become messy over time and is currently the
largest. In addition to the inconsistent naming scheme, it has
multiple stat files which report more or less the same thing, a number
of debug stat files which expose internal details which shouldn't have
been part of the public interface in the first place, recursive and
non-recursive stats and leaf and non-leaf knobs.

Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
don't make any sense on the unified hierarchy as only leaf cgroups can
contain processes. cgroups is going through a major interface
revision with the unified hierarchy involving significant fundamental
usage changes and given that a significant portion of the interface
doesn't make sense anymore, it's a good time to reorganize the
interface.

As the first step, this patch renames the external visible subsystem
name from "blkio" to "io". This is more concise, matches the other
two major subsystem names, "cpu" and "memory", and better suited as
blkcg will be involved in anything writeback related too whether an
actual block device is involved or not.

As the subsystem legacy_name is set to "blkio", the only userland
visible change outside the unified hierarchy is that blkcg is reported
as "io" instead of "blkio" in the subsystem initialized message during
boot. On the unified hierarchy, blkcg now appears as "io".

Signed-off-by: Tejun Heo
Cc: Li Zefan
Cc: Johannes Weiner
Cc: cgroups@vger.kernel.org
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
20386ce01 blkcg: refine error codes returned during blkcg configuration ... Browse Code »

blkcg currently returns -EINVAL for most errors which can be pretty
confusing given that the failure modes are quite varied. Update the
error returns so that

* -EINVAL only for syntactic errors.
* -ERANGE if the value is out of range.
* -ENODEV if the target device can't be found.
* -EOPNOTSUPP if the policy is not enabled on the target device.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
5332dfc36 blkcg: remove unnecessary NULL checks from __cfqg_set_weight_device() ... Browse Code »

blkg_to_cfqg() and blkcg_to_cfqgd() on a valid blkg with the policy
enabled are guaranteed to return non-NULL and the counterpart in
blk-throttle doesn't have these checks either. Remove the spurious
NULL checks.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
3a7faeada blkcg: reduce stack usage of blkg_rwstat_recursive_sum() ... Browse Code »

The recent percpu conversion of blkg_rwstat triggered the following
warning in certain configurations.

block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes

This is because blkg_rwstat now contains four percpu_counter which can
be pretty big depending on debug options although it shouldn't be a
problem in production configs. This patch removes one of the two
local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
reduce stack usage.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Reported-by: kbuild test robot
Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
702747cab blkcg: remove cfqg_stats->sectors ... Browse Code »

cfq_stats->sectors is a blkg_stat which keeps track of the total
number of sectors serviced; however, this can be trivially calculated
from blkcg_gq->stat_bytes. The only thing necessary is adding up
READs and WRITEs and then dividing by sector size.

Remove cfqg_stats->sectors and make cfq print "sectors" and
"sectors_recursive" from stat_bytes.

While this is a bit more code, it removes duplicate stat allocations
and updates and ensures that the reported stats stay in tune with each
other.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:18 +0800
77ea73388 blkcg: move io_service_bytes and io_serviced stats into blkcg_gq ... Browse Code »

Currently, both cfq-iosched and blk-throttle keep track of
io_service_bytes and io_serviced stats. While keeping track of them
separately may be useful during development, it doesn't make much
sense otherwise. Also, blk-throttle was counting bio's as IOs while
cfq-iosched request's, which is more confusing than informative.

This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
removes the counterparts from cfq-iosched and blk-throttle and let
them print from the common blkg counters. The common counters are
incremented during bio issue in blkcg_bio_issue_check().

The outputs are still filtered by whether the policy has
blkg_policy_data on a given blkg, so cfq's output won't show up if it
has never been used for a given blkg. The only times when the outputs
would differ significantly are when policies are attached on the fly
or elevators are switched back and forth. Those are quite exceptional
operations and I don't think they warrant keeping separate counters.

v3: Update blkio-controller.txt accordingly.

v2: Account IOs during bio issues instead of request completions so
that bio-based drivers can be handled the same way.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
f12c74cab blkcg: make blkg_[rw]stat_recursive_sum() to be able to index into blkcg_gq ... Browse Code »

Currently, blkg_[rw]stat_recursive_sum() assume that the target
counter is located in pd (blkg_policy_data); however, some counters
are planned to be moved to blkg (blkcg_gq).

This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
blkg_policy pointers instead of pd. If policy is NULL, it indexes
into blkg. If non-NULL, into the blkg's pd of the policy.

The existing usages are updated to maintain the current behaviors.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
24bdb8ef0 blkcg: make blkcg_[rw]stat per-cpu ... Browse Code »

blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
per-cpu wrapping in blk-throttle.

* blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
percpu_counter. This makes syncp unnecessary as remote accesses are
handled by percpu_counter itself.

* blkg_[rw]stat_init() can now fail due to percpu allocation failure
and thus are updated to return int.

* percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.

* As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
and summing results are stored in ->aux_cnt[] instead.

* Custom per-cpu stat implementation in blk-throttle is removed.

This makes all blkcg stat counters per-cpu without complicating policy
implmentations.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
e6269c445 blkcg: add blkg_[rw]stat->aux_cnt and replace cfq_group->dead_stats with it ... Browse Code »

cgroup stats are local to each cgroup and doesn't propagate to
ancestors by default. When recursive stats are necessary, the sum is
calculated over all the descendants. This initially was for backward
compatibility to support both group-local and recursive stats but this
mode of operation makes general sense as stat update is much hotter
thafn reporting those stats.

This however ends up losing recursive stats when a child is removed.
To work around this, cfq-iosched adds its stats to its parent
cfq_group->dead_stats which is summed up together when calculating
recursive stats.

It's planned that the core stats will be moved to blkcg_gq, so we want
to move the mechanism for keeping track of the stats of dead children
from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
are atomic64_t's keeping track of auxiliary counts which are excluded
when reading local counts but included for recursive.

blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
a dead cgroup to the aux counts of parent->stats instead of separate
->dead_stats.

This will also help making blkg_[rw]stats per-cpu.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
ae1188963 blkcg: consolidate blkg creation in blkcg_bio_issue_check() ... Browse Code »

blkg (blkcg_gq) currently is created by blkcg policies invoking
blkg_lookup_create() which ends up repeating about the same code in
different policies. Theoretically, this can avoid the overhead of
looking and/or creating blkg's if blkcg is enabled but no policy is in
use; however, the cost of blkg lookup / creation is very low
especially if only the root blkcg is in use which is highly likely if
no blkcg policy is in active use - it boils down to a single very
predictable conditional and surrounding RCU protection.

This patch consolidates blkg creation to a new function
blkcg_bio_issue_check() which is called during bio issue from
generic_make_request_checks(). blkcg_bio_issue_check() is now the
only function which tries to create missing blkg's. The subsequent
policy and request_list operations just perform blkg_lookup() and if
missing falls back to the root.

* blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
instead of blkg_lookup_create().

* blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
read locked and blkg already looked up. Both throtl_lookup_tg() and
throtl_lookup_create_tg() are dropped.

* cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
cfq_lookup_cfqg()which uses blkg_lookup().

This consolidates blkg handling and avoids unnecessary blkg creation
retries under memory pressure. In addition, this provides a common
bio entry point into blkcg where things like common accounting can be
performed.

v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
!CONFIG_BLK_DEV_THROTTLING.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
c9589f03e blk-throttle: improve queue bypass handling ... Browse Code »

If a queue is bypassing, all blkcg policies should become noops but
blk-throttle wasn't. It only became noop if the queue was dying.
While this wouldn't lead to an oops as falling back to the root blkg
is safe in this case, this can be a bit surprising - a bypassing queue
could still be applying throttle limits.

Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
before doing anything else.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
85b6bc9db blkcg: move root blkg lookup optimization from throtl_lookup_tg() to __blkg_lookup() ... Browse Code »

Currently, both throttle and cfq policies implement their own root
blkg (blkcg_gq) lookup fast path. This patch moves root blkg
optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched
currently doesn't use blkg_lookup() but will be converted and drop the
optimization too.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
24f290466 blkcg: inline [__]blkg_lookup() ... Browse Code »

blkg_lookup() checks whether the target queue is bypassing and, if
not, calls __blkg_lookup() which first checks the lookup hint and then
performs radix tree walk. The operations upto hint checking are
trivial and there are many users of this function. This patch inlines
blkg_lookup() and the fast path part of __blkg_lookup(). The radix
tree lookup and hint update are now in blkg_lookup_slowpath().

This will help consolidating blkg handling by easing moving root blkcg
short-circuit to inlined lookup fast path.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
e4a9bde95 blkcg: replace blkcg_policy->cpd_size with ->cpd_alloc/free_fn() methods ... Browse Code »

Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
cpd's were allocated by blkcg core and each policy could request to
allocate extra space at the end by setting blkcg_policy->cpd_size
larger than the size of cpd.

This is a bit unusual but blkg (blkcg_gq) policy data used to be
handled this way too so it made sense to be consistent; however, blkg
policy data switched to alloc/free callbacks.

This patch makes similar changes to cpd handling.
blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
cpd allocation is now done from policy side, it can simply allocate a
larger area which embeds cpd at the beginning.

As ->cpd_alloc_fn() may be able to perform all necessary
initializations, this patch makes ->cpd_init_fn() optional.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
814376483 blkcg: minor updates around blkcg_policy_data ... Browse Code »

* Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
for blkcg_policy_data.

* Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
blkcg. This makes it consistent with blkg_policy_data methods and
to-be-added cpd alloc/free methods.

* blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
cpd_init_fn() can determine the associated blkcg from
blkcg_policy_data.

v2: blkcg_policy_data->blkcg initializations were missing. Added.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
a9520cd6f blkcg: make blkcg_policy methods take a pointer to blkcg_policy_data ... Browse Code »

The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
(blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
can always be mapped to blkg and given that these are policy-specific
methods, it makes sense to converge on pd.

This patch makes all methods deal with pd instead of blkg. Most
conversions are trivial. In blk-cgroup.c, a couple method invocation
sites now test whether pd exists instead of policy state for
consistency. This shouldn't cause any behavioral differences.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
b2ce2643c blk-throttle: clean up blkg_policy_data alloc/init/exit/free methods ... Browse Code »

With the recent addition of alloc and free methods, things became
messier. This patch reorganizes them according to the followings.

* ->pd_alloc_fn()

Responsible for allocation and static initializations - the ones
which can be done independent of where the pd might be attached.

* ->pd_init_fn()

Initializations which require the knowledge of where the pd is
attached.

* ->pd_free_fn()

The counter part of pd_alloc_fn(). Static de-init and freeing.

This leaves ->pd_exit_fn() without any users. Removed.

While at it, collapse an one liner function throtl_pd_exit(), which
has only one user, into its user.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:17 +0800
4fb72036f blk-throttle: remove asynchrnous percpu stats allocation mechanism ... Browse Code »

Because percpu allocator couldn't do non-blocking allocations,
blk-throttle was forced to implement an ad-hoc asynchronous allocation
mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
allocated from an IO path without sleepable context.

Now that percpu allocator can handle gfp_mask and blkg_policy_data
alloc / free are handled by policy methods, the ad-hoc asynchronous
allocation mechanism can be replaced with direct allocation from
tg_stats_alloc_fn(). Rit it out.

This ensures that an active throtl_grp always has valid non-NULL
->stats_cpu. Remove checks on it.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
001bea73e blkcg: replace blkcg_policy->pd_size with ->pd_alloc/free_fn() methods ... Browse Code »

A blkg (blkcg_gq) represents the relationship between a cgroup and
request_queue. Each active policy has a pd (blkg_policy_data) on each
blkg. The pd's were allocated by blkcg core and each policy could
request to allocate extra space at the end by setting
blkcg_policy->pd_size larger than the size of pd.

This is a bit unusual but was done this way mostly to simplify error
handling and all the existing use cases could be handled this way;
however, this is becoming too restrictive now that percpu memory can
be allocated without blocking.

This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
and pd_free_fn() - which are used to allocate and release pd for a
given policy. As pd allocation is now done from policy side, it can
simply allocate a larger area which embeds pd at the beginning. This
change makes ->pd_size pointless. Removed.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
3e4187104 blkcg: make blkcg_activate_policy() allow NULL ->pd_init_fn ... Browse Code »

blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
doesn't. As both in-kernel policies implement ->pd_init_fn, it
currently doesn't break anything. Update blkcg_activate_policy() so
that its behavior is consistent with blkg_create().

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
4c55f4f9a blkcg: restructure blkg_policy_data allocation in blkcg_activate_policy() ... Browse Code »

When a policy gets activated, it needs to allocate and install its
policy data on all existing blkg's (blkcg_gq's). Because blkg
iteration is protected by a spinlock, it currently counts the total
number of blkg's in the system, allocates the matching number of
policy data on a list and installs them during a single iteration.

This can be simplified by using speculative GFP_NOWAIT allocations
while iterating and falling back to a preallocated policy data on
failure. If the preallocated one has already been consumed, it
releases the lock, preallocate with GFP_KERNEL and then restarts the
iteration. This can be a bit more expensive than before but policy
activation is a very cold path and shouldn't matter.

Signed-off-by: Tejun Heo
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
bc915e61c blkcg: remove unnecessary blkcg_root handling from css_alloc/free paths ... Browse Code »

blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
bypasses policy data and blkcg freeing for blkcg_root. There's no
reason to to treat policy data any differently for blkcg_root. If the
root css gets allocated after policies are registered, policy
registration path will add policy data; otherwise, the alloc path
will. The free path isn't never invoked for root csses.

This patch removes the unnecessary special handling of blkcg_root from
css_alloc/free paths.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
994b78327 blkcg: use blkg_free() in blkcg_init_queue() failure path ... Browse Code »

When blkcg_init_queue() fails midway after creating a new blkg, it
performs kfree() directly; however, this doesn't free the policy data
areas. Make it use blkg_free() instead. In turn, blkg_free() is
updated to handle root request_list special case.

While this fixes a possible memory leak, it's on an unlikely failure
path of an already cold path and the size leaked per occurrence is
miniscule too. I don't think it needs to be tagged for -stable.

Signed-off-by: Tejun Heo
Cc: Vivek Goyal
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
60a837077 cfq-iosched: charge async IOs to the appropriate blkcg's instead of the root ... Browse Code »

Up until now, all async IOs were queued to async queues which are
shared across the whole request_queue, which means that blkcg resource
control is completely void on async IOs including all writeback IOs.
It was done this way because writeback didn't support writeback and
there was no way of telling which writeback IO belonged to which
cgroup; however, writeback recently became cgroup aware and writeback
bio's are sent down properly tagged with the blkcg's to charge them
against.

This patch makes async cfq_queues per-cfq_cgroup instead of
per-cfq_data so that each async IO is charged to the blkcg that it was
tagged for instead of unconditionally attributing it to root.

* cfq_data->async_cfqq and ->async_idle_cfqq are moved to cfq_group
and alloc / destroy paths are updated accordingly.

* cfq_link_cfqq_cfqg() no longer overrides @cfqg to root for async
queues.

* check_blkcg_changed() now also invalidates async queues as they no
longer stay the same across cgroups.

After this patch, cfq's proportional IO control through blkio.weight
works correctly when cgroup writeback is in use.

Signed-off-by: Tejun Heo
Reviewed-by: Jeff Moyer
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
d4aad7ff0 cfq-iosched: fold cfq_find_alloc_queue() into cfq_get_queue() ... Browse Code »

cfq_find_alloc_queue() checks whether a queue actually needs to be
allocated, which is unnecessary as its sole caller, cfq_get_queue(),
only calls it if so. Also, the oom queue fallback logic is scattered
between cfq_get_queue() and cfq_find_alloc_queue(). There really
isn't much going on in the latter and things can be made simpler by
folding it into cfq_get_queue().

This patch collapses cfq_find_alloc_queue() into cfq_get_queue(). The
change is fairly straight-forward with one exception - async_cfqq is
now initialized to NULL and the "!is_sync" test in the last if
conditional is replaced with "async_cfqq" test. This is because gcc
(5.1.1) gets confused for some reason and warns that async_cfqq may be
used uninitialized otherwise. Oh well, the code isn't necessarily
worse this way.

This patch doesn't cause any functional difference.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.

Signed-off-by: Tejun Heo
Reviewed-by: Jeff Moyer
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
322731ed0 cfq-iosched: move cfq_group determination from cfq_find_alloc_queue() to cfq_get_queue() ... Browse Code »

This is necessary for making async cfq_cgroups per-cfq_group instead
of per-cfq_data. While this change makes cfq_get_queue() perform RCU
locking and look up cfq_group even when it reuses async queue, the
extra overhead is extremely unlikely to be noticeable given that this
is already sitting behind cic->cfqq[] cache and the overall cost of
cfq operation.

Signed-off-by: Tejun Heo
Reviewed-by: Jeff Moyer
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800
2da8de0bb cfq-iosched: remove @gfp_mask from cfq_find_alloc_queue() ... Browse Code »

Even when allocations fail, cfq_find_alloc_queue() always returns a
valid cfq_queue by falling back to the oom cfq_queue. As such, there
isn't much point in taking @gfp_mask and trying "harder" if __GFP_WAIT
is set. GFP_NOWAIT allocations don't fail often and even when they do
the degraded behavior is acceptable and temporary.

After all, the only reason get_request(), which ultimately determines
the gfp_mask, cares about __GFP_WAIT is to guarantee request
allocation, assuming IO forward progress, for callers which are
willing to wait. There's no reason for cfq_find_alloc_queue() to
behave differently on __GFP_WAIT when it already has a fallback
mechanism.

Remove @gfp_mask from cfq_find_alloc_queue() and propagate the changes
to its callers. This simplifies the function quite a bit and will
help making async queues per-cfq_group.

v2: Updated to reflect GFP_ATOMIC -> GPF_NOWAIT.

Signed-off-by: Tejun Heo
Reviewed-by: Jeff Moyer
Cc: Vivek Goyal
Cc: Arianna Avanzini
Signed-off-by: Jens Axboe

Tejun Heo
2015-08-19 06:49:16 +0800