30 Sep, 2016

1 commit

  • Unlocking a mutex twice is wrong. Hence modify blkcg_policy_register()
    such that blkcg_pol_mutex is unlocked once if cpd == NULL. This patch
    avoids that smatch reports the following error:

    block/blk-cgroup.c:1378: blkcg_policy_register() error: double unlock 'mutex:&blkcg_pol_mutex'

    Fixes: 06b285bd1125 ("blkcg: fix blkcg_policy_data allocation bug")
    Signed-off-by: Bart Van Assche
    Cc: Tejun Heo
    Cc: # v4.2+
    Signed-off-by: Tejun Heo

    Bart Van Assche
     

14 Jun, 2016

1 commit


10 Feb, 2016

1 commit

  • get_disk(),get_gendisk() calls have non explicit side effect: they
    increase the reference on the disk owner module.

    The following is the correct sequence how to get a disk reference and
    to put it:

    disk = get_gendisk(...);

    /* use disk */

    owner = disk->fops->owner;
    put_disk(disk);
    module_put(owner);

    fs/block_dev.c is aware of this required module_put() call, but f.e.
    blkg_conf_finish(), which is located in block/blk-cgroup.c, does not put
    a module reference. To see a leakage in action cgroups throttle config
    can be used. In the following script I'm removing throttle for /dev/ram0
    (actually this is NOP, because throttle was never set for this device):

    # lsmod | grep brd
    brd 5175 0
    # i=100; while [ $i -gt 0 ]; do echo "1:0 0" > \
    /sys/fs/cgroup/blkio/blkio.throttle.read_bps_device; i=$(($i - 1)); \
    done
    # lsmod | grep brd
    brd 5175 100

    Now brd module has 100 references.

    The issue is fixed by calling module_put() just right away put_disk().

    Signed-off-by: Roman Pen
    Cc: Gi-Oh Kim
    Cc: Tejun Heo
    Cc: Jens Axboe
    Cc: linux-block@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jens Axboe

    Roman Pen
     

03 Dec, 2015

1 commit

  • Consider the following v2 hierarchy.

    P0 (+memory) --- P1 (-memory) --- A
    \- B

    P0 has memory enabled in its subtree_control while P1 doesn't. If
    both A and B contain processes, they would belong to the memory css of
    P1. Now if memory is enabled on P1's subtree_control, memory csses
    should be created on both A and B and A's processes should be moved to
    the former and B's processes the latter. IOW, enabling controllers
    can cause atomic migrations into different csses.

    The core cgroup migration logic has been updated accordingly but the
    controller migration methods haven't and still assume that all tasks
    migrate to a single target css; furthermore, the methods were fed the
    css in which subtree_control was updated which is the parent of the
    target csses. pids controller depends on the migration methods to
    move charges and this made the controller attribute charges to the
    wrong csses often triggering the following warning by driving a
    counter negative.

    WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
    Modules linked in:
    CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
    ...
    ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
    ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
    ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
    Call Trace:
    [] dump_stack+0x4e/0x82
    [] warn_slowpath_common+0x82/0xc0
    [] warn_slowpath_null+0x1a/0x20
    [] pids_cancel.constprop.6+0x31/0x40
    [] pids_can_attach+0x6d/0xf0
    [] cgroup_taskset_migrate+0x6c/0x330
    [] cgroup_migrate+0xf5/0x190
    [] cgroup_attach_task+0x176/0x200
    [] __cgroup_procs_write+0x2ad/0x460
    [] cgroup_procs_write+0x14/0x20
    [] cgroup_file_write+0x35/0x1c0
    [] kernfs_fop_write+0x141/0x190
    [] __vfs_write+0x28/0xe0
    [] vfs_write+0xac/0x1a0
    [] SyS_write+0x49/0xb0
    [] entry_SYSCALL_64_fastpath+0x12/0x76

    This patch fixes the bug by removing @css parameter from the three
    migration methods, ->can_attach, ->cancel_attach() and ->attach() and
    updating cgroup_taskset iteration helpers also return the destination
    css in addition to the task being migrated. All controllers are
    updated accordingly.

    * Controllers which don't care whether there are one or multiple
    target csses can be converted trivially. cpu, io, freezer, perf,
    netclassid and netprio fall in this category.

    * cpuset's current implementation assumes that there's single source
    and destination and thus doesn't support v2 hierarchy already. The
    only change made by this patchset is how that single destination css
    is obtained.

    * memory migration path already doesn't do anything on v2. How the
    single destination css is obtained is updated and the prep stage of
    mem_cgroup_can_attach() is reordered to accomodate the change.

    * pids is the only controller which was affected by this bug. It now
    correctly handles multi-destination migrations and no longer causes
    counter underflow from incorrect accounting.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Daniel Wagner
    Cc: Aleksa Sarai

    Tejun Heo
     

06 Nov, 2015

1 commit

  • Pull cgroup updates from Tejun Heo:
    "The cgroup core saw several significant updates this cycle:

    - percpu_rwsem for threadgroup locking is reinstated. This was
    temporarily dropped due to down_write latency issues. Oleg's
    rework of percpu_rwsem which is scheduled to be merged in this
    merge window resolves the issue.

    - On the v2 hierarchy, when controllers are enabled and disabled, all
    operations are atomic and can fail and revert cleanly. This allows
    ->can_attach() failure which is necessary for cpu RT slices.

    - Tasks now stay associated with the original cgroups after exit
    until released. This allows tracking resources held by zombies
    (e.g. pids) and makes it easy to find out where zombies came from
    on the v2 hierarchy. The pids controller was broken before these
    changes as zombies escaped the limits; unfortunately, updating this
    behavior required too many invasive changes and I don't think it's
    a good idea to backport them, so the pids controller on 4.3, the
    first version which included the pids controller, will stay broken
    at least until I'm sure about the cgroup core changes.

    - Optimization of a couple common tests using static_key"

    * 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (38 commits)
    cgroup: fix race condition around termination check in css_task_iter_next()
    blkcg: don't create "io.stat" on the root cgroup
    cgroup: drop cgroup__DEVEL__legacy_files_on_dfl
    cgroup: replace error handling in cgroup_init() with WARN_ON()s
    cgroup: add cgroup_subsys->free() method and use it to fix pids controller
    cgroup: keep zombies associated with their original cgroups
    cgroup: make css_set_rwsem a spinlock and rename it to css_set_lock
    cgroup: don't hold css_set_rwsem across css task iteration
    cgroup: reorganize css_task_iter functions
    cgroup: factor out css_set_move_task()
    cgroup: keep css_set and task lists in chronological order
    cgroup: make cgroup_destroy_locked() test cgroup_is_populated()
    cgroup: make css_sets pin the associated cgroups
    cgroup: relocate cgroup_[try]get/put()
    cgroup: move check_for_release() invocation
    cgroup: replace cgroup_has_tasks() with cgroup_is_populated()
    cgroup: make cgroup->nr_populated count the number of populated css_sets
    cgroup: remove an unused parameter from cgroup_task_migrate()
    cgroup: fix too early usage of static_branch_disable()
    cgroup: make cgroup_update_dfl_csses() migrate all target processes atomically
    ...

    Linus Torvalds
     

22 Oct, 2015

1 commit

  • The stat files on the root cgroup shows stats for the whole system and
    usually don't contain any information which isn't available through
    the usual system monitoring mechanisms. Some controllers skip
    collecting these duplicate stats to optimize cases where cgroup isn't
    used and later try to emulate the result on demand.

    This leads to complexities and subtle differences in the information
    shown through different channels. This is entirely unnecessary and
    cgroup v2 is dropping stat files which are duplicate from all
    controllers. This patch removes "io.stat" from the root hierarchy.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Vivek Goyal

    Tejun Heo
     

20 Sep, 2015

1 commit

  • Pull block updates from Jens Axboe:
    "This is a bit bigger than it should be, but I could (did) not want to
    send it off last week due to both wanting extra testing, and expecting
    a fix for the bounce regression as well. In any case, this contains:

    - Fix for the blk-merge.c compilation warning on gcc 5.x from me.

    - A set of back/front SG gap merge fixes, from me and from Sagi.
    This ensures that we honor SG gapping for integrity payloads as
    well.

    - Two small fixes for null_blk from Matias, fixing a leak and a
    capacity propagation issue.

    - A blkcg fix from Tejun, fixing a NULL dereference.

    - A fast clone optimization from Ming, fixing a performance
    regression since the arbitrarily sized bio's were introduced.

    - Also from Ming, a regression fix for bouncing IOs"

    * 'for-linus' of git://git.kernel.dk/linux-block:
    block: fix bounce_end_io
    block: blk-merge: fast-clone bio when splitting rw bios
    block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg
    block: Copy a user iovec if it includes gaps
    block: Refuse adding appending a gapped integrity page to a bio
    block: Refuse request/bio merges with gaps in the integrity payload
    block: Check for gaps on front and back merges
    null_blk: fix wrong capacity when bs is not 512 bytes
    null_blk: fix memory leak on cleanup
    block: fix bogus compiler warnings in blk-merge.c

    Linus Torvalds
     

11 Sep, 2015

1 commit

  • While making the root blkg unconditional, ec13b1d6f0a0 ("blkcg: always
    create the blkcg_gq for the root blkcg") removed the part which clears
    q->root_blkg and ->root_rl.blkg during q exit. This leaves the two
    pointers dangling after blkg_destroy_all(). blk-throttle exit path
    performs blkg traversals and dereferences ->root_blkg and can lead to
    the following oops.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
    IP: [] __blkg_lookup+0x26/0x70
    ...
    task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
    RIP: 0010:[] [] __blkg_lookup+0x26/0x70
    ...
    Call Trace:
    [] blk_throtl_drain+0x5a/0x110
    [] blkcg_drain_queue+0x18/0x20
    [] __blk_drain_queue+0xc0/0x170
    [] blk_queue_bypass_start+0x61/0x80
    [] blkcg_deactivate_policy+0x39/0x100
    [] blk_throtl_exit+0x38/0x50
    [] blkcg_exit_queue+0x3e/0x50
    [] blk_release_queue+0x1e/0xc0
    ...

    While the bug is a straigh-forward use-after-free bug, it is tricky to
    reproduce because blkg release is RCU protected and the rest of exit
    path usually finishes before RCU grace period.

    This patch fixes the bug by updating blkg_destro_all() to clear
    q->root_blkg and ->root_rl.blkg.

    Signed-off-by: Tejun Heo
    Reported-by: "Richard W.M. Jones"
    Reported-by: Josh Boyer
    Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
    Fixes: ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root blkcg")
    Cc: stable@vger.kernel.org # v4.2+
    Tested-by: Richard W.M. Jones
    Signed-off-by: Jens Axboe

    Tejun Heo
     

19 Aug, 2015

24 commits

  • cgroup is trying to make interface consistent across different
    controllers. For weight based resource control, the knob should have
    the range [1, 10000] and default to 100. This patch updates
    cfq-iosched so that the weight range conforms. The internal
    calculations have enough range and the widening of the weight range
    shouldn't cause any problem.

    * blkcg_policy->cpd_bind_fn() is added. If present, this is invoked
    when blkcg is attached to a hierarchy.

    * cfq_cpd_init() is updated to use the new default value on the
    unified hierarchy.

    * cfq_cpd_bind() callback is implemented to clear per-blkg configs and
    apply the default config matching the hierarchy type.

    * cfqd->root_group->[leaf_]weight initialization in cfq_init_queue()
    is moved into !CONFIG_CFQ_GROUP_IOSCHED block. cfq_cpd_bind() is
    now responsible for initializing the initial weights when blkcg is
    enabled.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg interface grew to be the biggest of all controllers and
    unfortunately most inconsistent too. The interface files are
    inconsistent with a number of cloes duplicates. Some files have
    recursive variants while others don't. There's distinction between
    normal and leaf weights which isn't intuitive and there are a lot of
    stat knobs which don't make much sense outside of debugging and expose
    too much implementation details to userland.

    In the unified hierarchy, everything is always hierarchical and
    internal nodes can't have tasks rendering the two structural issues
    twisting the current interface. The interface has to be updated in a
    significant anyway and this is a good chance to revamp it as a whole.
    This patch implements blkcg interface for the unified hierarchy.

    * (from a previous patch) blkcg is identified by "io" instead of
    "blkio" on the unified hierarchy. Given that the whole interface is
    updated anyway, the rename shouldn't carry noticeable conversion
    overhead.

    * The original interface consisted of 27 files is replaced with the
    following three files.

    blkio.stat : per-blkcg stats
    blkio.weight : per-cgroup and per-cgroup-queue weight settings
    blkio.max : per-cgroup-queue bps and iops max limits

    Documentation/cgroups/unified-hierarchy.txt updated accordingly.

    v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list. Fixed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Export blkg_dev_name()

    * Drop unnecessary @cft from __cfq_set_weight().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_conf_prep() expects input to be of the following form

    MAJ:MIN NUM

    and reads the NUM part into blkg_conf_ctx->v. This is quite
    restrictive and gets in the way in implementing blkcg interface for
    the unified hierarchy. This patch updates blkg_conf_prep() so that it
    expects

    MAJ:MIN BODY_STR

    where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
    with ->body which is a char pointer pointing to the start of BODY_STR.
    Parsing of the body is moved to blkg_conf_prep()'s callers.

    To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
    non-const pointer and to accommodate that const is dropped from @input
    too.

    This doesn't cause any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is about to grow interface for the unified hierarchy. Add
    legacy to existing cftypes.

    * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
    * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
    * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
    * blk-throttle.c:throtl_files -> throtl_legacy_files

    Pure renames. No functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkio interface has become messy over time and is currently the
    largest. In addition to the inconsistent naming scheme, it has
    multiple stat files which report more or less the same thing, a number
    of debug stat files which expose internal details which shouldn't have
    been part of the public interface in the first place, recursive and
    non-recursive stats and leaf and non-leaf knobs.

    Both recursive vs. non-recursive and leaf vs. non-leaf distinctions
    don't make any sense on the unified hierarchy as only leaf cgroups can
    contain processes. cgroups is going through a major interface
    revision with the unified hierarchy involving significant fundamental
    usage changes and given that a significant portion of the interface
    doesn't make sense anymore, it's a good time to reorganize the
    interface.

    As the first step, this patch renames the external visible subsystem
    name from "blkio" to "io". This is more concise, matches the other
    two major subsystem names, "cpu" and "memory", and better suited as
    blkcg will be involved in anything writeback related too whether an
    actual block device is involved or not.

    As the subsystem legacy_name is set to "blkio", the only userland
    visible change outside the unified hierarchy is that blkcg is reported
    as "io" instead of "blkio" in the subsystem initialized message during
    boot. On the unified hierarchy, blkcg now appears as "io".

    Signed-off-by: Tejun Heo
    Cc: Li Zefan
    Cc: Johannes Weiner
    Cc: cgroups@vger.kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg currently returns -EINVAL for most errors which can be pretty
    confusing given that the failure modes are quite varied. Update the
    error returns so that

    * -EINVAL only for syntactic errors.
    * -ERANGE if the value is out of range.
    * -ENODEV if the target device can't be found.
    * -EOPNOTSUPP if the policy is not enabled on the target device.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The recent percpu conversion of blkg_rwstat triggered the following
    warning in certain configurations.

    block/blk-cgroup.c:654:1: warning: the frame size of 1360 bytes is larger than 1024 bytes

    This is because blkg_rwstat now contains four percpu_counter which can
    be pretty big depending on debug options although it shouldn't be a
    problem in production configs. This patch removes one of the two
    local blkg_rwstat variables used by blkg_rwstat_recursive_sum() to
    reduce stack usage.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Reported-by: kbuild test robot
    Link: http://article.gmane.org/gmane.linux.kernel.cgroups/13835
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both cfq-iosched and blk-throttle keep track of
    io_service_bytes and io_serviced stats. While keeping track of them
    separately may be useful during development, it doesn't make much
    sense otherwise. Also, blk-throttle was counting bio's as IOs while
    cfq-iosched request's, which is more confusing than informative.

    This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
    removes the counterparts from cfq-iosched and blk-throttle and let
    them print from the common blkg counters. The common counters are
    incremented during bio issue in blkcg_bio_issue_check().

    The outputs are still filtered by whether the policy has
    blkg_policy_data on a given blkg, so cfq's output won't show up if it
    has never been used for a given blkg. The only times when the outputs
    would differ significantly are when policies are attached on the fly
    or elevators are switched back and forth. Those are quite exceptional
    operations and I don't think they warrant keeping separate counters.

    v3: Update blkio-controller.txt accordingly.

    v2: Account IOs during bio issues instead of request completions so
    that bio-based drivers can be handled the same way.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_[rw]stat_recursive_sum() assume that the target
    counter is located in pd (blkg_policy_data); however, some counters
    are planned to be moved to blkg (blkcg_gq).

    This patch updates blkg_[rw]stat_recursive_sum() to take blkg and
    blkg_policy pointers instead of pd. If policy is NULL, it indexes
    into blkg. If non-NULL, into the blkg's pd of the policy.

    The existing usages are updated to maintain the current behaviors.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
    per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
    it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
    per-cpu wrapping in blk-throttle.

    * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
    percpu_counter. This makes syncp unnecessary as remote accesses are
    handled by percpu_counter itself.

    * blkg_[rw]stat_init() can now fail due to percpu allocation failure
    and thus are updated to return int.

    * percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.

    * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
    and summing results are stored in ->aux_cnt[] instead.

    * Custom per-cpu stat implementation in blk-throttle is removed.

    This makes all blkcg stat counters per-cpu without complicating policy
    implmentations.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cgroup stats are local to each cgroup and doesn't propagate to
    ancestors by default. When recursive stats are necessary, the sum is
    calculated over all the descendants. This initially was for backward
    compatibility to support both group-local and recursive stats but this
    mode of operation makes general sense as stat update is much hotter
    thafn reporting those stats.

    This however ends up losing recursive stats when a child is removed.
    To work around this, cfq-iosched adds its stats to its parent
    cfq_group->dead_stats which is summed up together when calculating
    recursive stats.

    It's planned that the core stats will be moved to blkcg_gq, so we want
    to move the mechanism for keeping track of the stats of dead children
    from cfq to blkcg core. This patch adds blkg_[rw]stat->aux_cnt which
    are atomic64_t's keeping track of auxiliary counts which are excluded
    when reading local counts but included for recursive.

    blkg_[rw]stat_merge() which were used by cfq to implement dead_stats
    are replaced by blkg_[rw]stat_add_aux(), and cfq now forwards stats of
    a dead cgroup to the aux counts of parent->stats instead of separate
    ->dead_stats.

    This will also help making blkg_[rw]stats per-cpu.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg (blkcg_gq) currently is created by blkcg policies invoking
    blkg_lookup_create() which ends up repeating about the same code in
    different policies. Theoretically, this can avoid the overhead of
    looking and/or creating blkg's if blkcg is enabled but no policy is in
    use; however, the cost of blkg lookup / creation is very low
    especially if only the root blkcg is in use which is highly likely if
    no blkcg policy is in active use - it boils down to a single very
    predictable conditional and surrounding RCU protection.

    This patch consolidates blkg creation to a new function
    blkcg_bio_issue_check() which is called during bio issue from
    generic_make_request_checks(). blkcg_bio_issue_check() is now the
    only function which tries to create missing blkg's. The subsequent
    policy and request_list operations just perform blkg_lookup() and if
    missing falls back to the root.

    * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
    instead of blkg_lookup_create().

    * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
    read locked and blkg already looked up. Both throtl_lookup_tg() and
    throtl_lookup_create_tg() are dropped.

    * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
    cfq_lookup_cfqg()which uses blkg_lookup().

    This consolidates blkg handling and avoids unnecessary blkg creation
    retries under memory pressure. In addition, this provides a common
    bio entry point into blkcg where things like common accounting can be
    performed.

    v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
    !CONFIG_BLK_DEV_THROTTLING.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg_lookup() checks whether the target queue is bypassing and, if
    not, calls __blkg_lookup() which first checks the lookup hint and then
    performs radix tree walk. The operations upto hint checking are
    trivial and there are many users of this function. This patch inlines
    blkg_lookup() and the fast path part of __blkg_lookup(). The radix
    tree lookup and hint update are now in blkg_lookup_slowpath().

    This will help consolidating blkg handling by easing moving root blkcg
    short-circuit to inlined lookup fast path.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Each active policy has a cpd (blkcg_policy_data) on each blkcg. The
    cpd's were allocated by blkcg core and each policy could request to
    allocate extra space at the end by setting blkcg_policy->cpd_size
    larger than the size of cpd.

    This is a bit unusual but blkg (blkcg_gq) policy data used to be
    handled this way too so it made sense to be consistent; however, blkg
    policy data switched to alloc/free callbacks.

    This patch makes similar changes to cpd handling.
    blkcg_policy->cpd_alloc/free_fn() are added to replace ->cpd_size. As
    cpd allocation is now done from policy side, it can simply allocate a
    larger area which embeds cpd at the beginning.

    As ->cpd_alloc_fn() may be able to perform all necessary
    initializations, this patch makes ->cpd_init_fn() optional.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * Rename blkcg->pd[] to blkcg->cpd[] so that cpd is consistently used
    for blkcg_policy_data.

    * Make blkcg_policy->cpd_init_fn() take blkcg_policy_data instead of
    blkcg. This makes it consistent with blkg_policy_data methods and
    to-be-added cpd alloc/free methods.

    * blkcg_policy_data->blkcg and cpd_to_blkcg() added so that
    cpd_init_fn() can determine the associated blkcg from
    blkcg_policy_data.

    v2: blkcg_policy_data->blkcg initializations were missing. Added.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
    (blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
    blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
    can always be mapped to blkg and given that these are policy-specific
    methods, it makes sense to converge on pd.

    This patch makes all methods deal with pd instead of blkg. Most
    conversions are trivial. In blk-cgroup.c, a couple method invocation
    sites now test whether pd exists instead of policy state for
    consistency. This shouldn't cause any behavioral differences.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the recent addition of alloc and free methods, things became
    messier. This patch reorganizes them according to the followings.

    * ->pd_alloc_fn()

    Responsible for allocation and static initializations - the ones
    which can be done independent of where the pd might be attached.

    * ->pd_init_fn()

    Initializations which require the knowledge of where the pd is
    attached.

    * ->pd_free_fn()

    The counter part of pd_alloc_fn(). Static de-init and freeing.

    This leaves ->pd_exit_fn() without any users. Removed.

    While at it, collapse an one liner function throtl_pd_exit(), which
    has only one user, into its user.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • A blkg (blkcg_gq) represents the relationship between a cgroup and
    request_queue. Each active policy has a pd (blkg_policy_data) on each
    blkg. The pd's were allocated by blkcg core and each policy could
    request to allocate extra space at the end by setting
    blkcg_policy->pd_size larger than the size of pd.

    This is a bit unusual but was done this way mostly to simplify error
    handling and all the existing use cases could be handled this way;
    however, this is becoming too restrictive now that percpu memory can
    be allocated without blocking.

    This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
    and pd_free_fn() - which are used to allocate and release pd for a
    given policy. As pd allocation is now done from policy side, it can
    simply allocate a larger area which embeds pd at the beginning. This
    change makes ->pd_size pointless. Removed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg_create() allows NULL ->pd_init_fn() but blkcg_activate_policy()
    doesn't. As both in-kernel policies implement ->pd_init_fn, it
    currently doesn't break anything. Update blkcg_activate_policy() so
    that its behavior is consistent with blkg_create().

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When a policy gets activated, it needs to allocate and install its
    policy data on all existing blkg's (blkcg_gq's). Because blkg
    iteration is protected by a spinlock, it currently counts the total
    number of blkg's in the system, allocates the matching number of
    policy data on a list and installs them during a single iteration.

    This can be simplified by using speculative GFP_NOWAIT allocations
    while iterating and falling back to a preallocated policy data on
    failure. If the preallocated one has already been consumed, it
    releases the lock, preallocate with GFP_KERNEL and then restarts the
    iteration. This can be a bit more expensive than before but policy
    activation is a very cold path and shouldn't matter.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_css_alloc() bypasses policy data allocation and blkcg_css_free()
    bypasses policy data and blkcg freeing for blkcg_root. There's no
    reason to to treat policy data any differently for blkcg_root. If the
    root css gets allocated after policies are registered, policy
    registration path will add policy data; otherwise, the alloc path
    will. The free path isn't never invoked for root csses.

    This patch removes the unnecessary special handling of blkcg_root from
    css_alloc/free paths.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When blkcg_init_queue() fails midway after creating a new blkg, it
    performs kfree() directly; however, this doesn't free the policy data
    areas. Make it use blkg_free() instead. In turn, blkg_free() is
    updated to handle root request_list special case.

    While this fixes a possible memory leak, it's on an unlikely failure
    path of an already cold path and the size leaked per occurrence is
    miniscule too. I don't think it needs to be tagged for -stable.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg performs several allocations to track IOs per cgroup and enforce
    resource control. Most of these allocations are performed lazily on
    demand in the IO path and thus can't involve reclaim path. Currently,
    these allocations use GFP_ATOMIC; however, blkcg can gracefully deal
    with occassional failures of these allocations by punting IOs to the
    root cgroup and there's no reason to reach into the emergency reserve.

    This patch replaces GFP_ATOMIC with GFP_NOWAIT for the following
    allocations.

    * bdi_writeback_congested and blkcg_gq allocations in blkg_create().

    * radix tree node allocations for blkcg->blkg_tree.

    * cfq_queue allocation on ioprio changes.

    Signed-off-by: Tejun Heo
    Suggested-and-Reviewed-by: Jeff Moyer
    Suggested-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Jul, 2015

1 commit


10 Jul, 2015

4 commits

  • e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg
    data") updated per-blkcg policy data to be dynamically allocated.
    When a policy is registered, its policy data aren't created. Instead,
    when the policy is activated on a queue, the policy data are allocated
    if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
    This is buggy. Consider the following scenario.

    1. A blkcg is created. No blkg's attached yet.

    2. The policy is registered. No policy data is allocated.

    3. The policy is activated on a queue. As the above blkcg doesn't
    have any blkg's, it won't allocate the matching blkcg_policy_data.

    4. An IO is issued from the blkcg and blkg is created and the blkcg
    still doesn't have the matching policy data allocated.

    With cfq-iosched, this leads to an oops.

    It also doesn't free policy data on policy unregistration assuming
    that freeing of all policy data on blkcg destruction should take care
    of it; however, this also is incorrect.

    1. A blkcg has policy data.

    2. The policy gets unregistered but the policy data remains.

    3. Another policy gets registered on the same slot.

    4. Later, the new policy tries to allocate policy data on the previous
    blkcg but the slot is already occupied and gets skipped. The
    policy ends up operating on the policy data of the previous policy.

    There's no reason to manage blkcg_policy_data lazily. The reason we
    do lazy allocation of blkg's is that the number of all possible blkg's
    is the product of cgroups and block devices which can reach a
    surprising level. blkcg_policy_data is contrained by the number of
    cgroups and shouldn't be a problem.

    This patch makes blkcg_policy_data to be allocated for all existing
    blkcg's on policy registration and freed on unregistration and removes
    blkcg_policy_data handling from policy [de]activation paths. This
    makes that blkcg_policy_data are created and removed with the policy
    they belong to and fixes the above described problems.

    Signed-off-by: Tejun Heo
    Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data")
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
    protected by blkcg_pol_mutex. This will be used to fix
    blkcg_policy_data allocation bug.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • An entry in blkcg_policy[] is stable while there are non-bypassing
    in-flight IOs on a request_queue which has the policy activated. This
    is why most derefs of blkcg_policy[] don't need explicit locking;
    however, blkcg_css_alloc() isn't invoked from IO path and thus doesn't
    have this protection and may race policies being added and removed.

    Fix it by adding explicit blkcg_pol_mutex protection around
    blkcg_policy[] iteration in blkcg_css_alloc().

    Signed-off-by: Tejun Heo
    Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data")
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_pol_mutex primarily protects the blkcg_policy array. It also
    protects cgroup file type [un]registration during policy addition /
    removal. This puts blkcg_pol_mutex outside cgroup internal
    synchronization and in turn makes it impossible to grab from blkcg's
    cgroup methods as that leads to cyclic dependency.

    Another problematic dependency arising from this is through cgroup
    interface file deactivation. Removing a cftype requires removing all
    files of the type which in turn involves draining all on-going
    invocations of the file methods. This means that an interface file
    implementation can't grab blkcg_pol_mutex as draining can lead to AA
    deadlock.

    blkcg_reset_stats() is already in this situation. It currently
    trylocks blkcg_pol_mutex and then unwinds and retries the whole
    operation on failure, which is cumbersome at best. It has a lengthy
    comment explaining how cgroup internal synchronization is involved and
    expected to be updated but as explained above this doesn't need cgroup
    internal locking to deadlock. It's a self-contained AA deadlock.

    The described circular dependencies can be easily broken by moving
    cftype [un]registration out of blkcg_pol_mutex and protect them with
    an outer mutex. This patch introduces blkcg_pol_register_mutex which
    wraps entire policy [un]registration including cftype operations and
    shrinks blkcg_pol_mutex critical section. This also makes the trylock
    dancing in blkcg_reset_stats() unnecessary. Removed.

    This patch is necessary for the following blkcg_policy_data allocation
    bug fixes.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     

07 Jul, 2015

1 commit

  • Currently, per-blkcg data is freed each time a policy is deactivated,
    that is also upon scheduler switch. However, when switching from a
    scheduler implementing a policy which requires per-blkcg data to
    another one, that same policy might be active on other devices, and
    therefore those same per-blkcg data could be still in use.
    This commit lets per-blkcg data be freed when the blkcg is freed
    instead of on policy deactivation.

    Signed-off-by: Arianna Avanzini
    Reported-and-tested-by: Michael Kaminsky
    Fixes: e48453c3 ("block, cgroup: implement policy-specific per-blkcg data")
    Signed-off-by: Jens Axboe

    Arianna Avanzini
     

26 Jun, 2015

1 commit

  • Pull cgroup writeback support from Jens Axboe:
    "This is the big pull request for adding cgroup writeback support.

    This code has been in development for a long time, and it has been
    simmering in for-next for a good chunk of this cycle too. This is one
    of those problems that has been talked about for at least half a
    decade, finally there's a solution and code to go with it.

    Also see last weeks writeup on LWN:

    http://lwn.net/Articles/648292/"

    * 'for-4.2/writeback' of git://git.kernel.dk/linux-block: (85 commits)
    writeback, blkio: add documentation for cgroup writeback support
    vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
    writeback: do foreign inode detection iff cgroup writeback is enabled
    v9fs: fix error handling in v9fs_session_init()
    bdi: fix wrong error return value in cgwb_create()
    buffer: remove unusued 'ret' variable
    writeback: disassociate inodes from dying bdi_writebacks
    writeback: implement foreign cgroup inode bdi_writeback switching
    writeback: add lockdep annotation to inode_to_wb()
    writeback: use unlocked_inode_to_wb transaction in inode_congested()
    writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
    writeback: implement [locked_]inode_to_wb_and_lock_list()
    writeback: implement foreign cgroup inode detection
    writeback: make writeback_control track the inode being written back
    writeback: relocate wb[_try]_get(), wb_put(), inode_{attach|detach}_wb()
    mm: vmscan: disable memcg direct reclaim stalling if cgroup writeback support is in use
    writeback: implement memcg writeback domain based throttling
    writeback: reset wb_domain->dirty_limit[_tstmp] when memcg domain size changes
    writeback: implement memcg wb_domain
    writeback: update wb_over_bg_thresh() to use wb_domain aware operations
    ...

    Linus Torvalds
     

07 Jun, 2015

1 commit

  • The block IO (blkio) controller enables the block layer to provide service
    guarantees in a hierarchical fashion. Specifically, service guarantees
    are provided by registered request-accounting policies. As of now, a
    proportional-share and a throttling policy are available. They are
    implemented, respectively, by the CFQ I/O scheduler and the blk-throttle
    subsystem. Unfortunately, as for adding new policies, the current
    implementation of the block IO controller is only halfway ready to allow
    new policies to be plugged in. This commit provides a solution to make
    the block IO controller fully ready to handle new policies.
    In what follows, we first describe briefly the current state, and then
    list the changes made by this commit.

    The throttling policy does not need any per-cgroup information to perform
    its task. In contrast, the proportional share policy uses, for each cgroup,
    both the weight assigned by the user to the cgroup, and a set of dynamically-
    computed weights, one for each device.

    The first, user-defined weight is stored in the blkcg data structure: the
    block IO controller allocates a private blkcg data structure for each
    cgroup in the blkio cgroups hierarchy (regardless of which policy is active).
    In other words, the block IO controller internally mirrors the blkio cgroups
    with private blkcg data structures.

    On the other hand, for each cgroup and device, the corresponding dynamically-
    computed weight is maintained in the following, different way. For each device,
    the block IO controller keeps a private blkcg_gq structure for each cgroup in
    blkio. In other words, block IO also keeps one private mirror copy of the blkio
    cgroups hierarchy for each device, made of blkcg_gq structures.
    Each blkcg_gq structure keeps per-policy information in a generic array of
    dynamically-allocated 'dedicated' data structures, one for each registered
    policy (so currently the array contains two elements). To be inserted into the
    generic array, each dedicated data structure embeds a generic blkg_policy_data
    structure. Consider now the array contained in the blkcg_gq structure
    corresponding to a given pair of cgroup and device: one of the elements
    of the array contains the dedicated data structure for the proportional-share
    policy, and this dedicated data structure contains the dynamically-computed
    weight for that pair of cgroup and device.

    The generic strategy adopted for storing per-policy data in blkcg_gq structures
    is already capable of handling new policies, whereas the one adopted with blkcg
    structures is not, because per-policy data are hard-coded in the blkcg
    structures themselves (currently only data related to the proportional-
    share policy).

    This commit addresses the above issues through the following changes:
    . It generalizes blkcg structures so that per-policy data are stored in the same
    way as in blkcg_gq structures.
    Specifically, it lets also the blkcg structure store per-policy data in a
    generic array of dynamically-allocated dedicated data structures. We will
    refer to these data structures as blkcg dedicated data structures, to
    distinguish them from the dedicated data structures inserted in the generic
    arrays kept by blkcg_gq structures.
    To allow blkcg dedicated data structures to be inserted in the generic array
    inside a blkcg structure, this commit also introduces a new blkcg_policy_data
    structure, which is the equivalent of blkg_policy_data for blkcg dedicated
    data structures.
    . It adds to the blkcg_policy structure, i.e., to the descriptor of a policy, a
    cpd_size field and a cpd_init field, to be initialized by the policy with,
    respectively, the size of the blkcg dedicated data structures, and the
    address of a constructor function for blkcg dedicated data structures.
    . It moves the CFQ-specific fields embedded in the blkcg data structure (i.e.,
    the fields related to the proportional-share policy), into a new blkcg
    dedicated data structure called cfq_group_data.

    Signed-off-by: Paolo Valente
    Signed-off-by: Arianna Avanzini
    Acked-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Arianna Avanzini