20 Mar, 2018

1 commit


20 Sep, 2016

1 commit

  • Right now, if slice is expired, we start a new slice. If a bio is
    queued, we keep on extending slice by throtle_slice interval (100ms).

    This worked well as long as pending timer function got executed with-in
    few milli seconds of scheduled time. But looks like with recent changes
    in timer subsystem, slack can be much longer depending on the expiry time
    of the scheduled timer.

    commit 500462a9de65 ("timers: Switch to a non-cascading wheel")

    This means, by the time timer function gets executed, it is possible the
    delay from scheduled time is more than 100ms. That means current code
    will conclude that existing slice has expired and a new one needs to
    be started. New slice will be 100ms by default and that will not be
    sufficient to meet rate requirement of group given the bio size and
    bio will not be dispatched and we will start a new timer function to
    wait. And when that timer expires, same process will repeat and we
    will wait again and this can easily be an infinite loop.

    Solve this issue by starting a new slice only if throttle gropup is
    empty. If it is not empty, that means there should be an active slice
    going on. Ideally it should not be expired but given the slack, it is
    possible that it has expired.

    Reported-by: Hou Tao
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

08 Aug, 2016

1 commit

  • Since commit 63a4cc24867d, bio->bi_rw contains flags in the lower
    portion and the op code in the higher portions. This means that
    old code that relies on manually setting bi_rw is most likely
    going to be broken. Instead of letting that brokeness linger,
    rename the member, to force old and out-of-tree code to break
    at compile time instead of at runtime.

    No intended functional changes in this commit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

05 Aug, 2016

1 commit


10 May, 2016

1 commit


18 Sep, 2015

1 commit

  • cgroup_on_dfl() tests whether the cgroup's root is the default
    hierarchy; however, an individual controller is only interested in
    whether the controller is attached to the default hierarchy and never
    tests a cgroup which doesn't belong to the hierarchy that the
    controller is attached to.

    This patch replaces cgroup_on_dfl() tests in controllers with faster
    static_key based cgroup_subsys_on_dfl(). This leaves cgroup core as
    the only user of cgroup_on_dfl() and the function is moved from the
    header file to cgroup.c.

    Signed-off-by: Tejun Heo
    Acked-by: Zefan Li
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

19 Aug, 2015

13 commits

  • blkcg interface grew to be the biggest of all controllers and
    unfortunately most inconsistent too. The interface files are
    inconsistent with a number of cloes duplicates. Some files have
    recursive variants while others don't. There's distinction between
    normal and leaf weights which isn't intuitive and there are a lot of
    stat knobs which don't make much sense outside of debugging and expose
    too much implementation details to userland.

    In the unified hierarchy, everything is always hierarchical and
    internal nodes can't have tasks rendering the two structural issues
    twisting the current interface. The interface has to be updated in a
    significant anyway and this is a good chance to revamp it as a whole.
    This patch implements blkcg interface for the unified hierarchy.

    * (from a previous patch) blkcg is identified by "io" instead of
    "blkio" on the unified hierarchy. Given that the whole interface is
    updated anyway, the rename shouldn't carry noticeable conversion
    overhead.

    * The original interface consisted of 27 files is replaced with the
    following three files.

    blkio.stat : per-blkcg stats
    blkio.weight : per-cgroup and per-cgroup-queue weight settings
    blkio.max : per-cgroup-queue bps and iops max limits

    Documentation/cgroups/unified-hierarchy.txt updated accordingly.

    v2: blkcg_policy->dfl_cftypes wasn't removed on
    blkcg_policy_unregister() corrupting the cftypes list. Fixed.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • tg_set_conf() is largely consisted of parsing and setting the new
    config and the follow-up application and propagation. This patch
    separates out the latter part into tg_conf_updated(). This will be
    used to implement interface for the unified hierarchy.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, blkg_conf_prep() expects input to be of the following form

    MAJ:MIN NUM

    and reads the NUM part into blkg_conf_ctx->v. This is quite
    restrictive and gets in the way in implementing blkcg interface for
    the unified hierarchy. This patch updates blkg_conf_prep() so that it
    expects

    MAJ:MIN BODY_STR

    where BODY_STR is an arbitrary string. blkg_conf_ctx->v is replaced
    with ->body which is a char pointer pointing to the start of BODY_STR.
    Parsing of the body is moved to blkg_conf_prep()'s callers.

    To allow using, for example, strsep() on blkg_conf_ctx->val, it is a
    non-const pointer and to accommodate that const is dropped from @input
    too.

    This doesn't cause any behavior changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg is about to grow interface for the unified hierarchy. Add
    legacy to existing cftypes.

    * blkcg_policy->cftypes -> blkcg_policy->legacy_cftypes
    * blk-cgroup.c:blkcg_files -> blkcg_legacy_files
    * cfq-iosched.c:cfq_blkcg_files -> cfq_blkcg_legacy_files
    * blk-throttle.c:throtl_files -> throtl_legacy_files

    Pure renames. No functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both cfq-iosched and blk-throttle keep track of
    io_service_bytes and io_serviced stats. While keeping track of them
    separately may be useful during development, it doesn't make much
    sense otherwise. Also, blk-throttle was counting bio's as IOs while
    cfq-iosched request's, which is more confusing than informative.

    This patch adds ->stat_bytes and ->stat_ios to blkg (blkcg_gq),
    removes the counterparts from cfq-iosched and blk-throttle and let
    them print from the common blkg counters. The common counters are
    incremented during bio issue in blkcg_bio_issue_check().

    The outputs are still filtered by whether the policy has
    blkg_policy_data on a given blkg, so cfq's output won't show up if it
    has never been used for a given blkg. The only times when the outputs
    would differ significantly are when policies are attached on the fly
    or elevators are switched back and forth. Those are quite exceptional
    operations and I don't think they warrant keeping separate counters.

    v3: Update blkio-controller.txt accordingly.

    v2: Account IOs during bio issues instead of request completions so
    that bio-based drivers can be handled the same way.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkcg_[rw]stat are used as stat counters for blkcg policies. It isn't
    per-cpu by itself and blk-throttle makes it per-cpu by wrapping around
    it. This patch makes blkcg_[rw]stat per-cpu and drop the ad-hoc
    per-cpu wrapping in blk-throttle.

    * blkg_[rw]stat->cnt is replaced with cpu_cnt which is struct
    percpu_counter. This makes syncp unnecessary as remote accesses are
    handled by percpu_counter itself.

    * blkg_[rw]stat_init() can now fail due to percpu allocation failure
    and thus are updated to return int.

    * percpu_counters need explicit freeing. blkg_[rw]stat_exit() added.

    * As blkg_rwstat->cpu_cnt[] can't be read directly anymore, reading
    and summing results are stored in ->aux_cnt[] instead.

    * Custom per-cpu stat implementation in blk-throttle is removed.

    This makes all blkcg stat counters per-cpu without complicating policy
    implmentations.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkg (blkcg_gq) currently is created by blkcg policies invoking
    blkg_lookup_create() which ends up repeating about the same code in
    different policies. Theoretically, this can avoid the overhead of
    looking and/or creating blkg's if blkcg is enabled but no policy is in
    use; however, the cost of blkg lookup / creation is very low
    especially if only the root blkcg is in use which is highly likely if
    no blkcg policy is in active use - it boils down to a single very
    predictable conditional and surrounding RCU protection.

    This patch consolidates blkg creation to a new function
    blkcg_bio_issue_check() which is called during bio issue from
    generic_make_request_checks(). blkcg_bio_issue_check() is now the
    only function which tries to create missing blkg's. The subsequent
    policy and request_list operations just perform blkg_lookup() and if
    missing falls back to the root.

    * blk_get_rl() no longer tries to create blkg. It uses blkg_lookup()
    instead of blkg_lookup_create().

    * blk_throtl_bio() is now called from blkcg_bio_issue_check() with rcu
    read locked and blkg already looked up. Both throtl_lookup_tg() and
    throtl_lookup_create_tg() are dropped.

    * cfq is similarly updated. cfq_lookup_create_cfqg() is replaced with
    cfq_lookup_cfqg()which uses blkg_lookup().

    This consolidates blkg handling and avoids unnecessary blkg creation
    retries under memory pressure. In addition, this provides a common
    bio entry point into blkcg where things like common accounting can be
    performed.

    v2: Build fixes for !CONFIG_CFQ_GROUP_IOSCHED and
    !CONFIG_BLK_DEV_THROTTLING.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • If a queue is bypassing, all blkcg policies should become noops but
    blk-throttle wasn't. It only became noop if the queue was dying.
    While this wouldn't lead to an oops as falling back to the root blkg
    is safe in this case, this can be a bit surprising - a bypassing queue
    could still be applying throttle limits.

    Fix it by removing blk_queue_dying() test in throtl_lookup_create_tg()
    and testing blk_queue_bypass() in blk_throtl_bio() and bypassing
    before doing anything else.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, both throttle and cfq policies implement their own root
    blkg (blkcg_gq) lookup fast path. This patch moves root blkg
    optimization from throtl_lookup_tg() to __blkg_lookup(). cfq-iosched
    currently doesn't use blkg_lookup() but will be converted and drop the
    optimization too.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Cc: Arianna Avanzini
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The newly added ->pd_alloc_fn() and ->pd_free_fn() deal with pd
    (blkg_policy_data) while the older ones use blkg (blkcg_gq). As using
    blkg doesn't make sense for ->pd_alloc_fn() and after allocation pd
    can always be mapped to blkg and given that these are policy-specific
    methods, it makes sense to converge on pd.

    This patch makes all methods deal with pd instead of blkg. Most
    conversions are trivial. In blk-cgroup.c, a couple method invocation
    sites now test whether pd exists instead of policy state for
    consistency. This shouldn't cause any behavioral differences.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With the recent addition of alloc and free methods, things became
    messier. This patch reorganizes them according to the followings.

    * ->pd_alloc_fn()

    Responsible for allocation and static initializations - the ones
    which can be done independent of where the pd might be attached.

    * ->pd_init_fn()

    Initializations which require the knowledge of where the pd is
    attached.

    * ->pd_free_fn()

    The counter part of pd_alloc_fn(). Static de-init and freeing.

    This leaves ->pd_exit_fn() without any users. Removed.

    While at it, collapse an one liner function throtl_pd_exit(), which
    has only one user, into its user.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Because percpu allocator couldn't do non-blocking allocations,
    blk-throttle was forced to implement an ad-hoc asynchronous allocation
    mechanism for its percpu stats for cases where blkg's (blkcg_gq's) are
    allocated from an IO path without sleepable context.

    Now that percpu allocator can handle gfp_mask and blkg_policy_data
    alloc / free are handled by policy methods, the ad-hoc asynchronous
    allocation mechanism can be replaced with direct allocation from
    tg_stats_alloc_fn(). Rit it out.

    This ensures that an active throtl_grp always has valid non-NULL
    ->stats_cpu. Remove checks on it.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • A blkg (blkcg_gq) represents the relationship between a cgroup and
    request_queue. Each active policy has a pd (blkg_policy_data) on each
    blkg. The pd's were allocated by blkcg core and each policy could
    request to allocate extra space at the end by setting
    blkcg_policy->pd_size larger than the size of pd.

    This is a bit unusual but was done this way mostly to simplify error
    handling and all the existing use cases could be handled this way;
    however, this is becoming too restrictive now that percpu memory can
    be allocated without blocking.

    This introduces two new mandatory blkcg_policy methods - pd_alloc_fn()
    and pd_free_fn() - which are used to allocate and release pd for a
    given policy. As pd allocation is now done from policy side, it can
    simply allocate a larger area which embeds pd at the beginning. This
    change makes ->pd_size pointless. Removed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Jun, 2015

1 commit


21 Feb, 2015

1 commit

  • When reading blkio.throttle.io_serviced in a recently created blkio
    cgroup, it's possible to race against the creation of a throttle policy,
    which delays the allocation of stats_cpu.

    Like other functions in the throttle code, just checking for a NULL
    stats_cpu prevents the following oops caused by that race.

    [ 1117.285199] Unable to handle kernel paging request for data at address 0x7fb4d0020
    [ 1117.285252] Faulting instruction address: 0xc0000000003efa2c
    [ 1137.733921] Oops: Kernel access of bad area, sig: 11 [#1]
    [ 1137.733945] SMP NR_CPUS=2048 NUMA PowerNV
    [ 1137.734025] Modules linked in: bridge stp llc kvm_hv kvm binfmt_misc autofs4
    [ 1137.734102] CPU: 3 PID: 5302 Comm: blkcgroup Not tainted 3.19.0 #5
    [ 1137.734132] task: c000000f1d188b00 ti: c000000f1d210000 task.ti: c000000f1d210000
    [ 1137.734167] NIP: c0000000003efa2c LR: c0000000003ef9f0 CTR: c0000000003ef980
    [ 1137.734202] REGS: c000000f1d213500 TRAP: 0300 Not tainted (3.19.0)
    [ 1137.734230] MSR: 9000000000009032 CR: 42008884 XER: 20000000
    [ 1137.734325] CFAR: 0000000000008458 DAR: 00000007fb4d0020 DSISR: 40000000 SOFTE: 0
    GPR00: c0000000003ed3a0 c000000f1d213780 c000000000c59538 0000000000000000
    GPR04: 0000000000000800 0000000000000000 0000000000000000 0000000000000000
    GPR08: ffffffffffffffff 00000007fb4d0020 00000007fb4d0000 c000000000780808
    GPR12: 0000000022000888 c00000000fdc0d80 0000000000000000 0000000000000000
    GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
    GPR20: 000001003e120200 c000000f1d5b0cc0 0000000000000200 0000000000000000
    GPR24: 0000000000000001 c000000000c269e0 0000000000000020 c000000f1d5b0c80
    GPR28: c000000000ca3a08 c000000000ca3dec c000000f1c667e00 c000000f1d213850
    [ 1137.734886] NIP [c0000000003efa2c] .tg_prfill_cpu_rwstat+0xac/0x180
    [ 1137.734915] LR [c0000000003ef9f0] .tg_prfill_cpu_rwstat+0x70/0x180
    [ 1137.734943] Call Trace:
    [ 1137.734952] [c000000f1d213780] [d000000005560520] 0xd000000005560520 (unreliable)
    [ 1137.734996] [c000000f1d2138a0] [c0000000003ed3a0] .blkcg_print_blkgs+0xe0/0x1a0
    [ 1137.735039] [c000000f1d213960] [c0000000003efb50] .tg_print_cpu_rwstat+0x50/0x70
    [ 1137.735082] [c000000f1d2139e0] [c000000000104b48] .cgroup_seqfile_show+0x58/0x150
    [ 1137.735125] [c000000f1d213a70] [c0000000002749dc] .kernfs_seq_show+0x3c/0x50
    [ 1137.735161] [c000000f1d213ae0] [c000000000218630] .seq_read+0xe0/0x510
    [ 1137.735197] [c000000f1d213bd0] [c000000000275b04] .kernfs_fop_read+0x164/0x200
    [ 1137.735240] [c000000f1d213c80] [c0000000001eb8e0] .__vfs_read+0x30/0x80
    [ 1137.735276] [c000000f1d213cf0] [c0000000001eb9c4] .vfs_read+0x94/0x1b0
    [ 1137.735312] [c000000f1d213d90] [c0000000001ebb38] .SyS_read+0x58/0x100
    [ 1137.735349] [c000000f1d213e30] [c000000000009218] syscall_exit+0x0/0x98
    [ 1137.735383] Instruction dump:
    [ 1137.735405] 7c6307b4 7f891800 409d00b8 60000000 60420000 3d420004 392a63b0 786a1f24
    [ 1137.735471] 7d49502a e93e01c8 7d495214 7d2ad214 e9090008 e9490010 e9290018

    And here is one code that allows to easily reproduce this, although this
    has first been found by running docker.

    void run(pid_t pid)
    {
    int n;
    int status;
    int fd;
    char *buffer;
    buffer = memalign(BUFFER_ALIGN, BUFFER_SIZE);
    n = snprintf(buffer, BUFFER_SIZE, "%d\n", pid);
    fd = open(CGPATH "/test/tasks", O_WRONLY);
    write(fd, buffer, n);
    close(fd);
    if (fork() > 0) {
    fd = open("/dev/sda", O_RDONLY | O_DIRECT);
    read(fd, buffer, 512);
    close(fd);
    wait(&status);
    } else {
    fd = open(CGPATH "/test/blkio.throttle.io_serviced", O_RDONLY);
    n = read(fd, buffer, BUFFER_SIZE);
    close(fd);
    }
    free(buffer);
    exit(0);
    }

    void test(void)
    {
    int status;
    mkdir(CGPATH "/test", 0666);
    if (fork() > 0)
    wait(&status);
    else
    run(getpid());
    rmdir(CGPATH "/test");
    }

    int main(int argc, char **argv)
    {
    int i;
    for (i = 0; i < NR_TESTS; i++)
    test();
    return 0;
    }

    Reported-by: Ricardo Marin Matinata
    Signed-off-by: Thadeu Lima de Souza Cascardo
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Thadeu Lima de Souza Cascardo
     

09 Jul, 2014

1 commit

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

14 May, 2014

1 commit

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     

03 May, 2014

1 commit


22 Apr, 2014

1 commit


19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

12 Feb, 2014

1 commit

  • cftype->max_write_len is used to extend the maximum size of writes.
    It's interpreted in such a way that the actual maximum size is one
    less than the specified value. The default size is defined by
    CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its
    value is decremented by 1 and then compared for equality with max
    size, which means that the actual default size is
    CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.

    There's no point in having a limit that low. Update its definition so
    that it means the actual string length sans termination and anything
    below PAGE_SIZE-1 is treated as PAGE_SIZE-1.

    .max_write_len for "release_agent" is updated to PATH_MAX-1 and
    cgroup_release_agent_write() is updated so that the redundant strlen()
    check is removed and it uses strlcpy() instead of strcpy().
    .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
    no longer necessary and removed. The one in cpuset is kept unchanged
    as it's an approximated value to begin with.

    This will also make transition to kernfs smoother.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

31 Jan, 2014

1 commit

  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

06 Dec, 2013

1 commit

  • In preparation of conversion to kernfs, cgroup file handling is
    updated so that it can be easily mapped to kernfs. This patch
    replaces cftype->read_seq_string() with cftype->seq_show() which is
    not limited to single_open() operation and will map directcly to
    kernfs seq_file interface.

    The conversions are mechanical. As ->seq_show() doesn't have @css and
    @cft, the functions which make use of them are converted to use
    seq_css() and seq_cft() respectively. In several occassions, e.f. if
    it has seq_string in its name, the function name is updated to fit the
    new method better.

    This patch does not introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Michal Hocko
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Neil Horman

    Tejun Heo
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

13 Nov, 2013

1 commit

  • Now that seqcounts are lockdep enabled objects, we need to explicitly
    initialize runtime allocated seqcounts so that lockdep can track them.

    Without this patch, Fengguang was seeing:

    [ 4.127282] INFO: trying to register non-static key.
    [ 4.128027] the code is fine but needs lockdep annotation.
    [ 4.128027] turning off the locking correctness validator.
    [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
    [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ ... ]
    [ 4.128027] Call Trace:
    [ 4.128027] [] ? console_unlock+0x353/0x380
    [ 4.128027] [] dump_stack+0x48/0x60
    [ 4.128027] [] __lock_acquire.isra.26+0x7e3/0xceb
    [ 4.128027] [] lock_acquire+0x71/0x9a
    [ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
    [ 4.128027] [] throtl_update_dispatch_stats+0x7c/0x153
    [ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
    [ 4.128027] [] blk_throtl_bio+0x1c3/0x485
    ...

    Use u64_stats_init() for all affected data structures, which initializes
    the seqcount.

    Reported-and-Tested-by: Fengguang Wu
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Peter Zijlstra
    [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
    Signed-off-by: John Stultz
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
    [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Aug, 2013

3 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     

15 May, 2013

5 commits

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     
  • blk_throtl_bio() has a quick exit path for throtl_grps without limits
    configured. It looks at the bps and iops limits and if both are not
    configured, the bio is issued immediately. While this is correct in
    the current flat hierarchy as each throtl_grp behaves completely
    independently, it would become wrong in proper hierarchy mode. A
    group without any limits could still be limited by one of its
    ancestors and bio's queued for such group should not bypass
    blk-throtl.

    As having a quick bypass mechanism is beneficial, this patch
    reimplements the mechanism such that it's correct even with proper
    hierarchy. throtl_grp->has_rules[] is added. These booleans are
    updated for the whole subtree whenever a config is updated so that
    has_rules[] of the whole subtree stays synchronized. They're also
    updated when a new throtl_grp comes online so that it can't escape the
    limits of its ancestors.

    As no throtl_grp has another throtl_grp as parent now, this patch
    doesn't yet make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the planned proper hierarchy support, a bio will climb up the
    tree before actually being dispatched. This makes sure bio is also
    subjected to parent's throttling limits, if any.

    It might happen that parent is idle and when bio is transferred to
    parent, a new slice starts fresh. But that is incorrect as parents
    wait time should have started when bio was queued in child group and
    causes IOs to be throttled more than configured as they climb the
    hierarchy.

    Given the fact that we have not written hierarchical algorithm in a
    way where child's and parents time slices are synchronized, we
    transfer the child's start time to parent if parent was idling. If
    parent was busy doing dispatch of other bios all this while, this is
    not an issue.

    Child's slice start time is passed to parent. Parent looks at its
    last expired slice start time. If child's start time is after parents
    old start time, that means parent had been idle and after parent
    went idle, child had an IO queued. So use child's start time as
    parent start time.

    If parent's start time is after child's start time, that means,
    when IO got queued in child group, parent was not idle. But later
    it dispatched some IO, its slice got trimmed and then it went idle.
    After a while child's request got shifted in parent group. In this
    case use parent's old start time as new start time as that's the
    duration of slice we did not use.

    This logic is far from perfect as if there are multiple childs
    then first child transferring the bio decides the start time while
    a bio might have queued up even earlier in other child, which is
    yet to be transferred up to parent. In that case we will lose
    time and bandwidth in parent. This patch is just an approximation
    to make situation somewhat better.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • With flat hierarchy, there's only single level of dispatching
    happening and fairness beyond that point is the responsibility of the
    rest of the block layer and driver, which usually works out okay;
    however, with the planned hierarchy support,
    service_queue->bio_lists[] can be filled up by bios from a single
    source. While the limits would still be honored, it'd be very easy to
    starve IOs from siblings or children.

    To avoid such starvation, this patch implements throtl_qnode and
    converts service_queue->bio_lists[] to lists of per-source qnodes
    which in turn contains the bio's. For example, when a bio is
    dispatched from a child group, the bio doesn't get queued on
    ->bio_lists[] directly but it first gets queued on the group's qnode
    which in turn gets queued on service_queue->queued[]. When
    dispatching for the upper level, the ->queued[] list is consumed in
    round-robing order so that the dispatch windows is consumed fairly by
    all IO sources.

    There are two ways a bio can come to a throtl_grp - directly queued to
    the group or dispatched from a child. For the former
    throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
    ->qnode_on_parent[rw].

    Note that this means that the child which is contributing a bio to its
    parent should stay pinned until all its bios are dispatched to its
    grand-parent. This patch moves blkg refcnting from bio add/remove
    spots to qnode activation/deactivation so that the blkg containing an
    active qnode is always pinned. As child pins the parent, this is
    sufficient for keeping the relevant sub-tree pinned while bios are in
    flight.

    The starvation issue was spotted by Vivek Goyal.

    v2: The original patch used the same throtl_grp->qnode_on_self/parent
    for reads and writes causing RWs to be queued incorrectly if there
    already are outstanding IOs in the other direction. They should
    be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
    can use different qnodes. Spotted by Vivek Goyal.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_pending_timer_fn() currently assumes that the parent_sq is the
    top level one and the bio's dispatched are ready to be issued;
    however, this assumption will be wrong with proper hierarchy support.
    This patch makes the following changes to make
    throtl_pending_timer_fn() ready for hiearchy.

    * If the parent_sq isn't the top-level one, update the parent
    throtl_grp's dispatch time and schedule the next dispatch as
    necessary. If the parent's dispatch time is now, repeat the
    function for the parent throtl_grp.

    * If the parent_sq is the top-level one, kick issue work_item as
    before.

    * The debug message printed by throtl_log() now prints out the
    service_queue's nr_queued[] instead of the total nr_queued as the
    latter becomes uninteresting and misleading with hierarchical
    dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo