01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

22 Feb, 2013

1 commit

  • While stress-running very-small container scenarios with the Kernel Memory
    Controller, I've run into a lockdep-detected lock imbalance in
    cfq-iosched.c.

    I'll apologize beforehand for not posting a backlog: I didn't anticipate
    it would be so hard to reproduce, so I didn't save my serial output and
    went directly on debugging. Turns out that it did not happen again in
    more than 20 runs, making it a quite rare pattern.

    But here is my analysis:

    When we are in very low-memory situations, we will arrive at
    cfq_find_alloc_queue and may not find a queue, having to resort to the oom
    queue, in an rcu-locked condition:

    if (!cfqq || cfqq == &cfqd->oom_cfqq)
    [ ... ]

    Next, we will release the rcu lock, and try to allocate a queue, retrying
    if we succeed:

    rcu_read_unlock();
    spin_unlock_irq(cfqd->queue->queue_lock);
    new_cfqq = kmem_cache_alloc_node(cfq_pool,
    gfp_mask | __GFP_ZERO,
    cfqd->queue->node);
    spin_lock_irq(cfqd->queue->queue_lock);
    if (new_cfqq)
    goto retry;

    We are unlocked at this point, but it should be fine, since we will
    reacquire the rcu_read_lock when we retry.

    Except of course, that we may not retry: the allocation may very well fail
    and we'll keep on going through the flow:

    The next branch is:

    if (cfqq) {
    [ ... ]
    } else
    cfqq = &cfqd->oom_cfqq;

    And right before exiting, we'll issue rcu_read_unlock().

    Being already unlocked, this is the likely source of our imbalance. Since
    cfqq is either already NULL or made NULL in the first statement of the
    outter branch, the only viable alternative here seems to be to return the
    oom queue right away in case of allocation failure.

    Please review the following patch and apply if you agree with my analysis.

    Signed-off-by: Glauber Costa
    Cc: Jens Axboe
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Glauber Costa
     

10 Jan, 2013

15 commits

  • Unfortunately, at this point, there's no way to make the existing
    statistics hierarchical without creating nasty surprises for the
    existing users. Just create recursive counterpart of the existing
    stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To support hierarchical stats, it's necessary to remember stats from
    dead children. Add cfqg->dead_stats and make a dying cfqg transfer
    its stats to the parent's dead-stats.

    The transfer happens form ->pd_offline_fn() and it is possible that
    there are some residual IOs completing afterwards. Currently, we lose
    these stats. Given that cgroup removal isn't a very high frequency
    operation and the amount of residual IOs on offline are likely to be
    nil or small, this shouldn't be a big deal and the complexity needed
    to handle residual IOs - another callback and rather elaborate
    synchronization to reach and lock the matching q - doesn't seem
    justified.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Separate out cfqg_stats_reset() which takes struct cfqg_stats * from
    cfq_pd_reset_stats() and move the latter to where other pd methods are
    defined. cfqg_stats_reset() will be used to implement hierarchical
    stats.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Rename blkg_rwstat_sum() to blkg_rwstat_total(). sum will be used for
    summing up stats from multiple blkgs.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the previous two patches, all cfqg scheduling decisions are based
    on vfraction and ready for hierarchy support. The only thing which
    keeps the behavior flat is cfqg_flat_parent() which makes vfraction
    calculation consider all non-root cfqgs children of the root cfqg.

    Replace it with cfqg_parent() which returns the real parent. This
    enables full blkcg hierarchy support for cfq-iosched. For example,
    consider the following hierarchy.

    root
    / \
    A:500 B:250
    / \
    AA:500 AB:1000

    For simplicity, let's say all the leaf nodes have active tasks and are
    on service tree. For each leaf node, vfraction would be

    AA: (500 / 1500) * (500 / 750) =~ 0.2222
    AB: (1000 / 1500) * (500 / 750) =~ 0.4444
    B: (250 / 750) =~ 0.3333

    and vdisktime will be distributed accordingly. For more detail,
    please refer to Documentation/block/cfq-iosched.txt.

    v2: cfq-iosched.txt updated to describe group scheduling as suggested
    by Vivek.

    v3: blkio-controller.txt updated.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • cfq_group_slice() calculates slice by taking a fraction of
    cfq_target_latency according to the ratio of cfqg->weight against
    service_tree->total_weight. This currently works only because all
    cfqgs are treated to be at the same level.

    To prepare for proper hierarchy support, convert cfq_group_slice() to
    base the calculation on cfqg->vfraction. As cfqg->vfraction is always
    a fraction of 1 and represents the fraction allocated to the cfqg with
    hierarchy considered, the slice can be simply calculated by
    multiplying cfqg->vfraction to cfq_target_latency (with fixed point
    shift factored in).

    As vfraction calculation currently treats all non-root cfqgs as
    children of the root cfqg, this patch doesn't introduce noticeable
    behavior difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, cfqg charges are scaled directly according to cfqg->weight.
    Regardless of the number of active cfqgs or the amount of active
    weights, a given weight value always scales charge the same way. This
    works fine as long as all cfqgs are treated equally regardless of
    their positions in the hierarchy, which is what cfq currently
    implements. It can't work in hierarchical settings because the
    interpretation of a given weight value depends on where the weight is
    located in the hierarchy.

    This patch reimplements cfqg charge scaling so that it can be used to
    support hierarchy properly. The scheme is fairly simple and
    light-weight.

    * When a cfqg is added to the service tree, v(disktime)weight is
    calculated. It walks up the tree to root calculating the fraction
    it has in the hierarchy. At each level, the fraction can be
    calculated as

    cfqg->weight / parent->level_weight

    By compounding these, the global fraction of vdisktime the cfqg has
    claim to - vfraction - can be determined.

    * When the cfqg needs to be charged, the charge is scaled inversely
    proportionally to the vfraction.

    The new scaling scheme uses the same CFQ_SERVICE_SHIFT for fixed point
    representation as before; however, the smallest scaling factor is now
    1 (ie. 1 << CFQ_SERVICE_SHIFT). This is different from before where 1
    was for CFQ_WEIGHT_DEFAULT and higher weight would result in smaller
    scaling factor.

    While this shifts the global scale of vdisktime a bit, it doesn't
    change the relative relationships among cfqgs and the scheduling
    result isn't different.

    cfq_group_notify_queue_add uses fixed CFQ_IDLE_DELAY when appending
    new cfqg to the service tree. The specific value of CFQ_IDLE_DELAY
    didn't have any relevance to vdisktime before and is unlikely to cause
    any visible behavior difference now especially as the scale shift
    isn't that large.

    As the new scheme now makes proper distinction between cfqg->weight
    and ->leaf_weight, reverse the weight aliasing for root cfqgs. For
    root, both weights are now mapped to ->leaf_weight instead of the
    other way around.

    Because we're still using cfqg_flat_parent(), this patch shouldn't
    change the scheduling behavior in any noticeable way.

    v2: Beefed up comments on vfraction as requested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To prepare for blkcg hierarchy support, add cfqg->nr_active and
    ->children_weight. cfqg->nr_active counts the number of active cfqgs
    at the cfqg's level and ->children_weight is sum of weights of those
    cfqgs. The level covers itself (cfqg->leaf_weight) and immediate
    children.

    The two values are updated when a cfqg enters and leaves the group
    service tree. Unless the hierarchy is very deep, the added overhead
    should be negligible.

    Currently, the parent is determined using cfqg_flat_parent() which
    makes the root cfqg the parent of all other cfqgs. This is to make
    the transition to hierarchy-aware scheduling gradual. Scheduling
    logic will be converted to use cfqg->children_weight without actually
    changing the behavior. When everything is ready,
    blkcg_weight_parent() will be replaced with proper parent function.

    This patch doesn't introduce any behavior chagne.

    v2: s/cfqg->level_weight/cfqg->children_weight/ as per Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • cfq blkcg is about to grow proper hierarchy handling, where a child
    blkg's weight would nest inside the parent's. This makes tasks in a
    blkg to compete against both tasks in the sibling blkgs and the tasks
    of child blkgs.

    We're gonna use the existing weight as the group weight which decides
    the blkg's weight against its siblings. This patch introduces a new
    weight - leaf_weight - which decides the weight of a blkg against the
    child blkgs.

    It's named leaf_weight because another way to look at it is that each
    internal blkg nodes have a hidden child leaf node which contains all
    its tasks and leaf_weight is the weight of the leaf node and handled
    the same as the weight of the child blkgs.

    This patch only adds leaf_weight fields and exposes it to userland.
    The new weight isn't actually used anywhere yet. Note that
    cfq-iosched currently offcially supports only single level hierarchy
    and root blkgs compete with the first level blkgs - ie. root weight is
    basically being used as leaf_weight. For root blkgs, the two weights
    are kept in sync for backward compatibility.

    v2: cfqd->root_group->leaf_weight initialization was missing from
    cfq_init_queue() causing divide by zero when
    !CONFIG_CFQ_GROUP_SCHED. Fix it. Reported by Fengguang.

    Signed-off-by: Tejun Heo
    Cc: Fengguang Wu

    Tejun Heo
     
  • Currently we attach a character "S" or "A" to the cfqq, to represent
    whether queues is sync or async. Add one more character "N" to represent
    whether it is sync-noidle queue or sync queue. So now three different
    type of queues will look as follows.

    cfq1234S --> sync queus
    cfq1234SN --> sync noidle queue
    cfq1234A --> Async queue

    Previously S/A classification was being printed only if group scheduling
    was enabled. This patch also makes sure that this classification is
    displayed even if group idling is disabled.

    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • Use of local varibale "n" seems to be unnecessary. Remove it. This brings
    it inline with function __cfq_group_st_add(), which is also doing the
    similar operation of adding a group to a rb tree.

    No functionality change here.

    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • choose_service_tree() selects/sets both wl_class and wl_type. Rename it to
    choose_wl_class_and_type() to make it very clear.

    cfq_choose_wl() only selects and sets wl_type. It is easy to confuse
    it with choose_st(). So rename it to cfq_choose_wl_type() to make
    it clear what does it do.

    Just renaming. No functionality change.

    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • At quite a few places we use the keyword "service_tree". At some places,
    especially local variables, I have abbreviated it to "st".

    Also at couple of places moved binary operator "+" from beginning of line
    to end of previous line, as per Tejun's feedback.

    v2:
    Reverted most of the service tree name change based on Jeff Moyer's feedback.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • Some more renaming. Again making the code uniform w.r.t use of
    wl_class/class to represent IO class (RT, BE, IDLE) and using
    wl_type/type to represent subclass (SYNC, SYNC-IDLE, ASYNC).

    At places this patch shortens the string "workload" to "wl".
    Renamed "saved_workload" to "saved_wl_type". Renamed
    "saved_serving_class" to "saved_wl_class".

    For uniformity with "saved_wl_*" variables, renamed "serving_class"
    to "serving_wl_class" and renamed "serving_type" to "serving_wl_type".

    Again, just trying to improve upon code uniformity and improve
    readability. No functional change.

    v2:
    - Restored the usage of keyword "service" based on Jeff Moyer's feedback.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • Currently CFQ has three IO classes, RT, BE and IDLE. At many a places we
    are calling workloads belonging to these classes as "prio". This gets
    very confusing as one starts to associate it with ioprio.

    So this patch just does bunch of renaming so that reading code becomes
    easier. All reference to RT, BE and IDLE workload are done using keyword
    "class" and all references to subclass, SYNC, SYNC-IDLE, ASYNC are made
    using keyword "type".

    This makes me feel much better while I am reading the code. There is no
    functionality change due to this patch.

    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Tejun Heo

    Vivek Goyal
     

06 Nov, 2012

1 commit

  • request is queued in cfqq->fifo list. Looks it's possible we are moving a
    request from one cfqq to another in request merge case. In such case, adjusting
    the fifo list order doesn't make sense and is impossible if we don't iterate
    the whole fifo list.

    My test does hit one case the two cfqq are different, but didn't cause kernel
    crash, maybe it's because fifo list isn't used frequently. Anyway, from the
    code logic, this is buggy.

    I thought we can re-enable the recusive merge logic after this is fixed.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

04 Jun, 2012

2 commits

  • cfq may be built w/ or w/o blkcg support depending on
    CONFIG_CFQ_CGROUP_IOSCHED. If blkcg support is disabled, most of
    related code is ifdef'd out but some part is left dangling -
    blkcg_policy_cfq is left zero-filled and blkcg_policy_[un]register()
    calls are made on it.

    Feeding zero filled policy to blkcg_policy_register() is incorrect and
    triggers the following WARN_ON() if CONFIG_BLK_CGROUP &&
    !CONFIG_CFQ_GROUP_IOSCHED.

    ------------[ cut here ]------------
    WARNING: at block/blk-cgroup.c:867
    Modules linked in:
    Modules linked in:
    CPU: 3 Not tainted 3.4.0-09547-gfb21aff #1
    Process swapper/0 (pid: 1, task: 000000003ff80000, ksp: 000000003ff7f8b8)
    Krnl PSW : 0704100180000000 00000000003d76ca (blkcg_policy_register+0xca/0xe0)
    R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:1 PM:0 EA:3
    Krnl GPRS: 0000000000000000 00000000014b85ec 00000000014b85b0 0000000000000000
    000000000096fb60 0000000000000000 00000000009a8e78 0000000000000048
    000000000099c070 0000000000b6f000 0000000000000000 000000000099c0b8
    00000000014b85b0 0000000000667580 000000003ff7fd98 000000003ff7fd70
    Krnl Code: 00000000003d76be: a7280001 lhi %r2,1
    00000000003d76c2: a7f4ffdf brc 15,3d7680
    #00000000003d76c6: a7f40001 brc 15,3d76c8
    >00000000003d76ca: a7c8ffea lhi %r12,-22
    00000000003d76ce: a7f4ffce brc 15,3d766a
    00000000003d76d2: a7f40001 brc 15,3d76d4
    00000000003d76d6: a7c80000 lhi %r12,0
    00000000003d76da: a7f4ffc2 brc 15,3d765e
    Call Trace:
    ([] initcall_debug+0x0/0x4)
    [] cfq_init+0x62/0xd4
    [] do_one_initcall+0x3a/0x170
    [] kernel_init+0x214/0x2bc
    [] kernel_thread_starter+0x6/0xc
    [] kernel_thread_starter+0x0/0xc
    no locks held by swapper/0/1.
    Last Breaking-Event-Address:
    [] blkcg_policy_register+0xc6/0xe0
    ---[ end trace b8ef4903fcbf9dd3 ]---

    This patch fixes the problem by ensuring all blkcg support code is
    inside CONFIG_CFQ_GROUP_IOSCHED.

    * blkcg_policy_cfq declaration and blkg_to_cfqg() definition are moved
    inside the first CONFIG_CFQ_GROUP_IOSCHED block. __maybe_unused is
    dropped from blkcg_policy_cfq decl.

    * blkcg_deactivate_poilcy() invocation is moved inside ifdef. This
    also makes the activation logic match cfq_init_queue().

    * All blkcg_policy_[un]register() invocations are moved inside ifdef.

    Signed-off-by: Tejun Heo
    Reported-by: Heiko Carstens
    LKML-Reference:
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq_init() would return zero after kmem cache creation failure. Fix
    so that it returns -ENOMEM.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

01 May, 2012

1 commit


20 Apr, 2012

11 commits

  • There's no reason to keep blkcg_policy_ops separate. Collapse it into
    blkcg_policy.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently blkg_policy_data carries policy specific data as char flex
    array instead of being embedded in policy specific data. This was
    forced by oddities around blkg allocation which are all gone now.

    This patch makes blkg_policy_data embedded in policy specific data -
    throtl_grp and cfq_group so that it's more conventional and consistent
    with how io_cq is handled.

    * blkcg_policy->pdata_size is renamed to ->pd_size.

    * Functions which used to take void *pdata now takes struct
    blkg_policy_data *pd.

    * blkg_to_pdata/pdata_to_blkg() updated to blkg_to_pd/pd_to_blkg().

    * Dummy struct blkg_policy_data definition added. Dummy
    pdata_to_blkg() definition was unused and inconsistent with the
    non-dummy version - correct dummy pd_to_blkg() added.

    * throtl and cfq updated accordingly.

    * As dummy blkg_to_pd/pd_to_blkg() are provided,
    blkg_to_cfqg/cfqg_to_blkg() don't need to be ifdef'd. Moved outside
    ifdef block.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • During the recent blkcg cleanup, most of blkcg API has changed to such
    extent that mass renaming wouldn't cause any noticeable pain. Take
    the chance and cleanup the naming.

    * Rename blkio_cgroup to blkcg.

    * Drop blkio / blkiocg prefixes and consistently use blkcg.

    * Rename blkio_group to blkcg_gq, which is consistent with io_cq but
    keep the blkg prefix / variable name.

    * Rename policy method type and field names to signify they're dealing
    with policy data.

    * Rename blkio_policy_type to blkcg_policy.

    This patch doesn't cause any functional change.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blkio_group->path[] stores the path of the associated cgroup and is
    used only for debug messages. Just format the path from blkg->cgroup
    when printing debug messages.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * All_q_list is unused. Drop all_q_{mutex|list}.

    * @for_root of blkg_lookup_create() is always %false when called from
    outside blk-cgroup.c proper. Factor out __blkg_lookup_create() so
    that it doesn't check whether @q is bypassing and use the
    underscored version for the @for_root callsite.

    * blkg_destroy_all() is used only from blkcg proper and @destroy_root
    is always %true. Make it static and drop @destroy_root.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • All blkcg policies were assumed to be enabled on all request_queues.
    Due to various implementation obstacles, during the recent blkcg core
    updates, this was temporarily implemented as shooting down all !root
    blkgs on elevator switch and policy [de]registration combined with
    half-broken in-place root blkg updates. In addition to being buggy
    and racy, this meant losing all blkcg configurations across those
    events.

    Now that blkcg is cleaned up enough, this patch replaces the temporary
    implementation with proper per-queue policy activation. Each blkcg
    policy should call the new blkcg_[de]activate_policy() to enable and
    disable the policy on a specific queue. blkcg_activate_policy()
    allocates and installs policy data for the policy for all existing
    blkgs. blkcg_deactivate_policy() does the reverse. If a policy is
    not enabled for a given queue, blkg printing / config functions skip
    the respective blkg for the queue.

    blkcg_activate_policy() also takes care of root blkg creation, and
    cfq_init_queue() and blk_throtl_init() are updated accordingly.

    This replaces blkcg_bypass_{start|end}() and update_root_blkg_pd()
    unnecessary. Dropped.

    v2: cfq_init_queue() was returning uninitialized @ret on root_group
    alloc failure if !CONFIG_CFQ_GROUP_IOSCHED. Fixed.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • With per-queue policy activation, root blkg creation will be moved to
    blkcg core. Add q->root_blkg in preparation. For blk-throtl, this
    replaces throtl_data->root_tg; however, cfq needs to keep
    cfqd->root_group for !CONFIG_CFQ_GROUP_IOSCHED.

    This is to prepare for per-queue policy activation and doesn't cause
    any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add @pol to blkg_conf_prep() and let it return with queue lock held
    (to be released by blkg_conf_finish()). Note that @pol isn't used
    yet.

    This is to prepare for per-queue policy activation and doesn't cause
    any visible difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Remove BLKIO_POLICY_* enums and let blkio_policy_register() allocate
    @pol->plid dynamically on registration. The maximum number of blkcg
    policies which can be registered at the same time is defined by
    BLKCG_MAX_POLS constant added to include/linux/blkdev.h.

    Note that blkio_policy_register() now may fail. Policy init functions
    updated accordingly and unnecessary ifdefs removed from cfq_init().

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The two functions were taking "enum blkio_policy_id plid". Make them
    take "const struct blkio_policy_type *pol" instead.

    This is to prepare for per-queue policy activation and doesn't cause
    any functional difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * CFQ_WEIGHT_* defined inside CONFIG_BLK_CGROUP causes cfq-iosched.c
    compile failure when the config is disabled. Move it outside the
    ifdef block.

    * Dummy cfqg_stats_*() definitions were lacking inline modifiers
    causing unused functions warning if !CONFIG_CFQ_GROUP_IOSCHED. Add
    them.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     

02 Apr, 2012

7 commits

  • Now that all stat handling code lives in policy implementations,
    there's no need to encode policy ID in cft->private.

    * Export blkcg_prfill_[rw]stat() from blkcg, remove
    blkcg_print_[rw]stat(), and implement cfqg_print_[rw]stat() which
    use hard-code BLKIO_POLICY_PROP.

    * Use cft->private for offset of the target field directly and drop
    BLKCG_STAT_{PRIV|POL|OFF}().

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • Now that all conf and stat fields are moved into policy specific
    blkio_policy_data->pdata areas, there's no reason to use
    blkio_policy_data itself in prfill functions. Pass around @pd->pdata
    instead of @pd.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • blkio_group_conf->weight is owned by cfq and has no reason to be
    defined in blkcg core. Replace it with cfq_group->dev_weight and let
    conf setting functions directly set it. If dev_weight is zero, the
    cfqg doesn't have device specific weight configured.

    Also, rename BLKIO_WEIGHT_* constants to CFQ_WEIGHT_* and rename
    blkio_cgroup->weight to blkio_cgroup->cfq_weight. We eventually want
    per-policy storage in blkio_cgroup but just mark the ownership of the
    field for now.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • blkio_group_stats contains only fields used by cfq and has no reason
    to be defined in blkcg core.

    * Move blkio_group_stats to cfq-iosched.c and rename it to cfqg_stats.

    * blkg_policy_data->stats is replaced with cfq_group->stats.
    blkg_prfill_[rw]stat() are updated to use offset against pd->pdata
    instead.

    * All related macros / functions are renamed so that they have cfqg_
    prefix and the unnecessary @pol arguments are dropped.

    * All stat functions now take cfq_group * instead of blkio_group *.

    * lockdep assertion on queue lock dropped. Elevator runs under queue
    lock by default. There isn't much to be gained by adding lockdep
    assertions at stat function level.

    * cfqg_stats_reset() implemented for blkio_reset_group_stats_fn method
    so that cfqg->stats can be reset.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • blkio_group_stats_cpu is used to count dispatch stats using per-cpu
    counters. This is used by both blk-throtl and cfq-iosched but the
    sharing is rather silly.

    * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates
    those stats while holding queue_lock.

    * blk-throtl needs per-cpu dispatch stats but only service_bytes and
    serviced. It doesn't make use of sectors.

    This patch makes cfq add and use global stats for service_bytes,
    serviced and sectors, removes per-cpu sectors counter and moves
    per-cpu stat printing code to blk-throttle.c.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • As with conf/stats file handling code, there's no reason for stat
    update code to live in blkcg core with policies calling into update
    them. The current organization is both inflexible and complex.

    This patch moves stat update code to specific policies. All
    blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP
    stats are collapsed into their cfq_blkiocg_update_*_stats()
    counterparts. blkiocg_update_dispatch_stats() is used by both
    policies and duplicated as throtl_update_dispatch_stats() and
    cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • block/cfq.h contains some functions which interact with blkcg;
    however, this is only part of it and cfq-iosched.c already has quite
    some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being
    moved to specific policies, having these relay functions isolated in
    cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for
    now. Let's split blkcg support properly later if necessary.

    Signed-off-by: Tejun Heo

    Tejun Heo