09 Jul, 2014

1 commit

  • sane_behavior has been used as a development vehicle for the default
    unified hierarchy. Now that the default hierarchy is in place, the
    flag became redundant and confusing as its usage is allowed on all
    hierarchies. There are gonna be either the default hierarchy or
    legacy ones. Let's make that clear by removing sane_behavior support
    on non-default hierarchies.

    This patch replaces cgroup_sane_behavior() with cgroup_on_dfl(). The
    comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
    cgroup_on_dfl() with sane_behavior specific part dropped.

    On the default and legacy hierarchies w/o sane_behavior, this
    shouldn't cause any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Johannes Weiner
    Cc: Michal Hocko

    Tejun Heo
     

10 Jun, 2014

1 commit

  • Pull cgroup updates from Tejun Heo:
    "A lot of activities on cgroup side. Heavy restructuring including
    locking simplification took place to improve the code base and enable
    implementation of the unified hierarchy, which currently exists behind
    a __DEVEL__ mount option. The core support is mostly complete but
    individual controllers need further work. To explain the design and
    rationales of the the unified hierarchy

    Documentation/cgroups/unified-hierarchy.txt

    is added.

    Another notable change is css (cgroup_subsys_state - what each
    controller uses to identify and interact with a cgroup) iteration
    update. This is part of continuing updates on css object lifetime and
    visibility. cgroup started with reference count draining on removal
    way back and is now reaching a point where csses behave and are
    iterated like normal refcnted objects albeit with some complexities to
    allow distinguishing the state where they're being deleted. The css
    iteration update isn't taken advantage of yet but is planned to be
    used to simplify memcg significantly"

    * 'for-3.16' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (77 commits)
    cgroup: disallow disabled controllers on the default hierarchy
    cgroup: don't destroy the default root
    cgroup: disallow debug controller on the default hierarchy
    cgroup: clean up MAINTAINERS entries
    cgroup: implement css_tryget()
    device_cgroup: use css_has_online_children() instead of has_children()
    cgroup: convert cgroup_has_live_children() into css_has_online_children()
    cgroup: use CSS_ONLINE instead of CGRP_DEAD
    cgroup: iterate cgroup_subsys_states directly
    cgroup: introduce CSS_RELEASED and reduce css iteration fallback window
    cgroup: move cgroup->serial_nr into cgroup_subsys_state
    cgroup: link all cgroup_subsys_states in their sibling lists
    cgroup: move cgroup->sibling and ->children into cgroup_subsys_state
    cgroup: remove cgroup->parent
    device_cgroup: remove direct access to cgroup->children
    memcg: update memcg_has_children() to use css_next_child()
    memcg: remove tasks/children test from mem_cgroup_force_empty()
    cgroup: remove css_parent()
    cgroup: skip refcnting on normal root csses and cgrp_dfl_root self css
    cgroup: use cgroup->self.refcnt for cgroup refcnting
    ...

    Linus Torvalds
     

14 May, 2014

1 commit

  • Convert all cftype->write_string() users to the new cftype->write()
    which maps directly to kernfs write operation and has full access to
    kernfs and cgroup contexts. The conversions are mostly mechanical.

    * @css and @cft are accessed using of_css() and of_cft() accessors
    respectively instead of being specified as arguments.

    * Should return @nbytes on success instead of 0.

    * @buf is not trimmed automatically. Trim if necessary. Note that
    blkcg and netprio don't need this as the parsers already handle
    whitespaces.

    cftype->write_string() has no user left after the conversions and
    removed.

    While at it, remove unnecessary local variable @p in
    cgroup_subtree_control_write() and stale comment about
    CGROUP_LOCAL_BUFFER_SIZE in cgroup_freezer.c.

    This patch doesn't introduce any visible behavior changes.

    v2: netprio was missing from conversion. Converted.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Neil Horman
    Cc: "David S. Miller"

    Tejun Heo
     

03 May, 2014

1 commit


22 Apr, 2014

1 commit


19 Mar, 2014

1 commit

  • cftype->write_string() just passes on the writeable buffer from kernfs
    and there's no reason to add const restriction on the buffer. The
    only thing const achieves is unnecessarily complicating parsing of the
    buffer. Drop const from @buffer.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Daniel Borkmann
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki

    Tejun Heo
     

12 Feb, 2014

1 commit

  • cftype->max_write_len is used to extend the maximum size of writes.
    It's interpreted in such a way that the actual maximum size is one
    less than the specified value. The default size is defined by
    CGROUP_LOCAL_BUFFER_SIZE. Its interpretation is quite confusing - its
    value is decremented by 1 and then compared for equality with max
    size, which means that the actual default size is
    CGROUP_LOCAL_BUFFER_SIZE - 2, which is 62 chars.

    There's no point in having a limit that low. Update its definition so
    that it means the actual string length sans termination and anything
    below PAGE_SIZE-1 is treated as PAGE_SIZE-1.

    .max_write_len for "release_agent" is updated to PATH_MAX-1 and
    cgroup_release_agent_write() is updated so that the redundant strlen()
    check is removed and it uses strlcpy() instead of strcpy().
    .max_write_len initializations in blk-throttle.c and cfq-iosched.c are
    no longer necessary and removed. The one in cpuset is kept unchanged
    as it's an approximated value to begin with.

    This will also make transition to kernfs smoother.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan

    Tejun Heo
     

31 Jan, 2014

1 commit

  • Pull core block IO changes from Jens Axboe:
    "The major piece in here is the immutable bio_ve series from Kent, the
    rest is fairly minor. It was supposed to go in last round, but
    various issues pushed it to this release instead. The pull request
    contains:

    - Various smaller blk-mq fixes from different folks. Nothing major
    here, just minor fixes and cleanups.

    - Fix for a memory leak in the error path in the block ioctl code
    from Christian Engelmayer.

    - Header export fix from CaiZhiyong.

    - Finally the immutable biovec changes from Kent Overstreet. This
    enables some nice future work on making arbitrarily sized bios
    possible, and splitting more efficient. Related fixes to immutable
    bio_vecs:

    - dm-cache immutable fixup from Mike Snitzer.
    - btrfs immutable fixup from Muthu Kumar.

    - bio-integrity fix from Nic Bellinger, which is also going to stable"

    * 'for-3.14/core' of git://git.kernel.dk/linux-block: (44 commits)
    xtensa: fixup simdisk driver to work with immutable bio_vecs
    block/blk-mq-cpu.c: use hotcpu_notifier()
    blk-mq: for_each_* macro correctness
    block: Fix memory leak in rw_copy_check_uvector() handling
    bio-integrity: Fix bio_integrity_verify segment start bug
    block: remove unrelated header files and export symbol
    blk-mq: uses page->list incorrectly
    blk-mq: use __smp_call_function_single directly
    btrfs: fix missing increment of bi_remaining
    Revert "block: Warn and free bio if bi_end_io is not set"
    block: Warn and free bio if bi_end_io is not set
    blk-mq: fix initializing request's start time
    block: blk-mq: don't export blk_mq_free_queue()
    block: blk-mq: make blk_sync_queue support mq
    block: blk-mq: support draining mq queue
    dm cache: increment bi_remaining when bi_end_io is restored
    block: fixup for generic bio chaining
    block: Really silence spurious compiler warnings
    block: Silence spurious compiler warnings
    block: Kill bio_pair_split()
    ...

    Linus Torvalds
     

06 Dec, 2013

1 commit

  • In preparation of conversion to kernfs, cgroup file handling is
    updated so that it can be easily mapped to kernfs. This patch
    replaces cftype->read_seq_string() with cftype->seq_show() which is
    not limited to single_open() operation and will map directcly to
    kernfs seq_file interface.

    The conversions are mechanical. As ->seq_show() doesn't have @css and
    @cft, the functions which make use of them are converted to use
    seq_css() and seq_cft() respectively. In several occassions, e.f. if
    it has seq_string in its name, the function name is updated to fit the
    new method better.

    This patch does not introduce any behavior changes.

    Signed-off-by: Tejun Heo
    Acked-by: Aristeu Rozanski
    Acked-by: Vivek Goyal
    Acked-by: Michal Hocko
    Acked-by: Daniel Wagner
    Acked-by: Li Zefan
    Cc: Jens Axboe
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Neil Horman

    Tejun Heo
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

13 Nov, 2013

1 commit

  • Now that seqcounts are lockdep enabled objects, we need to explicitly
    initialize runtime allocated seqcounts so that lockdep can track them.

    Without this patch, Fengguang was seeing:

    [ 4.127282] INFO: trying to register non-static key.
    [ 4.128027] the code is fine but needs lockdep annotation.
    [ 4.128027] turning off the locking correctness validator.
    [ 4.128027] CPU: 0 PID: 96 Comm: kworker/u4:1 Not tainted 3.12.0-next-20131108-10601-gbad570d #2
    [ 4.128027] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
    [ ... ]
    [ 4.128027] Call Trace:
    [ 4.128027] [] ? console_unlock+0x353/0x380
    [ 4.128027] [] dump_stack+0x48/0x60
    [ 4.128027] [] __lock_acquire.isra.26+0x7e3/0xceb
    [ 4.128027] [] lock_acquire+0x71/0x9a
    [ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
    [ 4.128027] [] throtl_update_dispatch_stats+0x7c/0x153
    [ 4.128027] [] ? blk_throtl_bio+0x1c3/0x485
    [ 4.128027] [] blk_throtl_bio+0x1c3/0x485
    ...

    Use u64_stats_init() for all affected data structures, which initializes
    the seqcount.

    Reported-and-Tested-by: Fengguang Wu
    Cc: Vivek Goyal
    Cc: Jens Axboe
    Signed-off-by: Peter Zijlstra
    [ Folded in another fix from the mailing list as well as a fix to that fix. Tweaked commit message. ]
    Signed-off-by: John Stultz
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1384314134-6895-1-git-send-email-john.stultz@linaro.org
    [ So I actually think that the two SOBs from PeterZ are the right depiction of the patch route. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Aug, 2013

3 commits

  • Previously, all css descendant iterators didn't include the origin
    (root of subtree) css in the iteration. The reasons were maintaining
    consistency with css_for_each_child() and that at the time of
    introduction more use cases needed skipping the origin anyway;
    however, given that css_is_descendant() considers self to be a
    descendant, omitting the origin css has become more confusing and
    looking at the accumulated use cases rather clearly indicates that
    including origin would result in simpler code overall.

    While this is a change which can easily lead to subtle bugs, cgroup
    API including the iterators has recently gone through major
    restructuring and no out-of-tree changes will be applicable without
    adjustments making this a relatively acceptable opportunity for this
    type of change.

    The conversions are mostly straight-forward. If the iteration block
    had explicit origin handling before or after, it's moved inside the
    iteration. If not, if (pos == origin) continue; is added. Some
    conversions add extra reference get/put around origin handling by
    consolidating origin handling and the rest. While the extra ref
    operations aren't strictly necessary, this shouldn't cause any
    noticeable difference.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Michal Hocko
    Cc: Jens Axboe
    Cc: Matt Helsley
    Cc: Johannes Weiner
    Cc: Balbir Singh

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using css
    (cgroup_subsys_state) as the primary handle instead of cgroup in
    subsystem API. For hierarchy iterators, this is beneficial because

    * In most cases, css is the only thing subsystems care about anyway.

    * On the planned unified hierarchy, iterations for different
    subsystems will need to skip over different subtrees of the
    hierarchy depending on which subsystems are enabled on each cgroup.
    Passing around css makes it unnecessary to explicitly specify the
    subsystem in question as css is intersection between cgroup and
    subsystem

    * For the planned unified hierarchy, css's would need to be created
    and destroyed dynamically independent from cgroup hierarchy. Having
    cgroup core manage css iteration makes enforcing deref rules a lot
    easier.

    Most subsystem conversions are straight-forward. Noteworthy changes
    are

    * blkio: cgroup_to_blkcg() is no longer used. Removed.

    * freezer: cgroup_freezer() is no longer used. Removed.

    * devices: cgroup_to_devcgroup() is no longer used. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe

    Tejun Heo
     
  • cgroup is currently in the process of transitioning to using struct
    cgroup_subsys_state * as the primary handle instead of struct cgroup.
    Please see the previous commit which converts the subsystem methods
    for rationale.

    This patch converts all cftype file operations to take @css instead of
    @cgroup. cftypes for the cgroup core files don't have their subsytem
    pointer set. These will automatically use the dummy_css added by the
    previous patch and can be converted the same way.

    Most subsystem conversions are straight forwards but there are some
    interesting ones.

    * freezer: update_if_frozen() is also converted to take @css instead
    of @cgroup for consistency. This will make the code look simpler
    too once iterators are converted to use css.

    * memory/vmpressure: mem_cgroup_from_css() needs to be exported to
    vmpressure while mem_cgroup_from_cont() can be made static.
    Updated accordingly.

    * cpu: cgroup_tg() doesn't have any user left. Removed.

    * cpuacct: cgroup_ca() doesn't have any user left. Removed.

    * hugetlb: hugetlb_cgroup_form_cgroup() doesn't have any user left.
    Removed.

    * net_cls: cgrp_cls_state() doesn't have any user left. Removed.

    Signed-off-by: Tejun Heo
    Acked-by: Li Zefan
    Acked-by: Michal Hocko
    Acked-by: Vivek Goyal
    Acked-by: Aristeu Rozanski
    Acked-by: Daniel Wagner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Johannes Weiner
    Cc: Balbir Singh
    Cc: Matt Helsley
    Cc: Jens Axboe
    Cc: Steven Rostedt

    Tejun Heo
     

15 May, 2013

26 commits

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     
  • blk_throtl_bio() has a quick exit path for throtl_grps without limits
    configured. It looks at the bps and iops limits and if both are not
    configured, the bio is issued immediately. While this is correct in
    the current flat hierarchy as each throtl_grp behaves completely
    independently, it would become wrong in proper hierarchy mode. A
    group without any limits could still be limited by one of its
    ancestors and bio's queued for such group should not bypass
    blk-throtl.

    As having a quick bypass mechanism is beneficial, this patch
    reimplements the mechanism such that it's correct even with proper
    hierarchy. throtl_grp->has_rules[] is added. These booleans are
    updated for the whole subtree whenever a config is updated so that
    has_rules[] of the whole subtree stays synchronized. They're also
    updated when a new throtl_grp comes online so that it can't escape the
    limits of its ancestors.

    As no throtl_grp has another throtl_grp as parent now, this patch
    doesn't yet make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the planned proper hierarchy support, a bio will climb up the
    tree before actually being dispatched. This makes sure bio is also
    subjected to parent's throttling limits, if any.

    It might happen that parent is idle and when bio is transferred to
    parent, a new slice starts fresh. But that is incorrect as parents
    wait time should have started when bio was queued in child group and
    causes IOs to be throttled more than configured as they climb the
    hierarchy.

    Given the fact that we have not written hierarchical algorithm in a
    way where child's and parents time slices are synchronized, we
    transfer the child's start time to parent if parent was idling. If
    parent was busy doing dispatch of other bios all this while, this is
    not an issue.

    Child's slice start time is passed to parent. Parent looks at its
    last expired slice start time. If child's start time is after parents
    old start time, that means parent had been idle and after parent
    went idle, child had an IO queued. So use child's start time as
    parent start time.

    If parent's start time is after child's start time, that means,
    when IO got queued in child group, parent was not idle. But later
    it dispatched some IO, its slice got trimmed and then it went idle.
    After a while child's request got shifted in parent group. In this
    case use parent's old start time as new start time as that's the
    duration of slice we did not use.

    This logic is far from perfect as if there are multiple childs
    then first child transferring the bio decides the start time while
    a bio might have queued up even earlier in other child, which is
    yet to be transferred up to parent. In that case we will lose
    time and bandwidth in parent. This patch is just an approximation
    to make situation somewhat better.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • With flat hierarchy, there's only single level of dispatching
    happening and fairness beyond that point is the responsibility of the
    rest of the block layer and driver, which usually works out okay;
    however, with the planned hierarchy support,
    service_queue->bio_lists[] can be filled up by bios from a single
    source. While the limits would still be honored, it'd be very easy to
    starve IOs from siblings or children.

    To avoid such starvation, this patch implements throtl_qnode and
    converts service_queue->bio_lists[] to lists of per-source qnodes
    which in turn contains the bio's. For example, when a bio is
    dispatched from a child group, the bio doesn't get queued on
    ->bio_lists[] directly but it first gets queued on the group's qnode
    which in turn gets queued on service_queue->queued[]. When
    dispatching for the upper level, the ->queued[] list is consumed in
    round-robing order so that the dispatch windows is consumed fairly by
    all IO sources.

    There are two ways a bio can come to a throtl_grp - directly queued to
    the group or dispatched from a child. For the former
    throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
    ->qnode_on_parent[rw].

    Note that this means that the child which is contributing a bio to its
    parent should stay pinned until all its bios are dispatched to its
    grand-parent. This patch moves blkg refcnting from bio add/remove
    spots to qnode activation/deactivation so that the blkg containing an
    active qnode is always pinned. As child pins the parent, this is
    sufficient for keeping the relevant sub-tree pinned while bios are in
    flight.

    The starvation issue was spotted by Vivek Goyal.

    v2: The original patch used the same throtl_grp->qnode_on_self/parent
    for reads and writes causing RWs to be queued incorrectly if there
    already are outstanding IOs in the other direction. They should
    be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
    can use different qnodes. Spotted by Vivek Goyal.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_pending_timer_fn() currently assumes that the parent_sq is the
    top level one and the bio's dispatched are ready to be issued;
    however, this assumption will be wrong with proper hierarchy support.
    This patch makes the following changes to make
    throtl_pending_timer_fn() ready for hiearchy.

    * If the parent_sq isn't the top-level one, update the parent
    throtl_grp's dispatch time and schedule the next dispatch as
    necessary. If the parent's dispatch time is now, repeat the
    function for the parent throtl_grp.

    * If the parent_sq is the top-level one, kick issue work_item as
    before.

    * The debug message printed by throtl_log() now prints out the
    service_queue's nr_queued[] instead of the total nr_queued as the
    latter becomes uninteresting and misleading with hierarchical
    dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • tg_dispatch_one_bio() currently assumes that the parent_sq is the top
    level one and the bio being dispatched is ready to be issued; however,
    this assumption will be wrong with proper hierarchy support. This
    patch makes the following changes to make tg_dispatch_on_bio() ready
    for hiearchy.

    * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
    of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
    transfer a bio from a child tg to its parent.

    * tg_dispatch_one_bio() is updated to distinguish whether its parent
    is another throtl_grp or the throtl_data. If former, the bio is
    transferred to the parent throtl_grp using throtl_add_bio_tg(). If
    latter, the bio is ready to be issued and put on the top-level
    service_queue's bio_lists[] and throtl_data->nr_queued is
    decremented.

    As all throtl_grps currently have the top level service_queue as their
    ->parent_sq, this patch in itself doesn't make any behavior
    difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_bio() issues the passed in bio directly if it's
    within limits of its associated tg (throtl_grp). This behavior
    becomes incorrect with hierarchy support as the bio should be
    accounted to and throttled by the ancestor throtl_grps too.

    This patch makes the direct issue path of blk_throtl_bio() to loop
    until it reaches the top-level service_queue or gets throttled. If
    the former, the bio can be issued directly; otherwise, it gets queued
    at the first layer it was above limits.

    As tg->parent_sq is always the top-level service queue currently, this
    patch in itself doesn't make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • The current blk_throtl_drain() assumes that all active throtl_grps are
    queued on throtl_data->service_queue, which won't be true once
    hierarchy support is implemented.

    This patch makes blk_throtl_drain() perform post-order walk of the
    blkg hierarchy draining each associated throtl_grp, which guarantees
    that all bios will eventually be pushed to the top-level service_queue
    in throtl_data.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_dispatch_work_fn() is responsible for both
    dispatching bio's from throtl_grp's according to their limits and then
    issuing the dispatched bios.

    This patch moves the dispatch part to throtl_pending_timer_fn() so
    that the work item is kicked iff there are bio's to issue. This is to
    avoid work item execution at each step when hierarchy support is
    enabled. bio's will be dispatched towards the top-level service_queue
    from the timers at each layer and the work item will only be used to
    issue the bio's which reached the top-level service_queue.

    While fetching bio's to issue from bio_lists[],
    blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While
    the original code also dispatched READs first, if multiple throtl_grps
    are dispatched on the same run, WRITEs from throtl_grp which is
    dispatched first would precede READs from throtl_grps which are
    dispatched later. While this is a behavior change, given that the
    previous code already prioritized READs and block layer generally
    prioritizes and segregates READs from WRITEs, this isn't likely to
    make any noticeable differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_select_dispatch() only dispatches throtl_quantum bios on each
    invocation. blk_throtl_dispatch_work_fn() in turn depends on
    throtl_schedule_next_dispatch() scheduling the next dispatch window
    immediately so that undue delays aren't incurred. This effectively
    chains multiple dispatch work item executions back-to-back when there
    are more than throtl_quantum bios to dispatch on a given tick.

    There is no reason to finish the current work item just to repeat it
    immediately. This patch makes throtl_schedule_next_dispatch() return
    %false without doing anything if the current dispatch window is still
    open and updates blk_throtl_dispatch_work_fn() repeat dispatching
    after cpu_relax() on %false return.

    This change will help implementing hierarchy support as dispatching
    will be done from pending_timer and immediate reschedule of timer
    function isn't supported and doesn't make much sense.

    While this patch changes how dispatch behaves when there are more than
    throtl_quantum bios to dispatch on a single tick, the behavior change
    is immaterial.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, throtl_data->dispatch_work is a delayed_work item which
    handles both delayed dispatch and issuing bios. The two tasks will be
    separated to support proper hierarchy. To prepare for that, this
    patch separates out the timer into throtl_service_queue->pending_timer
    from throtl_data->dispatch_work and make the latter a work_struct.

    * As the timer is now per-service_queue, it's initialized and
    del_sync'd as its corresponding service_queue is created and
    destroyed. The timer, when triggered, simply schedules
    throtl_data->dispathc_work for execution.

    * throtl_schedule_delayed_work() is renamed to
    throtl_schedule_pending_timer() and takes @sq and @expires now.

    * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
    should be the parent_sq of the service_queue which just got a new
    bio or updated. As the parent_sq is always the top-level
    service_queue now, this doesn't change anything at this point.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With proper hierarchy support, a bio can be dispatched multiple times
    until it reaches the top-level service_queue and we don't want to
    update dispatch stats at each step. They are local stats and will be
    kept local. If recursive stats are necessary, they should be
    implemented separately and definitely not by updating counters
    recursively on each dispatch.

    This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
    stats update with it so that dispatch stats are updated only on the
    first time the bio is charged to a throtl_grp, which will always be
    the throtl_grp the bio was originally queued to.

    This means that REQ_THROTTLED would be set even for bios which don't
    get throttled. As we don't want bios to leave blk-throtl with the
    flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
    and clear if the bio is being issued directly.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Now that both throtl_data and throtl_grp embed throtl_service_queue,
    we can unify throtl_log() and throtl_log_tg().

    * sq_to_tg() is added. This returns the throtl_grp a service_queue is
    embedded in. If the service_queue is the top-level one embedded in
    throtl_data, NULL is returned.

    * sq_to_td() is added. A service_queue is always associated with a
    throtl_data. This function finds the associated td and returns it.

    * throtl_log() is updated to take throtl_service_queue instead of
    throtl_data. If the service_queue is one embedded in throtl_grp, it
    prints the same header as throtl_log_tg() did. If it's one embedded
    in throtl_data, it behaves the same as before. This renders
    throtl_log_tg() unnecessary. Removed.

    This change is necessary for hierarchy support as we're gonna be using
    the same code paths to dispatch bios to intermediate service_queues
    embedded in throtl_grps and the top-level service_queue embedded in
    throtl_data.

    This patch doesn't make any behavior changes.

    v2: throtl_log() didn't print a space after blkg path. Updated so
    that it prints a space after throtl_grp path. Spotted by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To prepare for hierarchy support, this patch adds
    throtl_service_queue->service_sq which points to the arent
    service_queue. Currently, for all service_queues embedded in
    throtl_grps, it points to throtl_data->service_queue. As
    throtl_data->service_queue doesn't have a parent its parent_sq is set
    to NULL.

    There are a number of functions which take both throtl_grp *tg and
    throtl_service_queue *parent_sq. With this patch, the parent
    service_queue can be determined from @tg and the @parent_sq arguments
    are removed.

    This patch doesn't make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
    avoids invoking tg_update_disptime() and
    throtl_schedule_next_dispatch() if the tg already has bios queued in
    that direction. As a new bio is appeneded after the existing ones, it
    can't change the tg's next dispatch time or the parent's dispatch
    schedule.

    This optimization is currently open coded in blk_throtl_bio().
    Whether the target biolist was occupied was recorded in a local
    variable and later used to skip disptime update. This patch moves
    generalizes it so that throtl_add_bio_tg() sets a new flag
    THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
    added. tg_update_disptime() clears the flag automatically.
    blk_throtl_bio() is updated to simply test the flag before updating
    disptime.

    This patch doesn't make any functional differences now but will enable
    using the same optimization for recursive dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queues will eventually form a tree which is anchored at
    throtl_data->service_queue and queue bios will climb the tree to the
    top service_queue to be executed.

    This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
    and blk_throtl_drain() to dispatch bios to
    throtl_data->service_queue.bio_lists[] instead of the on-stack
    bio_lists. This will keep the final dispatch to the top level
    service_queue share the same mechanism as dispatches through the rest
    of the hierarchy.

    As bio's should be issued in a sleepable context,
    blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
    service_queue bio_lists[] into an onstack one before dropping
    queue_lock and issuing the bio's.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queues will eventually form a tree which is anchored at
    throtl_data->service_queue and queue bios will climb the tree to the
    top service_queue to be executed.

    This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
    service_queue to prepare for that. As currently only the
    throtl_data->service_queue is in use, this patch just ends up moving
    throtl_grp->bio_lists[] and ->nr_queued[] to
    throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
    any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, there's single service_queue per queue -
    throtl_data->service_queue. All active throtl_grp's are queued on the
    queue and dispatched according to their limits. To support hierarchy,
    this will be expanded such that active throtl_grp's form a tree
    anchored at throtl_data->service_queue and chained through each
    intermediate throtl_grp's service_queue.

    This patch adds throtl_grp->service_queue to prepare for hierarchy
    support. The initialization function - throtl_service_queue_init() -
    is added and replaces the macro initializer. The newly added
    tg->service_queue isn't used yet. Following patches will do.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queue will be the building block of hierarchy support
    and will form a tree. This patch updates its usages as arguments to
    reduce confusion.

    * When a service queue is used as the parent role - the host of the
    rbtree - use @parent_sq instead of @sq.

    * For functions taking both @tg and @parent_sq, reorder them so that
    the order is (@tg, @parent_sq) not the other way around. This makes
    the code follow the usual convention of specifying the primary
    target of the operation as the first argument.

    This patch doesn't make any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queue will be used as the basic block to implement
    hierarchy support. Pass around throtl_service_queue *sq instead of
    throtl_data *td in the following functions which will be used across
    multiple levels of hierarchy.

    * [__]throtl_enqueue/dequeue_tg()

    * throtl_add_bio_tg()

    * tg_update_disptime()

    * throtl_select_dispatch()

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Add throtl_grp->td so that the td (throtl_data) a given tg
    (throtl_grp) belongs to can be determined, and remove @td argument
    from functions which take both @td and @tg as the former now can be
    determined from the latter.

    This generally simplifies the code and removes a number of cases where
    @td is passed as an argument without being actually used. This will
    also help hierarchy support implementation.

    While at it, in multi-line conditions, move the logical operators
    leading broken lines to the end of the previous line.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle is still using function-defining macros to define flag
    handling functions, which went out style at least a decade ago.

    Just define the flag as bitmask and use direct bit operations.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_rb_root will be expanded to cover more roles for hierarchy
    support. Rename it to throtl_service_queue and make its fields more
    descriptive.

    * rb -> pending_tree
    * left -> first_pending
    * count -> nr_pending
    * min_disptime -> first_pending_disptime

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • throtl_nr_queued() is used in several places to avoid performing
    certain operations when the throtl_data is empty. This usually is
    useless as those paths usually aren't traveled if there's no bio
    queued.

    * throtl_schedule_delayed_work() skips scheduling dispatch work item
    if @td doesn't have any bios queued; however, the only case it can
    be called when @td is empty is from tg_set_conf() which isn't
    something we should be optimizing for.

    * throtl_schedule_next_dispatch() takes a quick exit if @td is empty;
    however, right after that it triggers BUG if the service tree is
    empty. The two conditions are equivalent and it can just test
    @st->count for the quick exit.

    * blk_throtl_dispatch_work_fn() skips dispatch if @td is empty. This
    work function isn't usually invoked when @td is empty. The only
    possibility is from tg_set_conf() and when it happens the normal
    dispatching path can handle empty @td fine. No need to add special
    skip path.

    This patch removes the above three unnecessary optimizations, which
    leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user
    of throtl_nr_queued(). Remove throtl_nr_queued() and open code it in
    throtl_log(). I don't think we need td->nr_queued[] at all. Maybe we
    can remove it later.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Move throtl_schedule_delayed_work() above its first user so that the
    forward declaration can be removed.

    This patch is pure relocaiton.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle is about to go through major restructuring to support
    hierarchy. Do cosmetic updates in preparation.

    * s/throtl_data->throtl_work/throtl_data->dispatch_work/

    * s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/

    * Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn()

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo