03 Jul, 2013

1 commit

  • There's a race between elevator switching and normal io operation.
    Because the allocation of struct elevator_queue and struct elevator_data
    don't in a atomic operation.So there are have chance to use NULL
    ->elevator_data.
    For example:
    Thread A: Thread B
    blk_queu_bio elevator_switch
    spin_lock_irq(q->queue_block) elevator_alloc
    elv_merge elevator_init_fn

    Because call elevator_alloc, it can't hold queue_lock and the
    ->elevator_data is NULL.So at the same time, threadA call elv_merge and
    nedd some info of elevator_data.So the crash happened.

    Move the elevator_alloc into func elevator_init_fn, it make the
    operations in a atomic operation.

    Using the follow method can easy reproduce this bug
    1:dd if=/dev/sdb of=/dev/null
    2:while true;do echo noop > scheduler;echo deadline > scheduler;done

    The test method also use this method.

    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

29 Jun, 2013

1 commit

  • In case a device has three tags available we still reserve two of them
    for sync IO. That leaves only a single tag for async IO such as
    writeback from flusher thread which results in poor performance.

    Allow async IO to consume two tags in case queue has three tag availabe
    to get a decent async write performance.

    This patch improves streaming write performance on a machine with such disk
    from ~21 MB/s to ~52 MB/s. Also postmark throughput in presence of
    streaming writer improves from 8 to 12 transactions per second so sync
    IO doesn't seem to be harmed in presence of heavy async writer.

    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

28 Jun, 2013

1 commit

  • Code in blkdev.c moves a device inode to default_backing_dev_info when
    the last reference to the device is put and moves the device inode back
    to its bdi when the first reference is acquired. This includes moving to
    wb.b_dirty list if the device inode is dirty. The code however doesn't
    setup timer to wake corresponding flusher thread and while wb.b_dirty
    list is non-empty __mark_inode_dirty() will not set it up either. Thus
    periodic writeback is effectively disabled until a sync(2) call which can
    lead to unexpected data loss in case of crash or power failure.

    Fix the problem by setting up a timer for periodic writeback in case we
    add the first dirty inode to wb.b_dirty list in bdev_inode_switch_bdi().

    Reported-by: Bert De Jonghe
    CC: stable@vger.kernel.org # >= 3.0
    Signed-off-by: Jan Kara
    Signed-off-by: Jens Axboe

    Jan Kara
     

15 May, 2013

33 commits

  • With the recent updates, blk-throttle is finally ready for proper
    hierarchy support. Dispatching now honors service_queue->parent_sq
    and propagates correctly. The only thing missing is setting
    ->parent_sq correctly so that throtl_grp hierarchy matches the cgroup
    hierarchy.

    This patch updates throtl_pd_init() such that service_queues form the
    same hierarchy as the cgroup hierarchy if sane_behavior is enabled.
    As this concludes proper hierarchy support for blkcg, the shameful
    .broken_hierarchy tag is removed from blkio_subsys.

    v2: Updated blkio-controller.txt as suggested by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Li Zefan

    Tejun Heo
     
  • blk_throtl_bio() has a quick exit path for throtl_grps without limits
    configured. It looks at the bps and iops limits and if both are not
    configured, the bio is issued immediately. While this is correct in
    the current flat hierarchy as each throtl_grp behaves completely
    independently, it would become wrong in proper hierarchy mode. A
    group without any limits could still be limited by one of its
    ancestors and bio's queued for such group should not bypass
    blk-throtl.

    As having a quick bypass mechanism is beneficial, this patch
    reimplements the mechanism such that it's correct even with proper
    hierarchy. throtl_grp->has_rules[] is added. These booleans are
    updated for the whole subtree whenever a config is updated so that
    has_rules[] of the whole subtree stays synchronized. They're also
    updated when a new throtl_grp comes online so that it can't escape the
    limits of its ancestors.

    As no throtl_grp has another throtl_grp as parent now, this patch
    doesn't yet make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With the planned proper hierarchy support, a bio will climb up the
    tree before actually being dispatched. This makes sure bio is also
    subjected to parent's throttling limits, if any.

    It might happen that parent is idle and when bio is transferred to
    parent, a new slice starts fresh. But that is incorrect as parents
    wait time should have started when bio was queued in child group and
    causes IOs to be throttled more than configured as they climb the
    hierarchy.

    Given the fact that we have not written hierarchical algorithm in a
    way where child's and parents time slices are synchronized, we
    transfer the child's start time to parent if parent was idling. If
    parent was busy doing dispatch of other bios all this while, this is
    not an issue.

    Child's slice start time is passed to parent. Parent looks at its
    last expired slice start time. If child's start time is after parents
    old start time, that means parent had been idle and after parent
    went idle, child had an IO queued. So use child's start time as
    parent start time.

    If parent's start time is after child's start time, that means,
    when IO got queued in child group, parent was not idle. But later
    it dispatched some IO, its slice got trimmed and then it went idle.
    After a while child's request got shifted in parent group. In this
    case use parent's old start time as new start time as that's the
    duration of slice we did not use.

    This logic is far from perfect as if there are multiple childs
    then first child transferring the bio decides the start time while
    a bio might have queued up even earlier in other child, which is
    yet to be transferred up to parent. In that case we will lose
    time and bandwidth in parent. This patch is just an approximation
    to make situation somewhat better.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Tejun Heo

    Vivek Goyal
     
  • With flat hierarchy, there's only single level of dispatching
    happening and fairness beyond that point is the responsibility of the
    rest of the block layer and driver, which usually works out okay;
    however, with the planned hierarchy support,
    service_queue->bio_lists[] can be filled up by bios from a single
    source. While the limits would still be honored, it'd be very easy to
    starve IOs from siblings or children.

    To avoid such starvation, this patch implements throtl_qnode and
    converts service_queue->bio_lists[] to lists of per-source qnodes
    which in turn contains the bio's. For example, when a bio is
    dispatched from a child group, the bio doesn't get queued on
    ->bio_lists[] directly but it first gets queued on the group's qnode
    which in turn gets queued on service_queue->queued[]. When
    dispatching for the upper level, the ->queued[] list is consumed in
    round-robing order so that the dispatch windows is consumed fairly by
    all IO sources.

    There are two ways a bio can come to a throtl_grp - directly queued to
    the group or dispatched from a child. For the former
    throtl_grp->qnode_on_self[rw] is used. For the latter, the child's
    ->qnode_on_parent[rw].

    Note that this means that the child which is contributing a bio to its
    parent should stay pinned until all its bios are dispatched to its
    grand-parent. This patch moves blkg refcnting from bio add/remove
    spots to qnode activation/deactivation so that the blkg containing an
    active qnode is always pinned. As child pins the parent, this is
    sufficient for keeping the relevant sub-tree pinned while bios are in
    flight.

    The starvation issue was spotted by Vivek Goyal.

    v2: The original patch used the same throtl_grp->qnode_on_self/parent
    for reads and writes causing RWs to be queued incorrectly if there
    already are outstanding IOs in the other direction. They should
    be throtl_grp->qnode_on_self/parent[2] so that READs and WRITEs
    can use different qnodes. Spotted by Vivek Goyal.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_pending_timer_fn() currently assumes that the parent_sq is the
    top level one and the bio's dispatched are ready to be issued;
    however, this assumption will be wrong with proper hierarchy support.
    This patch makes the following changes to make
    throtl_pending_timer_fn() ready for hiearchy.

    * If the parent_sq isn't the top-level one, update the parent
    throtl_grp's dispatch time and schedule the next dispatch as
    necessary. If the parent's dispatch time is now, repeat the
    function for the parent throtl_grp.

    * If the parent_sq is the top-level one, kick issue work_item as
    before.

    * The debug message printed by throtl_log() now prints out the
    service_queue's nr_queued[] instead of the total nr_queued as the
    latter becomes uninteresting and misleading with hierarchical
    dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • tg_dispatch_one_bio() currently assumes that the parent_sq is the top
    level one and the bio being dispatched is ready to be issued; however,
    this assumption will be wrong with proper hierarchy support. This
    patch makes the following changes to make tg_dispatch_on_bio() ready
    for hiearchy.

    * throtl_data->nr_queued[] is incremented in blk_throtl_bio() instead
    of throtl_add_bio_tg() so that throtl_add_bio_tg() can be used to
    transfer a bio from a child tg to its parent.

    * tg_dispatch_one_bio() is updated to distinguish whether its parent
    is another throtl_grp or the throtl_data. If former, the bio is
    transferred to the parent throtl_grp using throtl_add_bio_tg(). If
    latter, the bio is ready to be issued and put on the top-level
    service_queue's bio_lists[] and throtl_data->nr_queued is
    decremented.

    As all throtl_grps currently have the top level service_queue as their
    ->parent_sq, this patch in itself doesn't make any behavior
    difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_bio() issues the passed in bio directly if it's
    within limits of its associated tg (throtl_grp). This behavior
    becomes incorrect with hierarchy support as the bio should be
    accounted to and throttled by the ancestor throtl_grps too.

    This patch makes the direct issue path of blk_throtl_bio() to loop
    until it reaches the top-level service_queue or gets throttled. If
    the former, the bio can be issued directly; otherwise, it gets queued
    at the first layer it was above limits.

    As tg->parent_sq is always the top-level service queue currently, this
    patch in itself doesn't make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • The current blk_throtl_drain() assumes that all active throtl_grps are
    queued on throtl_data->service_queue, which won't be true once
    hierarchy support is implemented.

    This patch makes blk_throtl_drain() perform post-order walk of the
    blkg hierarchy draining each associated throtl_grp, which guarantees
    that all bios will eventually be pushed to the top-level service_queue
    in throtl_data.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, blk_throtl_dispatch_work_fn() is responsible for both
    dispatching bio's from throtl_grp's according to their limits and then
    issuing the dispatched bios.

    This patch moves the dispatch part to throtl_pending_timer_fn() so
    that the work item is kicked iff there are bio's to issue. This is to
    avoid work item execution at each step when hierarchy support is
    enabled. bio's will be dispatched towards the top-level service_queue
    from the timers at each layer and the work item will only be used to
    issue the bio's which reached the top-level service_queue.

    While fetching bio's to issue from bio_lists[],
    blk_throtl_dispatch_work_fn() fetches all READs before WRITEs. While
    the original code also dispatched READs first, if multiple throtl_grps
    are dispatched on the same run, WRITEs from throtl_grp which is
    dispatched first would precede READs from throtl_grps which are
    dispatched later. While this is a behavior change, given that the
    previous code already prioritized READs and block layer generally
    prioritizes and segregates READs from WRITEs, this isn't likely to
    make any noticeable differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_select_dispatch() only dispatches throtl_quantum bios on each
    invocation. blk_throtl_dispatch_work_fn() in turn depends on
    throtl_schedule_next_dispatch() scheduling the next dispatch window
    immediately so that undue delays aren't incurred. This effectively
    chains multiple dispatch work item executions back-to-back when there
    are more than throtl_quantum bios to dispatch on a given tick.

    There is no reason to finish the current work item just to repeat it
    immediately. This patch makes throtl_schedule_next_dispatch() return
    %false without doing anything if the current dispatch window is still
    open and updates blk_throtl_dispatch_work_fn() repeat dispatching
    after cpu_relax() on %false return.

    This change will help implementing hierarchy support as dispatching
    will be done from pending_timer and immediate reschedule of timer
    function isn't supported and doesn't make much sense.

    While this patch changes how dispatch behaves when there are more than
    throtl_quantum bios to dispatch on a single tick, the behavior change
    is immaterial.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, throtl_data->dispatch_work is a delayed_work item which
    handles both delayed dispatch and issuing bios. The two tasks will be
    separated to support proper hierarchy. To prepare for that, this
    patch separates out the timer into throtl_service_queue->pending_timer
    from throtl_data->dispatch_work and make the latter a work_struct.

    * As the timer is now per-service_queue, it's initialized and
    del_sync'd as its corresponding service_queue is created and
    destroyed. The timer, when triggered, simply schedules
    throtl_data->dispathc_work for execution.

    * throtl_schedule_delayed_work() is renamed to
    throtl_schedule_pending_timer() and takes @sq and @expires now.

    * Simiarly, throtl_schedule_next_dispatch() now takes @sq, which
    should be the parent_sq of the service_queue which just got a new
    bio or updated. As the parent_sq is always the top-level
    service_queue now, this doesn't change anything at this point.

    This patch doesn't introduce any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • With proper hierarchy support, a bio can be dispatched multiple times
    until it reaches the top-level service_queue and we don't want to
    update dispatch stats at each step. They are local stats and will be
    kept local. If recursive stats are necessary, they should be
    implemented separately and definitely not by updating counters
    recursively on each dispatch.

    This patch moves REQ_THROTTLED setting to throtl_charge_bio() and gate
    stats update with it so that dispatch stats are updated only on the
    first time the bio is charged to a throtl_grp, which will always be
    the throtl_grp the bio was originally queued to.

    This means that REQ_THROTTLED would be set even for bios which don't
    get throttled. As we don't want bios to leave blk-throtl with the
    flag set, move REQ_THROTLLED clearing to the end of blk_throtl_bio()
    and clear if the bio is being issued directly.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Now that both throtl_data and throtl_grp embed throtl_service_queue,
    we can unify throtl_log() and throtl_log_tg().

    * sq_to_tg() is added. This returns the throtl_grp a service_queue is
    embedded in. If the service_queue is the top-level one embedded in
    throtl_data, NULL is returned.

    * sq_to_td() is added. A service_queue is always associated with a
    throtl_data. This function finds the associated td and returns it.

    * throtl_log() is updated to take throtl_service_queue instead of
    throtl_data. If the service_queue is one embedded in throtl_grp, it
    prints the same header as throtl_log_tg() did. If it's one embedded
    in throtl_data, it behaves the same as before. This renders
    throtl_log_tg() unnecessary. Removed.

    This change is necessary for hierarchy support as we're gonna be using
    the same code paths to dispatch bios to intermediate service_queues
    embedded in throtl_grps and the top-level service_queue embedded in
    throtl_data.

    This patch doesn't make any behavior changes.

    v2: throtl_log() didn't print a space after blkg path. Updated so
    that it prints a space after throtl_grp path. Spotted by Vivek.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • To prepare for hierarchy support, this patch adds
    throtl_service_queue->service_sq which points to the arent
    service_queue. Currently, for all service_queues embedded in
    throtl_grps, it points to throtl_data->service_queue. As
    throtl_data->service_queue doesn't have a parent its parent_sq is set
    to NULL.

    There are a number of functions which take both throtl_grp *tg and
    throtl_service_queue *parent_sq. With this patch, the parent
    service_queue can be determined from @tg and the @parent_sq arguments
    are removed.

    This patch doesn't make any behavior differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • When blk_throtl_bio() wants to queue a bio to a tg (throtl_grp), it
    avoids invoking tg_update_disptime() and
    throtl_schedule_next_dispatch() if the tg already has bios queued in
    that direction. As a new bio is appeneded after the existing ones, it
    can't change the tg's next dispatch time or the parent's dispatch
    schedule.

    This optimization is currently open coded in blk_throtl_bio().
    Whether the target biolist was occupied was recorded in a local
    variable and later used to skip disptime update. This patch moves
    generalizes it so that throtl_add_bio_tg() sets a new flag
    THROTL_TG_WAS_EMPTY if the biolist was empty before the new bio was
    added. tg_update_disptime() clears the flag automatically.
    blk_throtl_bio() is updated to simply test the flag before updating
    disptime.

    This patch doesn't make any functional differences now but will enable
    using the same optimization for recursive dispatch.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queues will eventually form a tree which is anchored at
    throtl_data->service_queue and queue bios will climb the tree to the
    top service_queue to be executed.

    This patch makes the dispatch paths in blk_throtl_dispatch_work_fn()
    and blk_throtl_drain() to dispatch bios to
    throtl_data->service_queue.bio_lists[] instead of the on-stack
    bio_lists. This will keep the final dispatch to the top level
    service_queue share the same mechanism as dispatches through the rest
    of the hierarchy.

    As bio's should be issued in a sleepable context,
    blk_throtl_dispatch_work_fn() transfers all dispatched bio's from the
    service_queue bio_lists[] into an onstack one before dropping
    queue_lock and issuing the bio's.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queues will eventually form a tree which is anchored at
    throtl_data->service_queue and queue bios will climb the tree to the
    top service_queue to be executed.

    This patch moves bio_lists[] and nr_queued[] from throtl_grp to its
    service_queue to prepare for that. As currently only the
    throtl_data->service_queue is in use, this patch just ends up moving
    throtl_grp->bio_lists[] and ->nr_queued[] to
    throtl_grp->service_queue.bio_lists[] and ->nr_queued[] without making
    any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, there's single service_queue per queue -
    throtl_data->service_queue. All active throtl_grp's are queued on the
    queue and dispatched according to their limits. To support hierarchy,
    this will be expanded such that active throtl_grp's form a tree
    anchored at throtl_data->service_queue and chained through each
    intermediate throtl_grp's service_queue.

    This patch adds throtl_grp->service_queue to prepare for hierarchy
    support. The initialization function - throtl_service_queue_init() -
    is added and replaces the macro initializer. The newly added
    tg->service_queue isn't used yet. Following patches will do.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queue will be the building block of hierarchy support
    and will form a tree. This patch updates its usages as arguments to
    reduce confusion.

    * When a service queue is used as the parent role - the host of the
    rbtree - use @parent_sq instead of @sq.

    * For functions taking both @tg and @parent_sq, reorder them so that
    the order is (@tg, @parent_sq) not the other way around. This makes
    the code follow the usual convention of specifying the primary
    target of the operation as the first argument.

    This patch doesn't make any functional differences.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_service_queue will be used as the basic block to implement
    hierarchy support. Pass around throtl_service_queue *sq instead of
    throtl_data *td in the following functions which will be used across
    multiple levels of hierarchy.

    * [__]throtl_enqueue/dequeue_tg()

    * throtl_add_bio_tg()

    * tg_update_disptime()

    * throtl_select_dispatch()

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Add throtl_grp->td so that the td (throtl_data) a given tg
    (throtl_grp) belongs to can be determined, and remove @td argument
    from functions which take both @td and @tg as the former now can be
    determined from the latter.

    This generally simplifies the code and removes a number of cases where
    @td is passed as an argument without being actually used. This will
    also help hierarchy support implementation.

    While at it, in multi-line conditions, move the logical operators
    leading broken lines to the end of the previous line.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle is still using function-defining macros to define flag
    handling functions, which went out style at least a decade ago.

    Just define the flag as bitmask and use direct bit operations.

    This patch doesn't make any functional changes.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_rb_root will be expanded to cover more roles for hierarchy
    support. Rename it to throtl_service_queue and make its fields more
    descriptive.

    * rb -> pending_tree
    * left -> first_pending
    * count -> nr_pending
    * min_disptime -> first_pending_disptime

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo

    Tejun Heo
     
  • throtl_nr_queued() is used in several places to avoid performing
    certain operations when the throtl_data is empty. This usually is
    useless as those paths usually aren't traveled if there's no bio
    queued.

    * throtl_schedule_delayed_work() skips scheduling dispatch work item
    if @td doesn't have any bios queued; however, the only case it can
    be called when @td is empty is from tg_set_conf() which isn't
    something we should be optimizing for.

    * throtl_schedule_next_dispatch() takes a quick exit if @td is empty;
    however, right after that it triggers BUG if the service tree is
    empty. The two conditions are equivalent and it can just test
    @st->count for the quick exit.

    * blk_throtl_dispatch_work_fn() skips dispatch if @td is empty. This
    work function isn't usually invoked when @td is empty. The only
    possibility is from tg_set_conf() and when it happens the normal
    dispatching path can handle empty @td fine. No need to add special
    skip path.

    This patch removes the above three unnecessary optimizations, which
    leave throtl_log() call in blk_throtl_dispatch_work_fn() the only user
    of throtl_nr_queued(). Remove throtl_nr_queued() and open code it in
    throtl_log(). I don't think we need td->nr_queued[] at all. Maybe we
    can remove it later.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Move throtl_schedule_delayed_work() above its first user so that the
    forward declaration can be removed.

    This patch is pure relocaiton.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle is about to go through major restructuring to support
    hierarchy. Do cosmetic updates in preparation.

    * s/throtl_data->throtl_work/throtl_data->dispatch_work/

    * s/blk_throtl_work()/blk_throtl_dispatch_work_fn()/

    * Collapse throtl_dispatch() into blk_throtl_dispatch_work_fn()

    This patch is purely cosmetic.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • When bps or iops configuration changes, blk-throttle records the new
    configuration and sets a flag indicating that the config has changed.
    The flag is checked in the bio dispatch path and applied. This
    deferred config application was necessary due to limitations in blkcg
    framework, which haven't existed for quite a while now.

    This patch removes the deferred config application mechanism and
    applies new configurations directly from tg_set_conf(), which is
    simpler.

    v2: Dropped unnecessary throtl_schedule_delayed_work() call from
    tg_set_conf() as suggested by Vivek Goyal.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • throtl_select_dispatch() calls throtl_enqueue_tg() right after
    tg_update_disptime(), which always calls the function anyway. The
    call is, while harmless, unnecessary. Remove it.

    This patch doesn't introduce any behavior difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, when the last reference of a blkcg_gq is put, all then
    release operations sans the actual freeing happen directly in
    blkg_put(). As blkg_put() may be called under queue_lock, all
    pd_exit_fn()s may be too. This makes it impossible for pd_exit_fn()s
    to use del_timer_sync() on timers which grab the queue_lock which is
    an irq-safe lock due to the deadlock possibility described in the
    comment on top of del_timer_sync().

    This can be easily avoided by perfoming the release operations in the
    RCU callback instead of directly from blkg_put(). This patch moves
    the blkcg_gq release operations to the RCU callback.

    As this leaves __blkg_release() with only call_rcu() invocation,
    blkg_rcu_free() is renamed to __blkg_release_rcu(), exported and
    call_rcu() invocation is now done directly from blkg_put() instead of
    going through __blkg_release() which is removed.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • Currently, when creating a new blkcg_gq, each policy's pd_init_fn() is
    invoked in blkg_alloc() before the parent is linked. This makes it
    difficult for policies to perform initializations which are dependent
    on the parent.

    This patch moves pd_init_fn() invocations to blkg_create() after the
    parent blkg is linked where the new blkg is fully initialized. As
    this means that blkg_free() can't assume that pd's are initialized,
    pd_exit_fn() invocations are moved to __blkg_release(). This
    guarantees that pd_exit_fn() is also invoked with fully initialized
    blkgs with valid parent pointers.

    This will help implementing hierarchy support in blk-throttle.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • This will be used by blk-throttle hierarchy support.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • blk-throttle hierarchy support will make use of it. Move
    blkg_for_each_descendant_pre() from block/blk-cgroup.c to
    block/blk-cgroup.h.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     
  • In blkg_create(), after lookup of parent fails, the control jumps to
    error path with the error code encoded into @blkg. The error path
    doesn't use @blkg for the return value. It returns ERR_PTR(ret).
    Make lookup fail path set @ret instead of @blkg.

    Note that the parent lookup is guaranteed to succeed at that point and
    the condition check is purely for sanity and triggers WARN when fails.
    As such, I don't think it's necessary to mark it for -stable.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal

    Tejun Heo
     

12 May, 2013

4 commits

  • Linus Torvalds
     
  • Pull tracing/kprobes update from Steven Rostedt:
    "The majority of these changes are from Masami Hiramatsu bringing
    kprobes up to par with the latest changes to ftrace (multi buffering
    and the new function probes).

    He also discovered and fixed some bugs in doing so. When pulling in
    his patches, I also found a few minor bugs as well and fixed them.

    This also includes a compile fix for some archs that select the ring
    buffer but not tracing.

    I based this off of the last patch you took from me that fixed the
    merge conflict error, as that was the commit that had all the changes
    I needed for this set of changes."

    * tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    tracing/kprobes: Support soft-mode disabling
    tracing/kprobes: Support ftrace_event_file base multibuffer
    tracing/kprobes: Pass trace_probe directly from dispatcher
    tracing/kprobes: Increment probe hit-count even if it is used by perf
    tracing/kprobes: Use bool for retprobe checker
    ftrace: Fix function probe when more than one probe is added
    ftrace: Fix the output of enabled_functions debug file
    ftrace: Fix locking in register_ftrace_function_probe()
    tracing: Add helper function trace_create_new_event() to remove duplicate code
    tracing: Modify soft-mode only if there's no other referrer
    tracing: Indicate enabled soft-mode in enable file
    tracing/kprobes: Fix to increment return event probe hit-count
    ftrace: Cleanup regex_lock and ftrace_lock around hash updating
    ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
    ftrace: Have ftrace_regex_write() return either read or error
    tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
    tracing: Don't succeed if event_enable_func did not register anything
    ring-buffer: Select IRQ_WORK

    Linus Torvalds
     
  • …nux/kernel/git/konrad/xen

    Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
    - More fixes in the vCPU PVHVM hotplug path.
    - Add more documentation.
    - Fix various ARM related issues in the Xen generic drivers.
    - Updates in the xen-pciback driver per Bjorn's updates.
    - Mask the x2APIC feature for PV guests.

    * tag 'stable/for-linus-3.10-rc0-tag-two' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
    xen/pci: Used cached MSI-X capability offset
    xen/pci: Use PCI_MSIX_TABLE_BIR, not PCI_MSIX_FLAGS_BIRMASK
    xen: clear IRQ_NOAUTOEN and IRQ_NOREQUEST
    xen: mask x2APIC feature in PV
    xen: SWIOTLB is only used on x86
    xen/spinlock: Fix check from greater than to be also be greater or equal to.
    xen/smp/pvhvm: Don't point per_cpu(xen_vpcu, 33 and larger) to shared_info
    xen/vcpu: Document the xen_vcpu_info and xen_vcpu
    xen/vcpu/pvhvm: Fix vcpu hotplugging hanging.

    Linus Torvalds
     
  • Pull second SCSI update from James "Jaj B" Bottomley:
    "This is the final round of SCSI patches for the merge window. It
    consists mostly of driver updates (bnx2fc, ibmfc, fnic, lpfc,
    be2iscsi, pm80xx, qla4x and ipr).

    There's also the power management updates that complete the patches in
    Jens' tree, an iscsi refcounting problem fix from the last pull, some
    dif handling in scsi_debug fixes, a few nice code cleanups and an
    error handling busy bug fix."

    * tag 'scsi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (92 commits)
    [SCSI] qla2xxx: Update firmware link in Kconfig file.
    [SCSI] iscsi class, qla4xxx: fix sess/conn refcounting when find fns are used
    [SCSI] sas: unify the pointlessly separated enums sas_dev_type and sas_device_type
    [SCSI] pm80xx: thermal, sas controller config and error handling update
    [SCSI] pm80xx: NCQ error handling changes
    [SCSI] pm80xx: WWN Modification for PM8081/88/89 controllers
    [SCSI] pm80xx: Changed module name and debug messages update
    [SCSI] pm80xx: Firmware flash memory free fix, with addition of new memory region for it
    [SCSI] pm80xx: SPC new firmware changes for device id 0x8081 alone
    [SCSI] pm80xx: Added SPCv/ve specific hardware functionalities and relevant changes in common files
    [SCSI] pm80xx: MSI-X implementation for using 64 interrupts
    [SCSI] pm80xx: Updated common functions common for SPC and SPCv/ve
    [SCSI] pm80xx: Multiple inbound/outbound queue configuration
    [SCSI] pm80xx: Added SPCv/ve specific ids, variables and modify for SPC
    [SCSI] lpfc: fix up Kconfig dependencies
    [SCSI] Handle MLQUEUE busy response in scsi_send_eh_cmnd
    [SCSI] sd: change to auto suspend mode
    [SCSI] sd: use REQ_PM in sd's runtime suspend operation
    [SCSI] qla4xxx: Fix iocb_cnt calculation in qla4xxx_send_mbox_iocb()
    [SCSI] ufs: Correct the expected data transfersize
    ...

    Linus Torvalds