31 Mar, 2011

1 commit


26 Mar, 2011

2 commits

  • When the queue work handler was converted to delayed work, the
    stopping was inadvertently made sync as well. Change this back
    to being async stop, using __cancel_delayed_work() instead of
    cancel_delayed_work().

    Reported-by: Jeremy Fitzhardinge
    Reported-by: Chris Mason
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • With the introduction of the on-stack plugging, we would assume
    that any request being inserted was a normal file system request.
    As flush/fua requires a special insert mode, this caused problems.

    Fix this up by checking for this in flush_plug_list() and use
    the appropriate insert mechanism.

    Big thanks goes to Markus Tripplesdorf for tirelessly testing
    patches, and to Sergey Senozhatsky for helping find the real
    issue.

    Reported-by: Markus Tripplesdorf
    Signed-off-by: Jens Axboe

    Jens Axboe
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

23 Mar, 2011

5 commits

  • Removing think time checking. A high thinktime queue might means the queue
    dispatches several requests and then do away. Limitting such queue seems
    meaningless. And also this can simplify code. This is suggested by Vivek.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Li, Shaohua
     
  • For v2, I added back lines to cfq_preempt_queue() that were removed
    during updates for accounting unaccounted_time. Thanks for pointing out
    that I'd missed these, Vivek.

    Previous commit "cfq-iosched: Don't set active queue in preempt" wrongly
    cleared stats for preempting queues when it shouldn't have, because when
    we choose a queue to preempt, it still isn't necessarily scheduled next.

    Thanks to Vivek Goyal for figuring this out and understanding how the
    preemption code works.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Lina reported that if throttle limits are initially very high and then
    dropped, then no new bio might be dispatched for a long time. And the
    reason being that after dropping the limits we don't reset the existing
    slice and do the rate calculation with new low rate and account the bios
    dispatched at high rate. To fix it, reset the slice upon rate change.

    https://lkml.org/lkml/2011/3/10/298

    Another problem with very high limit is that we never queued the
    bio on throtl service tree. That means we kept on extending the
    group slice but never trimmed it. Fix that also by regulary
    trimming the slice even if bio is not being queued up.

    Reported-by: Lina Lu
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • This change moves unaccounted_time to only be reported when
    CONFIG_DEBUG_BLK_CGROUP is true.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Commit "Add unaccounted time to timeslice_used" changed the behavior of
    cfq_preempt_queue to set cfqq active. Vivek pointed out that other
    preemption rules might get involved, so we shouldn't manually set which
    queue is active.

    This cleans up the code to just clear the queue stats at preemption
    time.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

21 Mar, 2011

1 commit

  • One of the disadvantages of on-stack plugging is that we potentially
    lose out on merging since all pending IO isn't always visible to
    everybody. When we flush the on-stack plugs, right now we don't do
    any checks to see if potential merge candidates could be utilized.

    Correct this by adding a new insert variant, ELEVATOR_INSERT_SORT_MERGE.
    It works just ELEVATOR_INSERT_SORT, but first checks whether we can
    merge with an existing request before doing the insertion (if we fail
    merging).

    This fixes a regression with multiple processes issuing IO that
    can be merged.

    Thanks to Shaohua Li for testing and fixing
    an accounting bug.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

18 Mar, 2011

1 commit

  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (170 commits)
    [SCSI] scsi_dh_rdac: Add MD36xxf into device list
    [SCSI] scsi_debug: add consecutive medium errors
    [SCSI] libsas: fix ata list corruption issue
    [SCSI] hpsa: export resettable host attribute
    [SCSI] hpsa: move device attributes to avoid forward declarations
    [SCSI] scsi_debug: Logical Block Provisioning (SBC3r26)
    [SCSI] sd: Logical Block Provisioning update
    [SCSI] Include protection operation in SCSI command trace
    [SCSI] hpsa: fix incorrect PCI IDs and add two new ones (2nd try)
    [SCSI] target: Fix volume size misreporting for volumes > 2TB
    [SCSI] bnx2fc: Broadcom FCoE offload driver
    [SCSI] fcoe: fix broken fcoe interface reset
    [SCSI] fcoe: precedence bug in fcoe_filter_frames()
    [SCSI] libfcoe: Remove stale fcoe-netdev entries
    [SCSI] libfcoe: Move FCOE_MTU definition from fcoe.h to libfcoe.h
    [SCSI] libfc: introduce __fc_fill_fc_hdr that accepts fc_hdr as an argument
    [SCSI] fcoe, libfc: initialize EM anchors list and then update npiv EMs
    [SCSI] Revert "[SCSI] libfc: fix exchange being deleted when the abort itself is timed out"
    [SCSI] libfc: Fixing a memory leak when destroying an interface
    [SCSI] megaraid_sas: Version and Changelog update
    ...

    Fix up trivial conflicts due to whitespace differences in
    drivers/scsi/libsas/{sas_ata.c,sas_scsi_host.c}

    Linus Torvalds
     

17 Mar, 2011

1 commit

  • Version 3 is updated to apply to for-2.6.39/core.

    For version 2, I took Vivek's advice and made sure we update the group
    weight from cfq_group_service_tree_add().

    If a weight was updated while a group is on the service tree, the
    calculation for the total weight of the service tree can be adjusted
    improperly, which either leads to bad service tree weights, or
    potentially crashes (if total_weight becomes 0).

    This patch defers updates to the weight until a group is off the service
    tree.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

12 Mar, 2011

2 commits


11 Mar, 2011

1 commit

  • BZ29402
    https://bugzilla.kernel.org/show_bug.cgi?id=29402

    We can hit serious mis-synchronization in bio completion path of
    blkdev_issue_zeroout() leading to a panic.

    The problem is that when we are going to wait_for_completion() in
    blkdev_issue_zeroout() we check if the bb.done equals issued (number of
    submitted bios). If it does, we can skip the wait_for_completition()
    and just out of the function since there is nothing to wait for.
    However, there is a ordering problem because bio_batch_end_io() is
    calling atomic_inc(&bb->done) before complete(), hence it might seem to
    blkdev_issue_zeroout() that all bios has been completed and exit. At
    this point when bio_batch_end_io() is going to call complete(bb->wait),
    bb and wait does not longer exist since it was allocated on stack in
    blkdev_issue_zeroout() ==> panic!

    (thread 1) (thread 2)
    bio_batch_end_io() blkdev_issue_zeroout()
    if(bb) { ...
    if (bb->end_io) ...
    bb->end_io(bio, err); ...
    atomic_inc(&bb->done); ...
    ... while (issued != atomic_read(&bb.done))
    ... (let issued == bb.done)
    ... (do the rest of the function)
    ... return ret;
    complete(bb->wait);
    ^^^^^^^^
    panic

    We can fix this easily by simplifying bio_batch and completion counting.

    Also remove bio_end_io_t *end_io since it is not used.

    Signed-off-by: Lukas Czerner
    Reported-by: Eric Whitney
    Tested-by: Eric Whitney
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

10 Mar, 2011

7 commits

  • Conflicts:
    block/blk-core.c
    block/blk-flush.c
    drivers/md/raid1.c
    drivers/md/raid10.c
    drivers/md/raid5.c
    fs/nilfs2/btnode.c
    fs/nilfs2/mdt.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Use plug in throttle dispatch also as we are dispatching a bunch of
    bios in throttle context and some of them might merge.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • With the plugging now being explicitly controlled by the
    submitter, callers need not pass down unplugging hints
    to the block layer. If they want to unplug, it's because they
    manually plugged on their own - in which case, they should just
    unplug at will.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Code has been converted over to the new explicit on-stack plugging,
    and delay users have been converted to use the new API for that.
    So lets kill off the old plugging along with aops->sync_page().

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch adds support for creating a queuing context outside
    of the queue itself. This enables us to batch up pieces of IO
    before grabbing the block device queue lock and submitting them to
    the IO scheduler.

    The context is created on the stack of the process and assigned in
    the task structure, so that we can auto-unplug it if we hit a schedule
    event.

    The current queue plugging happens implicitly if IO is submitted to
    an empty device, yet callers have to remember to unplug that IO when
    they are going to wait for it. This is an ugly API and has caused bugs
    in the past. Additionally, it requires hacks in the vm (->sync_page()
    callback) to handle that logic. By switching to an explicit plugging
    scheme we make the API a lot nicer and can get rid of the ->sync_page()
    hack in the vm.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently we use plugging for that, but as plugging is going away,
    we need an alternative mechanism.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Currently, disk_unblock_events() implicitly kick event check if the
    block count reaches zero. This behavior is not described in the
    comment and hinders with future changes. Make the unblocker
    explicitly check events by calling disk_check_events() as necessary.

    This patch doesn't cause any behavior difference.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Cc: Kay Sievers

    Tejun Heo
     

09 Mar, 2011

1 commit


08 Mar, 2011

2 commits

  • When throttle group limits are updated through cgroups, a thread is
    woken up to process these updates. While reviewing that code, oleg noted
    couple of race conditions existed in the code and he also suggested that
    code can be simplified.

    This patch fixes the races simplifies the code based on Oleg's suggestions:

    - Use xchg().
    - Introduced a common function throtl_update_blkio_group_common()
    which is shared now by all iops/bps update functions.

    Reviewed-by: Oleg Nesterov
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Vivek Goyal

    Fixed a merge issue, throtl_schedule_delayed_work() takes throtl_data
    as the argument now, not the queue.

    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • With the help of cgroup interface one can go and upate the bps/iops
    limits of existing group. Once the limits are udpated, a thread is
    woken up to see if some blocked group needs recalculation based on new
    limits and needs to be requeued.

    There was also a piece of code where I was checking for group limit
    update when a fresh bio comes in. This patch gets rid of that piece of
    code and keeps processing the limit change at one place
    throtl_process_limit_change(). It just keeps the code simple and easy
    to understand.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

07 Mar, 2011

4 commits


05 Mar, 2011

1 commit

  • This merge creates two set of conflicts. One is simple context
    conflicts caused by removal of throtl_scheduled_delayed_work() in
    for-linus and removal of throtl_shutdown_timer_wq() in
    for-2.6.39/core.

    The other is caused by commit 255bb490c8 (block: blk-flush shouldn't
    call directly into q->request_fn() __blk_run_queue()) in for-linus
    crashing with FLUSH reimplementation in for-2.6.39/core. The conflict
    isn't trivial but the resolution is straight-forward.

    * __blk_run_queue() calls in flush_end_io() and flush_data_end_io()
    should be called with @force_kblockd set to %true.

    * elv_insert() in blk_kick_flush() should use
    %ELEVATOR_INSERT_REQUEUE.

    Both changes are to avoid invoking ->request_fn() directly from
    request completion path and closely match the changes in the commit
    255bb490c8.

    Signed-off-by: Tejun Heo

    Tejun Heo
     

03 Mar, 2011

3 commits

  • Move blk_throtl_exit() in blk_cleanup_queue() as blk_throtl_exit() is
    written in such a way that it needs queue lock. In blk_release_queue()
    there is no gurantee that ->queue_lock is still around.

    Initially blk_throtl_exit() was in blk_cleanup_queue() but Ingo reported
    one problem.

    https://lkml.org/lkml/2010/10/23/86

    And a quick fix moved blk_throtl_exit() to blk_release_queue().

    commit 7ad58c028652753814054f4e3ac58f925e7343f4
    Author: Jens Axboe
    Date: Sat Oct 23 20:40:26 2010 +0200

    block: fix use-after-free bug in blk throttle code

    This patch reverts above change and does not try to shutdown the
    throtl work in blk_sync_queue(). By avoiding call to
    throtl_shutdown_timer_wq() from blk_sync_queue(), we should also avoid
    the problem reported by Ingo.

    blk_sync_queue() seems to be used only by md driver and it seems to be
    using it to make sure q->unplug_fn is not called as md registers its
    own unplug functions and it is about to free up the data structures
    used by unplug_fn(). Block throttle does not call back into unplug_fn()
    or into md. So there is no need to cancel blk throttle work.

    In fact I think cancelling block throttle work is bad because it might
    happen that some bios are throttled and scheduled to be dispatched later
    with the help of pending work and if work is cancelled, these bios might
    never be dispatched.

    Block layer also uses blk_sync_queue() during blk_cleanup_queue() and
    blk_release_queue() time. That should be safe as we are also calling
    blk_throtl_exit() which should make sure all the throttling related
    data structures are cleaned up.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • There does not seem to be a clear convention whether q->queue_lock is
    initialized or not when blk_cleanup_queue() is called. In the past it
    was not necessary but now blk_throtl_exit() takes up queue lock by
    default and needs queue lock to be available.

    In fact elevator_exit() code also has similar requirement just that it
    is less stringent in the sense that elevator_exit() is called only if
    elevator is initialized.

    Two problems have been noticed because of ambiguity about spin lock
    status.

    - If a driver calls blk_alloc_queue() and then soon calls
    blk_cleanup_queue() almost immediately, (because some other
    driver structure allocation failed or some other error happened)
    then blk_throtl_exit() will run into issues as queue lock is not
    initialized. Loop driver ran into this issue recently and I
    noticed error paths in md driver too. Similar error paths should
    exist in other drivers too.

    - If some driver provided external spin lock and zapped the lock
    before blk_cleanup_queue(), then it can lead to issues.

    So this patch initializes the default queue lock at queue allocation time.

    block throttling code is one of the users of queue lock and it is
    initialized at the queue allocation time, so it makes sense to
    initialize ->queue_lock also to internal lock. A driver can overide that
    lock later. This will take care of the issue where a driver does not have
    to worry about initializing the queue lock to default before calling
    blk_cleanup_queue()

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Rename the numerals in the diskstats_show() into the macros.

    Cc: Jens Axboe
    Signed-off-by: Liu Yuan
    Signed-off-by: Jens Axboe

    Liu Yuan
     

02 Mar, 2011

6 commits

  • blk-flush decomposes a flush into sequence of multiple requests. On
    completion of a request, the next one is queued; however, block layer
    must not implicitly call into q->request_fn() directly from completion
    path. This makes the queue behave unexpectedly when seen from the
    drivers and violates the assumption that q->request_fn() is called
    with process context + queue_lock.

    This patch makes blk-flush the following two changes to make sure
    q->request_fn() is not called directly from request completion path.

    - blk_flush_complete_seq_end_io() now asks __blk_run_queue() to always
    use kblockd instead of calling directly into q->request_fn().

    - queue_next_fseq() uses ELEVATOR_INSERT_REQUEUE instead of
    ELEVATOR_INSERT_FRONT so that elv_insert() doesn't try to unplug the
    request queue directly.

    Reported by Jan in the following threads.

    http://thread.gmane.org/gmane.linux.ide/48778
    http://thread.gmane.org/gmane.linux.ide/48786

    stable: applicable to v2.6.37.

    Signed-off-by: Tejun Heo
    Reported-by: Jan Beulich
    Cc: "David S. Miller"
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Effectively, make group_isolation=1 the default and remove the tunable.
    The setting group_isolation=0 was because by default we idle on
    sync-noidle tree and on fast devices, this can be very harmful for
    throughput.

    However, this problem can also be addressed by tuning slice_idle and
    possibly group_idle on faster storage devices.

    This change simplifies the CFQ code by removing the feature entirely.

    Signed-off-by: Justin TerAvest
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     
  • Conflicts:
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Signed-off-by: Ben Hutchings
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Ben Hutchings
     
  • o Dominik Klein reported a system hang issue while doing some blkio
    throttling testing.

    https://lkml.org/lkml/2011/2/24/173

    o Some tracing revealed that CFQ was not dispatching any more jobs as
    queue unplug was not happening. And queue unplug was not happening
    because unplug work was not being called as there was one throttling
    work on same cpu which as not finished yet. And throttling work had not
    finished as it was tyring to dispatch a bio to CFQ but all the request
    descriptors were consume to it was put to sleep.

    o So basically it is a cyclic dependecny between CFQ unplug work and
    throtl dispatch work. Tejun suggested that use separate workqueue for
    such cases.

    o This patch uses a separate workqueue for throttle related work and
    does not rely on kblockd workqueue anymore.

    Cc: stable@kernel.org
    Reported-by: Dominik Klein
    Signed-off-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds