01 Mar, 2013

1 commit

  • Pull block IO core bits from Jens Axboe:
    "Below are the core block IO bits for 3.9. It was delayed a few days
    since my workstation kept crashing every 2-8h after pulling it into
    current -git, but turns out it is a bug in the new pstate code (divide
    by zero, will report separately). In any case, it contains:

    - The big cfq/blkcg update from Tejun and and Vivek.

    - Additional block and writeback tracepoints from Tejun.

    - Improvement of the should sort (based on queues) logic in the plug
    flushing.

    - _io() variants of the wait_for_completion() interface, using
    io_schedule() instead of schedule() to contribute to io wait
    properly.

    - Various little fixes.

    You'll get two trivial merge conflicts, which should be easy enough to
    fix up"

    Fix up the trivial conflicts due to hlist traversal cleanups (commit
    b67bfe0d42ca: "hlist: drop the node parameter from iterators").

    * 'for-3.9/core' of git://git.kernel.dk/linux-block: (39 commits)
    block: remove redundant check to bd_openers()
    block: use i_size_write() in bd_set_size()
    cfq: fix lock imbalance with failed allocations
    drivers/block/swim3.c: fix null pointer dereference
    block: don't select PERCPU_RWSEM
    block: account iowait time when waiting for completion of IO request
    sched: add wait_for_completion_io[_timeout]
    writeback: add more tracepoints
    block: add block_{touch|dirty}_buffer tracepoint
    buffer: make touch_buffer() an exported function
    block: add @req to bio_{front|back}_merge tracepoints
    block: add missing block_bio_complete() tracepoint
    block: Remove should_sort judgement when flush blk_plug
    block,elevator: use new hashtable implementation
    cfq-iosched: add hierarchical cfq_group statistics
    cfq-iosched: collect stats from dead cfqgs
    cfq-iosched: separate out cfqg_stats_reset() from cfq_pd_reset_stats()
    blkcg: make blkcg_print_blkgs() grab q locks instead of blkcg lock
    block: RCU free request_queue
    blkcg: implement blkg_[rw]stat_recursive_sum() and blkg_[rw]stat_merge()
    ...

    Linus Torvalds
     

22 Feb, 2013

1 commit

  • This provides a band-aid to provide stable page writes on jbd without
    needing to backport the fixed locking and page writeback bit handling
    schemes of jbd2. The band-aid works by using bounce buffers to snapshot
    page contents instead of waiting.

    For those wondering about the ext3 bandage -- fixing the jbd locking
    (which was done as part of ext4dev years ago) is a lot of surgery, and
    setting PG_writeback on data pages when we actually hold the page lock
    dropped ext3 performance by nearly an order of magnitude. If we're
    going to migrate iscsi and raid to use stable page writes, the
    complaints about high latency will likely return. We might as well
    centralize their page snapshotting thing to one place.

    Signed-off-by: Darrick J. Wong
    Tested-by: Andy Lutomirski
    Cc: Adrian Hunter
    Cc: Artem Bityutskiy
    Reviewed-by: Jan Kara
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

14 Jan, 2013

2 commits

  • bio_{front|back}_merge tracepoints report a bio merging into an
    existing request but didn't specify which request the bio is being
    merged into. Add @req to it. This makes it impossible to share the
    event template with block_bio_queue - split it out.

    @req isn't used or exported to userland at this point and there is no
    userland visible behavior change. Later changes will make use of the
    extra parameter.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • bio completion didn't kick block_bio_complete TP. Only dm was
    explicitly triggering the TP on IO completion. This makes
    block_bio_complete TP useless for tracers which want to know about
    bios, and all other bio based drivers skip generating blktrace
    completion events.

    This patch makes all bio completions via bio_endio() generate
    block_bio_complete TP.

    * Explicit trace_block_bio_complete() invocation removed from dm and
    the trace point is unexported.

    * @rq dropped from trace_block_bio_complete(). bios may fly around
    w/o queue associated. Verifying and accessing the assocaited queue
    belongs to TP probes.

    * blktrace now gets both request and bio completions. Make it ignore
    bio completions if request completion path is happening.

    This makes all bio based drivers generate blktrace completion events
    properly and makes the block_bio_complete TP actually useful.

    v2: With this change, block_bio_complete TP could be invoked on sg
    commands which have bio's with %NULL bi_bdev. Update TP
    assignment code to check whether bio->bi_bdev is %NULL before
    dereferencing.

    Signed-off-by: Tejun Heo
    Original-patch-by: Namhyung Kim
    Cc: Tejun Heo
    Cc: Steven Rostedt
    Cc: Alasdair Kergon
    Cc: dm-devel@redhat.com
    Cc: Neil Brown
    Signed-off-by: Jens Axboe

    Tejun Heo
     

11 Jan, 2013

1 commit

  • In commit 975927b942c932,it add blk_rq_pos to sort rq when flushing.
    Although this commit was used for the situation which blk_plug handled
    multi devices on the same time like md device.
    I think there must be some situations like this but only single
    device.
    So remove the should_sort judgement.
    Because the parameter should_sort is only for this purpose,it can delete
    should_sort from blk_plug.

    CC: Shaohua Li
    Signed-off-by: Jianpeng Ma
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

15 Dec, 2012

1 commit


06 Dec, 2012

5 commits

  • Some request_fn implementations, e.g. scsi_request_fn(), unlock
    the queue lock internally. This may result in multiple threads
    executing request_fn for the same queue simultaneously. Keep
    track of the number of active request_fn calls and make sure that
    blk_cleanup_queue() waits until all active request_fn invocations
    have finished. A block driver may start cleaning up resources
    needed by its request_fn as soon as blk_cleanup_queue() finished,
    so blk_cleanup_queue() must wait for all outstanding request_fn
    invocations to finish.

    Signed-off-by: Bart Van Assche
    Reported-by: Chanho Min
    Cc: James Bottomley
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Running a queue must continue after it has been marked dying until
    it has been marked dead. So the function blk_run_queue_async() must
    not schedule delayed work after blk_cleanup_queue() has marked a queue
    dead. Hence add a test for that queue state in blk_run_queue_async()
    and make sure that queue_unplugged() invokes that function with the
    queue lock held. This avoids that the queue state can change after
    it has been tested and before mod_delayed_work() is invoked. Drop
    the queue dying test in queue_unplugged() since it is now
    superfluous: __blk_run_queue() already tests whether or not the
    queue is dead.

    Signed-off-by: Bart Van Assche
    Cc: Mike Christie
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • A block driver may start cleaning up resources needed by its
    request_fn as soon as blk_cleanup_queue() finished, so request_fn
    must not be invoked after draining finished. This is important
    when blk_run_queue() is invoked without any requests in progress.
    As an example, if blk_drain_queue() and scsi_run_queue() run in
    parallel, blk_drain_queue() may have finished all requests after
    scsi_run_queue() has taken a SCSI device off the starved list but
    before that last function has had a chance to run the queue.

    Signed-off-by: Bart Van Assche
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Chanho Min
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • Let the caller of blk_drain_queue() obtain the queue lock to improve
    readability of the patch called "Avoid that request_fn is invoked on
    a dead queue".

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • QUEUE_FLAG_DEAD is used to indicate that queuing new requests must
    stop. After this flag has been set queue draining starts. However,
    during the queue draining phase it is still safe to invoke the
    queue's request_fn, so QUEUE_FLAG_DYING is a better name for this
    flag.

    This patch has been generated by running the following command
    over the kernel source tree:

    git grep -lEw 'blk_queue_dead|QUEUE_FLAG_DEAD' |
    xargs sed -i.tmp -e 's/blk_queue_dead/blk_queue_dying/g' \
    -e 's/QUEUE_FLAG_DEAD/QUEUE_FLAG_DYING/g'; \
    sed -i.tmp -e "s/QUEUE_FLAG_DYING$(printf \\t)*5/QUEUE_FLAG_DYING$(printf \\t)5/g" \
    include/linux/blkdev.h; \
    sed -i.tmp -e 's/ DEAD/ DYING/g' -e 's/dead queue/a dying queue/' \
    -e 's/Dead queue/A dying queue/' block/blk-core.c

    Signed-off-by: Bart Van Assche
    Acked-by: Tejun Heo
    Cc: James Bottomley
    Cc: Mike Christie
    Cc: Jens Axboe
    Cc: Chanho Min
    Signed-off-by: Jens Axboe

    Bart Van Assche
     

10 Nov, 2012

1 commit


26 Oct, 2012

1 commit

  • My workload is a raid5 which had 16 disks. And used our filesystem to
    write using direct-io mode.

    I used the blktrace to find those message:
    8,16 0 6647 2.453665504 2579 M W 7493152 + 8 [md0_raid5]
    8,16 0 6648 2.453672411 2579 Q W 7493160 + 8 [md0_raid5]
    8,16 0 6649 2.453672606 2579 M W 7493160 + 8 [md0_raid5]
    8,16 0 6650 2.453679255 2579 Q W 7493168 + 8 [md0_raid5]
    8,16 0 6651 2.453679441 2579 M W 7493168 + 8 [md0_raid5]
    8,16 0 6652 2.453685948 2579 Q W 7493176 + 8 [md0_raid5]
    8,16 0 6653 2.453686149 2579 M W 7493176 + 8 [md0_raid5]
    8,16 0 6654 2.453693074 2579 Q W 7493184 + 8 [md0_raid5]
    8,16 0 6655 2.453693254 2579 M W 7493184 + 8 [md0_raid5]
    8,16 0 6656 2.453704290 2579 Q W 7493192 + 8 [md0_raid5]
    8,16 0 6657 2.453704482 2579 M W 7493192 + 8 [md0_raid5]
    8,16 0 6658 2.453715016 2579 Q W 7493200 + 8 [md0_raid5]
    8,16 0 6659 2.453715247 2579 M W 7493200 + 8 [md0_raid5]
    8,16 0 6660 2.453721730 2579 Q W 7493208 + 8 [md0_raid5]
    8,16 0 6661 2.453721974 2579 M W 7493208 + 8 [md0_raid5]
    8,16 0 6662 2.453728202 2579 Q W 7493216 + 8 [md0_raid5]
    8,16 0 6663 2.453728436 2579 M W 7493216 + 8 [md0_raid5]
    8,16 0 6664 2.453734782 2579 Q W 7493224 + 8 [md0_raid5]
    8,16 0 6665 2.453735019 2579 M W 7493224 + 8 [md0_raid5]
    8,16 0 6666 2.453741401 2579 Q W 7493232 + 8 [md0_raid5]
    8,16 0 6667 2.453741632 2579 M W 7493232 + 8 [md0_raid5]
    8,16 0 6668 2.453748148 2579 Q W 7493240 + 8 [md0_raid5]
    8,16 0 6669 2.453748386 2579 M W 7493240 + 8 [md0_raid5]
    8,16 0 6670 2.453851843 2579 I W 7493144 + 104 [md0_raid5]
    8,16 0 0 2.453853661 0 m N cfq2579 insert_request
    8,16 0 6671 2.453854064 2579 I W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453854439 0 m N cfq2579 insert_request
    8,16 0 6672 2.453854793 2579 U N [md0_raid5] 2
    8,16 0 0 2.453855513 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453855927 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453861771 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453862248 0 m N cfq2579 activate rq,drv=1
    8,16 0 6673 2.453862332 2579 D W 7493120 + 24 [md0_raid5]
    8,16 0 0 2.453865957 0 m N cfq2579 Not idling.st->count:1
    8,16 0 0 2.453866269 0 m N cfq2579 dispatch_insert
    8,16 0 0 2.453866707 0 m N cfq2579 dispatched a request
    8,16 0 0 2.453867061 0 m N cfq2579 activate rq,drv=2
    8,16 0 6674 2.453867145 2579 D W 7493144 + 104 [md0_raid5]
    8,16 0 6675 2.454147608 0 C W 7493120 + 24 [0]
    8,16 0 0 2.454149357 0 m N cfq2579 complete rqnoidle 0
    8,16 0 6676 2.454791505 0 C W 7493144 + 104 [0]
    8,16 0 0 2.454794803 0 m N cfq2579 complete rqnoidle 0
    8,16 0 0 2.454795160 0 m N cfq schedule dispatch

    From above messages,we can find rq[W 7493144 + 104] and rq[W
    7493120 + 24] do not merge.
    Because the bio order is:
    8,16 0 6638 2.453619407 2579 Q W 7493144 + 8 [md0_raid5]
    8,16 0 6639 2.453620460 2579 G W 7493144 + 8 [md0_raid5]
    8,16 0 6640 2.453639311 2579 Q W 7493120 + 8 [md0_raid5]
    8,16 0 6641 2.453639842 2579 G W 7493120 + 8 [md0_raid5]
    The bio(7493144) first and bio(7493120) later.So the subsequent
    bios will be divided into two parts.
    When flushing plug-list,because elv_attempt_insert_merge only support
    backmerge,not supporting frontmerge.
    So rq[7493120 + 24] can't merge with rq[7493144 + 104].

    From my test,i found those situation can count 25% in our system.
    Using this patch, there is no this situation.

    Signed-off-by: Jianpeng Ma
    CC:Shaohua Li
    Signed-off-by: Jens Axboe

    Jianpeng Ma
     

11 Oct, 2012

1 commit

  • Pull block IO update from Jens Axboe:
    "Core block IO bits for 3.7. Not a huge round this time, it contains:

    - First series from Kent cleaning up and generalizing bio allocation
    and freeing.

    - WRITE_SAME support from Martin.

    - Mikulas patches to prevent O_DIRECT crashes when someone changes
    the block size of a device.

    - Make bio_split() work on data-less bio's (like trim/discards).

    - A few other minor fixups."

    Fixed up silent semantic mis-merge as per Mikulas Patocka and Andrew
    Morton. It is due to the VM no longer using a prio-tree (see commit
    6b2dbba8b6ac: "mm: replace vma prio_tree with an interval tree").

    So make set_blocksize() use mapping_mapped() instead of open-coding the
    internal VM knowledge that has changed.

    * 'for-3.7/core' of git://git.kernel.dk/linux-block: (26 commits)
    block: makes bio_split support bio without data
    scatterlist: refactor the sg_nents
    scatterlist: add sg_nents
    fs: fix include/percpu-rwsem.h export error
    percpu-rw-semaphore: fix documentation typos
    fs/block_dev.c:1644:5: sparse: symbol 'blkdev_mmap' was not declared
    blockdev: turn a rw semaphore into a percpu rw semaphore
    Fix a crash when block device is read and block size is changed at the same time
    block: fix request_queue->flags initialization
    block: lift the initial queue bypass mode on blk_register_queue() instead of blk_init_allocated_queue()
    block: ioctl to zero block ranges
    block: Make blkdev_issue_zeroout use WRITE SAME
    block: Implement support for WRITE SAME
    block: Consolidate command flag and queue limit checks for merges
    block: Clean up special command handling logic
    block/blk-tag.c: Remove useless kfree
    block: remove the duplicated setting for congestion_threshold
    block: reject invalid queue attribute values
    block: Add bio_clone_bioset(), bio_clone_kmalloc()
    block: Consolidate bio_alloc_bioset(), bio_kmalloc()
    ...

    Linus Torvalds
     

03 Oct, 2012

1 commit

  • Pull workqueue changes from Tejun Heo:
    "This is workqueue updates for v3.7-rc1. A lot of activities this
    round including considerable API and behavior cleanups.

    * delayed_work combines a timer and a work item. The handling of the
    timer part has always been a bit clunky leading to confusing
    cancelation API with weird corner-case behaviors. delayed_work is
    updated to use new IRQ safe timer and cancelation now works as
    expected.

    * Another deficiency of delayed_work was lack of the counterpart of
    mod_timer() which led to cancel+queue combinations or open-coded
    timer+work usages. mod_delayed_work[_on]() are added.

    These two delayed_work changes make delayed_work provide interface
    and behave like timer which is executed with process context.

    * A work item could be executed concurrently on multiple CPUs, which
    is rather unintuitive and made flush_work() behavior confusing and
    half-broken under certain circumstances. This problem doesn't
    exist for non-reentrant workqueues. While non-reentrancy check
    isn't free, the overhead is incurred only when a work item bounces
    across different CPUs and even in simulated pathological scenario
    the overhead isn't too high.

    All workqueues are made non-reentrant. This removes the
    distinction between flush_[delayed_]work() and
    flush_[delayed_]_work_sync(). The former is now as strong as the
    latter and the specified work item is guaranteed to have finished
    execution of any previous queueing on return.

    * In addition to the various bug fixes, Lai redid and simplified CPU
    hotplug handling significantly.

    * Joonsoo introduced system_highpri_wq and used it during CPU
    hotplug.

    There are two merge commits - one to pull in IRQ safe timer from
    tip/timers/core and the other to pull in CPU hotplug fixes from
    wq/for-3.6-fixes as Lai's hotplug restructuring depended on them."

    Fixed a number of trivial conflicts, but the more interesting conflicts
    were silent ones where the deprecated interfaces had been used by new
    code in the merge window, and thus didn't cause any real data conflicts.

    Tejun pointed out a few of them, I fixed a couple more.

    * 'for-3.7' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (46 commits)
    workqueue: remove spurious WARN_ON_ONCE(in_irq()) from try_to_grab_pending()
    workqueue: use cwq_set_max_active() helper for workqueue_set_max_active()
    workqueue: introduce cwq_set_max_active() helper for thaw_workqueues()
    workqueue: remove @delayed from cwq_dec_nr_in_flight()
    workqueue: fix possible stall on try_to_grab_pending() of a delayed work item
    workqueue: use hotcpu_notifier() for workqueue_cpu_down_callback()
    workqueue: use __cpuinit instead of __devinit for cpu callbacks
    workqueue: rename manager_mutex to assoc_mutex
    workqueue: WORKER_REBIND is no longer necessary for idle rebinding
    workqueue: WORKER_REBIND is no longer necessary for busy rebinding
    workqueue: reimplement idle worker rebinding
    workqueue: deprecate __cancel_delayed_work()
    workqueue: reimplement cancel_delayed_work() using try_to_grab_pending()
    workqueue: use mod_delayed_work() instead of __cancel + queue
    workqueue: use irqsafe timer for delayed_work
    workqueue: clean up delayed_work initializers and add missing one
    workqueue: make deferrable delayed_work initializer names consistent
    workqueue: cosmetic whitespace updates for macro definitions
    workqueue: deprecate system_nrt[_freezable]_wq
    workqueue: deprecate flush[_delayed]_work_sync()
    ...

    Linus Torvalds
     

21 Sep, 2012

2 commits

  • A queue newly allocated with blk_alloc_queue_node() has only
    QUEUE_FLAG_BYPASS set. For request-based drivers,
    blk_init_allocated_queue() is called and q->queue_flags is overwritten
    with QUEUE_FLAG_DEFAULT which doesn't include BYPASS even though the
    initial bypass is still in effect.

    In blk_init_allocated_queue(), or QUEUE_FLAG_DEFAULT to q->queue_flags
    instead of overwriting.

    Signed-off-by: Tejun Heo
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • …_init_allocated_queue()

    b82d4b197c ("blkcg: make request_queue bypassing on allocation") made
    request_queues bypassed on allocation to avoid switching on and off
    bypass mode on a queue being initialized. Some drivers allocate and
    then destroy a lot of queues without fully initializing them and
    incurring bypass latency overhead on each of them could add upto
    significant overhead.

    Unfortunately, blk_init_allocated_queue() is never used by queues of
    bio-based drivers, which means that all bio-based driver queues are in
    bypass mode even after initialization and registration complete
    successfully.

    Due to the limited way request_queues are used by bio drivers, this
    problem is hidden pretty well but it shows up when blk-throttle is
    used in combination with a bio-based driver. Trying to configure
    (echoing to cgroupfs file) blk-throttle for a bio-based driver hangs
    indefinitely in blkg_conf_prep() waiting for bypass mode to end.

    This patch moves the initial blk_queue_bypass_end() call from
    blk_init_allocated_queue() to blk_register_queue() which is called for
    any userland-visible queues regardless of its type.

    I believe this is correct because I don't think there is any block
    driver which needs or wants working elevator and blk-cgroup on a queue
    which isn't visible to userland. If there are such users, we need a
    different solution.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Reported-by: Joseph Glanville <joseph.glanville@orionvm.com.au>
    Cc: stable@vger.kernel.org
    Acked-by: Vivek Goyal <vgoyal@redhat.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

    Tejun Heo
     

20 Sep, 2012

3 commits

  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • - blk_check_merge_flags() verifies that cmd_flags / bi_rw are
    compatible. This function is called for both req-req and req-bio
    merging.

    - blk_rq_get_max_sectors() and blk_queue_get_max_sectors() can be used
    to query the maximum sector count for a given request or queue. The
    calls will return the right value from the queue limits given the
    type of command (RW, discard, write same, etc.)

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Remove special-casing of non-rw fs style requests (discard). The nomerge
    flags are consolidated in blk_types.h, and rq_mergeable() and
    bio_mergeable() have been modified to use them.

    bio_is_rw() is used in place of bio_has_data() a few places. This is
    done to to distinguish true reads and writes from other fs type requests
    that carry a payload (e.g. write same).

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

09 Sep, 2012

4 commits

  • Before call the blk_queue_congestion_threshold(),
    the blk_queue_congestion_threshold() is already called at blk_queue_make_rquest().
    Because this code is the duplicated, it has removed.

    Signed-off-by: Jaehoon Chung
    Signed-off-by: Kyungmin Park
    Signed-off-by: Jens Axboe

    Jaehoon Chung
     
  • Previously, there was bio_clone() but it only allocated from the fs bio
    set; as a result various users were open coding it and using
    __bio_clone().

    This changes bio_clone() to become bio_clone_bioset(), and then we add
    bio_clone() and bio_clone_kmalloc() as wrappers around it, making use of
    the functionality the last patch adedd.

    This will also help in a later patch changing how bio cloning works.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: NeilBrown
    CC: Alasdair Kergon
    CC: Boaz Harrosh
    CC: Jeff Garzik
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that we've got generic code for freeing bios allocated from bio
    pools, this isn't needed anymore.

    This patch also makes bio_free() static, since without bi_destructor
    there should be no need for it to be called anywhere else.

    bio_free() is now only called from bio_put, so we can refactor those a
    bit - move some code from bio_put() to bio_free() and kill the redundant
    bio->bi_next = NULL.

    v5: Switch to BIO_KMALLOC_POOL ((void *)~0), per Boaz
    v6: BIO_KMALLOC_POOL now NULL, drop bio_free's EXPORT_SYMBOL
    v7: No #define BIO_KMALLOC_POOL anymore

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Kent Overstreet
     
  • Now that bios keep track of where they were allocated from,
    bio_integrity_alloc_bioset() becomes redundant.

    Remove bio_integrity_alloc_bioset() and drop bio_set argument from the
    related functions and make them use bio->bi_pool.

    Signed-off-by: Kent Overstreet
    CC: Jens Axboe
    CC: Martin K. Petersen
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Kent Overstreet
     

31 Aug, 2012

1 commit

  • When performing a cable pull test w/ active stress I/O using fio over
    a dual port Intel 82599 FCoE CNA, w/ 256LUNs on one port and about 32LUNs
    on the other, it is observed that the system becomes not usable due to
    scsi-ml being busy printing the error messages for all the failing commands.
    I don't believe this problem is specific to FCoE and these commands are
    anyway failing due to link being down (DID_NO_CONNECT), just rate-limit
    the messages here to solve this issue.

    v2->v1: use __ratelimit() as Tomas Henzl mentioned as the proper way for
    rate-limit per function. However, in this case, the failed i/o gets to
    blk_end_request_err() and then blk_update_request(), which also has to
    be rate-limited, as added in the v2 of this patch.

    v3-v2: resolved conflict to apply on current 3.6-rc3 upstream tip.

    Signed-off-by: Yi Zou
    Cc: www.Open-FCoE.org
    Cc: Tomas Henzl
    Cc:
    Signed-off-by: Jens Axboe

    Yi Zou
     

22 Aug, 2012

2 commits

  • Now that cancel_delayed_work() can be safely called from IRQ handlers,
    there's no reason to use __cancel_delayed_work(). Use
    cancel_delayed_work() instead of __cancel_delayed_work() and mark the
    latter deprecated.

    Signed-off-by: Tejun Heo
    Acked-by: Jens Axboe
    Cc: Jiri Kosina
    Cc: Roland Dreier
    Cc: Tomi Valkeinen

    Tejun Heo
     
  • Now that mod_delayed_work() is safe to call from IRQ handlers,
    __cancel_delayed_work() followed by queue_delayed_work() can be
    replaced with mod_delayed_work().

    Most conversions are straight-forward except for the following.

    * net/core/link_watch.c: linkwatch_schedule_work() was doing a quite
    elaborate dancing around its delayed_work. Collapse it such that
    linkwatch_work is queued for immediate execution if LW_URGENT and
    existing timer is kept otherwise.

    Signed-off-by: Tejun Heo
    Cc: "David S. Miller"
    Cc: Tomi Valkeinen

    Tejun Heo
     

31 Jul, 2012

3 commits

  • This will allow md/raid to know why the unplug was called,
    and will be able to act according - if !from_schedule it
    is safe to perform tasks which could themselves schedule.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     
  • MD raid1 prepares to dispatch request in unplug callback. If make_request in
    low level queue also uses unplug callback to dispatch request, the low level
    queue's unplug callback will not be called. Recheck the callback list helps
    this case.

    Signed-off-by: Shaohua Li
    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Both md and umem has similar code for getting notified on an
    blk_finish_plug event.
    Centralize this code in block/ and allow each driver to
    provide its distinctive difference.

    Signed-off-by: NeilBrown
    Signed-off-by: Jens Axboe

    NeilBrown
     

27 Jun, 2012

1 commit

  • Currently, request_queue has one request_list to allocate requests
    from regardless of blkcg of the IO being issued. When the unified
    request pool is used up, cfq proportional IO limits become meaningless
    - whoever grabs the next request being freed wins the race regardless
    of the configured weights.

    This can be easily demonstrated by creating a blkio cgroup w/ very low
    weight, put a program which can issue a lot of random direct IOs there
    and running a sequential IO from a different cgroup. As soon as the
    request pool is used up, the sequential IO bandwidth crashes.

    This patch implements per-blkg request_list. Each blkg has its own
    request_list and any IO allocates its request from the matching blkg
    making blkcgs completely isolated in terms of request allocation.

    * Root blkcg uses the request_list embedded in each request_queue,
    which was renamed to @q->root_rl from @q->rq. While making blkcg rl
    handling a bit harier, this enables avoiding most overhead for root
    blkcg.

    * Queue fullness is properly per request_list but bdi isn't blkcg
    aware yet, so congestion state currently just follows the root
    blkcg. As writeback isn't aware of blkcg yet, this works okay for
    async congestion but readahead may get the wrong signals. It's
    better than blkcg completely collapsing with shared request_list but
    needs to be improved with future changes.

    * After this change, each block cgroup gets a full request pool making
    resource consumption of each cgroup higher. This makes allowing
    non-root users to create cgroups less desirable; however, note that
    allowing non-root users to directly manage cgroups is already
    severely broken regardless of this patch - each block cgroup
    consumes kernel memory and skews IO weight (IO weights are not
    hierarchical).

    v2: queue-sysfs.txt updated and patch description udpated as suggested
    by Vivek.

    v3: blk_get_rl() wasn't checking error return from
    blkg_lookup_create() and may cause oops on lookup failure. Fix it
    by falling back to root_rl on blkg lookup failures. This problem
    was spotted by Rakesh Iyer .

    v4: Updated to accomodate 458f27a982 "block: Avoid missed wakeup in
    request waitqueue". blk_drain_queue() now wakes up waiters on all
    blkg->rl on the target queue.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Cc: Wu Fengguang
    Signed-off-by: Jens Axboe

    Tejun Heo
     

25 Jun, 2012

5 commits

  • Request allocation is about to be made per-blkg meaning that there'll
    be multiple request lists.

    * Make queue full state per request_list. blk_*queue_full() functions
    are renamed to blk_*rl_full() and takes @rl instead of @q.

    * Rename blk_init_free_list() to blk_init_rl() and make it take @rl
    instead of @q. Also add @gfp_mask parameter.

    * Add blk_exit_rl() instead of destroying rl directly from
    blk_release_queue().

    * Add request_list->q and make request alloc/free functions -
    blk_free_request(), [__]freed_request(), __get_request() - take @rl
    instead of @q.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Add q->nr_rqs[] which currently behaves the same as q->rq.count[] and
    move q->rq.elvpriv to q->nr_rqs_elvpriv. blk_drain_queue() is updated
    to use q->nr_rqs[] instead of q->rq.count[].

    These counters separates queue-wide request statistics from the
    request list and allow implementation of per-queue request allocation.

    While at it, properly indent fields of struct request_list.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Block layer very lazy allocation of ioc. It waits until the moment
    ioc is absolutely necessary; unfortunately, that time could be inside
    queue lock and __get_request() performs unlock - try alloc - retry
    dancing.

    Just allocate it up-front on entry to block layer. We're not saving
    the rain forest by deferring it to the last possible moment and
    complicating things unnecessarily.

    This patch is to prepare for further updates to request allocation
    path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently, there are two request allocation functions - get_request()
    and get_request_wait(). The former tries to allocate a request once
    and the latter keeps retrying until it succeeds. The latter wraps the
    former and keeps retrying until allocation succeeds.

    The combination of two functions deliver fallible non-wait allocation,
    fallible wait allocation and unfailing wait allocation. However,
    given that forward progress is guaranteed, fallible wait allocation
    isn't all that useful and in fact nobody uses it.

    This patch simplifies the interface as follows.

    * get_request() is renamed to __get_request() and is only used by the
    wrapper function.

    * get_request_wait() is renamed to get_request(). It now takes
    @gfp_mask and retries iff it contains %__GFP_WAIT.

    This patch doesn't introduce any functional change and is to prepare
    for further updates to request allocation path.

    Signed-off-by: Tejun Heo
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • mempool_create_node() currently assumes %GFP_KERNEL. Its only user,
    blk_init_free_list(), is about to be updated to use other allocation
    flags - add @gfp_mask argument to the function.

    Signed-off-by: Tejun Heo
    Cc: Andrew Morton
    Cc: Hugh Dickins
    Signed-off-by: Jens Axboe

    Tejun Heo
     

15 Jun, 2012

2 commits

  • Commit 777eb1bf15b8532c396821774bf6451e563438f5 disconnects externally
    supplied queue_lock before blk_drain_queue(). Switching the lock would
    introduce lock unbalance because theads which have taken the external
    lock might unlock the internal lock in the during the queue drain. This
    patch mitigate this by disconnecting the lock after the queue draining
    since queue draining makes a lot of request_queue users go away.

    However, please note, this patch only makes the problem less likely to
    happen. Anyone who still holds a ref might try to issue a new request on
    a dead queue after the blk_cleanup_queue() finishes draining, the lock
    unbalance might still happen in this case.

    =====================================
    [ BUG: bad unlock balance detected! ]
    3.4.0+ #288 Not tainted
    -------------------------------------
    fio/17706 is trying to release lock (&(&q->__queue_lock)->rlock) at:
    [] blk_queue_bio+0x2a2/0x380
    but there are no more locks to release!

    other info that might help us debug this:
    1 lock held by fio/17706:
    #0: (&(&vblk->lock)->rlock){......}, at: []
    get_request_wait+0x19a/0x250

    stack backtrace:
    Pid: 17706, comm: fio Not tainted 3.4.0+ #288
    Call Trace:
    [] ? blk_queue_bio+0x2a2/0x380
    [] print_unlock_inbalance_bug+0xf9/0x100
    [] lock_release_non_nested+0x1df/0x330
    [] ? dio_bio_end_aio+0x34/0xc0
    [] ? bio_check_pages_dirty+0x85/0xe0
    [] ? dio_bio_end_aio+0xb1/0xc0
    [] ? blk_queue_bio+0x2a2/0x380
    [] ? blk_queue_bio+0x2a2/0x380
    [] lock_release+0xd9/0x250
    [] _raw_spin_unlock_irq+0x23/0x40
    [] blk_queue_bio+0x2a2/0x380
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x76/0xf0
    [] ? set_page_dirty_lock+0x3c/0x60
    [] ? bio_set_pages_dirty+0x51/0x70
    [] do_blockdev_direct_IO+0xbf8/0xee0
    [] ? blkdev_get_block+0x80/0x80
    [] __blockdev_direct_IO+0x55/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] blkdev_direct_IO+0x57/0x60
    [] ? blkdev_get_block+0x80/0x80
    [] generic_file_aio_read+0x70e/0x760
    [] ? __lock_acquire+0x215/0x5a0
    [] ? aio_run_iocb+0x54/0x1a0
    [] ? grab_cache_page_nowait+0xc0/0xc0
    [] aio_rw_vect_retry+0x7c/0x1e0
    [] ? aio_fsync+0x30/0x30
    [] aio_run_iocb+0x66/0x1a0
    [] do_io_submit+0x6f0/0xb80
    [] ? trace_hardirqs_on_thunk+0x3a/0x3f
    [] sys_io_submit+0x10/0x20
    [] system_call_fastpath+0x16/0x1b

    Changes since v2: Update commit log to explain how the code is still
    broken even if we delay the lock switching after the drain.
    Changes since v1: Update commit log as Tejun suggested.

    Acked-by: Tejun Heo
    Signed-off-by: Asias He
    Signed-off-by: Jens Axboe

    Asias He
     
  • After hot-unplug a stressed disk, I found that rl->wait[] is not empty
    while rl->count[] is empty and there are theads still sleeping on
    get_request after the queue cleanup. With simple debug code, I found
    there are exactly nr_sleep - nr_wakeup of theads in D state. So there
    are missed wakeup.

    $ dmesg | grep nr_sleep
    [ 52.917115] ---> nr_sleep=1046, nr_wakeup=873, delta=173
    $ vmstat 1
    1 173 0 712640 24292 96172 0 0 0 0 419 757 0 0 0 100 0

    To quote Tejun:

    Ah, okay, freed_request() wakes up single waiter with the assumption
    that after the wakeup there will at least be one successful allocation
    which in turn will continue the wakeup chain until the wait list is
    empty - ie. waiter wakeup is dependent on successful request
    allocation happening after each wakeup. With queue marked dead, any
    woken up waiter fails the allocation path, so the wakeup chaining is
    lost and we're left with hung waiters. What we need is wake_up_all()
    after drain completion.

    This patch fixes the missed wakeup by waking up all the theads which
    are sleeping on wait queue after queue drain.

    Changes in v2: Drop waitqueue_active() optimization

    Acked-by: Tejun Heo
    Signed-off-by: Asias He

    Fixed a bug by me, where stacked devices would oops on calling
    blk_drain_queue() since ->rq.wait[] do not get initialized unless
    it's a full queue setup.

    Signed-off-by: Jens Axboe

    Asias He
     

01 May, 2012

1 commit


20 Apr, 2012

1 commit

  • Request allocation is mempool backed to guarantee forward progress
    under memory pressure; unfortunately, this property got broken while
    adding elvpriv data. Failures during elvpriv allocation, including
    ioc and icq creation failures, currently make get_request() fail as
    whole. There's no forward progress guarantee for these allocations -
    they may fail indefinitely under memory pressure stalling IO and
    deadlocking the system.

    This patch updates get_request() such that elvpriv allocation failure
    doesn't make the whole function fail. If elvpriv allocation fails,
    the allocation is degraded into !ELVPRIV. This will force the request
    to ELEVATOR_INSERT_BACK disturbing scheduling but elvpriv alloc
    failures should be rare (nothing is per-request) and anything is
    better than deadlocking.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo