16 Jan, 2012

1 commit

  • * 'for-3.3/core' of git://git.kernel.dk/linux-block: (37 commits)
    Revert "block: recursive merge requests"
    block: Stop using macro stubs for the bio data integrity calls
    blockdev: convert some macros to static inlines
    fs: remove unneeded plug in mpage_readpages()
    block: Add BLKROTATIONAL ioctl
    block: Introduce blk_set_stacking_limits function
    block: remove WARN_ON_ONCE() in exit_io_context()
    block: an exiting task should be allowed to create io_context
    block: ioc_cgroup_changed() needs to be exported
    block: recursive merge requests
    block, cfq: fix empty queue crash caused by request merge
    block, cfq: move icq creation and rq->elv.icq association to block core
    block, cfq: restructure io_cq creation path for io_context interface cleanup
    block, cfq: move io_cq exit/release to blk-ioc.c
    block, cfq: move icq cache management to block core
    block, cfq: move io_cq lookup to blk-ioc.c
    block, cfq: move cfqd->icq_list to request_queue and add request->elv.icq
    block, cfq: reorganize cfq_io_context into generic and cfq specific parts
    block: remove elevator_queue->ops
    block: reorder elevator switch sequence
    ...

    Fix up conflicts in:
    - block/blk-cgroup.c
    Switch from can_attach_task to can_attach
    - block/cfq-iosched.c
    conflict with now removed cic index changes (we now use q->id instead)

    Linus Torvalds
     

16 Dec, 2011

1 commit

  • While probing, fd sets up queue, probes hardware and tears down the
    queue if probing fails. In the process, blk_drain_queue() kicks the
    queue which failed to finish initialization and fd is unhappy about
    that.

    floppy0: no floppy controllers found
    ------------[ cut here ]------------
    WARNING: at drivers/block/floppy.c:2929 do_fd_request+0xbf/0xd0()
    Hardware name: To Be Filled By O.E.M.
    VFS: do_fd_request called on non-open device
    Modules linked in:
    Pid: 1, comm: swapper Not tainted 3.2.0-rc4-00077-g5983fe2 #2
    Call Trace:
    [] warn_slowpath_common+0x7a/0xb0
    [] warn_slowpath_fmt+0x41/0x50
    [] do_fd_request+0xbf/0xd0
    [] blk_drain_queue+0x65/0x80
    [] blk_cleanup_queue+0xe3/0x1a0
    [] floppy_init+0xdeb/0xe28
    [] ? daring+0x6b/0x6b
    [] do_one_initcall+0x3f/0x170
    [] kernel_init+0x9d/0x11e
    [] ? schedule_tail+0x22/0xa0
    [] kernel_thread_helper+0x4/0x10
    [] ? start_kernel+0x2be/0x2be
    [] ? gs_change+0xb/0xb

    Avoid it by making blk_drain_queue() kick queue iff dispatch queue has
    something on it.

    Signed-off-by: Tejun Heo
    Reported-by: Ralf Hildebrandt
    Reported-by: Wu Fengguang
    Tested-by: Sergei Trofimovich
    Signed-off-by: Jens Axboe

    Tejun Heo
     

14 Dec, 2011

9 commits

  • Now block layer knows everything necessary to create and associate
    icq's with requests. Move ioc_create_icq() to blk-ioc.c and update
    get_request() such that, if elevator_type->icq_size is set, requests
    are automatically associated with their matching icq's before
    elv_set_request(). io_context reference is also managed by block core
    on request alloc/free.

    * Only ioprio/cgroup changed handling remains from cfq_get_cic().
    Collapsed into cfq_set_request().

    * This removes queue kicking on icq allocation failure (for now). As
    icq allocation failure is rare and the only effect of queue kicking
    achieved was possibily accelerating queue processing, this change
    shouldn't be noticeable.

    There is a larger underlying problem. Unlike request allocation,
    icq allocation is not guaranteed to succeed eventually after
    retries. The number of icq is unbound and thus mempool can't be the
    solution either. This effectively adds allocation dependency on
    memory free path and thus possibility of deadlock.

    This usually wouldn't happen because icq allocation is not a hot
    path and, even when the condition triggers, it's highly unlikely
    that none of the writeback workers already has icq.

    However, this is still possible especially if elevator is being
    switched under high memory pressure, so we better get it fixed.
    Probably the only solution is just bypassing elevator and appending
    to dispatch queue on any elevator allocation failure.

    * Comment added to explain how icq's are managed and synchronized.

    This completes cleanup of io_context interface.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Most of icq management is about to be moved out of cfq into blk-ioc.
    This patch prepares for it.

    * Move cfqd->icq_list to request_queue->icq_list

    * Make request explicitly point to icq instead of through elevator
    private data. ->elevator_private[3] is replaced with sub struct elv
    which contains icq pointer and priv[2]. cfq is updated accordingly.

    * Meaningless clearing of ->elevator_private[0] removed from
    elv_set_request(). At that point in code, the field was guaranteed
    to be %NULL anyway.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When called under queue_lock, current_io_context() triggers lockdep
    warning if it hits allocation path. This is because io_context
    installation is protected by task_lock which is not IRQ safe, so it
    triggers irq-unsafe-lock -> irq -> irq-safe-lock -> irq-unsafe-lock
    deadlock warning.

    Given the restriction, accessor + creator rolled into one doesn't work
    too well. Drop current_io_context() and let the users access
    task->io_context directly inside queue_lock combined with explicit
    creation using create_io_context().

    Future ioc updates will further consolidate ioc access and the create
    interface will be unexported.

    While at it, relocate ioc internal interface declarations in blk.h and
    add section comments before and after.

    This patch does not introduce functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • * blk_get_queue() is peculiar in that it returns 0 on success and 1 on
    failure instead of 0 / -errno or boolean. Update it such that it
    returns %true on success and %false on failure.

    * Make sure the caller checks for the return value.

    * Separate out __blk_get_queue() which doesn't check whether @q is
    dead and put it in blk.h. This will be used later.

    This patch doesn't introduce any functional changes.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • cfq allocates per-queue id using ida and uses it to index cic radix
    tree from io_context. Move it to q->id and allocate on queue init and
    free on queue release. This simplifies cfq a bit and will allow for
    further improvements of io context life-cycle management.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_insert_cloned_request(), blk_execute_rq_nowait() and
    blk_flush_plug_list() either didn't check whether the queue was dead
    or did it without holding queue_lock. Update them so that dead state
    is checked while holding queue_lock.

    AFAICS, this plugs all holes (requeue doesn't matter as the request is
    transitioning atomically from in_flight to queued).

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • When trying to drain all requests, blk_drain_queue() checked only
    q->rq.count[]; however, this only tracks REQ_ALLOCED requests. This
    patch updates blk_drain_queue() such that it looks at all the counters
    and queues so that request_queue is actually empty on completion.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There are a number of QUEUE_FLAG_DEAD tests. Add blk_queue_dead()
    macro and use it.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • The only user left for blk_insert_request() is sx8 and it can be
    trivially switched to use blk_execute_rq_nowait() - special requests
    aren't included in io stat and sx8 doesn't use block layer tagging.
    Switch sx8 and kill blk_insert_requeset().

    This patch doesn't introduce any functional difference.

    Only compile tested.

    Signed-off-by: Tejun Heo
    Acked-by: Jeff Garzik
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Nov, 2011

1 commit

  • struct request_queue is allocated with __GFP_ZERO so its "node" field is
    zero before initialization. This causes an oops if node 0 is offline in
    the page allocator because its zonelists are not initialized. From Dave
    Young's dmesg:

    SRAT: Node 1 PXM 2 0-d0000000
    SRAT: Node 1 PXM 2 100000000-330000000
    SRAT: Node 0 PXM 1 330000000-630000000
    Initmem setup node 1 0000000000000000-000000000affb000
    ...
    Built 1 zonelists in Node order, mobility grouping on.
    ...
    BUG: unable to handle kernel paging request at 0000000000001c08
    IP: [] __alloc_pages_nodemask+0xb5/0x870

    and __alloc_pages_nodemask+0xb5 translates to a NULL pointer on
    zonelist->_zonerefs.

    The fix is to initialize q->node at the time of allocation so the correct
    node is passed to the slab allocator later.

    Since blk_init_allocated_queue_node() is no longer needed, merge it with
    blk_init_allocated_queue().

    [rientjes@google.com: changelog, initializing q->node]
    Cc: stable@vger.kernel.org [2.6.37+]
    Reported-by: Dave Young
    Signed-off-by: Mike Snitzer
    Signed-off-by: David Rientjes
    Tested-by: Dave Young
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

16 Nov, 2011

2 commits


04 Nov, 2011

1 commit

  • blk_cleanup_queue() may be called before elevator is set up on a
    queue which triggers the following oops.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] elv_drain_elevator+0x1c/0x70
    ...
    Pid: 830, comm: kworker/0:2 Not tainted 3.1.0-next-20111025_64+ #1590
    Bochs Bochs
    RIP: 0010:[] [] elv_drain_elevator+0x1c/0x70
    ...
    Call Trace:
    [] blk_drain_queue+0x42/0x70
    [] blk_cleanup_queue+0xd0/0x1c0
    [] md_free+0x50/0x70
    [] kobject_release+0x8b/0x1d0
    [] kref_put+0x36/0xa0
    [] kobject_put+0x27/0x60
    [] mddev_delayed_delete+0x2f/0x40
    [] process_one_work+0x100/0x3b0
    [] worker_thread+0x15f/0x3a0
    [] kthread+0x87/0x90
    [] kernel_thread_helper+0x4/0x10

    Fix it by making blk_cleanup_queue() check whether q->elevator is set
    up before invoking blk_drain_queue.

    Signed-off-by: Tejun Heo
    Reported-and-tested-by: Jiri Slaby
    Signed-off-by: Jens Axboe

    Tejun Heo
     

24 Oct, 2011

3 commits

  • Jens Axboe
     
  • A dm-multipath user reported[1] a problem when trying to boot
    a kernel with commit 4853abaae7e4a2af938115ce9071ef8684fb7af4
    (block: fix flush machinery for stacking drivers with differring
    flush flags) applied. It turns out that an empty flush request
    can be sent into blk_insert_flush. When the BUG_ON was fixed
    to allow for this, I/O on the underlying device would stall. The
    reason is that blk_insert_cloned_request does not kick the queue.
    In the aforementioned commit, I had added a special case to
    kick the queue if data was sent down but the queue flags did
    not require a flush. A better solution is to push the queue
    kick up into blk_insert_cloned_request.

    This patch, along with a follow-on which fixes the BUG_ON, fixes
    the issue reported.

    [1] http://www.redhat.com/archives/dm-devel/2011-September/msg00154.html

    Reported-by: Christophe Saout
    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo

    Stable note: 3.1
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Jeff Moyer
     
  • bio originally has the functionality to set the complete cpu, but
    it is broken.

    Chirstoph said that "This code is unused, and from the all the
    discussions lately pretty obviously broken. The only thing keeping
    it serves is creating more confusion and possibly more bugs."

    And Jens replied with "We can kill bio_set_completion_cpu(). I'm fine
    with leaving cpu control to the request based drivers, they are the
    only ones that can toggle the setting anyway".

    So this patch tries to remove all the work of controling complete cpu
    from a bio.

    Cc: Shaohua Li
    Cc: Christoph Hellwig
    Signed-off-by: Tao Ma
    Signed-off-by: Jens Axboe

    Tao Ma
     

19 Oct, 2011

7 commits

  • request_queue is refcounted but actually depdends on lifetime
    management from the queue owner - on blk_cleanup_queue(), block layer
    expects that there's no request passing through request_queue and no
    new one will.

    This is fundamentally broken. The queue owner (e.g. SCSI layer)
    doesn't have a way to know whether there are other active users before
    calling blk_cleanup_queue() and other users (e.g. bsg) don't have any
    guarantee that the queue is and would stay valid while it's holding a
    reference.

    With delay added in blk_queue_bio() before queue_lock is grabbed, the
    following oops can be easily triggered when a device is removed with
    in-flight IOs.

    sd 0:0:1:0: [sdb] Stopping disk
    ata1.01: disabled
    general protection fault: 0000 [#1] PREEMPT SMP
    CPU 2
    Modules linked in:

    Pid: 648, comm: test_rawio Not tainted 3.1.0-rc3-work+ #56 Bochs Bochs
    RIP: 0010:[] [] elv_rqhash_find+0x61/0x100
    ...
    Process test_rawio (pid: 648, threadinfo ffff880019efa000, task ffff880019ef8a80)
    ...
    Call Trace:
    [] elv_merge+0x84/0xe0
    [] blk_queue_bio+0xf4/0x400
    [] generic_make_request+0xca/0x100
    [] submit_bio+0x74/0x100
    [] dio_bio_submit+0xbc/0xc0
    [] __blockdev_direct_IO+0x92e/0xb40
    [] blkdev_direct_IO+0x57/0x60
    [] generic_file_aio_read+0x6d5/0x760
    [] do_sync_read+0xda/0x120
    [] vfs_read+0xc5/0x180
    [] sys_pread64+0x9a/0xb0
    [] system_call_fastpath+0x16/0x1b

    This happens because blk_queue_cleanup() destroys the queue and
    elevator whether IOs are in progress or not and DEAD tests are
    sprinkled in the request processing path without proper
    synchronization.

    Similar problem exists for blk-throtl. On queue cleanup, blk-throtl
    is shutdown whether it has requests in it or not. Depending on
    timing, it either oopses or throttled bios are lost putting tasks
    which are waiting for bio completion into eternal D state.

    The way it should work is having the usual clear distinction between
    shutdown and release. Shutdown drains all currently pending requests,
    marks the queue dead, and performs partial teardown of the now
    unnecessary part of the queue. Even after shutdown is complete,
    reference holders are still allowed to issue requests to the queue
    although they will be immmediately failed. The rest of teardown
    happens on release.

    This patch makes the following changes to make blk_queue_cleanup()
    behave as proper shutdown.

    * QUEUE_FLAG_DEAD is now set while holding both q->exit_mutex and
    queue_lock.

    * Unsynchronized DEAD check in generic_make_request_checks() removed.
    This couldn't make any meaningful difference as the queue could die
    after the check.

    * blk_drain_queue() updated such that it can drain all requests and is
    now called during cleanup.

    * blk_throtl updated such that it checks DEAD on grabbing queue_lock,
    drains all throttled bios during cleanup and free td when queue is
    released.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • attempt_plug_merge() accesses elevator without holding queue_lock and
    may call into ->elevator_bio_merge_fn(). The elvator is guaranteed to
    be valid because it's accessed iff the plugged list has requests and
    elevator is never exited with live requests, so as long as the
    elevator method can deal with unlocked access, this is safe.

    Explain the sync rules around attempt_plug_merge() and drop the
    unnecessary @tsk parameter.

    This patch doesn't introduce any functional change.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Currently get_request[_wait]() allocates request whether queue is dead
    or not. This patch makes get_request[_wait]() return NULL if @q is
    dead. blk_queue_bio() is updated to fail the submitted bio if request
    allocation fails. While at it, add docbook comments for
    get_request[_wait]().

    Note that the current code has rather unclear (there are spurious DEAD
    tests scattered around) assumption that the owner of a queue
    guarantees that no request travels block layer if the queue is dead
    and this patch in itself doesn't change much; however, this will allow
    fixing the broken assumption in the next patch.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_throtl_bio() and throtl_get_tg() have rather unusual interface.

    * throtl_get_tg() returns pointer to a valid tg or ERR_PTR(-ENODEV),
    and drops queue_lock in the latter case. Different locking context
    depending on return value is error-prone and DEAD state is scheduled
    to be protected by queue_lock anyway. Move DEAD check inside
    queue_lock and return valid tg or NULL.

    * blk_throtl_bio() indicates return status both with its return value
    and in/out param **@bio. The former is used to indicate whether
    queue is found to be dead during throtl processing. The latter
    whether the bio is throttled.

    There's no point in returning DEAD check result from
    blk_throtl_bio(). The queue can die after blk_throtl_bio() is
    finished but before make_request_fn() grabs queue lock.

    Make it take *@bio instead and return boolean result indicating
    whether the request is throttled or not.

    This patch doesn't cause any visible functional difference.

    Signed-off-by: Tejun Heo
    Cc: Vivek Goyal
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Reorganize queue draining related code in preparation of queue exit
    changes.

    * Factor out actual draining from elv_quiesce_start() to
    blk_drain_queue().

    * Make elv_quiesce_start/end() responsible for their own locking.

    * Replace open-coded ELVSWITCH clearing in elevator_switch() with
    elv_quiesce_end().

    This patch doesn't cause any visible functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • blk_alloc_request() and freed_request() take different combinations of
    REQ_* @flags, @priv and @is_sync when @flags is superset of the latter
    two. Make them take @flags only. This cleans up the code a bit and
    will ease updating allocation related REQ_* flags.

    This patch doesn't introduce any functional difference.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    block/blk-core.c
    include/linux/blkdev.h

    Signed-off-by: Jens Axboe

    Jens Axboe
     

28 Sep, 2011

1 commit

  • A kernel crash is observed when a mounted ext3/ext4 filesystem is
    physically removed. The problem is that blk_cleanup_queue() frees up
    some resources eg by calling elevator_exit(), which are not checked for
    in normal operation. So we should rather move these calls to the
    destructor function blk_release_queue() as at that point all remaining
    references are gone. However, in doing so we have to ensure that any
    externally supplied queue_lock is disconnected as the driver might free
    up the lock after the call of blk_cleanup_queue(),

    Signed-off-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Hannes Reinecke
     

21 Sep, 2011

1 commit

  • Thus spake Andrew Morton:

    "And I have the usual maintainability whine. If someone comes up to
    vmscan.c and sees it calling blk_start_plug(), how are they supposed to
    work out why that call is there? They go look at the blk_start_plug()
    definition and it is undocumented. I think we can do better than this?"

    Adapted from the LWN article - http://lwn.net/Articles/438256/ by Jens
    Axboe and from an earlier attempt by Shaohua Li to document blk-plug.

    [akpm@linux-foundation.org: grammatical and spelling tweaks]
    Signed-off-by: Suresh Jayaraman
    Cc: Shaohua Li
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Suresh Jayaraman
     

15 Sep, 2011

1 commit

  • Move all the checks performed on a bio into a new helper, and call it as
    soon as bio is submitted even if it is a re-submission from ->make_request.

    We explicitly mark the new helper as beeing non-inlined as the stack
    usage for printing the block device name in the failure case is quite
    high and this a patch where we have to be extremely conservative about
    stack usage.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Sep, 2011

3 commits


24 Aug, 2011

2 commits

  • Cleaning up the code a little bit. attempt_plug_merge() traverses the plug
    list anyway, we can do the request counting there, so stack size is reduced
    a little bit.
    The motivation here is I suspect if we should count the requests for each
    queue (task could handle multiple disks in the meantime), but my test doesn't
    show it's worthy doing. If somebody proves we should do it, below change
    will make that more easier.

    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Do blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request aDo blk_flush_plug_list() first and then add new request at the tail. New
    request can't be merged to existing requests, but later new requests might
    be merged with this new one. If blk_flush_plug_list() is done later, the
    merge doesn't happen.
    Believe it or not, this fixes a 10% regression running sysbench workload.

    Signed-off-by: Shaohua Li
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Aug, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-block: (23 commits)
    Revert "cfq: Remove special treatment for metadata rqs."
    block: fix flush machinery for stacking drivers with differring flush flags
    block: improve rq_affinity placement
    blktrace: add FLUSH/FUA support
    Move some REQ flags to the common bio/request area
    allow blk_flush_policy to return REQ_FSEQ_DATA independent of *FLUSH
    xen/blkback: Make description more obvious.
    cfq-iosched: Add documentation about idling
    block: Make rq_affinity = 1 work as expected
    block: swim3: fix unterminated of_device_id table
    block/genhd.c: remove useless cast in diskstats_show()
    drivers/cdrom/cdrom.c: relax check on dvd manufacturer value
    drivers/block/drbd/drbd_nl.c: use bitmap_parse instead of __bitmap_parse
    bsg-lib: add module.h include
    cfq-iosched: Reduce linked group count upon group destruction
    blk-throttle: correctly determine sync bio
    loop: fix deadlock when sysfs and LOOP_CLR_FD race against each other
    loop: add BLK_DEV_LOOP_MIN_COUNT=%i to allow distros 0 pre-allocated loop devices
    loop: add management interface for on-demand device allocation
    loop: replace linked list of allocated devices with an idr index
    ...

    Linus Torvalds
     

16 Aug, 2011

1 commit

  • Commit ae1b1539622fb46e51b4d13b3f9e5f4c713f86ae, block: reimplement
    FLUSH/FUA to support merge, introduced a performance regression when
    running any sort of fsyncing workload using dm-multipath and certain
    storage (in our case, an HP EVA). The test I ran was fs_mark, and it
    dropped from ~800 files/sec on ext4 to ~100 files/sec. It turns out
    that dm-multipath always advertised flush+fua support, and passed
    commands on down the stack, where those flags used to get stripped off.
    The above commit changed that behavior:

    static inline struct request *__elv_next_request(struct request_queue *q)
    {
    struct request *rq;

    while (1) {
    - while (!list_empty(&q->queue_head)) {
    + if (!list_empty(&q->queue_head)) {
    rq = list_entry_rq(q->queue_head.next);
    - if (!(rq->cmd_flags & (REQ_FLUSH | REQ_FUA)) ||
    - (rq->cmd_flags & REQ_FLUSH_SEQ))
    - return rq;
    - rq = blk_do_flush(q, rq);
    - if (rq)
    - return rq;
    + return rq;
    }

    Note that previously, a command would come in here, have
    REQ_FLUSH|REQ_FUA set, and then get handed off to blk_do_flush:

    struct request *blk_do_flush(struct request_queue *q, struct request *rq)
    {
    unsigned int fflags = q->flush_flags; /* may change, cache it */
    bool has_flush = fflags & REQ_FLUSH, has_fua = fflags & REQ_FUA;
    bool do_preflush = has_flush && (rq->cmd_flags & REQ_FLUSH);
    bool do_postflush = has_flush && !has_fua && (rq->cmd_flags &
    REQ_FUA);
    unsigned skip = 0;
    ...
    if (blk_rq_sectors(rq) && !do_preflush && !do_postflush) {
    rq->cmd_flags &= ~REQ_FLUSH;
    if (!has_fua)
    rq->cmd_flags &= ~REQ_FUA;
    return rq;
    }

    So, the flush machinery was bypassed in such cases (q->flush_flags == 0
    && rq->cmd_flags & (REQ_FLUSH|REQ_FUA)).

    Now, however, we don't get into the flush machinery at all. Instead,
    __elv_next_request just hands a request with flush and fua bits set to
    the scsi_request_fn, even if the underlying request_queue does not
    support flush or fua.

    The agreed upon approach is to fix the flush machinery to allow
    stacking. While this isn't used in practice (since there is only one
    request-based dm target, and that target will now reflect the flush
    flags of the underlying device), it does future-proof the solution, and
    make it function as designed.

    In order to make this work, I had to add a field to the struct request,
    inside the flush structure (to store the original req->end_io). Shaohua
    had suggested overloading the union with rb_node and completion_data,
    but the completion data is used by device mapper and can also be used by
    other drivers. So, I didn't see a way around the additional field.

    I tested this patch on an HP EVA with both ext4 and xfs, and it recovers
    the lost performance. Comments and other testers, as always, are
    appreciated.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

04 Aug, 2011

1 commit

  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

27 Jul, 2011

1 commit

  • This changes should_fail_request() to more usable wrapper function of
    should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
    the middle of a function.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

26 Jul, 2011

2 commits

  • After commit 5757a6d7 introduced an unsafe calling of
    smp_processor_id(), with preempt debuggin turned on we spew a lot of:

    BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
    caller is __make_request+0x1b8/0x308
    [] (unwind_backtrace+0x0/0xe8) from [] (debug_smp_processor_id+0xbc/0xf0)
    [] (debug_smp_processor_id+0xbc/0xf0) from [] (__make_request+0x1b8/0x308)
    [] (__make_request+0x1b8/0x308) from [] (generic_make_request+0x4dc/0x558)
    [] (generic_make_request+0x4dc/0x558) from [] (submit_bio+0x114/0x138)
    [] (submit_bio+0x114/0x138) from [] (submit_bh+0x148/0x16c)
    [] (submit_bh+0x148/0x16c) from [] (__sync_dirty_buffer+0x88/0xd8)
    [] (__sync_dirty_buffer+0x88/0xd8) from [] (journal_commit_transaction+0x1198/0x1688)
    [] (journal_commit_transaction+0x1198/0x1688) from [] (kjournald+0xb4/0x224)
    [] (kjournald+0xb4/0x224) from [] (kthread+0x8c/0x94)
    [] (kthread+0x8c/0x94) from [] (kernel_thread_exit+0x0/0x8)

    Fix this by just using raw_smp_processor_id(), it's just a hint
    after all. There's no pinning of the CPU or accessing per-cpu
    structures involved.

    Reported-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

24 Jul, 2011

1 commit

  • Some systems benefit from completions always being steered to the strict
    requester cpu rather than the looser "per-socket" steering that
    blk_cpu_to_group() attempts by default. This is because the first
    CPU in the group mask ends up being completely overloaded with work,
    while the others (including the original submitter) has power left
    to spare.

    Allow the strict mode to be set by writing '2' to the sysfs control
    file. This is identical to the scheme used for the nomerges file,
    where '2' is a more aggressive setting than just being turned on.

    echo 2 > /sys/block//queue/rq_affinity

    Cc: Christoph Hellwig
    Cc: Roland Dreier
    Tested-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams