04 Aug, 2011

1 commit

  • init_fault_attr_dentries() is used to export fault_attr via debugfs.
    But it can only export it in debugfs root directory.

    Per Forlin is working on mmc_fail_request which adds support to inject
    data errors after a completed host transfer in MMC subsystem.

    The fault_attr for mmc_fail_request should be defined per mmc host and
    export it in debugfs directory per mmc host like
    /sys/kernel/debug/mmc0/mmc_fail_request.

    init_fault_attr_dentries() doesn't help for mmc_fail_request. So this
    introduces fault_create_debugfs_attr() which is able to create a
    directory in the arbitrary directory and replace
    init_fault_attr_dentries().

    [akpm@linux-foundation.org: extraneous semicolon, per Randy]
    Signed-off-by: Akinobu Mita
    Tested-by: Per Forlin
    Cc: Jens Axboe
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: Matt Mackall
    Cc: Randy Dunlap
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

27 Jul, 2011

1 commit

  • This changes should_fail_request() to more usable wrapper function of
    should_fail(). It can avoid putting #ifdef CONFIG_FAIL_MAKE_REQUEST in
    the middle of a function.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Akinobu Mita
     

26 Jul, 2011

3 commits

  • After commit 5757a6d7 introduced an unsafe calling of
    smp_processor_id(), with preempt debuggin turned on we spew a lot of:

    BUG: using smp_processor_id() in preemptible [00000000] code: kjournald/514
    caller is __make_request+0x1b8/0x308
    [] (unwind_backtrace+0x0/0xe8) from [] (debug_smp_processor_id+0xbc/0xf0)
    [] (debug_smp_processor_id+0xbc/0xf0) from [] (__make_request+0x1b8/0x308)
    [] (__make_request+0x1b8/0x308) from [] (generic_make_request+0x4dc/0x558)
    [] (generic_make_request+0x4dc/0x558) from [] (submit_bio+0x114/0x138)
    [] (submit_bio+0x114/0x138) from [] (submit_bh+0x148/0x16c)
    [] (submit_bh+0x148/0x16c) from [] (__sync_dirty_buffer+0x88/0xd8)
    [] (__sync_dirty_buffer+0x88/0xd8) from [] (journal_commit_transaction+0x1198/0x1688)
    [] (journal_commit_transaction+0x1198/0x1688) from [] (kjournald+0xb4/0x224)
    [] (kjournald+0xb4/0x224) from [] (kthread+0x8c/0x94)
    [] (kthread+0x8c/0x94) from [] (kernel_thread_exit+0x0/0x8)

    Fix this by just using raw_smp_processor_id(), it's just a hint
    after all. There's no pinning of the CPU or accessing per-cpu
    structures involved.

    Reported-by: Ming Lei
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * 'for-3.1/drivers' of git://git.kernel.dk/linux-block:
    cciss: do not attempt to read from a write-only register
    xen/blkback: Add module alias for autoloading
    xen/blkback: Don't let in-flight requests defer pending ones.
    bsg: fix address space warning from sparse
    bsg: remove unnecessary conditional expressions
    bsg: fix bsg_poll() to return POLLOUT properly

    Linus Torvalds
     
  • * 'for-3.1/core' of git://git.kernel.dk/linux-block: (24 commits)
    block: strict rq_affinity
    backing-dev: use synchronize_rcu_expedited instead of synchronize_rcu
    block: fix patch import error in max_discard_sectors check
    block: reorder request_queue to remove 64 bit alignment padding
    CFQ: add think time check for group
    CFQ: add think time check for service tree
    CFQ: move think time check variables to a separate struct
    fixlet: Remove fs_excl from struct task.
    cfq: Remove special treatment for metadata rqs.
    block: document blk_plug list access
    block: avoid building too big plug list
    compat_ioctl: fix make headers_check regression
    block: eliminate potential for infinite loop in blkdev_issue_discard
    compat_ioctl: fix warning caused by qemu
    block: flush MEDIA_CHANGE from drivers on close(2)
    blk-throttle: Make total_nr_queued unsigned
    block: Add __attribute__((format(printf...) and fix fallout
    fs/partitions/check.c: make local symbols static
    block:remove some spare spaces in genhd.c
    block:fix the comment error in blkdev.h
    ...

    Linus Torvalds
     

24 Jul, 2011

3 commits

  • Some systems benefit from completions always being steered to the strict
    requester cpu rather than the looser "per-socket" steering that
    blk_cpu_to_group() attempts by default. This is because the first
    CPU in the group mask ends up being completely overloaded with work,
    while the others (including the original submitter) has power left
    to spare.

    Allow the strict mode to be set by writing '2' to the sysfs control
    file. This is identical to the scheme used for the nomerges file,
    where '2' is a more aggressive setting than just being turned on.

    echo 2 > /sys/block//queue/rq_affinity

    Cc: Christoph Hellwig
    Cc: Roland Dreier
    Tested-by: Dave Jiang
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • A '!' snuck in before the unlikely, rendering it useless.

    Reported-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6: (77 commits)
    [SCSI] fix crash in scsi_dispatch_cmd()
    [SCSI] sr: check_events() ignore GET_EVENT when TUR says otherwise
    [SCSI] bnx2i: Fixed kernel panic due to illegal usage of sc->request->cpu
    [SCSI] bfa: Update the driver version to 3.0.2.1
    [SCSI] bfa: Driver and BSG enhancements.
    [SCSI] bfa: Added support to query PHY.
    [SCSI] bfa: Added HBA diagnostics support.
    [SCSI] bfa: Added support for flash configuration
    [SCSI] bfa: Added support to obtain SFP info.
    [SCSI] bfa: Added support for CEE info and stats query.
    [SCSI] bfa: Extend BSG interface.
    [SCSI] bfa: FCS bug fixes.
    [SCSI] bfa: DMA memory allocation enhancement.
    [SCSI] bfa: Brocade-1860 Fabric Adapter vHBA support.
    [SCSI] bfa: Brocade-1860 Fabric Adapter PLL init fixes.
    [SCSI] bfa: Added Fabric Assigned Address(FAA) support
    [SCSI] bfa: IOC bug fixes.
    [SCSI] bfa: Enable ASIC block configuration and query.
    [SCSI] bnx2i: Updated copyright and bump version
    [SCSI] bnx2i: Modified to skip CNIC registration if iSCSI is not supported
    ...

    Fix up some trivial conflicts in:
    - drivers/scsi/bnx2fc/{bnx2fc.h,bnx2fc_fcoe.c}:
    Crazy broadcom version number conflicts
    - drivers/target/tcm_fc/tfc_cmd.c
    Just trivial cleanups done on adjacent lines

    Linus Torvalds
     

22 Jul, 2011

1 commit

  • USB surprise removal of sr is triggering an oops in
    scsi_dispatch_command(). What seems to be happening is that USB is
    hanging on to a queue reference until the last close of the upper
    device, so the crash is caused by surprise remove of a mounted CD
    followed by attempted unmount.

    The problem is that USB doesn't issue its final commands as part of
    the SCSI teardown path, but on last close when the block queue is long
    gone. The long term fix is probably to make sr do the teardown in the
    same way as sd (so remove all the lower bits on ejection, but keep the
    upper disk alive until last close of user space). However, the
    current oops can be simply fixed by not allowing any commands to be
    sent to a dead queue.

    Cc: stable@kernel.org
    Signed-off-by: James Bottomley

    James Bottomley
     

21 Jul, 2011

1 commit


12 Jul, 2011

4 commits

  • Currently when the last queue of a group has no request, we don't expire
    the queue to hope request from the group comes soon, so the group doesn't
    miss its share. But if the think time is big, the assumption isn't correct
    and we just waste bandwidth. In such case, we don't do idle.

    [global]
    runtime=30
    direct=1

    [test1]
    cgroup=test1
    cgroup_weight=1000
    rw=randread
    ioengine=libaio
    size=500m
    runtime=30
    directory=/mnt
    filename=file1
    thinktime=9000

    [test2]
    cgroup=test2
    cgroup_weight=1000
    rw=randread
    ioengine=libaio
    size=500m
    runtime=30
    directory=/mnt
    filename=file2

    patched base
    test1 64k 39k
    test2 548k 540k
    total 604k 578k

    group1 gets much better throughput because it waits less time.

    To check if the patch changes behavior of queue without think time. I also
    tried to give test1 2ms think time or no think time. The test result is stable.
    The thoughput doesn't change with/without the patch.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Currently when the last queue of a service tree has no request, we don't
    expire the queue to hope request from the service tree comes soon, so the
    service tree doesn't miss its share. But if the think time is big, the
    assumption isn't correct and we just waste bandwidth. In such case, we
    don't do idle.

    [global]
    runtime=10
    direct=1

    [test1]
    rw=randread
    ioengine=libaio
    size=500m
    directory=/mnt
    filename=file1
    thinktime=9000

    [test2]
    rw=read
    ioengine=libaio
    size=1G
    directory=/mnt
    filename=file2

    patched base
    test1 41k/s 33k/s
    test2 15868k/s 15789k/s
    total 15902k/s 15817k/s

    A slightly better

    To check if the patch changes behavior of queue without think time. I also
    tried to give test1 2ms think time or no think time. The test has variation
    even without the patch, but the average throughput doesn't change with/without
    the patch.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • Move the variables to do think time check to a sepatate struct. This is
    to prepare adding think time check for service tree and group. No
    functional change.

    Signed-off-by: Shaohua Li
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • fs_excl is a poor man's priority inheritance for filesystems to hint to
    the block layer that an operation is important. It was never clearly
    specified, not widely adopted, and will not prevent starvation in many
    cases (like across cgroups).

    fs_excl was introduced with the time sliced CFQ IO scheduler, to
    indicate when a process held FS exclusive resources and thus needed
    a boost.

    It doesn't cover all file systems, and it was never fully complete.
    Lets kill it.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

11 Jul, 2011

1 commit

  • There is no consistency among filesystems from what bios (or requests)
    are marked as being metadata. It's interesting to expose this in traces,
    but we shouldn't schedule the requests differently based on whether or
    not they're marked as being metadata.

    Signed-off-by: Justin TerAvest
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

08 Jul, 2011

1 commit

  • When I test fio script with big I/O depth, I found the total throughput drops
    compared to some relative small I/O depth. The reason is the thread accumulates
    big requests in its plug list and causes some delays (surely this depends
    on CPU speed).
    I thought we'd better have a threshold for requests. When a threshold reaches,
    this means there is no request merge and queue lock contention isn't severe
    when pushing per-task requests to queue, so the main advantages of blk plug
    don't exist. We can force a plug list flush in this case.
    With this, my test throughput actually increases and almost equals to small
    I/O depth. Another side effect is irq off time decreases in blk_flush_plug_list()
    for big I/O depth.
    The BLK_MAX_REQUEST_COUNT is choosen arbitarily, but 16 is efficiently to
    reduce lock contention to me. But I'm open here, 32 is ok in my test too.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

07 Jul, 2011

1 commit

  • Due to the recently identified overflow in read_capacity_16() it was
    possible for max_discard_sectors to be zero but still have discards
    enabled on the associated device's queue.

    Eliminate the possibility for blkdev_issue_discard to infinitely loop.

    Interestingly this issue wasn't identified until a device, whose
    discard_granularity was 0 due to read_capacity_16 overflow, was consumed
    by blk_stack_limits() to construct limits for a higher-level DM
    multipath device. The multipath device's resulting limits never had the
    discard limits stacked because blk_stack_limits() will only do so if
    the bottom device's discard_granularity != 0. This resulted in the
    multipath device's limits.max_discard_sectors being 0.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

02 Jul, 2011

1 commit

  • On Linux x86_64 host with 32bit userspace, running
    qemu or even just "qemu-img create -f qcow2 some.img 1G"
    causes a kernel warning:

    ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(00005326){t:'S';sz:0} arg(7fffffff) on some.img
    ioctl32(qemu-img:5296): Unknown cmd fd(3) cmd(801c0204){t:02;sz:28} arg(fff77350) on some.img

    ioctl 00005326 is CDROM_DRIVE_STATUS,
    ioctl 801c0204 is FDGETPRM.

    The warning appears because the Linux compat-ioctl handler for these
    ioctls only applies to block devices, while qemu also uses the ioctls on
    plain files.

    Signed-off-by: Johannes Stezenbach
    Acked-by: Arnd Bergmann
    Signed-off-by: Jens Axboe

    Johannes Stezenbach
     

01 Jul, 2011

2 commits

  • Currently, only open(2) is defined as the 'clearing' point. It has
    two roles - first, it's an acknowledgement from userland indicating
    that the event has been received and kernel can clear pending states
    and proceed to generate more events. Secondly, it's passed on to
    device drivers as a hint indicating that a synchronization point has
    been reached and it might want to take a deeper look at the device.

    The latter currently is only used by sr which uses two different
    mechanisms - GET_EVENT_MEDIA_STATUS_NOTIFICATION and TEST_UNIT_READY
    to discover events, where the former is lighter weight and safe to be
    used repeatedly but may not provide full coverage. Among other
    things, GET_EVENT can't detect media removal while TUR can.

    This patch makes close(2) - blkdev_put() - indicate clearing hint for
    MEDIA_CHANGE to drivers. disk_check_events() is renamed to
    disk_flush_events() and updated to take @mask for events to flush
    which is or'd to ev->clearing and will be passed to the driver on the
    next ->check_events() invocation.

    This change makes sr generate MEDIA_CHANGE when media is ejected from
    userland - e.g. with eject(1).

    Note: Given the current usage, it seems @clearing hint is needlessly
    complex. disk_clear_events() can simply clear all events and the hint
    can be boolean @flush.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Conflicts:
    block/blk-throttle.c
    block/cfq-iosched.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

27 Jun, 2011

2 commits

  • ioc->ioc_data is rcu protectd, so uses correct API to access it.
    This doesn't change any behavior, but just make code consistent.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I got a rcu warnning at boot. the ioc->ioc_data is rcu_deferenced, but
    doesn't hold rcu_read_lock.

    Signed-off-by: Shaohua Li
    Cc: stable@kernel.org # after ab4bd22d
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Jun, 2011

3 commits

  • copy_from/to_user() and blk_rq_map_user() want __user pointer.
    This patch fixes following warnings from sparse:

    CHECK block/bsg.c
    block/bsg.c:185:38: warning: incorrect type in argument 2 (different address spaces)
    block/bsg.c:185:38: expected void const [noderef] *from
    block/bsg.c:185:38: got void *
    block/bsg.c:295:58: warning: incorrect type in argument 4 (different address spaces)
    block/bsg.c:295:58: expected void [noderef] *
    block/bsg.c:295:58: got void *[assigned] dxferp
    block/bsg.c:311:52: warning: incorrect type in argument 4 (different address spaces)
    block/bsg.c:311:52: expected void [noderef] *
    block/bsg.c:311:52: got void *[assigned] dxferp
    block/bsg.c:448:37: warning: incorrect type in argument 1 (different address spaces)
    block/bsg.c:448:37: expected void [noderef] *dst
    block/bsg.c:448:37: got void *

    Signed-off-by: Namhyung Kim
    Acked-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    Namhyung Kim
     
  • Second condition in OR always implies first condition is false
    thus bytes_read in the second is not needed. The same goes to
    bytes_written.

    Signed-off-by: Namhyung Kim
    Acked-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    Namhyung Kim
     
  • POLLOUT should be returned only if bd->queued_cmds < bd->max_queue
    so that bsg_alloc_command() can proceed.

    Signed-off-by: Namhyung Kim
    Acked-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    Namhyung Kim
     

14 Jun, 2011

2 commits


13 Jun, 2011

2 commits


10 Jun, 2011

3 commits

  • disk_block_events() should guarantee that the event work is not in
    flight on return and once blocked it shouldn't issue further
    cancellations.

    Because there was no synchronization between the first blocker doing
    cancel_delayed_work_sync() and the following blockers, the following
    blockers could finish before cancellation was complete, which broke
    both guarantees - event work could be in flight and cancellation could
    happen after return.

    This bug triggered WARN_ON_ONCE() in disk_clear_events() reported in
    bug#34662.

    https://bugzilla.kernel.org/show_bug.cgi?id=34662

    Fix it by adding an outer mutex which protects both block count
    manipulation and work cancellation.

    -v2: Use outer mutex instead of bit waitqueue per Linus.

    Signed-off-by: Tejun Heo
    Tested-by: Sitsofe Wheeler
    Reported-by: Sitsofe Wheeler
    Reported-by: Borislav Petkov
    Reported-by: Meelis Roos
    Reported-by: Linus Torvalds
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • After the previous update to disk_check_events(), nobody is using
    non-syncing __disk_block_events(). Remove @sync and, as this makes
    __disk_block_events() virtually identical to disk_block_events(),
    remove the underscore prefixed version.

    Signed-off-by: Tejun Heo
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • This patch is part of fix for triggering of WARN_ON_ONCE() in
    disk_clear_events() reported in bug#34662.

    https://bugzilla.kernel.org/show_bug.cgi?id=34662

    disk_clear_events() blocks events, schedules and flushes the event
    work. It expects the work to have started execution on schedule and
    finished on return from flush. WARN_ON_ONCE() triggers if the event
    work hasn't executed as expected. This problem happens because
    __disk_block_events() fails to guarantee that the event work item is
    not in flight on return from the function in race-free manner. The
    problem is two-fold and this patch addresses one of them.

    When __disk_block_events() is called with @sync == %false, it bumps
    event block count, calls cancel_delayed_work() and return. This makes
    it impossible to guarantee that event polling is not in flight on
    return from syncing __disk_block_events() - if the first blocker was
    non-syncing, polling could still be in progress and later syncing ones
    would assume that the first blocker already canceled it.

    Making __disk_block_events() cancel_sync regardless of block count
    isn't feasible either as it may race with forced event checking in
    disk_clear_events().

    As disk_check_events() is the only user of non-syncing
    __disk_block_events(), updating it to directly cancel and schedule
    event work is the easiest way to solve the issue.

    Note that there's another bug in __disk_block_events() and this patch
    doesn't fix the issue completely. Later patch will fix the other bug.

    Signed-off-by: Tejun Heo
    Tested-by: Sitsofe Wheeler
    Reported-by: Sitsofe Wheeler
    Reported-by: Borislav Petkov
    Reported-by: Meelis Roos
    Reported-by: Linus Torvalds
    Cc: Andrew Morton
    Cc: Jens Axboe
    Cc: Kay Sievers
    Signed-off-by: Jens Axboe

    Tejun Heo
     

06 Jun, 2011

4 commits

  • If we rename the return of alloc_io_context() and get_io_context() from
    "ret" to "ioc" the code get's (a bit) more readable and (a lot) more
    grepable.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jens Axboe

    Paul Bolle
     
  • Correctly suggested by sparse.

    Signed-off-by: Paul Bolle
    Signed-off-by: Jens Axboe

    Paul Bolle
     
  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Since we are modifying this RCU pointer, we need to hold
    the lock protecting it around it.

    This fixes a potential reuse and double free of a cfq
    io_context structure. The bug has been in CFQ for a long
    time, it hit very few people but those it did hit seemed
    to see it a lot.

    Tracked in RH bugzilla here:

    https://bugzilla.redhat.com/show_bug.cgi?id=577968

    Credit goes to Paul Bolle for figuring out that the issue
    was around the one-hit ioc->ioc_data cache. Thanks to his
    hard work the issue is now fixed.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 Jun, 2011

1 commit

  • Hi, Jens,

    If you recall, I posted an RFC patch for this back in July of last year:
    http://lkml.org/lkml/2010/7/13/279

    The basic problem is that a process can issue a never-ending stream of
    async direct I/Os to the same sector on a device, thus starving out
    other I/O in the system (due to the way the alias handling works in both
    cfq and deadline). The solution I proposed back then was to start
    dispatching from the fifo after a certain number of aliases had been
    dispatched. Vivek asked why we had to treat aliases differently at all,
    and I never had a good answer. So, I put together a simple patch which
    allows aliases to be added to the rb tree (it adds them to the right,
    though that doesn't matter as the order isn't guaranteed anyway). I
    think this is the preferred solution, as it doesn't break up time slices
    in CFQ or batches in deadline. I've tested it, and it does solve the
    starvation issue. Let me know what you think.

    Cheers,
    Jeff

    Signed-off-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Jeff Moyer
     

02 Jun, 2011

2 commits


01 Jun, 2011

1 commit