01 Jul, 2008

1 commit

  • AS scheduler alternates between issuing read and write batches. It does
    the batch switch only after all requests from the previous batch are
    completed.

    When switching to a write batch, if there is an on-going read request,
    it waits for its completion and indicates its intention of switching by
    setting ad->changed_batch and the new direction but does not update the
    batch_expire_time for the new write batch which it does in the case of
    no previous pending requests.
    On completion of the read request, it sees that we were waiting for the
    switch and schedules work for kblockd right away and resets the
    ad->changed_data flag.
    Now when kblockd enters dispatch_request where it is expected to pick
    up a write request, it in turn ends the write batch because the
    batch_expire_timer was not updated and shows the expire timestamp for
    the previous batch.

    This results in the write starvation for all the cases where there is
    the intention for switching to a write batch, but there is a previous
    in-flight read request and the batch gets reverted to a read_batch
    right away.

    This also holds true in the reverse case (switching from a write batch
    to a read batch with an in-flight write request).

    I've checked that this bug exists on 2.6.11, 2.6.18, 2.6.24 and
    linux-2.6-block git HEAD. I've tested the fix on x86 platforms with
    SCSI drives where the driver asks for the next request while a current
    request is in-flight.

    This patch is based off linux-2.6-block git HEAD.

    Bug reproduction:
    A simple scenario which reproduces this bug is:
    - dd if=/dev/hda3 of=/dev/null &
    - lilo
    The lilo takes forever to complete.

    This can also be reproduced fairly easily with the earlier dd and
    another test
    program doing msync().

    The example test program below should print out a message after every
    iteration
    but it simply hangs forever. With this bugfix it makes forward progress.

    ====
    Example test program using msync() (thanks to suleiman AT google DOT
    com)

    inline uint64_t
    rdtsc(void)
    {
    int64_t tsc;

    __asm __volatile("rdtsc" : "=A" (tsc));
    return (tsc);
    }

    int
    main(int argc, char **argv)
    {
    struct stat st;
    uint64_t e, s, t;
    char *p, q;
    long i;
    int fd;

    if (argc < 2) {
    printf("Usage: %s \n", argv[0]);
    return (1);
    }

    if ((fd = open(argv[1], O_RDWR | O_NOATIME)) < 0)
    err(1, "open");

    if (fstat(fd, &st) < 0)
    err(1, "fstat");

    p = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE,
    MAP_SHARED, fd, 0);

    t = 0;
    for (i = 0; i < 1000; i++) {
    *p = 0;
    msync(p, 4096, MS_SYNC);
    s = rdtsc();
    *p = 0;
    __asm __volatile(""::: "memory");
    e = rdtsc();
    if (argc > 2)
    printf("%d: %lld cycles %jd %jd\n",
    i, e - s, (intmax_t)s, (intmax_t)e);
    t += e - s;
    }
    printf("average time: %lld cycles\n", t / 1000);
    return (0);
    }

    Cc:
    Acked-by: Nick Piggin
    Signed-off-by: Jens Axboe

    Divyesh Shah
     

13 Jun, 2008

1 commit


10 Jun, 2008

1 commit

  • Commit 30f2f0eb4bd2c43d10a8b0d872c6e5ad8f31c9a0 ("block: do_mounts -
    accept root=") extended blk_lookup_devt() to be
    able to look up partitions that had not yet been registered, but in the
    process made the assumption that the '&block_class.devices' list only
    contains disk devices and that you can do 'dev_to_disk(dev)' on them.

    That isn't actually true. The block_class device list also contains the
    partitions we've discovered so far, and you can't just do a
    'dev_to_disk()' on those.

    So make sure to only work on devices that block/genhd.c has registered
    itself, something we can test by checking the 'dev->type' member. This
    makes the loop in blk_lookup_devt() match the other such loops in this
    file.

    [ We may want to do an alternate version that knows to handle _either_
    whole-disk devices or partitions, but for now this is the minimal fix
    for a series of crashes reported by Mariusz Kozlowski in

    http://lkml.org/lkml/2008/5/25/25

    and Ingo in

    http://lkml.org/lkml/2008/6/9/39 ]

    Reported-by: Mariusz Kozlowski
    Reported-by: Ingo Molnar
    Cc: Neil Brown
    Cc: Joao Luis Meloni Assirati
    Acked-by: Kay Sievers
    Cc: Greg Kroah-Hartman
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 May, 2008

6 commits


15 May, 2008

2 commits

  • As setting and clearing queue flags now requires that we hold a spinlock
    on the queue, and as blk_queue_stack_limits is called without that lock,
    get the lock inside blk_queue_stack_limits.

    For blk_queue_stack_limits to be able to find the right lock, each md
    personality needs to set q->queue_lock to point to the appropriate lock.
    Those personalities which didn't previously use a spin_lock, us
    q->__queue_lock. So always initialise that lock when allocated.

    With this in place, setting/clearing of the QUEUE_FLAG_PLUGGED bit will no
    longer cause warnings as it will be clear that the proper lock is held.

    Thanks to Dan Williams for review and fixing the silly bugs.

    Signed-off-by: NeilBrown
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Alistair John Strachan
    Cc: Nick Piggin
    Cc: "Rafael J. Wysocki"
    Cc: Jacek Luczak
    Cc: Prakash Punnoor
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Brown
     
  • Some devices, like md, may create partitions only at first access,
    so allow root= to be set to a valid non-existant partition of an
    existing disk. This applies only to non-initramfs root mounting.

    This fixes a regression from 2.6.24 which did allow this to happen and
    broke some users machines :(

    Acked-by: Neil Brown
    Tested-by: Joao Luis Meloni Assirati
    Cc: stable
    Signed-off-by: Kay Sievers
    Signed-off-by: Greg Kroah-Hartman

    Kay Sievers
     

13 May, 2008

1 commit

  • bdevname() fills the buffer that it is given as a parameter, so calling
    strcpy() or snprintf() on the returned value is redundant (and probably not
    guaranteed to work - I don't think strcpy and snprintf support overlapping
    buffers.)

    Signed-off-by: Jean Delvare
    Cc: Stephen Tweedie
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     

07 May, 2008

7 commits

  • get_part() is fairly expensive, as it O(N) loops over partitions
    to find the right one. In lots of normal IO paths we end up looking
    up the partition twice, to make matters even worse. Change the
    stat add code to accept a passed in partition instead.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • We currently set all processes to the best-effort scheduling class,
    regardless of what CPU scheduling class they belong to. Improve that
    so that we correctly track idle and rt scheduling classes as well.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Original patch from Mikulas Patocka

    Mike Anderson was doing an OLTP benchmark on a computer with 48 physical
    disks mapped to one logical device via device mapper.

    He found that there was a slowdown on request_queue->lock in function
    generic_unplug_device. The slowdown is caused by the fact that when some
    code calls unplug on the device mapper, device mapper calls unplug on all
    physical disks. These unplug calls take the lock, find that the queue is
    already unplugged, release the lock and exit.

    With the below patch, performance of the benchmark was increased by 18%
    (the whole OLTP application, not just block layer microbenchmarks).

    So I'm submitting this patch for upstream. I think the patch is correct,
    because when more threads call simultaneously plug and unplug, it is
    unspecified, if the queue is or isn't plugged (so the patch can't make
    this worse). And the caller that plugged the queue should unplug it
    anyway. (if it doesn't, there's 3ms timeout).

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • They tend to depend a lot on the workload, so not a clear-cut
    likely or unlikely fit.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • put_io_context() drops the RCU read lock before calling into cfq_dtor(),
    however we need to hold off freeing there before grabbing and
    dereferencing the first object on the list.

    So extend the rcu_read_lock() scope to cover the calling of cfq_dtor(),
    and optimize cfq_free_io_context() to use a new variant for
    call_for_each_cic() that assumes the RCU read lock is already held.

    Hit in the wild by Alexey Dobriyan

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • For most initialization purposes, calling blk_queue_init_tags() without
    the queue lock held is OK. Only if called for resizing an existing map
    must the lock be held. Ditto for tag cleanup, the maps are reference
    counted.

    So switch the general queue flag setting to the unlocked variant, but
    retain the locked variant for resizing.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Concurrency isn't a big deal here since we have requests in flight
    at this point, but do the locked variant to set a better example.

    Signed-off-by: Jens Axboe

    Jens Axboe
     

03 May, 2008

2 commits

  • * git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6:
    [SCSI] aic94xx: fix section mismatch
    [SCSI] u14-34f: Fix 32bit only problem
    [SCSI] dpt_i2o: sysfs code
    [SCSI] dpt_i2o: 64 bit support
    [SCSI] dpt_i2o: move from virt_to_bus/bus_to_virt to dma_alloc_coherent
    [SCSI] dpt_i2o: use standard __init / __exit code
    [SCSI] megaraid_sas: fix suspend/resume sections
    [SCSI] aacraid: Add Power Management support
    [SCSI] aacraid: Fix jbod operations scan issues
    [SCSI] aacraid: Fix warning about macro side-effects
    [SCSI] add support for variable length extended commands
    [SCSI] Let scsi_cmnd->cmnd use request->cmd buffer
    [SCSI] bsg: add large command support
    [SCSI] aacraid: Fix down_interruptible() to check the return value correctly
    [SCSI] megaraid_sas; Update the Version and Changelog
    [SCSI] ibmvscsi: Handle non SCSI error status
    [SCSI] bug fix for free list handling
    [SCSI] ipr: Rename ipr's state scsi host attribute to prevent collisions
    [SCSI] megaraid_mbox: fix Dell CERC firmware problem

    Linus Torvalds
     
  • Add support for variable-length, extended, and vendor specific
    CDBs to scsi-ml. It is now possible for initiators and ULD's
    to issue these types of commands. LLDs need not change much.
    All they need is to raise the .max_cmd_len to the longest command
    they support (see iscsi patch).

    - clean-up some code paths that did not expect commands to be
    larger than 16, and change cmd_len members' type to short as
    char is not enough.

    Signed-off-by: Boaz Harrosh
    Signed-off-by: Benny Halevy
    Signed-off-by: James Bottomley

    Boaz Harrosh
     

02 May, 2008

1 commit


01 May, 2008

1 commit


30 Apr, 2008

1 commit

  • Provide a place in sysfs (/sys/class/bdi) for the backing_dev_info object.
    This allows us to see and set the various BDI specific variables.

    In particular this properly exposes the read-ahead window for all relevant
    users and /sys/block//queue/read_ahead_kb should be deprecated.

    With patient help from Kay Sievers and Greg KH

    [mszeredi@suse.cz]

    - split off NFS and FUSE changes into separate patches
    - document new sysfs attributes under Documentation/ABI
    - do bdi_class_init as a core_initcall, otherwise the "default" BDI
    won't be initialized
    - remove bdi_init_fmt macro, it's not used very much

    [akpm@linux-foundation.org: fix ia64 warning]
    Signed-off-by: Peter Zijlstra
    Cc: Kay Sievers
    Acked-by: Greg KH
    Cc: Trond Myklebust
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Apr, 2008

11 commits

  • The block I/O + elevator + I/O scheduler code spend a lot of time trying
    to merge I/Os -- rightfully so under "normal" circumstances. However,
    if one were to know that the incoming I/O stream was /very/ random in
    nature, the cycles are wasted.

    This patch adds a per-request_queue tunable that (when set) disables
    merge attempts (beyond the simple one-hit cache check), thus freeing up
    a non-trivial amount of CPU cycles.

    Signed-off-by: Alan D. Brunelle
    Signed-off-by: Jens Axboe

    Alan D. Brunelle
     
  • This patch changes rq->cmd from the static array to a pointer to
    support large commands.

    We rarely handle large commands. So for optimization, a struct request
    still has a static array for a command. rq_init sets rq->cmd pointer
    to the static array.

    Signed-off-by: FUJITA Tomonori
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • This is a preparation for changing rq->cmd from the static array to a
    pointer.

    Signed-off-by: FUJITA Tomonori
    Cc: Boaz Harrosh
    Cc: Bartlomiej Zolnierkiewicz
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • This rename rq_init() blk_rq_init() and export it. Any path that hands
    the request to the block layer needs to call it to initialize the
    request.

    This is a preparation for large command support, which needs to
    initialize the request in a proper way (that is, just doing a memset()
    will not work).

    Signed-off-by: FUJITA Tomonori
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • blk_get_request initializes rq->cmd (rq_init does) so the users don't
    need to do that.

    The purpose of this patch is to remove sizeof(rq->cmd) and &rq->cmd,
    as a preparation for large command support, which changes rq->cmd from
    the static array to a pointer. sizeof(rq->cmd) will not make sense and
    &rq->cmd won't work.

    Signed-off-by: FUJITA Tomonori
    Cc: James Bottomley
    Cc: Alasdair G Kergon
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • This patch fixes the following build error with UML and gcc 4.3:

    ...
    CC block/blk-barrier.o
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c: In function ‘blk_do_ordered’:
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:252: sorry, unimplemented: called from here
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:57: sorry, unimplemented: inlining failed in call to ‘blk_ordered_cur_seq’: function body not available
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/blk-barrier.c:253: sorry, unimplemented: called from here
    make[2]: *** [block/blk-barrier.o] Error 1

    Signed-off-by: Adrian Bunk
    Signed-off-by: Jens Axboe

    Adrian Bunk
     
  • This patch fixes the following build error with UML and gcc 4.3:

    ...
    CC block/elevator.o
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c: In function ‘elv_merge’:
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:103: sorry, unimplemented: called from here
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:73: sorry, unimplemented: inlining failed in call to ‘elv_rq_merge_ok’: function body not available
    /home/bunk/linux/kernel-2.6/git/linux-2.6/block/elevator.c:495: sorry, unimplemented: called from here
    make[2]: *** [block/elevator.o] Error 1
    make[1]: *** [block] Error 2

    Signed-off-by: Adrian Bunk
    Signed-off-by: Jens Axboe

    Adrian Bunk
     
  • We can save some atomic ops in the IO path, if we clearly define
    the rules of how to modify the queue flags.

    Signed-off-by: Jens Axboe

    Nick Piggin
     
  • This patch adds bio_copy_kern similar to
    bio_copy_user. blk_rq_map_kern uses bio_copy_kern instead of
    bio_map_kern if necessary.

    bio_copy_kern uses temporary pages and the bi_end_io callback frees
    these pages. bio_copy_kern saves the original kernel buffer at
    bio->bi_private it doesn't use something like struct bio_map_data to
    store the information about the caller.

    Signed-off-by: FUJITA Tomonori
    Cc: Tejun Heo
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     
  • blk_max_pfn can now be unexported.

    Signed-off-by: Adrian Bunk
    Signed-off-by: Jens Axboe

    Adrian Bunk
     
  • This requires moving rq_init() from get_request() to blk_alloc_request().
    The upside is that we can now require an rq_init() from any path that
    wishes to hand the request to the block layer.

    rq_init() will be exported for the code that uses struct request
    without blk_get_request.

    This is a preparation for large command support, which needs to
    initialize struct request in a proper way (that is, just doing a
    memset() will not work).

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: Jens Axboe

    FUJITA Tomonori
     

23 Apr, 2008

1 commit

  • This patch adds release callback support, which is called when a bsg
    device goes away. bsg_register_queue() takes a pointer to a callback
    function. This feature is useful for stuff like sas_host that can't
    use the release callback in struct device.

    If a caller doesn't need bsg's release callback, it can call
    bsg_register_queue() with NULL pointer (e.g. scsi devices can use
    release callback in struct device so they don't need bsg's callback).

    With this patch, bsg uses kref for refcounts on bsg devices instead of
    get/put_device in fops->open/release. bsg calls put_device and the
    caller's release callback (if it was registered) in kref_put's
    release.

    Signed-off-by: FUJITA Tomonori
    Signed-off-by: James Bottomley

    FUJITA Tomonori
     

22 Apr, 2008

1 commit

  • * 'for-2.6.26' of git://git.kernel.dk/linux-2.6-block:
    block: fix blk_register_queue() return value
    block: fix memory hotplug and bouncing in block layer
    block: replace remaining __FUNCTION__ occurrences
    Kconfig: clean up block/Kconfig help descriptions
    cciss: fix warning oops on rmmod of driver
    cciss: Fix race between disk-adding code and interrupt handler
    block: move the padding adjustment to blk_rq_map_sg
    block: add bio_copy_user_iov support to blk_rq_map_user_iov
    block: convert bio_copy_user to bio_copy_user_iov
    loop: manage partitions in disk image
    cdrom: use kmalloced buffers instead of buffers on stack
    cdrom: make unregister_cdrom() return void
    cdrom: use list_head for cdrom_device_info list
    cdrom: protect cdrom_device_info list by mutex
    cdrom: cleanup hardcoded error-code
    cdrom: remove ifdef CONFIG_SYSCTL

    Linus Torvalds
     

21 Apr, 2008

3 commits

  • blk_register_queue() returns -ENXIO when queue->request_fn is NULL. But there
    are some block drivers that call blk_register_queue() via add_disk() with
    queue->request_fn == NULL. (For example, brd, loop)

    Although no one checks return value of blk_register_queue(), this patch makes
    it return 0 instead of -ENXIO when queue->request_fn is NULL,

    Also this patch adds warning when blk_register_queue() and
    blk_unregister_queue() are called with queue == NULL rather than ignore
    invalid usage silently.

    Signed-off-by: Akinobu Mita
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Akinobu Mita
     
  • Modify the help descriptions of block/Kconfig for clarity, accuracy and
    consistency.

    Refactor the BLOCK description a bit. The wording "This permits ... to be
    removed" isn't quite right; the block layer is removed when the option is
    disabled, whereas most descriptions talk about what happens when the option is
    enabled. Reformat the list of what is affected by disabling the block layer.

    Add more examples of large block devices to LBD and strive for technical
    accuracy; block devices of size _exactly_ 2TB require CONFIG_LBD, not only
    "bigger than 2TB". Also try to say (perhaps not very clearly) that the config
    option is only needed when you want to have individual block devices of size
    >= 2TB, for example if you had 3 x 1TB disks in your computer you'd have a
    total storage size of 3TB but you wouldn't need the option unless you want to
    aggregate those disks into a RAID or LVM.

    Improve terminology and grammar on BLK_DEV_IO_TRACE.

    I also added the boilerplate "If unsure, say N" to most options.

    Precisely say "2TB and larger" for LSF.

    Indent the help text for BLK_DEV_BSG by 2 spaces in accordance with the
    standard.

    Signed-off-by: Nick Andrew
    Cc: "Randy.Dunlap"
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Nick Andrew
     
  • blk_rq_map_user adjusts bi_size of the last bio. It breaks the rule
    that req->data_len (the true data length) is equal to sum(bio). It
    broke the scsi command completion code.

    commit e97a294ef6938512b655b1abf17656cf2b26f709 was introduced to fix
    the above issue. However, the partial completion code doesn't work
    with it. The commit is also a layer violation (scsi mid-layer should
    not know about the block layer's padding).

    This patch moves the padding adjustment to blk_rq_map_sg (suggested by
    James). The padding works like the drain buffer. This patch breaks the
    rule that req->data_len is equal to sum(sg), however, the drain buffer
    already broke it. So this patch just restores the rule that
    req->data_len is equal to sub(bio) without breaking anything new.

    Now when a low level driver needs padding, blk_rq_map_user and
    blk_rq_map_user_iov guarantee there's enough room for padding.
    blk_rq_map_sg can safely extend the last entry of a scatter list.

    blk_rq_map_sg must extend the last entry of a scatter list only for a
    request that got through bio_copy_user_iov. This patches introduces
    new REQ_COPY_USER flag.

    Signed-off-by: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: Mike Christie
    Cc: James Bottomley
    Signed-off-by: Jens Axboe

    FUJITA Tomonori