14 Jan, 2011

1 commit

  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

13 Jan, 2011

1 commit


07 Jan, 2011

3 commits


05 Jan, 2011

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    Also add a refcount to struct hd_struct to keep the partition in
    memory as long as users exist. We use kref_test_and_get() to ensure
    we don't add a reference to a partition which is going away.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

03 Jan, 2011

1 commit

  • kblockd is used for unplugging and may affect IO latency and
    throughput and the max number of concurrent work items are bound by
    the number of block devices. Make it HIGHPRI workqueue w/ default max
    concurrency.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Dec, 2010

1 commit


21 Dec, 2010

2 commits

  • This patch fixes a spelling error in a source code comment and removes
    superfluous braces in the function exit_io_context().

    Signed-off-by: Bart Van Assche
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cciss: fix cciss_revalidate panic
    block: max hardware sectors limit wrapper
    block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
    blk-throttle: Correct the placement of smp_rmb()
    blk-throttle: Trim/adjust slice_end once a bio has been dispatched
    block: check for proper length of iov entries earlier in blk_rq_map_user_iov()
    drbd: fix for spin_lock_irqsave in endio callback
    drbd: don't recvmsg with zero length

    Linus Torvalds
     

17 Dec, 2010

8 commits

  • The major/minor device numbers are always defined and used as `unsigned'.

    Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • When cfq_choose_cfqg() is called in select_queue(), there must be at least one
    backlogged CFQ queue waiting for dispatching, hence there must be at least one
    backlogged CFQ group on service tree. So we never call choose_service_tree()
    with cfqg == NULL.

    Signed-off-by: Gui Jianfeng
    Reviewed-by: Jeff Moyer
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Gui Jianfeng
     
  • Implement blk_limits_max_hw_sectors() and make
    blk_queue_max_hw_sectors() a wrapper around it.

    DM needs this to avoid setting queue_limits' max_hw_sectors and
    max_sectors directly. dm_set_device_limits() now leverages
    blk_limits_max_hw_sectors() logic to establish the appropriate
    max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
    incorrectly setting max_sectors rather than max_hw_sectors (which
    caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).

    Signed-off-by: Mike Snitzer
    Cc: stable@kernel.org
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • When stacking devices, a request_queue is not always available. This
    forced us to have a no_cluster flag in the queue_limits that could be
    used as a carrier until the request_queue had been set up for a
    metadevice.

    There were several problems with that approach. First of all it was up
    to the stacking device to remember to set queue flag after stacking had
    completed. Also, the queue flag and the queue limits had to be kept in
    sync at all times. We got that wrong, which could lead to us issuing
    commands that went beyond the max scatterlist limit set by the driver.

    The proper fix is to avoid having two flags for tracking the same thing.
    We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
    block layer merging functions. The queue_limit 'no_cluster' is turned
    into 'cluster' to avoid double negatives and to ease stacking.
    Clustering defaults to being enabled as before. The queue flag logic is
    removed from the stacking function, and explicitly setting the cluster
    flag is no longer necessary in DM and MD.

    Reported-by: Ed Lin
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Currently, media presence polling for removeable block devices is done
    from userland. There are several issues with this.

    * Polling is done by periodically opening the device. For SCSI
    devices, the command sequence generated by such action involves a
    few different commands including TEST_UNIT_READY. This behavior,
    while perfectly legal, is different from Windows which only issues
    single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
    ATAPI devices lock up after being periodically queried such command
    sequences.

    * There is no reliable and unintrusive way for a userland program to
    tell whether the target device is safe for media presence polling.
    For example, polling for media presence during an on-going burning
    session can make it fail. The polling program can avoid this by
    opening the device with O_EXCL but then it risks making a valid
    exclusive user of the device fail w/ -EBUSY.

    * Userland polling is unnecessarily heavy and in-kernel implementation
    is lighter and better coordinated (workqueue, timer slack).

    This patch implements framework for in-kernel disk event handling,
    which includes media presence polling.

    * bdops->check_events() is added, which supercedes ->media_changed().
    It should check whether there's any pending event and return if so.
    Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
    DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
    called parallelly.

    * gendisk->events and ->async_events are added. These should be
    initialized by block driver before passing the device to add_disk().
    The former contains the mask of all supported events and the latter
    the mask of all events which the device can report without polling.
    /sys/block/*/events[_async] export these to userland.

    * Kernel parameter block.events_dfl_poll_msecs controls the system
    polling interval (default is 0 which means disable) and
    /sys/block/*/events_poll_msecs control polling intervals for
    individual devices (default is -1 meaning use system setting). Note
    that if a device can report all supported events asynchronously and
    its polling interval isn't explicitly set, the device won't be
    polled regardless of the system polling interval.

    * If a device is opened exclusively with write access, event checking
    is automatically disabled until all write exclusive accesses are
    released.

    * There are event 'clearing' events. For example, both of currently
    defined events are cleared after the device has been successfully
    opened. This information is passed to ->check_events() callback
    using @clearing argument as a hint.

    * Event checking is always performed from system_nrt_wq and timer
    slack is set to 25% for polling.

    * Nothing changes for drivers which implement ->media_changed() but
    not ->check_events(). Going forward, all drivers will be converted
    to ->check_events() and ->media_change() will be dropped.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no reason for register_disk() and del_gendisk() to be in
    fs/partitions/check.c. Move both to genhd.c. While at it, collapse
    unlink_gendisk(), which was artificially in a separate function due to
    genhd.c / check.c split, into del_gendisk().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no user of the facility. Kill it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Dec, 2010

1 commit


09 Dec, 2010

1 commit

  • This patch corrects an issue in bsg that results in a general protection
    fault if an LLD is removed while an application is using an open file
    handle to a bsg device, and the application issues an ioctl. The fault
    occurs because the class_dev is NULL, having been cleared in
    bsg_unregister_queue() when the driver was removed. With this
    patch, a check is made for the class_dev, and the application
    will receive ENXIO if the related object is gone.

    Signed-off-by: Carl Lajeunesse
    Signed-off-by: James Smart
    Signed-off-by: James Bottomley

    James Smart
     

02 Dec, 2010

2 commits

  • o I was discussing what are the variable being updated without spin lock and
    why do we need barriers and Oleg pointed out that location of smp_rmb()
    should be between read of td->limits_changed and tg->limits_changed. This
    patch fixes it.

    o Following is one possible sequence of events. Say cpu0 is executing
    throtl_update_blkio_group_read_bps() and cpu1 is executing
    throtl_process_limit_change().

    cpu0 cpu1

    tg->limits_changed = true;
    smp_mb__before_atomic_inc();
    atomic_inc(&td->limits_changed);

    if (!atomic_read(&td->limits_changed))
    return;

    if (tg->limits_changed)
    do_something;

    If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
    make sure that if update to td->limits_changed is visible on cpu1, then
    update to tg->limits_changed should also be visible.

    Oleg pointed out to ensure that we need to insert an smp_rmb() between
    td->limits_changed read and tg->limits_changed read.

    o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
    This patch fixes it.

    Reported-by: Oleg Nesterov
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o During some testing I did following and noticed throttling stops working.

    - Put a very low limit on a cgroup, say 1 byte per second.
    - Start some reads, this will set slice_end to a very high value.
    - Change the limit to higher value say 1MB/s
    - Now IO unthrottles and finishes as expected.
    - Try to do the read again but IO is not limited to 1MB/s as expected.

    o What is happening.
    - Initially low value of limit sets slice_end to a very high value.
    - During updation of limit, slice_end is not being truncated.
    - Very high value of slice_end leads to keeping the existing slice
    valid for a very long time and new slice does not start.
    - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
    is not called in this path. So slice_start is some old value and
    practically we are able to do huge amount of IO.

    o There are many ways it can be fixed. I have fixed it by trying to
    adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
    is big and can't be dispatched in one slice. After dispatch of bio, readjust
    the slice_end to make sure we don't end up with huge values.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

01 Dec, 2010

2 commits


29 Nov, 2010

1 commit


28 Nov, 2010

1 commit


27 Nov, 2010

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cciss: fix build for PROC_FS disabled
    block: fix amiga and atari floppy driver compile warning
    blk-throttle: Fix calculation of max number of WRITES to be dispatched
    ioprio: grab rcu_read_lock in sys_ioprio_{set,get}()
    xen/blkfront: cope with backend that fail empty BLKIF_OP_WRITE_BARRIER requests
    xen/blkfront: Implement FUA with BLKIF_OP_WRITE_BARRIER
    xen/blkfront: change blk_shadow.request to proper pointer
    xen/blkfront: map REQ_FLUSH into a full barrier

    Linus Torvalds
     

18 Nov, 2010

1 commit


16 Nov, 2010

5 commits

  • Signed-off-by: Kiyoshi Ueda
    Signed-off-by: Jun'ichi Nomura
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Jens Axboe
     
  • Jens Axboe
     
  • o Allow hierarchical cgroup creation for blkio controller

    o Currently we disallow it as both the io controller policies (throttling
    as well as proportion bandwidth) do not support hierarhical accounting
    and control. But the flip side is that blkio controller can not be used with
    libvirt as libvirt creates a cgroup hierarchy deeper than 1 level.

    //libvirt/qemu/

    o So this patch will allow creation of cgroup hierarhcy but at the backend
    everything will be treated as flat. So if somebody created a an hierarchy
    like as follows.

    root
    / \
    test1 test2
    |
    test3

    CFQ and throttling will practically treat all groups at same level.

    pivot
    / | \ \
    root test1 test2 test3

    o Once we have actual support for hierarchical accounting and control
    then we can introduce another cgroup tunable file "blkio.use_hierarchy"
    which will be 0 by default but if user wants to enforce hierarhical
    control then it can be set to 1. This way there should not be any
    ABI problems down the line.

    o The only not so pretty part is introduction of extra file "use_hierarchy"
    down the line. Kame-san had mentioned that hierarhical accounting is
    expensive in memory controller hence they keep it off by default. I
    suspect same will be the case for IO controller also as for each IO
    completion we shall have to account IO through hierarchy up to the root.
    if yes, then it probably is not a very bad idea to introduce this extra
    file so that it will be used only when somebody needs it and some people
    might enable hierarchy only in part of the hierarchy.

    o This is how basically memory controller also uses "use_hierarhcy" and
    they also allowed creation of hierarchies when actual backend support
    was not available.

    Signed-off-by: Vivek Goyal
    Acked-by: Balbir Singh
    Reviewed-by: Gui Jianfeng
    Reviewed-by: Ciju Rajan K
    Tested-by: Ciju Rajan K
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Currently we try to dispatch more READS and less WRITES (75%, 25%) in one
    dispatch round. ummy pointed out that there is a bug in max_nr_writes
    calculation. This patch fixes it.

    Reported-by: ummy y
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

13 Nov, 2010

1 commit

  • Over time, block layer has accumulated a set of APIs dealing with bdev
    open, close, claim and release.

    * blkdev_get/put() are the primary open and close functions.

    * bd_claim/release() deal with exclusive open.

    * open/close_bdev_exclusive() are combination of open and claim and
    the other way around, respectively.

    * bd_link/unlink_disk_holder() to create and remove holder/slave
    symlinks.

    * open_by_devnum() wraps bdget() + blkdev_get().

    The interface is a bit confusing and the decoupling of open and claim
    makes it impossible to properly guarantee exclusive access as
    in-kernel open + claim sequence can disturb the existing exclusive
    open even before the block layer knows the current open if for another
    exclusive access. Reorganize the interface such that,

    * blkdev_get() is extended to include exclusive access management.
    @holder argument is added and, if is @FMODE_EXCL specified, it will
    gain exclusive access atomically w.r.t. other exclusive accesses.

    * blkdev_put() is similarly extended. It now takes @mode argument and
    if @FMODE_EXCL is set, it releases an exclusive access. Also, when
    the last exclusive claim is released, the holder/slave symlinks are
    removed automatically.

    * bd_claim/release() and close_bdev_exclusive() are no longer
    necessary and either made static or removed.

    * bd_link_disk_holder() remains the same but bd_unlink_disk_holder()
    is no longer necessary and removed.

    * open_bdev_exclusive() becomes a simple wrapper around lookup_bdev()
    and blkdev_get(). It also has an unexpected extra bdev_read_only()
    test which probably should be moved into blkdev_get().

    * open_by_devnum() is modified to take @holder argument and pass it to
    blkdev_get().

    Most of bdev open/close operations are unified into blkdev_get/put()
    and most exclusive accesses are tested atomically at the open time (as
    it should). This cleans up code and removes some, both valid and
    invalid, but unnecessary all the same, corner cases.

    open_bdev_exclusive() and open_by_devnum() can use further cleanup -
    rename to blkdev_get_by_path() and blkdev_get_by_devt() and drop
    special features. Well, let's leave them for another day.

    Most conversions are straight-forward. drbd conversion is a bit more
    involved as there was some reordering, but the logic should stay the
    same.

    Signed-off-by: Tejun Heo
    Acked-by: Neil Brown
    Acked-by: Ryusuke Konishi
    Acked-by: Mike Snitzer
    Acked-by: Philipp Reisner
    Cc: Peter Osterlund
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Jan Kara
    Cc: Andrew Morton
    Cc: Andreas Dilger
    Cc: "Theodore Ts'o"
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Alex Elder
    Cc: Christoph Hellwig
    Cc: dm-devel@redhat.com
    Cc: drbd-dev@lists.linbit.com
    Cc: Leo Chen
    Cc: Scott Branden
    Cc: Chris Mason
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: reiserfs-devel@vger.kernel.org
    Cc: Alexander Viro

    Tejun Heo
     

11 Nov, 2010

1 commit


10 Nov, 2010

5 commits

  • REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
    at this point is:

    - various checks inside the block layer.
    - sanity checks in bio based drivers.
    - now unused bio_empty_barrier helper.
    - Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
    but Xen really needs to sort out it's barrier situaton.
    - setting of ordered tags in uas - dead code copied from old scsi
    drivers.
    - scsi different retry for barriers - it's dead and should have been
    removed when flushes were converted to FS requests.
    - blktrace handling of barriers - removed. Someone who knows blktrace
    better should add support for REQ_FLUSH and REQ_FUA, though.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Structure hd_geometry is copied to userland with 4 padding bytes
    between cylinders and start fields uninitialized on 64-bit platforms.
    It leads to leaking of contents of kernel stack memory.

    Currently there is no memset() in real implementations of getgeo()
    in drivers/block/, so it makes sense to have memset() in blkdev_ioctl().

    Signed-off-by: Vasiliy Kulikov
    Signed-off-by: Jens Axboe

    Vasiliy Kulikov
     
  • Convert direct reads of an inode's i_size to using i_size_read().

    i_size_{read,write} use a seqcount to protect reads from accessing
    incomple writes. Concurrent i_size_write()s require mutual exclussion
    to protect the seqcount that is used by i_size_{read,write}. But
    i_size_read() callers do not need to use additional locking.

    Signed-off-by: Mike Snitzer
    Acked-by: NeilBrown
    Acked-by: Lars Ellenberg
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • Reported-by: Dan Rosenberg
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Ensure that we pass down properly validated iov segments before
    calling into the mapping or copy functions.

    Reported-by: Dan Rosenberg
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe