02 Mar, 2011

4 commits

  • blk-flush decomposes a flush into sequence of multiple requests. On
    completion of a request, the next one is queued; however, block layer
    must not implicitly call into q->request_fn() directly from completion
    path. This makes the queue behave unexpectedly when seen from the
    drivers and violates the assumption that q->request_fn() is called
    with process context + queue_lock.

    This patch makes blk-flush the following two changes to make sure
    q->request_fn() is not called directly from request completion path.

    - blk_flush_complete_seq_end_io() now asks __blk_run_queue() to always
    use kblockd instead of calling directly into q->request_fn().

    - queue_next_fseq() uses ELEVATOR_INSERT_REQUEUE instead of
    ELEVATOR_INSERT_FRONT so that elv_insert() doesn't try to unplug the
    request queue directly.

    Reported by Jan in the following threads.

    http://thread.gmane.org/gmane.linux.ide/48778
    http://thread.gmane.org/gmane.linux.ide/48786

    stable: applicable to v2.6.37.

    Signed-off-by: Tejun Heo
    Reported-by: Jan Beulich
    Cc: "David S. Miller"
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • __blk_run_queue() automatically either calls q->request_fn() directly
    or schedules kblockd depending on whether the function is recursed.
    blk-flush implementation needs to be able to explicitly choose
    kblockd. Add @force_kblockd.

    All the current users are converted to specify %false for the
    parameter and this patch doesn't introduce any behavior change.

    stable: This is prerequisite for fixing ide oops caused by the new
    blk-flush implementation.

    Signed-off-by: Tejun Heo
    Cc: Jan Beulich
    Cc: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • Signed-off-by: Ben Hutchings
    Signed-off-by: Andrew Morton
    Signed-off-by: Jens Axboe

    Ben Hutchings
     
  • o Dominik Klein reported a system hang issue while doing some blkio
    throttling testing.

    https://lkml.org/lkml/2011/2/24/173

    o Some tracing revealed that CFQ was not dispatching any more jobs as
    queue unplug was not happening. And queue unplug was not happening
    because unplug work was not being called as there was one throttling
    work on same cpu which as not finished yet. And throttling work had not
    finished as it was tyring to dispatch a bio to CFQ but all the request
    descriptors were consume to it was put to sleep.

    o So basically it is a cyclic dependecny between CFQ unplug work and
    throtl dispatch work. Tejun suggested that use separate workqueue for
    such cases.

    o This patch uses a separate workqueue for throttle related work and
    does not rely on kblockd workqueue anymore.

    Cc: stable@kernel.org
    Reported-by: Dominik Klein
    Signed-off-by: Vivek Goyal
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

26 Feb, 2011

1 commit

  • * 'for-linus' of git://neil.brown.name/md:
    md: Fix - again - partition detection when array becomes active
    Fix over-zealous flush_disk when changing device size.
    md: avoid spinlock problem in blk_throtl_exit
    md: correctly handle probe of an 'mdp' device.
    md: don't set_capacity before array is active.
    md: Fix raid1->raid0 takeover

    Linus Torvalds
     

25 Feb, 2011

1 commit

  • Adam Kovari and others reported that disconnecting an USB drive with
    an ntfs-3g filesystem would cause "kernel BUG at fs/inode.c:1421!" to
    be triggered.

    The BUG could be traced back to ioctl(BLKBSZSET), which would
    erroneously decrement the refcount on the bdev. This is because
    blkdev_get() expects the refcount to be already incremented and either
    returns success or decrements the refcount and returns an error.

    The bug was introduced by e525fd89 (block: make blkdev_get/put()
    handle exclusive access), which didn't take into account this behavior
    of blkdev_get().

    This fixes
    https://bugzilla.kernel.org/show_bug.cgi?id=29202
    (and likely 29792 too)

    Reported-by: Adam Kovari
    Acked-by: Tejun Heo
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     

24 Feb, 2011

1 commit

  • There are two cases when we call flush_disk.
    In one, the device has disappeared (check_disk_change) so any
    data will hold becomes irrelevant.
    In the oter, the device has changed size (check_disk_size_change)
    so data we hold may be irrelevant.

    In both cases it makes sense to discard any 'clean' buffers,
    so they will be read back from the device if needed.

    In the former case it makes sense to discard 'dirty' buffers
    as there will never be anywhere safe to write the data. In the
    second case it *does*not* make sense to discard dirty buffers
    as that will lead to file system corruption when you simply enlarge
    the containing devices.

    flush_disk calls __invalidate_devices.
    __invalidate_device calls both invalidate_inodes and invalidate_bdev.

    invalidate_inodes *does* discard I_DIRTY inodes and this does lead
    to fs corruption.

    invalidate_bev *does*not* discard dirty pages, but I don't really care
    about that at present.

    So this patch adds a flag to __invalidate_device (calling it
    __invalidate_device2) to indicate whether dirty buffers should be
    killed, and this is passed to invalidate_inodes which can choose to
    skip dirty inodes.

    flusk_disk then passes true from check_disk_change and false from
    check_disk_size_change.

    dm avoids tripping over this problem by calling i_size_write directly
    rathher than using check_disk_size_change.

    md does use check_disk_size_change and so is affected.

    This regression was introduced by commit 608aeef17a which causes
    check_disk_size_change to call flush_disk, so it is suitable for any
    kernel since 2.6.27.

    Cc: stable@kernel.org
    Acked-by: Jeff Moyer
    Cc: Andrew Patterson
    Cc: Jens Axboe
    Signed-off-by: NeilBrown

    NeilBrown
     

10 Feb, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cdrom: support devices that have check_events but not media_changed
    cfq-iosched: Don't wait if queue already has requests.
    blkio-throttle: Avoid calling blkiocg_lookup_group() for root group
    cfq: rename a function to give it more appropriate name
    cciss: make cciss_revalidate not loop through CISS_MAX_LUNS volumes unnecessarily.
    drivers/block/aoe/Makefile: replace the use of -objs with -y
    loop: queue_lock NULL pointer derefence in blk_throtl_exit
    drivers/block/Makefile: replace the use of -objs with -y
    blktrace: Don't output messages if NOTIFY isn't set.

    Linus Torvalds
     

09 Feb, 2011

1 commit

  • Commit 7667aa0630407bc07dc38dcc79d29cc0a65553c1 added logic to wait for
    the last queue of the group to become busy (have at least one request),
    so that the group does not lose out for not being continuously
    backlogged. The commit did not check for the condition that the last
    queue already has some requests. As a result, if the queue already has
    requests, wait_busy is set. Later on, cfq_select_queue() checks the
    flag, and decides that since the queue has a request now and wait_busy
    is set, the queue is expired. This results in early expiration of the
    queue.

    This patch fixes the problem by adding a check to see if queue already
    has requests. If it does, wait_busy is not set. As a result, time slices
    do not expire early.

    The queues with more than one request are usually buffered writers.
    Testing shows improvement in isolation between buffered writers.

    Cc: stable@kernel.org
    Signed-off-by: Justin TerAvest
    Reviewed-by: Gui Jianfeng
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Justin TerAvest
     

21 Jan, 2011

1 commit

  • The meaning of CONFIG_EMBEDDED has long since been obsoleted; the option
    is used to configure any non-standard kernel with a much larger scope than
    only small devices.

    This patch renames the option to CONFIG_EXPERT in init/Kconfig and fixes
    references to the option throughout the kernel. A new CONFIG_EMBEDDED
    option is added that automatically selects CONFIG_EXPERT when enabled and
    can be used in the future to isolate options that should only be
    considered for embedded systems (RISC architectures, SLOB, etc).

    Calling the option "EXPERT" more accurately represents its intention: only
    expert users who understand the impact of the configuration changes they
    are making should enable it.

    Reviewed-by: Ingo Molnar
    Acked-by: David Woodhouse
    Signed-off-by: David Rientjes
    Cc: Greg KH
    Cc: "David S. Miller"
    Cc: Jens Axboe
    Cc: Arnd Bergmann
    Cc: Robin Holt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

19 Jan, 2011

2 commits

  • o Jeff Moyer was doing some testing on a RAM backed disk and
    blkiocg_lookup_group() showed up high overhead after memcpy(). Similarly
    somebody else reported that blkiocg_lookup_group() is eating 6% extra
    cpu. Though looking at the code I can't think why the overhead of
    this function is so high. One thing is that it is called with very high
    frequency (once for every IO).

    o For lot of folks blkio controller will be compiled in but they might
    not have actually created cgroups. Hence optimize the case of root
    cgroup where we can avoid calling blkiocg_lookup_group() if IO is happening
    in root group (common case).

    Reported-by: Jeff Moyer
    Signed-off-by: Vivek Goyal
    Acked-by: Jeff Moyer
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o Rename a function to give it more approprate name. We are calculating
    cfq queue slice and function name gives the impression as if cfq group
    slice length is being calculated.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

14 Jan, 2011

3 commits

  • If a queue is preempted before it gets slice assigned, the queue doesn't get
    compensation, which looks unfair. For such queue, we compensate it for a whole
    slice.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • I got this:
    fio-874 [007] 2157.724514: 8,32 m N cfq874 preempt
    fio-874 [007] 2157.724519: 8,32 m N cfq830 slice expired t=1
    fio-874 [007] 2157.724520: 8,32 m N cfq830 sl_used=1 disp=0 charge=1 iops=0 sect=0
    fio-874 [007] 2157.724521: 8,32 m N cfq830 set_active wl_prio:0 wl_type:0
    fio-874 [007] 2157.724522: 8,32 m N cfq830 Not idling. st->count:1

    cfq830 is an async queue, and preempted by a sync queue cfq874. But since we
    have cfqg->saved_workload_slice mechanism, the preempt is a nop.
    Looks currently our preempt is totally broken if the two queues are not from
    the same workload type.
    Below patch fixes it. This will might make async queue starvation, but it's
    what our old code does before cgroup is added.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • * 'for-2.6.38/core' of git://git.kernel.dk/linux-2.6-block: (43 commits)
    block: ensure that completion error gets properly traced
    blktrace: add missing probe argument to block_bio_complete
    block cfq: don't use atomic_t for cfq_group
    block cfq: don't use atomic_t for cfq_queue
    block: trace event block fix unassigned field
    block: add internal hd part table references
    block: fix accounting bug on cross partition merges
    kref: add kref_test_and_get
    bio-integrity: mark kintegrityd_wq highpri and CPU intensive
    block: make kblockd_workqueue smarter
    Revert "sd: implement sd_check_events()"
    block: Clean up exit_io_context() source code.
    Fix compile warnings due to missing removal of a 'ret' variable
    fs/block: type signature of major_to_index(int) to major_to_index(unsigned)
    block: convert !IS_ERR(p) && p to !IS_ERR_NOR_NULL(p)
    cfq-iosched: don't check cfqg in choose_service_tree()
    fs/splice: Pull buf->ops->confirm() from splice_from_pipe actors
    cdrom: export cdrom_check_events()
    sd: implement sd_check_events()
    sr: implement sr_check_events()
    ...

    Linus Torvalds
     

13 Jan, 2011

1 commit


07 Jan, 2011

3 commits


05 Jan, 2011

1 commit

  • /proc/diskstats would display a strange output as follows.

    $ cat /proc/diskstats |grep sda
    8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
    8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
    ~~~~~~~~~~
    8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
    8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
    8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
    8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

    Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
    merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

    The detailed root cause is as follows.

    Assuming that there are two partition, sda1 and sda2.

    1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
    is 0 and sda2's one is 1.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    2. A bio belongs to sda1 is issued and is merged into the request mentioned on
    step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
    from sda2 region to sda1 region. However the two partition's
    hd_struct->in_flight are not changed.

    | hd_struct->in_flight
    ---------------------------
    sda1 | 0
    sda2 | 1
    ---------------------------

    3. The request is finished and blk_account_io_done() is called. In this case,
    sda2's hd_struct->in_flight, not a sda1's one, is decremented.

    | hd_struct->in_flight
    ---------------------------
    sda1 | -1
    sda2 | 1
    ---------------------------

    The patch fixes the problem by caching the partition lookup
    inside the request structure, hence making sure that the increment
    and decrement will always happen on the same partition struct. This
    also speeds up IO with accounting enabled, since it cuts down on
    the number of lookups we have to do.

    Also add a refcount to struct hd_struct to keep the partition in
    memory as long as users exist. We use kref_test_and_get() to ensure
    we don't add a reference to a partition which is going away.

    Signed-off-by: Jerome Marchand
    Signed-off-by: Yasuaki Ishimatsu
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jerome Marchand
     

03 Jan, 2011

1 commit

  • kblockd is used for unplugging and may affect IO latency and
    throughput and the max number of concurrent work items are bound by
    the number of block devices. Make it HIGHPRI workqueue w/ default max
    concurrency.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

23 Dec, 2010

1 commit


21 Dec, 2010

2 commits

  • This patch fixes a spelling error in a source code comment and removes
    superfluous braces in the function exit_io_context().

    Signed-off-by: Bart Van Assche
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Bart Van Assche
     
  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    cciss: fix cciss_revalidate panic
    block: max hardware sectors limit wrapper
    block: Deprecate QUEUE_FLAG_CLUSTER and use queue_limits instead
    blk-throttle: Correct the placement of smp_rmb()
    blk-throttle: Trim/adjust slice_end once a bio has been dispatched
    block: check for proper length of iov entries earlier in blk_rq_map_user_iov()
    drbd: fix for spin_lock_irqsave in endio callback
    drbd: don't recvmsg with zero length

    Linus Torvalds
     

17 Dec, 2010

8 commits

  • The major/minor device numbers are always defined and used as `unsigned'.

    Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • Signed-off-by: Yang Zhang
    Signed-off-by: Jens Axboe

    Yang Zhang
     
  • When cfq_choose_cfqg() is called in select_queue(), there must be at least one
    backlogged CFQ queue waiting for dispatching, hence there must be at least one
    backlogged CFQ group on service tree. So we never call choose_service_tree()
    with cfqg == NULL.

    Signed-off-by: Gui Jianfeng
    Reviewed-by: Jeff Moyer
    Acked-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Gui Jianfeng
     
  • Implement blk_limits_max_hw_sectors() and make
    blk_queue_max_hw_sectors() a wrapper around it.

    DM needs this to avoid setting queue_limits' max_hw_sectors and
    max_sectors directly. dm_set_device_limits() now leverages
    blk_limits_max_hw_sectors() logic to establish the appropriate
    max_hw_sectors minimum (PAGE_SIZE). Fixes issue where DM was
    incorrectly setting max_sectors rather than max_hw_sectors (which
    caused dm_merge_bvec()'s max_hw_sectors check to be ineffective).

    Signed-off-by: Mike Snitzer
    Cc: stable@kernel.org
    Acked-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     
  • When stacking devices, a request_queue is not always available. This
    forced us to have a no_cluster flag in the queue_limits that could be
    used as a carrier until the request_queue had been set up for a
    metadevice.

    There were several problems with that approach. First of all it was up
    to the stacking device to remember to set queue flag after stacking had
    completed. Also, the queue flag and the queue limits had to be kept in
    sync at all times. We got that wrong, which could lead to us issuing
    commands that went beyond the max scatterlist limit set by the driver.

    The proper fix is to avoid having two flags for tracking the same thing.
    We deprecate QUEUE_FLAG_CLUSTER and use the queue limit directly in the
    block layer merging functions. The queue_limit 'no_cluster' is turned
    into 'cluster' to avoid double negatives and to ease stacking.
    Clustering defaults to being enabled as before. The queue flag logic is
    removed from the stacking function, and explicitly setting the cluster
    flag is no longer necessary in DM and MD.

    Reported-by: Ed Lin
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Currently, media presence polling for removeable block devices is done
    from userland. There are several issues with this.

    * Polling is done by periodically opening the device. For SCSI
    devices, the command sequence generated by such action involves a
    few different commands including TEST_UNIT_READY. This behavior,
    while perfectly legal, is different from Windows which only issues
    single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some
    ATAPI devices lock up after being periodically queried such command
    sequences.

    * There is no reliable and unintrusive way for a userland program to
    tell whether the target device is safe for media presence polling.
    For example, polling for media presence during an on-going burning
    session can make it fail. The polling program can avoid this by
    opening the device with O_EXCL but then it risks making a valid
    exclusive user of the device fail w/ -EBUSY.

    * Userland polling is unnecessarily heavy and in-kernel implementation
    is lighter and better coordinated (workqueue, timer slack).

    This patch implements framework for in-kernel disk event handling,
    which includes media presence polling.

    * bdops->check_events() is added, which supercedes ->media_changed().
    It should check whether there's any pending event and return if so.
    Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and
    DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be
    called parallelly.

    * gendisk->events and ->async_events are added. These should be
    initialized by block driver before passing the device to add_disk().
    The former contains the mask of all supported events and the latter
    the mask of all events which the device can report without polling.
    /sys/block/*/events[_async] export these to userland.

    * Kernel parameter block.events_dfl_poll_msecs controls the system
    polling interval (default is 0 which means disable) and
    /sys/block/*/events_poll_msecs control polling intervals for
    individual devices (default is -1 meaning use system setting). Note
    that if a device can report all supported events asynchronously and
    its polling interval isn't explicitly set, the device won't be
    polled regardless of the system polling interval.

    * If a device is opened exclusively with write access, event checking
    is automatically disabled until all write exclusive accesses are
    released.

    * There are event 'clearing' events. For example, both of currently
    defined events are cleared after the device has been successfully
    opened. This information is passed to ->check_events() callback
    using @clearing argument as a hint.

    * Event checking is always performed from system_nrt_wq and timer
    slack is set to 25% for polling.

    * Nothing changes for drivers which implement ->media_changed() but
    not ->check_events(). Going forward, all drivers will be converted
    to ->check_events() and ->media_change() will be dropped.

    Signed-off-by: Tejun Heo
    Cc: Kay Sievers
    Cc: Jan Kara
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no reason for register_disk() and del_gendisk() to be in
    fs/partitions/check.c. Move both to genhd.c. While at it, collapse
    unlink_gendisk(), which was artificially in a separate function due to
    genhd.c / check.c split, into del_gendisk().

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     
  • There's no user of the facility. Kill it.

    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Tejun Heo
     

13 Dec, 2010

1 commit


09 Dec, 2010

1 commit

  • This patch corrects an issue in bsg that results in a general protection
    fault if an LLD is removed while an application is using an open file
    handle to a bsg device, and the application issues an ioctl. The fault
    occurs because the class_dev is NULL, having been cleared in
    bsg_unregister_queue() when the driver was removed. With this
    patch, a check is made for the class_dev, and the application
    will receive ENXIO if the related object is gone.

    Signed-off-by: Carl Lajeunesse
    Signed-off-by: James Smart
    Signed-off-by: James Bottomley

    James Smart
     

02 Dec, 2010

2 commits

  • o I was discussing what are the variable being updated without spin lock and
    why do we need barriers and Oleg pointed out that location of smp_rmb()
    should be between read of td->limits_changed and tg->limits_changed. This
    patch fixes it.

    o Following is one possible sequence of events. Say cpu0 is executing
    throtl_update_blkio_group_read_bps() and cpu1 is executing
    throtl_process_limit_change().

    cpu0 cpu1

    tg->limits_changed = true;
    smp_mb__before_atomic_inc();
    atomic_inc(&td->limits_changed);

    if (!atomic_read(&td->limits_changed))
    return;

    if (tg->limits_changed)
    do_something;

    If cpu0 has updated tg->limits_changed and td->limits_changed, we want to
    make sure that if update to td->limits_changed is visible on cpu1, then
    update to tg->limits_changed should also be visible.

    Oleg pointed out to ensure that we need to insert an smp_rmb() between
    td->limits_changed read and tg->limits_changed read.

    o I had erroneously put smp_rmb() before atomic_read(&td->limits_changed).
    This patch fixes it.

    Reported-by: Oleg Nesterov
    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • o During some testing I did following and noticed throttling stops working.

    - Put a very low limit on a cgroup, say 1 byte per second.
    - Start some reads, this will set slice_end to a very high value.
    - Change the limit to higher value say 1MB/s
    - Now IO unthrottles and finishes as expected.
    - Try to do the read again but IO is not limited to 1MB/s as expected.

    o What is happening.
    - Initially low value of limit sets slice_end to a very high value.
    - During updation of limit, slice_end is not being truncated.
    - Very high value of slice_end leads to keeping the existing slice
    valid for a very long time and new slice does not start.
    - tg_may_dispatch() is called in blk_throtle_bio(), and trim_slice()
    is not called in this path. So slice_start is some old value and
    practically we are able to do huge amount of IO.

    o There are many ways it can be fixed. I have fixed it by trying to
    adjust/cleanup slice_end in trim_slice(). Generally we extend slices if bio
    is big and can't be dispatched in one slice. After dispatch of bio, readjust
    the slice_end to make sure we don't end up with huge values.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

01 Dec, 2010

2 commits


29 Nov, 2010

1 commit


28 Nov, 2010

1 commit