02 Jun, 2011

2 commits


01 Jun, 2011

1 commit


28 May, 2011

1 commit

  • * 'for-linus' of git://git.kernel.dk/linux-2.6-block:
    loop: export module parameters
    block: export blk_{get,put}_queue()
    block: remove unused variable in bio_attempt_front_merge()
    block: always allocate genhd->ev if check_events is implemented
    brd: export module parameters
    brd: fix comment on initial device creation
    brd: handle on-demand devices correctly
    brd: limit 'max_part' module param to DISK_MAX_PARTS
    brd: get rid of unused members from struct brd_device
    block: fix oops on !disk->queue and sysfs discard alignment display

    Linus Torvalds
     

27 May, 2011

4 commits

  • We need them in SCSI to fix a bug, but currently they are not
    exported to modules. Export them.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Add cgroup subsystem callbacks for per-thread attachment in atomic contexts

    Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
    for cgroups's subsystem interface. Unlike can_attach and attach, these
    are for per-thread operations, to be called potentially many times when
    attaching an entire threadgroup.

    Also, the old "bool threadgroup" interface is removed, as replaced by
    this. All subsystems are modified for the new interface - of note is
    cpuset, which requires from/to nodemasks for attach to be globally scoped
    (though per-cpuset would work too) to persist from its pre_attach to
    attach_task and attach.

    This is a pre-patch for cgroup-procs-writable.patch.

    Signed-off-by: Ben Blum
    Cc: "Eric W. Biederman"
    Cc: Li Zefan
    Cc: Matt Helsley
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Blum
     
  • sector is never read inside the function.

    Signed-off-by: Luca Tettamanti
    Signed-off-by: Jens Axboe

    Luca Tettamanti
     
  • 9fd097b149 (block: unexport DISK_EVENT_MEDIA_CHANGE for legacy/fringe
    drivers) removed DISK_EVENT_MEDIA_CHANGE from legacy/fringe block
    drivers which have inadequate ->check_events(). Combined with earlier
    change 7c88a168da (block: don't propagate unlisted DISK_EVENTs to
    userland), this enables using ->check_events() for internal processing
    while avoiding enabling in-kernel block event polling which can lead
    to infinite event loop.

    Unfortunately, this made many drivers including floppy without any bit
    set in disk->events and ->async_events in which case disk_add_events()
    simply skipped allocation of disk->ev, which disables whole event
    handling. As ->check_events() is still used during open processing
    for revalidation, this can lead to open failure.

    This patch always allocates disk->ev if ->check_events is implemented.
    In the long term, it would make sense to simply include the event
    structure inline into genhd as it's now used by virtually all block
    devices.

    Signed-off-by: Tejun Heo
    Reported-by: Ondrej Zary
    Reported-by: Alex Villacis Lasso
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Tejun Heo
     

24 May, 2011

5 commits


23 May, 2011

3 commits


21 May, 2011

16 commits

  • We don't need them anymore, so kill:

    - REQ_ON_PLUG checks in various places
    - !rq_mergeable() check in plug merging

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • This patch merges in a fix that missed 2.6.39 final.

    Conflicts:
    block/blk.h

    Jens Axboe
     
  • Currently we take a queue lock on each bio to check if there are any
    throttling rules associated with the group and also update the stats.
    Now access the group under rcu and update the stats without taking
    the queue lock. Queue lock is taken only if there are throttling rules
    associated with the group.

    So the common case of root group when there are no rules, save
    unnecessary pounding of request queue lock.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Now dispatch stats update is lock free. But reset of these stats still
    takes blkg->stats_lock and is dependent on that. As stats are per cpu,
    we should be able to just reset the stats on each cpu without any locks.
    (Atleast for 64bit arch).

    On 32bit arch there is a small race where 64bit updates are not atomic.
    The result of this race can be that in the presence of other writers,
    one might not get 0 value after reset of a stat and might see something
    intermediate

    One can write more complicated code to cover this race like sending IPI
    to other cpus to reset stats and for offline cpus, reset these directly.

    Right not I am not taking that path because reset_update is more of a
    debug feature and it can happen only on 32bit arch and possibility of
    it happening is small. Will fix it if it becomes a real problem. For
    the time being going for code simplicity.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Some of the stats are 64bit and updation will be non atomic on 32bit
    architecture. Use sequence counters on 32bit arch to make reading
    of stats safe.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently we take blkg_stat lock for even updating the stats. So even if
    a group has no throttling rules (common case for root group), we end
    up taking blkg_lock, for updating the stats.

    Make dispatch stats per cpu so that these can be updated without taking
    blkg lock.

    If cpu goes offline, these stats simply disappear. No protection has
    been provided for that yet. Do we really need anything for that?

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Soon we will allow accessing a throtl_grp under rcu_read_lock(). Hence
    start freeing up throtl_grp after one rcu grace period.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Use same helper function for root group as we use with dynamically
    allocated groups to add it to various lists.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • A helper function for the code which is used at 2-3 places. Makes reading
    code little easier.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently, we allocate root throtl_grp statically. But as we will be
    introducing per cpu stat pointers and that will be allocated
    dynamically even for root group, we might as well make whole root
    throtl_grp allocation dynamic and treat it in same manner as other
    groups.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Currently, all the cfq_group or throtl_group allocations happen while
    we are holding ->queue_lock and sleeping is not allowed.

    Soon, we will move to per cpu stats and also need to allocate the
    per group stats. As one can not call alloc_percpu() from atomic
    context as it can sleep, we need to drop ->queue_lock, allocate the
    group, retake the lock and continue processing.

    In throttling code, I check the queue DEAD flag again to make sure
    that driver did not call blk_cleanup_queue() in the mean time.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • blkg->key = cfqd is an rcu protected pointer and hence we used to do
    call_rcu(cfqd->rcu_head) to free up cfqd after one rcu grace period.

    The problem here is that even though cfqd is around, there are no
    gurantees that associated request queue (td->queue) or q->queue_lock
    is still around. A driver might have called blk_cleanup_queue() and
    release the lock.

    It might happen that after freeing up the lock we call
    blkg->key->queue->queue_ock and crash. This is possible in following
    path.

    blkiocg_destroy()
    blkio_unlink_group_fn()
    cfq_unlink_blkio_group()

    Hence, wait for an rcu peirod if there are groups which have not
    been unlinked from blkcg->blkg_list. That way, if there are any groups
    which are taking cfq_unlink_blkio_group() path, can safely take queue
    lock.

    This is how we have taken care of race in throttling logic also.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Nobody seems to be using cfq_find_alloc_cfqg() function parameter "create".
    Get rid of that.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • cgroup unaccounted_time file is created only if CONFIG_DEBUG_BLK_CGROUP=y.
    there are some fields which are out side this config option. Fix that.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Group initialization code seems to be at two places. root group
    initialization in blk_throtl_init() and dynamically allocated group
    in throtl_find_alloc_tg(). Create a common function and use at both
    the places.

    Signed-off-by: Vivek Goyal
    Signed-off-by: Jens Axboe

    Vivek Goyal
     
  • Since for-2.6.40/core was forked off the 2.6.39 devel tree, we've
    had churn in the core area that makes it difficult to handle
    patches for eg cfq or blk-throttle. Instead of requiring that they
    be based in older versions with bugs that have been fixed later
    in the rc cycle, merge in 2.6.39 final.

    Also fixes up conflicts in the below files.

    Conflicts:
    drivers/block/paride/pcd.c
    drivers/cdrom/viocd.c
    drivers/ide/ide-cd.c

    Signed-off-by: Jens Axboe

    Jens Axboe
     

19 May, 2011

1 commit

  • blk_cleanup_queue() calls elevator_exit() and after this, we can't
    touch the elevator without oopsing. __elv_next_request() must check
    for this state because in the refcounted queue model, we can still
    call it after blk_cleanup_queue() has been called.

    This was reported as causing an oops attributable to scsi.

    Signed-off-by: James Bottomley
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    James Bottomley
     

18 May, 2011

2 commits

  • Let's check a scenario:
    1. blk_delay_queue(q, SCSI_QUEUE_DELAY);
    2. blk_run_queue_async();
    the second one will became a noop, because q->delay_work already has
    WORK_STRUCT_PENDING_BIT set, so the delayed work will still run after
    SCSI_QUEUE_DELAY. But blk_run_queue_async actually hopes the delayed
    work runs immediately.

    Fix this by doing a cancel on potentially pending delayed work
    before queuing an immediate run of the workqueue.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • In some cases we would end up stacking discard_zeroes_data incorrectly.
    Fix this by enabling the feature by default for stacking drivers and
    clearing it for low-level drivers. Incorporating a device that does not
    support dzd will then cause the feature to be disabled in the stacking
    driver.

    Also ensure that the maximum discard value does not overflow when
    exported in sysfs and return 0 in the alignment and dzd fields for
    devices that don't support discard.

    Reported-by: Lukas Czerner
    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

16 May, 2011

1 commit

  • Currentlly we first map the task to cgroup and then cgroup to
    blkio_cgroup. There is a more direct way to get to blkio_cgroup
    from task using task_subsys_state(). Use that.

    The real reason for the fix is that it also avoids a race in generic
    cgroup code. During remount/umount rebind_subsystems() is called and
    it can do following with and rcu protection.

    cgrp->subsys[i] = NULL;

    That means if somebody got hold of cgroup under rcu and then it tried
    to do cgroup->subsys[] to get to blkio_cgroup, it would get NULL which
    is wrong. I was running into this race condition with ltp running on a
    upstream derived kernel and that lead to crash.

    So ideally we should also fix cgroup generic code to wait for rcu
    grace period before setting pointer to NULL. Li Zefan is not very keen
    on introducing synchronize_wait() as he thinks it will slow
    down moun/remount/umount operations.

    So for the time being atleast fix the kernel crash by taking a more
    direct route to blkio_cgroup.

    One tester had reported a crash while running LTP on a derived kernel
    and with this fix crash is no more seen while the test has been
    running for over 6 days.

    Signed-off-by: Vivek Goyal
    Reviewed-by: Li Zefan
    Signed-off-by: Jens Axboe

    Vivek Goyal
     

07 May, 2011

4 commits

  • Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
    bio fails due to underlying device not supporting discard request.
    However, if the device is for example dm device composed of devices
    which some of them support discard and some of them does not, it is ok
    for some bios to fail with EOPNOTSUPP, but it does not mean that discard
    is not supported at all.

    This commit removes the check for bios failed with EOPNOTSUPP and change
    blkdev_issue_discard() to return operation not supported if and only if
    the device does not actually supports it, not just part of the device as
    some bios might indicate.

    This change also fixes problem with BLKDISCARD ioctl() which now works
    correctly on such dm devices.

    Signed-off-by: Lukas Czerner
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
    not need to check for -EOPNOTSUPP specifically in case of error. Also
    there is no need to have label submit: because there is no way to jump
    out from the while cycle without an error and we really want to exit,
    rather than try again. And also remove the check for (sz == 0) since at
    that point sz can never be zero.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • Currently we are waiting for every submitted REQ_DISCARD bio separately,
    but it can have unwanted consequences of repeatedly flushing the queue,
    so we rather submit bios in batches and wait for the entire batch, hence
    narrowing the window of other ios going in.

    Use bio_batch_end_io() and struct bio_batch for that purpose, the same
    is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
    always set !BIO_UPTODATE in the case of error and remove the check for
    bb, since we are the only user of this function and we always set this.

    Remove bio_get()/bio_put() from the blkdev_issue_discard() since
    bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
    not needed anymore.

    I have done simple dd testing with surprising results. The script I have
    used is:

    for i in $(seq 10); do
    echo $i
    dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
    sleep 5
    done
    /usr/bin/time -f %e ./blkdiscard /dev/sdc1

    Running time of BLKDISCARD on the whole device:
    with patch without patch
    0.95 15.58

    So we can see that in this artificial test the kernel with the patch
    applied is approx 16x faster in discarding the device.

    Signed-off-by: Lukas Czerner
    CC: Dmitry Monakhov
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • In some drives, flush requests are non-queueable. When flush request is
    running, normal read/write requests can't run. If block layer dispatches
    such request, driver can't handle it and requeue it. Tejun suggested we
    can hold the queue when flush is running. This can avoid unnecessary
    requeue. Also this can improve performance. For example, we have
    request flush1, write1, flush 2. flush1 is dispatched, then queue is
    hold, write1 isn't inserted to queue. After flush1 is finished, flush2
    will be dispatched. Since disk cache is already clean, flush2 will be
    finished very soon, so looks like flush2 is folded to flush1.

    In my test, the queue holding completely solves a regression introduced by
    commit 53d63e6b0dfb95882ec0219ba6bbd50cde423794:

    block: make the flush insertion use the tail of the dispatch list

    It's not a preempt type request, in fact we have to insert it
    behind requests that do specify INSERT_FRONT.

    which causes about 20% regression running a sysbench fileio
    workload.

    Stable: 2.6.39 only

    Cc: stable@kernel.org
    Signed-off-by: Shaohua Li
    Acked-by: Tejun Heo
    Signed-off-by: Jens Axboe

    shaohua.li@intel.com