13 Feb, 2014

1 commit

  • When mkfs issues a full device discard and the device only
    supports discards of a smallish size, we can loop in
    blkdev_issue_discard() for a long time. If preempt isn't enabled,
    this can turn into a softlock situation and the kernel will
    start complaining.

    Add an explicit cond_resched() at the end of the loop to avoid
    that.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

09 Nov, 2013

1 commit

  • do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
    of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused
    issues in the past on 32-bit platforms, cfr. commit
    ea077b1b96e073eac5c3c5590529e964767fc5f7 ("m68k: Truncate base in
    do_div()").

    As queue_limits.max_discard_sectors and .discard_granularity are unsigned
    int, max_discard_sectors and granularity should be unsigned int.
    As bdev_discard_alignment() returns int, alignment should be int.
    Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
    - The 64-bit modulo operation can become a 32-bit modulo operation,
    - The 64-bit division and multiplication can be replaced by a 32-bit
    modulo operation and a subtraction.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     

15 Feb, 2013

1 commit

  • Using wait_for_completion() for waiting for a IO request to be executed
    results in wrong iowait time accounting. For example, a system having
    the only task doing write() and fdatasync() on a block device can be
    reported being idle instead of iowaiting as it should because
    blkdev_issue_flush() calls wait_for_completion() which in turn calls
    schedule() that does not increment the iowait proc counter and thus does
    not turn on iowait time accounting.

    The patch makes block layer use wait_for_completion_io() instead of
    wait_for_completion() where appropriate to account iowait time
    correctly.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Jens Axboe

    Vladimir Davydov
     

15 Dec, 2012

2 commits

  • Last post of this patch appears lost, so I resend this.

    Now discard merge works, add plug for blkdev_issue_discard. This will help
    discard request merge especially for raid0 case. In raid0, a big discard
    request is split to small requests, and if correct plug is added, such small
    requests can be merged in low layer.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • In MD raid case, discard granularity might not be power of 2, for example, a
    4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
    such cases.

    Reported-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Sep, 2012

2 commits

  • If the device supports WRITE SAME, use that to optimize zeroing of
    blocks. If the device does not support WRITE SAME or if the operation
    fails, fall back to writing zeroes the old-fashioned way.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

02 Aug, 2012

2 commits

  • When a disk has large discard_granularity and small max_discard_sectors,
    discards are not split with optimal alignment. In the limit case of
    discard_granularity == max_discard_sectors, no request could be aligned
    correctly, so in fact you might end up with no discarded logical blocks
    at all.

    Another example that helps showing the condition in the patch is with
    discard_granularity == 64, max_discard_sectors == 128. A request that is
    submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
    However, only 2 aligned blocks out of 3 are included in the request;
    128..191 may be left intact and not discarded. With this patch, the
    first request will be truncated to ensure good alignment of what's left,
    and the split will be 2..127, 128..255, 256..257. The patch will also
    take into account the discard_alignment.

    At most one extra request will be introduced, because the first request
    will be reduced by at most granularity-1 sectors, and granularity
    must be less than max_discard_sectors. Subsequent requests will run
    on round_down(max_discard_sectors, granularity) sectors, as in the
    current code.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Mostly a preparation for the next patch.

    In principle this fixes an infinite loop if max_discard_sectors < granularity,
    but that really shouldn't happen.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

24 Jul, 2011

1 commit


07 Jul, 2011

1 commit

  • Due to the recently identified overflow in read_capacity_16() it was
    possible for max_discard_sectors to be zero but still have discards
    enabled on the associated device's queue.

    Eliminate the possibility for blkdev_issue_discard to infinitely loop.

    Interestingly this issue wasn't identified until a device, whose
    discard_granularity was 0 due to read_capacity_16 overflow, was consumed
    by blk_stack_limits() to construct limits for a higher-level DM
    multipath device. The multipath device's resulting limits never had the
    discard limits stacked because blk_stack_limits() will only do so if
    the bottom device's discard_granularity != 0. This resulted in the
    multipath device's limits.max_discard_sectors being 0.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

07 May, 2011

3 commits

  • Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
    bio fails due to underlying device not supporting discard request.
    However, if the device is for example dm device composed of devices
    which some of them support discard and some of them does not, it is ok
    for some bios to fail with EOPNOTSUPP, but it does not mean that discard
    is not supported at all.

    This commit removes the check for bios failed with EOPNOTSUPP and change
    blkdev_issue_discard() to return operation not supported if and only if
    the device does not actually supports it, not just part of the device as
    some bios might indicate.

    This change also fixes problem with BLKDISCARD ioctl() which now works
    correctly on such dm devices.

    Signed-off-by: Lukas Czerner
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
    not need to check for -EOPNOTSUPP specifically in case of error. Also
    there is no need to have label submit: because there is no way to jump
    out from the while cycle without an error and we really want to exit,
    rather than try again. And also remove the check for (sz == 0) since at
    that point sz can never be zero.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • Currently we are waiting for every submitted REQ_DISCARD bio separately,
    but it can have unwanted consequences of repeatedly flushing the queue,
    so we rather submit bios in batches and wait for the entire batch, hence
    narrowing the window of other ios going in.

    Use bio_batch_end_io() and struct bio_batch for that purpose, the same
    is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
    always set !BIO_UPTODATE in the case of error and remove the check for
    bb, since we are the only user of this function and we always set this.

    Remove bio_get()/bio_put() from the blkdev_issue_discard() since
    bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
    not needed anymore.

    I have done simple dd testing with surprising results. The script I have
    used is:

    for i in $(seq 10); do
    echo $i
    dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
    sleep 5
    done
    /usr/bin/time -f %e ./blkdiscard /dev/sdc1

    Running time of BLKDISCARD on the whole device:
    with patch without patch
    0.95 15.58

    So we can see that in this artificial test the kernel with the patch
    applied is approx 16x faster in discarding the device.

    Signed-off-by: Lukas Czerner
    CC: Dmitry Monakhov
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

12 Mar, 2011

1 commit


11 Mar, 2011

1 commit

  • BZ29402
    https://bugzilla.kernel.org/show_bug.cgi?id=29402

    We can hit serious mis-synchronization in bio completion path of
    blkdev_issue_zeroout() leading to a panic.

    The problem is that when we are going to wait_for_completion() in
    blkdev_issue_zeroout() we check if the bb.done equals issued (number of
    submitted bios). If it does, we can skip the wait_for_completition()
    and just out of the function since there is nothing to wait for.
    However, there is a ordering problem because bio_batch_end_io() is
    calling atomic_inc(&bb->done) before complete(), hence it might seem to
    blkdev_issue_zeroout() that all bios has been completed and exit. At
    this point when bio_batch_end_io() is going to call complete(bb->wait),
    bb and wait does not longer exist since it was allocated on stack in
    blkdev_issue_zeroout() ==> panic!

    (thread 1) (thread 2)
    bio_batch_end_io() blkdev_issue_zeroout()
    if(bb) { ...
    if (bb->end_io) ...
    bb->end_io(bio, err); ...
    atomic_inc(&bb->done); ...
    ... while (issued != atomic_read(&bb.done))
    ... (let issued == bb.done)
    ... (do the rest of the function)
    ... return ret;
    complete(bb->wait);
    ^^^^^^^^
    panic

    We can fix this easily by simplifying bio_batch and completion counting.

    Also remove bio_end_io_t *end_io since it is not used.

    Signed-off-by: Lukas Czerner
    Reported-by: Eric Whitney
    Tested-by: Eric Whitney
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

02 Mar, 2011

1 commit


17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Sep, 2010

1 commit

  • Remove support for barriers on discards, which is unused now. Also
    remove the DISCARD_NOBARRIER I/O type in favour of just setting the
    rw flags up locally in blkdev_issue_discard.

    tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Aug, 2010

1 commit

  • Secure discard is the same as discard except that all copies of the
    discarded sectors (perhaps created by garbage collection) must also be
    erased.

    Signed-off-by: Adrian Hunter
    Acked-by: Jens Axboe
    Cc: Kyungmin Park
    Cc: Madhusudhan Chikkature
    Cc: Christoph Hellwig
    Cc: Ben Gardiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Adrian Hunter
     

09 Aug, 2010

1 commit


08 Aug, 2010

2 commits

  • If the queue doesn't have a limit set, or it just set UINT_MAX like
    we default to, we coud be sending down a discard request that isn't
    of the correct granularity if the block size is > 512b.

    Fix this by adjusting max_discard_sectors down to the proper
    alignment.

    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Allocating a fixed payload for discard requests always was a horrible hack,
    and it's not coming to byte us when adding support for discard in DM/MD.

    So change the code to leave the allocation of a payload to the lowlevel
    driver. Unfortunately that means we'll need another hack, which allows
    us to update the various block layer length fields indicating that we
    have a payload. Instead of hiding this in sd.c, which we already partially
    do for UNMAP support add a documented helper in the core block layer for it.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

29 Apr, 2010

3 commits