12 Oct, 2016

1 commit

  • Make sure that the offset and length arguments that we're using to
    construct WRITE SAME and DISCARD requests are actually aligned to the
    logical block size. Failure to do this causes other errors in other parts
    of the block layer or the SCSI layer because disks don't support partial
    logical block writes.

    Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

21 Jul, 2016

2 commits

  • WRITE SAME is a data integrity operation and we can't simply ignore
    errors.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently blkdev_issue_zeroout cascades down from discards (if the driver
    guarantees that discards zero data), to WRITE SAME and then to a loop
    writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the
    block layer blkdev_issue_discard helper to work around DM volumes that
    may have mixed discard support underneath.

    This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to
    blkdev_issue_discard that indicates we are called for zeroing operation.
    This allows both to ignore the EOPNOTSUPP hack and actually consolidating
    the discard_zeroes_data check into the function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jun, 2016

1 commit


08 Jun, 2016

4 commits

  • This converts the block issue discard helper and users to use
    the bio_set_op_attrs accessor and only pass in the operation flags
    like REQ_SEQURE.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This has callers of submit_bio/submit_bio_wait set the bio->bi_rw
    instead of passing it in. This makes that use the same as
    generic_make_request and how we set the other bio fields.

    Signed-off-by: Mike Christie

    Fixed up fs/ext4/crypto.c

    Signed-off-by: Jens Axboe

    Mike Christie
     
  • submit_bio_wait() gives the caller an opportunity to examine
    struct bio and so expects the caller to issue the put_bio()

    This fixes a memory leak reported by a few people in 4.7-rc2
    kmemleak report after 9082e87bfbf8 ("block: remove struct bio_batch")

    Signed-off-by: Shaun Tancheff
    Tested-by: Catalin Marinas
    Tested-by: Larry Finger@lwfinger.net
    Tested-by: David Drysdale
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Shaun Tancheff
     

06 May, 2016

1 commit

  • Commit 38f25255330 ("block: add __blkdev_issue_discard") incorrectly
    disallowed the early return of -EOPNOTSUPP if the device doesn't support
    discard (or secure discard). This early return of -EOPNOTSUPP has
    always been part of blkdev_issue_discard() interface so there isn't a
    good reason to break that behaviour -- especially when it can be easily
    reinstated.

    The nuance of allowing early return of -EOPNOTSUPP vs disallowing late
    return of -EOPNOTSUPP is: if the overall device never advertised support
    for discards and one is issued to the device it is beneficial to inform
    the caller that discards are not supported via -EOPNOTSUPP. But if a
    device advertises discard support it means that at least a subset of the
    device does have discard support -- but it could be that discards issued
    to some regions of a stacked device will not be supported. In that case
    the late return of -EOPNOTSUPP must be disallowed.

    Fixes: 38f25255330 ("block: add __blkdev_issue_discard")
    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

02 May, 2016

2 commits

  • This is a version of blkdev_issue_discard which doesn't wait for
    the I/O to complete, but instead allows the caller to submit
    the final bio and/or chain it to others.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • It can be replaced with a combination of bio_chain and submit_bio_wait.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Sagi Grimberg
    Reviewed-by: Ming Lei
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

28 Oct, 2015

1 commit

  • In commit b49a087("block: remove split code in
    blkdev_issue_{discard,write_same}"), discard_granularity and alignment
    checks were removed. Ideally, with bio late splitting, the upper layers
    shouldn't need to depend on device's limits.

    Christoph reported a discard regression on the HGST Ultrastar SN100 NVMe
    device when mkfs.xfs. We have not found the root cause yet.

    This patch re-adds discard_granularity and alignment checks by reverting
    the related changes in commit b49a087. The good thing is now we can
    remove the 2G discard size cap and just use UINT_MAX to avoid bi_size
    overflow.

    Reviewed-by: Christoph Hellwig
    Tested-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Reviewed-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Ming Lin
     

14 Aug, 2015

1 commit

  • The split code in blkdev_issue_{discard,write_same} can go away
    now that any driver that cares does the split. We have to make
    sure bio size doesn't overflow.

    For discard, we set max discard sectors to (1<>9 to ensure
    it doesn't overflow bi_size and hopefully it is of the proper
    granularity as long as the granularity is a power of two.

    Acked-by: Christoph Hellwig
    Signed-off-by: Ming Lin
    Signed-off-by: Jens Axboe

    Ming Lin
     

29 Jul, 2015

1 commit

  • Currently we have two different ways to signal an I/O error on a BIO:

    (1) by clearing the BIO_UPTODATE flag
    (2) by returning a Linux errno value to the bi_end_io callback

    The first one has the drawback of only communicating a single possible
    error (-EIO), and the second one has the drawback of not beeing persistent
    when bios are queued up, and are not passed along from child to parent
    bio in the ever more popular chaining scenario. Having both mechanisms
    available has the additional drawback of utterly confusing driver authors
    and introducing bugs where various I/O submitters only deal with one of
    them, and the others have to add boilerplate code to deal with both kinds
    of error returns.

    So add a new bi_error field to store an errno value directly in struct
    bio and remove the existing mechanisms to clean all this up.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Reviewed-by: NeilBrown
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Feb, 2015

1 commit

  • blkdev_issue_zeroout() printed a warning if a device failed a discard or
    write same request despite advertising support for these. That's fine
    for SCSI since we'll disable these commands if we get an error back from
    the disk saying that they are not supported. And consequently the
    warning only gets printed once.

    There are other types of block devices that support discard, however,
    and these may return -EOPNOTSUPP for each command but leave discard
    enabled in the queue limits. This will cause a warning message for every
    blkdev_issue_zeroout() invocation.

    Remove the offending warning messages.

    Reported-by: Sedat Dilek
    Signed-off-by: Martin K. Petersen
    Tested-by: Sedat Dilek
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

22 Jan, 2015

1 commit

  • blkdev_issue_discard() will zero a given block range. This is done by
    way of explicit writing, thus provisioning or allocating the blocks on
    disk.

    There are use cases where the desired behavior is to zero the blocks but
    unprovision them if possible. The blocks must deterministically contain
    zeroes when they are subsequently read back.

    This patch adds a flag to blkdev_issue_zeroout() that provides this
    variant. If the discard flag is set and a block device guarantees
    discard_zeroes_data we will use REQ_DISCARD to clear the block range. If
    the device does not support discard_zeroes_data or if the discard
    request fails we will fall back to first REQ_WRITE_SAME and then a
    regular REQ_WRITE.

    Also update the callers of blkdev_issue_zero() to reflect the new flag
    and make sb_issue_zeroout() prefer the discard approach.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

27 May, 2014

1 commit


13 Feb, 2014

1 commit

  • When mkfs issues a full device discard and the device only
    supports discards of a smallish size, we can loop in
    blkdev_issue_discard() for a long time. If preempt isn't enabled,
    this can turn into a softlock situation and the kernel will
    start complaining.

    Add an explicit cond_resched() at the end of the loop to avoid
    that.

    Cc: stable@kernel.org
    Signed-off-by: Jens Axboe

    Jens Axboe
     

24 Nov, 2013

1 commit

  • Immutable biovecs are going to require an explicit iterator. To
    implement immutable bvecs, a later patch is going to add a bi_bvec_done
    member to this struct; for now, this patch effectively just renames
    things.

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: Geert Uytterhoeven
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: "Ed L. Cashin"
    Cc: Nick Piggin
    Cc: Lars Ellenberg
    Cc: Jiri Kosina
    Cc: Matthew Wilcox
    Cc: Geoff Levand
    Cc: Yehuda Sadeh
    Cc: Sage Weil
    Cc: Alex Elder
    Cc: ceph-devel@vger.kernel.org
    Cc: Joshua Morris
    Cc: Philip Kelleher
    Cc: Rusty Russell
    Cc: "Michael S. Tsirkin"
    Cc: Konrad Rzeszutek Wilk
    Cc: Jeremy Fitzhardinge
    Cc: Neil Brown
    Cc: Alasdair Kergon
    Cc: Mike Snitzer
    Cc: dm-devel@redhat.com
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: linux390@de.ibm.com
    Cc: Boaz Harrosh
    Cc: Benny Halevy
    Cc: "James E.J. Bottomley"
    Cc: Greg Kroah-Hartman
    Cc: "Nicholas A. Bellinger"
    Cc: Alexander Viro
    Cc: Chris Mason
    Cc: "Theodore Ts'o"
    Cc: Andreas Dilger
    Cc: Jaegeuk Kim
    Cc: Steven Whitehouse
    Cc: Dave Kleikamp
    Cc: Joern Engel
    Cc: Prasad Joshi
    Cc: Trond Myklebust
    Cc: KONISHI Ryusuke
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Ben Myers
    Cc: xfs@oss.sgi.com
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Ingo Molnar
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: "Rafael J. Wysocki"
    Cc: Herton Ronaldo Krzesinski
    Cc: Ben Hutchings
    Cc: Andrew Morton
    Cc: Guo Chao
    Cc: Tejun Heo
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Wei Yongjun
    Cc: "Roger Pau Monné"
    Cc: Jan Beulich
    Cc: Stefano Stabellini
    Cc: Ian Campbell
    Cc: Sebastian Ott
    Cc: Christian Borntraeger
    Cc: Minchan Kim
    Cc: Jiang Liu
    Cc: Nitin Gupta
    Cc: Jerome Marchand
    Cc: Joe Perches
    Cc: Peng Tao
    Cc: Andy Adamson
    Cc: fanchaoting
    Cc: Jie Liu
    Cc: Sunil Mushran
    Cc: "Martin K. Petersen"
    Cc: Namjae Jeon
    Cc: Pankaj Kumar
    Cc: Dan Magenheimer
    Cc: Mel Gorman 6

    Kent Overstreet
     

09 Nov, 2013

1 commit

  • do_div() (called by sector_div() if CONFIG_LBDAF=y) is meant for divisions
    of 64-bit number by 32-bit numbers. Passing 64-bit divisor types caused
    issues in the past on 32-bit platforms, cfr. commit
    ea077b1b96e073eac5c3c5590529e964767fc5f7 ("m68k: Truncate base in
    do_div()").

    As queue_limits.max_discard_sectors and .discard_granularity are unsigned
    int, max_discard_sectors and granularity should be unsigned int.
    As bdev_discard_alignment() returns int, alignment should be int.
    Now 2 calls to sector_div() can be replaced by 32-bit arithmetic:
    - The 64-bit modulo operation can become a 32-bit modulo operation,
    - The 64-bit division and multiplication can be replaced by a 32-bit
    modulo operation and a subtraction.

    Signed-off-by: Geert Uytterhoeven
    Signed-off-by: Jens Axboe

    Geert Uytterhoeven
     

15 Feb, 2013

1 commit

  • Using wait_for_completion() for waiting for a IO request to be executed
    results in wrong iowait time accounting. For example, a system having
    the only task doing write() and fdatasync() on a block device can be
    reported being idle instead of iowaiting as it should because
    blkdev_issue_flush() calls wait_for_completion() which in turn calls
    schedule() that does not increment the iowait proc counter and thus does
    not turn on iowait time accounting.

    The patch makes block layer use wait_for_completion_io() instead of
    wait_for_completion() where appropriate to account iowait time
    correctly.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Jens Axboe

    Vladimir Davydov
     

15 Dec, 2012

2 commits

  • Last post of this patch appears lost, so I resend this.

    Now discard merge works, add plug for blkdev_issue_discard. This will help
    discard request merge especially for raid0 case. In raid0, a big discard
    request is split to small requests, and if correct plug is added, such small
    requests can be merged in low layer.

    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     
  • In MD raid case, discard granularity might not be power of 2, for example, a
    4-disk raid5 has 3*chunk_size discard granularity. Correct the calculation for
    such cases.

    Reported-by: Neil Brown
    Signed-off-by: Shaohua Li
    Signed-off-by: Jens Axboe

    Shaohua Li
     

20 Sep, 2012

2 commits

  • If the device supports WRITE SAME, use that to optimize zeroing of
    blocks. If the device does not support WRITE SAME or if the operation
    fails, fall back to writing zeroes the old-fashioned way.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The WRITE SAME command supported on some SCSI devices allows the same
    block to be efficiently replicated throughout a block range. Only a
    single logical block is transferred from the host and the storage device
    writes the same data to all blocks described by the I/O.

    This patch implements support for WRITE SAME in the block layer. The
    blkdev_issue_write_same() function can be used by filesystems and block
    drivers to replicate a buffer across a block range. This can be used to
    efficiently initialize software RAID devices, etc.

    Signed-off-by: Martin K. Petersen
    Acked-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

02 Aug, 2012

2 commits

  • When a disk has large discard_granularity and small max_discard_sectors,
    discards are not split with optimal alignment. In the limit case of
    discard_granularity == max_discard_sectors, no request could be aligned
    correctly, so in fact you might end up with no discarded logical blocks
    at all.

    Another example that helps showing the condition in the patch is with
    discard_granularity == 64, max_discard_sectors == 128. A request that is
    submitted for 256 sectors 2..257 will be split in two: 2..129, 130..257.
    However, only 2 aligned blocks out of 3 are included in the request;
    128..191 may be left intact and not discarded. With this patch, the
    first request will be truncated to ensure good alignment of what's left,
    and the split will be 2..127, 128..255, 256..257. The patch will also
    take into account the discard_alignment.

    At most one extra request will be introduced, because the first request
    will be reduced by at most granularity-1 sectors, and granularity
    must be less than max_discard_sectors. Subsequent requests will run
    on round_down(max_discard_sectors, granularity) sectors, as in the
    current code.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     
  • Mostly a preparation for the next patch.

    In principle this fixes an infinite loop if max_discard_sectors < granularity,
    but that really shouldn't happen.

    Signed-off-by: Paolo Bonzini
    Acked-by: Vivek Goyal
    Tested-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Paolo Bonzini
     

24 Jul, 2011

1 commit


07 Jul, 2011

1 commit

  • Due to the recently identified overflow in read_capacity_16() it was
    possible for max_discard_sectors to be zero but still have discards
    enabled on the associated device's queue.

    Eliminate the possibility for blkdev_issue_discard to infinitely loop.

    Interestingly this issue wasn't identified until a device, whose
    discard_granularity was 0 due to read_capacity_16 overflow, was consumed
    by blk_stack_limits() to construct limits for a higher-level DM
    multipath device. The multipath device's resulting limits never had the
    discard limits stacked because blk_stack_limits() will only do so if
    the bottom device's discard_granularity != 0. This resulted in the
    multipath device's limits.max_discard_sectors being 0.

    Signed-off-by: Mike Snitzer
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

07 May, 2011

3 commits

  • Currently we return -EOPNOTSUPP in blkdev_issue_discard() if any of the
    bio fails due to underlying device not supporting discard request.
    However, if the device is for example dm device composed of devices
    which some of them support discard and some of them does not, it is ok
    for some bios to fail with EOPNOTSUPP, but it does not mean that discard
    is not supported at all.

    This commit removes the check for bios failed with EOPNOTSUPP and change
    blkdev_issue_discard() to return operation not supported if and only if
    the device does not actually supports it, not just part of the device as
    some bios might indicate.

    This change also fixes problem with BLKDISCARD ioctl() which now works
    correctly on such dm devices.

    Signed-off-by: Lukas Czerner
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • In blkdev_issue_zeroout() we are submitting regular WRITE bios, so we do
    not need to check for -EOPNOTSUPP specifically in case of error. Also
    there is no need to have label submit: because there is no way to jump
    out from the while cycle without an error and we really want to exit,
    rather than try again. And also remove the check for (sz == 0) since at
    that point sz can never be zero.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    CC: Jens Axboe
    Signed-off-by: Jens Axboe

    Lukas Czerner
     
  • Currently we are waiting for every submitted REQ_DISCARD bio separately,
    but it can have unwanted consequences of repeatedly flushing the queue,
    so we rather submit bios in batches and wait for the entire batch, hence
    narrowing the window of other ios going in.

    Use bio_batch_end_io() and struct bio_batch for that purpose, the same
    is used by blkdev_issue_zeroout(). Also change bio_batch_end_io() so we
    always set !BIO_UPTODATE in the case of error and remove the check for
    bb, since we are the only user of this function and we always set this.

    Remove bio_get()/bio_put() from the blkdev_issue_discard() since
    bio_alloc() and bio_batch_end_io() is doing the same thing, hence it is
    not needed anymore.

    I have done simple dd testing with surprising results. The script I have
    used is:

    for i in $(seq 10); do
    echo $i
    dd if=/dev/sdb1 of=/dev/sdc1 bs=4k &
    sleep 5
    done
    /usr/bin/time -f %e ./blkdiscard /dev/sdc1

    Running time of BLKDISCARD on the whole device:
    with patch without patch
    0.95 15.58

    So we can see that in this artificial test the kernel with the patch
    applied is approx 16x faster in discarding the device.

    Signed-off-by: Lukas Czerner
    CC: Dmitry Monakhov
    CC: Jens Axboe
    CC: Jeff Moyer
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

25 Mar, 2011

1 commit

  • * 'for-2.6.39/core' of git://git.kernel.dk/linux-2.6-block: (65 commits)
    Documentation/iostats.txt: bit-size reference etc.
    cfq-iosched: removing unnecessary think time checking
    cfq-iosched: Don't clear queue stats when preempt.
    blk-throttle: Reset group slice when limits are changed
    blk-cgroup: Only give unaccounted_time under debug
    cfq-iosched: Don't set active queue in preempt
    block: fix non-atomic access to genhd inflight structures
    block: attempt to merge with existing requests on plug flush
    block: NULL dereference on error path in __blkdev_get()
    cfq-iosched: Don't update group weights when on service tree
    fs: assign sb->s_bdi to default_backing_dev_info if the bdi is going away
    block: Require subsystems to explicitly allocate bio_set integrity mempool
    jbd2: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    jbd: finish conversion from WRITE_SYNC_PLUG to WRITE_SYNC and explicit plugging
    fs: make fsync_buffers_list() plug
    mm: make generic_writepages() use plugging
    blk-cgroup: Add unaccounted time to timeslice_used.
    block: fixup plugging stubs for !CONFIG_BLOCK
    block: remove obsolete comments for blkdev_issue_zeroout.
    blktrace: Use rq->cmd_flags directly in blk_add_trace_rq.
    ...

    Fix up conflicts in fs/{aio.c,super.c}

    Linus Torvalds
     

12 Mar, 2011

1 commit


11 Mar, 2011

1 commit

  • BZ29402
    https://bugzilla.kernel.org/show_bug.cgi?id=29402

    We can hit serious mis-synchronization in bio completion path of
    blkdev_issue_zeroout() leading to a panic.

    The problem is that when we are going to wait_for_completion() in
    blkdev_issue_zeroout() we check if the bb.done equals issued (number of
    submitted bios). If it does, we can skip the wait_for_completition()
    and just out of the function since there is nothing to wait for.
    However, there is a ordering problem because bio_batch_end_io() is
    calling atomic_inc(&bb->done) before complete(), hence it might seem to
    blkdev_issue_zeroout() that all bios has been completed and exit. At
    this point when bio_batch_end_io() is going to call complete(bb->wait),
    bb and wait does not longer exist since it was allocated on stack in
    blkdev_issue_zeroout() ==> panic!

    (thread 1) (thread 2)
    bio_batch_end_io() blkdev_issue_zeroout()
    if(bb) { ...
    if (bb->end_io) ...
    bb->end_io(bio, err); ...
    atomic_inc(&bb->done); ...
    ... while (issued != atomic_read(&bb.done))
    ... (let issued == bb.done)
    ... (do the rest of the function)
    ... return ret;
    complete(bb->wait);
    ^^^^^^^^
    panic

    We can fix this easily by simplifying bio_batch and completion counting.

    Also remove bio_end_io_t *end_io since it is not used.

    Signed-off-by: Lukas Czerner
    Reported-by: Eric Whitney
    Tested-by: Eric Whitney
    Reviewed-by: Jeff Moyer
    CC: Dmitry Monakhov
    Signed-off-by: Jens Axboe

    Lukas Czerner
     

02 Mar, 2011

1 commit


17 Sep, 2010

1 commit

  • All the blkdev_issue_* helpers can only sanely be used for synchronous
    caller. To issue cache flushes or barriers asynchronously the caller needs
    to set up a bio by itself with a completion callback to move the asynchronous
    state machine ahead. So drop the BLKDEV_IFL_WAIT flag that is always
    specified when calling blkdev_issue_* and also remove the now unused flags
    argument to blkdev_issue_flush and blkdev_issue_zeroout. For
    blkdev_issue_discard we need to keep it for the secure discard flag, which
    gains a more descriptive name and loses the bitops vs flag confusion.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

10 Sep, 2010

1 commit

  • Remove support for barriers on discards, which is unused now. Also
    remove the DISCARD_NOBARRIER I/O type in favour of just setting the
    rw flags up locally in blkdev_issue_discard.

    tj: Also remove DISCARD_SECURE and use REQ_SECURE directly.

    Signed-off-by: Christoph Hellwig
    Acked-by: Mike Snitzer
    Signed-off-by: Tejun Heo
    Signed-off-by: Jens Axboe

    Christoph Hellwig