25 Sep, 2020

1 commit


06 Aug, 2020

1 commit

  • If create a loop device with a backing NVMe SSD, current loop device
    driver doesn't correctly set its queue's limits.discard_granularity and
    leaves it as 0. If a discard request at LBA 0 on this loop device, in
    __blkdev_issue_discard() the calculated req_sects will be 0, and a zero
    length discard request will trigger a BUG() panic in generic block layer
    code at block/blk-mq.c:563.

    [ 955.565006][ C39] ------------[ cut here ]------------
    [ 955.559660][ C39] invalid opcode: 0000 [#1] SMP NOPTI
    [ 955.622171][ C39] CPU: 39 PID: 248 Comm: ksoftirqd/39 Tainted: G E 5.8.0-default+ #40
    [ 955.622171][ C39] Hardware name: Lenovo ThinkSystem SR650 -[7X05CTO1WW]-/-[7X05CTO1WW]-, BIOS -[IVE160M-2.70]- 07/17/2020
    [ 955.622175][ C39] RIP: 0010:blk_mq_end_request+0x107/0x110
    [ 955.622177][ C39] Code: 48 8b 03 e9 59 ff ff ff 48 89 df 5b 5d 41 5c e9 9f ed ff ff 48 8b 35 98 3c f4 00 48 83 c7 10 48 83 c6 19 e8 cb 56 c9 ff eb cb 0b 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 41 54
    [ 955.622179][ C39] RSP: 0018:ffffb1288701fe28 EFLAGS: 00010202
    [ 955.749277][ C39] RAX: 0000000000000001 RBX: ffff956fffba5080 RCX: 0000000000004003
    [ 955.749278][ C39] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
    [ 955.749279][ C39] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
    [ 955.749279][ C39] R10: ffffb1288701fd28 R11: 0000000000000001 R12: ffffffffa8e05160
    [ 955.749280][ C39] R13: 0000000000000004 R14: 0000000000000004 R15: ffffffffa7ad3a1e
    [ 955.749281][ C39] FS: 0000000000000000(0000) GS:ffff95bfbda00000(0000) knlGS:0000000000000000
    [ 955.749282][ C39] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 955.749282][ C39] CR2: 00007f6f0ef766a8 CR3: 0000005a37012002 CR4: 00000000007606e0
    [ 955.749283][ C39] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 955.749284][ C39] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 955.749284][ C39] PKRU: 55555554
    [ 955.749285][ C39] Call Trace:
    [ 955.749290][ C39] blk_done_softirq+0x99/0xc0
    [ 957.550669][ C39] __do_softirq+0xd3/0x45f
    [ 957.550677][ C39] ? smpboot_thread_fn+0x2f/0x1e0
    [ 957.550679][ C39] ? smpboot_thread_fn+0x74/0x1e0
    [ 957.550680][ C39] ? smpboot_thread_fn+0x14e/0x1e0
    [ 957.550684][ C39] run_ksoftirqd+0x30/0x60
    [ 957.550687][ C39] smpboot_thread_fn+0x149/0x1e0
    [ 957.886225][ C39] ? sort_range+0x20/0x20
    [ 957.886226][ C39] kthread+0x137/0x160
    [ 957.886228][ C39] ? kthread_park+0x90/0x90
    [ 957.886231][ C39] ret_from_fork+0x22/0x30
    [ 959.117120][ C39] ---[ end trace 3dacdac97e2ed164 ]---

    This is the procedure to reproduce the panic,
    # modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
    # losetup -f /dev/nvme0n1 --direct-io=on
    # blkdiscard /dev/loop0 -o 0 -l 0x200

    This patch fixes the issue by checking q->limits.discard_granularity in
    __blkdev_issue_discard() before composing the discard bio. If the value
    is 0, then prints a warning oops information and returns -EOPNOTSUPP to
    the caller to indicate that this buggy device driver doesn't support
    discard request.

    Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
    Fixes: c52abf563049 ("loop: Better discard support for block devices")
    Reported-and-suggested-by: Ming Lei
    Signed-off-by: Coly Li
    Reviewed-by: Ming Lei
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Jack Wang
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Darrick J. Wong
    Cc: Enzo Matsumiya
    Cc: Evan Green
    Cc: Jens Axboe
    Cc: Martin K. Petersen
    Cc: Xiao Ni
    Signed-off-by: Jens Axboe

    Coly Li
     

17 Jul, 2020

1 commit

  • This patch improves discard bio split for address and size alignment in
    __blkdev_issue_discard(). The aligned discard bio may help underlying
    device controller to perform better discard and internal garbage
    collection, and avoid unnecessary internal fragment.

    Current discard bio split algorithm in __blkdev_issue_discard() may have
    non-discarded fregment on device even the discard bio LBA and size are
    both aligned to device's discard granularity size.

    Here is the example steps on how to reproduce the above problem.
    - On a VMWare ESXi 6.5 update3 installation, create a 51GB virtual disk
    with thin mode and give it to a Linux virtual machine.
    - Inside the Linux virtual machine, if the 50GB virtual disk shows up as
    /dev/sdb, fill data into the first 50GB by,
    # dd if=/dev/zero of=/dev/sdb bs=4096 count=13107200
    - Discard the 50GB range from offset 0 on /dev/sdb,
    # blkdiscard /dev/sdb -o 0 -l 53687091200
    - Observe the underlying mapping status of the device
    # sg_get_lba_status /dev/sdb -m 1048 --lba=0
    descriptor LBA: 0x0000000000000000 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000000000800 blocks: 16773120 deallocated
    descriptor LBA: 0x0000000000fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000017ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000001800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000001fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000027ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000002800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000002fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000037ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000003800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000003fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000047ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000004800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000004fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005000000 blocks: 8386560 deallocated
    descriptor LBA: 0x00000000057ff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000005800000 blocks: 8386560 deallocated
    descriptor LBA: 0x0000000005fff800 blocks: 2048 mapped (or unknown)
    descriptor LBA: 0x0000000006000000 blocks: 6291456 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    Although the discard bio starts at LBA 0 and has 50<<< 9) > UINT_MAX);
    62
    63 bio = blk_next_bio(bio, 0, gfp_mask);
    64 bio->bi_iter.bi_sector = sector;
    65 bio_set_dev(bio, bdev);
    66 bio_set_op_attrs(bio, op, 0);
    67
    68 bio->bi_iter.bi_size = req_sects << 9;
    69 sector += req_sects;
    70 nr_sects -= req_sects;
    [snipped]
    79 }
    80
    81 *biop = bio;
    82 return 0;
    83 }
    84 EXPORT_SYMBOL(__blkdev_issue_discard);

    At line 58-59, to discard a 50GB range, req_sects is set as return value
    of bio_allowed_max_sectors(q), which is 8388607 sectors. In the above
    case, the discard granularity is 2048 sectors, although the start LBA
    and discard length are aligned to discard granularity, req_sects never
    has chance to be aligned to discard granularity. This is why there are
    some still-mapped 2048 sectors fragment in every 4 or 8 GB range.

    If req_sects at line 58 is set to a value aligned to discard_granularity
    and close to UNIT_MAX, then all consequent split bios inside device
    driver are (almostly) aligned to discard_granularity of the device
    queue. The 2048 sectors still-mapped fragment will disappear.

    This patch introduces bio_aligned_discard_max_sectors() to return the
    the value which is aligned to q->limits.discard_granularity and closest
    to UINT_MAX. Then this patch replaces bio_allowed_max_sectors() with
    this new routine to decide a more proper split bio length.

    But we still need to handle the situation when discard start LBA is not
    aligned to q->limits.discard_granularity, otherwise even the length is
    aligned, current code may still leave 2048 fragment around every 4GB
    range. Therefore, to calculate req_sects, firstly the start LBA of
    discard range is checked (including partition offset), if it is not
    aligned to discard granularity, the first split location should make
    sure following bio has bi_sector aligned to discard granularity. Then
    there won't be still-mapped fragment in the middle of the discard range.

    The above is how this patch improves discard bio alignment in
    __blkdev_issue_discard(). Now with this patch, after discard with same
    command line mentiond previously, sg_get_lba_status returns,
    descriptor LBA: 0x0000000000000000 blocks: 106954752 deallocated
    descriptor LBA: 0x0000000006600000 blocks: 0 deallocated

    We an see there is no 2048 sectors segment anymore, everything is clean.

    Reported-and-tested-by: Acshai Manoj
    Signed-off-by: Coly Li
    Reviewed-by: Hannes Reinecke
    Reviewed-by: Ming Lei
    Reviewed-by: Xiao Ni
    Cc: Bart Van Assche
    Cc: Christoph Hellwig
    Cc: Enzo Matsumiya
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Coly Li
     

14 Nov, 2018

1 commit

  • A discard cleanup merged into 4.20-rc2 causes fstests xfs/259 to
    fall into an endless loop in the discard code. The test is creating
    a device that is exactly 2^32 sectors in size to test mkfs boundary
    conditions around the 32 bit sector overflow region.

    mkfs issues a discard for the entire device size by default, and
    hence this throws a sector count of 2^32 into
    blkdev_issue_discard(). It takes the number of sectors to discard as
    a sector_t - a 64 bit value.

    The commit ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
    takes this sector count and casts it to a 32 bit value before
    comapring it against the maximum allowed discard size the device
    has. This truncates away the upper 32 bits, and so if the lower 32
    bits of the sector count is zero, it starts issuing discards of
    length 0. This causes the code to fall into an endless loop, issuing
    a zero length discards over and over again on the same sector.

    Fixes: ba5d73851e71 ("block: cleanup __blkdev_issue_discard")
    Tested-by: Darrick J. Wong
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Killed pointless WARN_ON().

    Signed-off-by: Jens Axboe

    Dave Chinner
     

09 Nov, 2018

3 commits

  • Obviously the created writesame bio has to be aligned with logical block
    size, and use bio_allowed_max_sectors() to retrieve this number.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: b49a0871be31a745b2ef ("block: remove split code in blkdev_issue_{discard,write_same}")
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Cleanup __blkdev_issue_discard() a bit:

    - remove local variable of 'end_sect'
    - remove code block of 'fail'

    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     
  • Obviously the created discard bio has to be aligned with logical block size.

    This patch introduces the helper of bio_allowed_max_sectors() for
    this purpose.

    Cc: stable@vger.kernel.org
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Cc: Mariusz Dabrowski
    Fixes: 744889b7cbb56a6 ("block: don't deal with discard limit in blkdev_issue_discard()")
    Fixes: a22c4d7e34402cc ("block: re-add discard_granularity and alignment checks")
    Reported-by: Rui Salvaterra
    Tested-by: Rui Salvaterra
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

26 Oct, 2018

1 commit

  • There is no need to synchronously execute all REQ_OP_ZONE_RESET BIOs
    necessary to reset a range of zones. Similarly to what is done for
    discard BIOs in blk-lib.c, all zone reset BIOs can be chained and
    executed asynchronously and a synchronous call done only for the last
    BIO of the chain.

    Modify blkdev_reset_zones() to operate similarly to
    blkdev_issue_discard() using the next_bio() helper for chaining BIOs. To
    avoid code duplication of that function in blk_zoned.c, rename
    next_bio() into blk_next_bio() and declare it as a block internal
    function in blk.h.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Damien Le Moal
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

18 Oct, 2018

1 commit

  • blk_queue_split() does respect this limit via bio splitting, so no
    need to do that in blkdev_issue_discard(), then we can align to
    normal bio submit(bio_add_page() & submit_bio()).

    More importantly, this patch fixes one issue introduced in a22c4d7e34402cc
    ("block: re-add discard_granularity and alignment checks"), in which
    zero discard bio may be generated in case of zero alignment.

    Fixes: a22c4d7e34402ccdf3 ("block: re-add discard_granularity and alignment checks")
    Cc: stable@vger.kernel.org
    Cc: Ming Lin
    Cc: Mike Snitzer
    Cc: Christoph Hellwig
    Cc: Xiao Ni
    Tested-by: Mariusz Dabrowski
    Signed-off-by: Ming Lei
    Signed-off-by: Jens Axboe

    Ming Lei
     

09 Jul, 2018

1 commit

  • If __blkdev_issue_discard is in progress and a device mapper device is
    reloaded with a table that doesn't support discard,
    q->limits.max_discard_sectors is set to zero. This results in infinite
    loop in __blkdev_issue_discard.

    This patch checks if max_discard_sectors is zero and aborts with
    -EOPNOTSUPP.

    Signed-off-by: Mikulas Patocka
    Tested-by: Zdenek Kabelac
    Cc: stable@vger.kernel.org
    Signed-off-by: Jens Axboe

    Mikulas Patocka
     

09 May, 2018

1 commit


19 Jan, 2018

1 commit

  • Similar to blkdev_write_iter(), return -EPERM if the partition is
    read-only. This covers ioctl(), fallocate() and most in-kernel users
    but isn't meant to be exhaustive -- everything else will be caught in
    generic_make_request_checks(), fail with -EIO and can be fixed later.

    Reviewed-by: Sagi Grimberg
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

15 Nov, 2017

1 commit

  • Pull core block layer updates from Jens Axboe:
    "This is the main pull request for block storage for 4.15-rc1.

    Nothing out of the ordinary in here, and no API changes or anything
    like that. Just various new features for drivers, core changes, etc.
    In particular, this pull request contains:

    - A patch series from Bart, closing the whole on blk/scsi-mq queue
    quescing.

    - A series from Christoph, building towards hidden gendisks (for
    multipath) and ability to move bio chains around.

    - NVMe
    - Support for native multipath for NVMe (Christoph).
    - Userspace notifications for AENs (Keith).
    - Command side-effects support (Keith).
    - SGL support (Chaitanya Kulkarni)
    - FC fixes and improvements (James Smart)
    - Lots of fixes and tweaks (Various)

    - bcache
    - New maintainer (Michael Lyle)
    - Writeback control improvements (Michael)
    - Various fixes (Coly, Elena, Eric, Liang, et al)

    - lightnvm updates, mostly centered around the pblk interface
    (Javier, Hans, and Rakesh).

    - Removal of unused bio/bvec kmap atomic interfaces (me, Christoph)

    - Writeback series that fix the much discussed hundreds of millions
    of sync-all units. This goes all the way, as discussed previously
    (me).

    - Fix for missing wakeup on writeback timer adjustments (Yafang
    Shao).

    - Fix laptop mode on blk-mq (me).

    - {mq,name} tupple lookup for IO schedulers, allowing us to have
    alias names. This means you can use 'deadline' on both !mq and on
    mq (where it's called mq-deadline). (me).

    - blktrace race fix, oopsing on sg load (me).

    - blk-mq optimizations (me).

    - Obscure waitqueue race fix for kyber (Omar).

    - NBD fixes (Josef).

    - Disable writeback throttling by default on bfq, like we do on cfq
    (Luca Miccio).

    - Series from Ming that enable us to treat flush requests on blk-mq
    like any other request. This is a really nice cleanup.

    - Series from Ming that improves merging on blk-mq with schedulers,
    getting us closer to flipping the switch on scsi-mq again.

    - BFQ updates (Paolo).

    - blk-mq atomic flags memory ordering fixes (Peter Z).

    - Loop cgroup support (Shaohua).

    - Lots of minor fixes from lots of different folks, both for core and
    driver code"

    * 'for-4.15/block' of git://git.kernel.dk/linux-block: (294 commits)
    nvme: fix visibility of "uuid" ns attribute
    blk-mq: fixup some comment typos and lengths
    ide: ide-atapi: fix compile error with defining macro DEBUG
    blk-mq: improve tag waiting setup for non-shared tags
    brd: remove unused brd_mutex
    blk-mq: only run the hardware queue if IO is pending
    block: avoid null pointer dereference on null disk
    fs: guard_bio_eod() needs to consider partitions
    xtensa/simdisk: fix compile error
    nvme: expose subsys attribute to sysfs
    nvme: create 'slaves' and 'holders' entries for hidden controllers
    block: create 'slaves' and 'holders' entries for hidden gendisks
    nvme: also expose the namespace identification sysfs files for mpath nodes
    nvme: implement multipath access to nvme subsystems
    nvme: track shared namespaces
    nvme: introduce a nvme_ns_ids structure
    nvme: track subsystems
    block, nvme: Introduce blk_mq_req_flags_t
    block, scsi: Make SCSI quiesce and resume work reliably
    block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
    ...

    Linus Torvalds
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

26 Oct, 2017

2 commits

  • sd_config_write_same() ignores ->max_ws_blocks == 0 and resets it to
    permit trying WRITE SAME on older SCSI devices, unless ->no_write_same
    is set. Because REQ_OP_WRITE_ZEROES is implemented in terms of WRITE
    SAME, blkdev_issue_zeroout() may fail with -EREMOTEIO:

    $ fallocate -zn -l 1k /dev/sdg
    fallocate: fallocate failed: Remote I/O error
    $ fallocate -zn -l 1k /dev/sdg # OK
    $ fallocate -zn -l 1k /dev/sdg # OK

    The following calls succeed because sd_done() sets ->no_write_same in
    response to a sense that would become BLK_STS_TARGET/-EREMOTEIO, causing
    __blkdev_issue_zeroout() to fall back to generating ZERO_PAGE bios.

    This means blkdev_issue_zeroout() must cope with WRITE ZEROES failing
    and fall back to manually zeroing, unless BLKDEV_ZERO_NOFALLBACK is
    specified. For BLKDEV_ZERO_NOFALLBACK case, return -EOPNOTSUPP if
    sd_done() has just set ->no_write_same thus indicating lack of offload
    support.

    Fixes: c20cfc27a473 ("block: stop using blkdev_issue_write_same for zeroing")
    Cc: Hannes Reinecke
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     
  • blkdev_issue_zeroout() will use this in !BLKDEV_ZERO_NOFALLBACK case.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Jens Axboe

    Ilya Dryomov
     

11 Sep, 2017

1 commit


24 Aug, 2017

1 commit

  • This way we don't need a block_device structure to submit I/O. The
    block_device has different life time rules from the gendisk and
    request_queue and is usually only available when the block device node
    is open. Other callers need to explicitly create one (e.g. the lightnvm
    passthrough code, or the new nvme multipathing code).

    For the actual I/O path all that we need is the gendisk, which exists
    once per block device. But given that the block layer also does
    partition remapping we additionally need a partition index, which is
    used for said remapping in generic_make_request.

    Note that all the block drivers generally want request_queue or
    sometimes the gendisk, so this removes a layer of indirection all
    over the stack.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

06 Jul, 2017

1 commit

  • The BIO issuing loop in __blkdev_issue_zeroout() is allocating BIOs
    with a maximum number of bvec (pages) equal to

    min(nr_sects, (sector_t)BIO_MAX_PAGES)

    This works since the requested number of bvecs will always be limited
    to the absolute maximum number supported (BIO_MAX_PAGES), but this is
    ineficient as too many bvec entries may be requested due to the
    different units being used in the min() operation (number of sectors vs
    number of pages).
    To fix this, introduce the helper __blkdev_sectors_to_bio_pages() to
    correctly calculate the number of bvecs for zeroout BIOs as the issuing
    loop progresses. The calculation is done using consistent units and
    makes sure that the number of pages return is at least 1 (for cases
    where the number of sectors is less that the number of sectors in
    a page).

    Also remove a trailing space after the bit shift in the internal loop
    min() call.

    Signed-off-by: Damien Le Moal
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Damien Le Moal
     

09 Apr, 2017

6 commits


25 Mar, 2017

1 commit


07 Feb, 2017

1 commit

  • Write Same can return an error asynchronously if it turns out the
    underlying SCSI device does not support Write Same, which makes a
    proper fallback to other methods in __blkdev_issue_zeroout impossible.
    Thus only issue a Write Same from blkdev_issue_zeroout an don't try it
    at all from __blkdev_issue_zeroout as a non-invasive workaround.

    Signed-off-by: Christoph Hellwig
    Reported-by: Junichi Nomura
    Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
    Tested-by: Junichi Nomura
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

14 Jan, 2017

1 commit

  • Discard can return -EIO asynchronously if the alignment for the request
    isn't suitable for the driver, which makes a proper fallback to other
    methods in __blkdev_issue_zeroout impossible. Thus only issue a sync
    discard from blkdev_issue_zeroout an don't try discard at all from
    __blkdev_issue_zeroout as a non-invasive workaround.

    One more reason why abusing discard for zeroing must die..

    Signed-off-by: Christoph Hellwig
    Reported-by: Eryu Guan
    Fixes: e73c23ff ("block: add async variant of blkdev_issue_zeroout")
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Dec, 2016

1 commit

  • Instead of allocating a single unused biovec for discard requests, send
    them down without any payload. Instead we allow the driver to add a
    "special" payload using a biovec embedded into struct request (unioned
    over other fields never used while in the driver), and overloading
    the number of segments for this case.

    This has a couple of advantages:

    - we don't have to allocate the bio_vec
    - the amount of special casing for discard requests in the block
    layer is significantly reduced
    - using this same scheme for other request types is trivial,
    which will be important for implementing the new WRITE_ZEROES
    op on devices where it actually requires a payload (e.g. SCSI)
    - we can get rid of playing games with the request length, as
    we'll never touch it and completions will work just fine
    - it will allow us to support ranged discard operations in the
    future by merging non-contiguous discard bios into a single
    request
    - last but not least it removes a lot of code

    This patch is the common base for my WIP series for ranges discards and to
    remove discard_zeroes_data in favor of always using REQ_OP_WRITE_ZEROES,
    so it would be good to get it in quickly.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

01 Dec, 2016

2 commits

  • This adds a new block layer operation to zero out a range of
    LBAs. This allows to implement zeroing for devices that don't use
    either discard with a predictable zero pattern or WRITE SAME of zeroes.
    The prominent example of that is NVMe with the Write Zeroes command,
    but in the future, this should also help with improving the way
    zeroing discards work. For this operation, suitable entry is exported in
    sysfs which indicate the number of maximum bytes allowed in one
    write zeroes operation by the device.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     
  • Similar to __blkdev_issue_discard this variant allows submitting
    the final bio asynchronously and chaining multiple ranges
    into a single completion.

    Signed-off-by: Chaitanya Kulkarni
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Chaitanya Kulkarni
     

28 Oct, 2016

1 commit

  • Now that we don't need the common flags to overflow outside the range
    of a 32-bit type we can encode them the same way for both the bio and
    request fields. This in addition allows us to place the operation
    first (and make some room for more ops while we're at it) and to
    stop having to shift around the operation values.

    In addition this allows passing around only one value in the block layer
    instead of two (and eventuall also in the file systems, but we can do
    that later) and thus clean up a lot of code.

    Last but not least this allows decreasing the size of the cmd_flags
    field in struct request to 32-bits. Various functions passing this
    value could also be updated, but I'd like to avoid the churn for now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

12 Oct, 2016

1 commit

  • Make sure that the offset and length arguments that we're using to
    construct WRITE SAME and DISCARD requests are actually aligned to the
    logical block size. Failure to do this causes other errors in other parts
    of the block layer or the SCSI layer because disks don't support partial
    logical block writes.

    Link: http://lkml.kernel.org/r/147518379026.22791.4437508871355153928.stgit@birch.djwong.org
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Bart Van Assche
    Reviewed-by: Martin K. Petersen
    Reviewed-by: Hannes Reinecke
    Cc: Theodore Ts'o
    Cc: Mike Snitzer # tweaked header
    Cc: Brian Foster
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

27 Jul, 2016

2 commits

  • Pull block driver updates from Jens Axboe:
    "This branch also contains core changes. I've come to the conclusion
    that from 4.9 and forward, I'll be doing just a single branch. We
    often have dependencies between core and drivers, and it's hard to
    always split them up appropriately without pulling core into drivers
    when that happens.

    That said, this contains:

    - separate secure erase type for the core block layer, from
    Christoph.

    - set of discard fixes, from Christoph.

    - bio shrinking fixes from Christoph, as a followup up to the
    op/flags change in the core branch.

    - map and append request fixes from Christoph.

    - NVMeF (NVMe over Fabrics) code from Christoph. This is pretty
    exciting!

    - nvme-loop fixes from Arnd.

    - removal of ->driverfs_dev from Dan, after providing a
    device_add_disk() helper.

    - bcache fixes from Bhaktipriya and Yijing.

    - cdrom subchannel read fix from Vchannaiah.

    - set of lightnvm updates from Wenwei, Matias, Johannes, and Javier.

    - set of drbd updates and fixes from Fabian, Lars, and Philipp.

    - mg_disk error path fix from Bart.

    - user notification for failed device add for loop, from Minfei.

    - NVMe in general:
    + NVMe delay quirk from Guilherme.
    + SR-IOV support and command retry limits from Keith.
    + fix for memory-less NUMA node from Masayoshi.
    + use UINT_MAX for discard sectors, from Minfei.
    + cancel IO fixes from Ming.
    + don't allocate unused major, from Neil.
    + error code fixup from Dan.
    + use constants for PSDT/FUSE from James.
    + variable init fix from Jay.
    + fabrics fixes from Ming, Sagi, and Wei.
    + various fixes"

    * 'for-4.8/drivers' of git://git.kernel.dk/linux-block: (115 commits)
    nvme/pci: Provide SR-IOV support
    nvme: initialize variable before logical OR'ing it
    block: unexport various bio mapping helpers
    scsi/osd: open code blk_make_request
    target: stop using blk_make_request
    block: simplify and export blk_rq_append_bio
    block: ensure bios return from blk_get_request are properly initialized
    virtio_blk: use blk_rq_map_kern
    memstick: don't allow REQ_TYPE_BLOCK_PC requests
    block: shrink bio size again
    block: simplify and cleanup bvec pool handling
    block: get rid of bio_rw and READA
    block: don't ignore -EOPNOTSUPP blkdev_issue_write_same
    block: introduce BLKDEV_DISCARD_ZERO to fix zeroout
    NVMe: don't allocate unused nvme_major
    nvme: avoid crashes when node 0 is memoryless node.
    nvme: Limit command retries
    loop: Make user notify for adding loop device failed
    nvme-loop: fix nvme-loop Kconfig dependencies
    nvmet: fix return value check in nvmet_subsys_alloc()
    ...

    Linus Torvalds
     
  • Pull core block updates from Jens Axboe:

    - the big change is the cleanup from Mike Christie, cleaning up our
    uses of command types and modified flags. This is what will throw
    some merge conflicts

    - regression fix for the above for btrfs, from Vincent

    - following up to the above, better packing of struct request from
    Christoph

    - a 2038 fix for blktrace from Arnd

    - a few trivial/spelling fixes from Bart Van Assche

    - a front merge check fix from Damien, which could cause issues on
    SMR drives

    - Atari partition fix from Gabriel

    - convert cfq to highres timers, since jiffies isn't granular enough
    for some devices these days. From Jan and Jeff

    - CFQ priority boost fix idle classes, from me

    - cleanup series from Ming, improving our bio/bvec iteration

    - a direct issue fix for blk-mq from Omar

    - fix for plug merging not involving the IO scheduler, like we do for
    other types of merges. From Tahsin

    - expose DAX type internally and through sysfs. From Toshi and Yigal

    * 'for-4.8/core' of git://git.kernel.dk/linux-block: (76 commits)
    block: Fix front merge check
    block: do not merge requests without consulting with io scheduler
    block: Fix spelling in a source code comment
    block: expose QUEUE_FLAG_DAX in sysfs
    block: add QUEUE_FLAG_DAX for devices to advertise their DAX support
    Btrfs: fix comparison in __btrfs_map_block()
    block: atari: Return early for unsupported sector size
    Doc: block: Fix a typo in queue-sysfs.txt
    cfq-iosched: Charge at least 1 jiffie instead of 1 ns
    cfq-iosched: Fix regression in bonnie++ rewrite performance
    cfq-iosched: Convert slice_resid from u64 to s64
    block: Convert fifo_time from ulong to u64
    blktrace: avoid using timespec
    block/blk-cgroup.c: Declare local symbols static
    block/bio-integrity.c: Add #include "blk.h"
    block/partition-generic.c: Remove a set-but-not-used variable
    block: bio: kill BIO_MAX_SIZE
    cfq-iosched: temporarily boost queue priority for idle classes
    block: drbd: avoid to use BIO_MAX_SIZE
    block: bio: remove BIO_MAX_SECTORS
    ...

    Linus Torvalds
     

21 Jul, 2016

2 commits

  • WRITE SAME is a data integrity operation and we can't simply ignore
    errors.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     
  • Currently blkdev_issue_zeroout cascades down from discards (if the driver
    guarantees that discards zero data), to WRITE SAME and then to a loop
    writing zeroes. Unfortunately we ignore run-time EOPNOTSUPP errors in the
    block layer blkdev_issue_discard helper to work around DM volumes that
    may have mixed discard support underneath.

    This patch intoroduces a new BLKDEV_DISCARD_ZERO flag to
    blkdev_issue_discard that indicates we are called for zeroing operation.
    This allows both to ignore the EOPNOTSUPP hack and actually consolidating
    the discard_zeroes_data check into the function.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Christoph Hellwig
     

09 Jun, 2016

1 commit


08 Jun, 2016

2 commits

  • This converts the block issue discard helper and users to use
    the bio_set_op_attrs accessor and only pass in the operation flags
    like REQ_SEQURE.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie
     
  • This patch converts the simple bi_rw use cases in the block,
    drivers, mm and fs code to set/get the bio operation using
    bio_set_op_attrs/bio_op

    These should be simple one or two liner cases, so I just did them
    in one patch. The next patches handle the more complicated
    cases in a module per patch.

    Signed-off-by: Mike Christie
    Reviewed-by: Hannes Reinecke
    Signed-off-by: Jens Axboe

    Mike Christie