22 Oct, 2015

7 commits

  • The libnvidmm-btt and nvme drivers use blk_integrity to reserve space
    for per-sector metadata, but sometimes without protection checksums.
    This property is generically useful, so teach the block core to
    internally specify a nop profile if one is not provided at registration
    time.

    Cc: Keith Busch
    Cc: Matthew Wilcox
    Suggested-by: Christoph Hellwig
    [hch: kill the local nvme nop profile as well]
    Acked-by: Martin K. Petersen
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • A trace like the following proceeds a crash in bio_integrity_process()
    when it goes to use an already freed blk_integrity profile.

    BUG: unable to handle kernel paging request at ffff8800d31b10d8
    IP: [] 0xffff8800d31b10d8
    PGD 2f65067 PUD 21fffd067 PMD 80000000d30001e3
    Oops: 0011 [#1] SMP
    Dumping ftrace buffer:
    ---------------------------------
    ndctl-2222 2.... 44526245us : disk_release: pmem1s
    systemd--2223 4.... 44573945us : bio_integrity_endio: pmem1s
    -409 4.... 44574005us : bio_integrity_process: pmem1s
    ---------------------------------
    [..]
    Call Trace:
    [] ? bio_integrity_process+0x159/0x2d0
    [] bio_integrity_verify_fn+0x36/0x60
    [] process_one_work+0x1cc/0x4e0

    Given that a request_queue is pinned while i/o is in flight and that a
    gendisk is allowed to have a shorter lifetime, move blk_integrity to
    request_queue to satisfy requests arriving after the gendisk has been
    torn down.

    Cc: Christoph Hellwig
    Cc: Martin K. Petersen
    [martin: fix the CONFIG_BLK_DEV_INTEGRITY=n case]
    Tested-by: Ross Zwisler
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Dan Williams
     
  • Up until now the_integrity profile has been dynamically allocated and
    attached to struct gendisk after the disk has been made active.

    This causes problems because NVMe devices need to register the profile
    prior to the partition table being read due to a mandatory metadata
    buffer requirement. In addition, DM goes through hoops to deal with
    preallocating, but not initializing integrity profiles.

    Since the integrity profile is small (4 bytes + a pointer), Christoph
    suggested moving it to struct gendisk proper. This requires several
    changes:

    - Moving the blk_integrity definition to genhd.h.

    - Inlining blk_integrity in struct gendisk.

    - Removing the dynamic allocation code.

    - Adding helper functions which allow gendisk to set up and tear down
    the integrity sysfs dir when a disk is added/deleted.

    - Adding a blk_integrity_revalidate() callback for updating the stable
    pages bdi setting.

    - The calls that depend on whether a device has an integrity profile or
    not now key off of the bi->profile pointer.

    - Simplifying the integrity support routines in DM (Mike Snitzer).

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Mike Snitzer
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The size of the data interval was not exported in the sysfs integrity
    directory. Export it so that userland apps can tell whether the interval
    is different from the device's logical block size.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The per-device properties in the blk_integrity structure were previously
    unsigned short. However, most of the values fit inside a char. The only
    exception is the data interval size and we can work around that by
    storing it as a power of two.

    This cuts the size of the dynamic portion of blk_integrity in half.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • We previously made a complete copy of a device's data integrity profile
    even though several of the fields inside the blk_integrity struct are
    pointers to fixed template entries in t10-pi.c.

    Split the static and per-device portions so that we can reference the
    template directly.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Cc: Dan Williams
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The integrity kobject purely exists to support the integrity
    subdirectory in sysfs and doesn't really have anything to do with the
    blk_integrity data structure. Move the kobject to struct gendisk where
    it belongs.

    Signed-off-by: Martin K. Petersen
    Reported-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Dan Williams
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

11 Sep, 2015

1 commit

  • If a driver sets the block queue virtual boundary mask, it means that
    it cannot handle gaps so we must not allow those in the integrity
    payload as well.

    Signed-off-by: Sagi Grimberg

    Fixed up by me to have duplicate integrity merge functions, depending
    on whether block integrity is enabled or not. Fixes a compilations
    issue with CONFIG_BLK_DEV_INTEGRITY unset.

    Signed-off-by: Jens Axboe

    Sagi Grimberg
     

02 Jun, 2015

1 commit

  • With the planned cgroup writeback support, backing-dev related
    declarations will be more widely used across block and cgroup;
    unfortunately, including backing-dev.h from include/linux/blkdev.h
    makes cyclic include dependency quite likely.

    This patch separates out backing-dev-defs.h which only has the
    essential definitions and updates blkdev.h to include it. c files
    which need access to more backing-dev details now include
    backing-dev.h directly. This takes backing-dev.h off the common
    include dependency chain making it a lot easier to use it across block
    and cgroup.

    v2: fs/fat build failure fixed.

    Signed-off-by: Tejun Heo
    Reviewed-by: Jan Kara
    Cc: Jens Axboe
    Signed-off-by: Jens Axboe

    Tejun Heo
     

27 Sep, 2014

5 commits

  • We'd occasionally merge requests with conflicting integrity flags.
    Introduce a merge helper which checks that the requests have compatible
    integrity payloads.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • So far we have relied on the app tag size to determine whether a disk
    has been formatted with T10 protection information or not. However, not
    all target devices provide application tag storage.

    Add a flag to the block integrity profile that indicates whether the
    disk has been formatted with protection information.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • Add a BLK_ prefix to the integrity profile flags. Also rename the flags
    to be more consistent with the generate/verify terminology in the rest
    of the integrity code.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • The protection interval is not necessarily tied to the logical block
    size of a block device. Stop using the terms "sector" and "sectors".

    Going forward we will use the term "seed" to describe the initial
    reference tag value for a given I/O. "Interval" will be used to describe
    the portion of the data buffer that a given piece of protection
    information is associated with.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     
  • None of the filesystems appear interested in using the integrity tagging
    feature. Potentially because very few storage devices actually permit
    using the application tag space.

    Remove the tagging functions.

    Signed-off-by: Martin K. Petersen
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Sagi Grimberg
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

24 Nov, 2013

1 commit

  • The bio integrity is also stored in a bvec array, so if we use the bvec
    iter code we just added, the integrity code won't need to implement its
    own iteration stuff (bio_integrity_mark_head(), bio_integrity_mark_tail())

    Signed-off-by: Kent Overstreet
    Cc: Jens Axboe
    Cc: "Martin K. Petersen"
    Cc: "James E.J. Bottomley"

    Kent Overstreet
     

20 Mar, 2013

1 commit


22 Feb, 2013

1 commit

  • This patchset ("stable page writes, part 2") makes some key
    modifications to the original 'stable page writes' patchset. First, it
    provides creators (devices and filesystems) of a backing_dev_info a flag
    that declares whether or not it is necessary to ensure that page
    contents cannot change during writeout. It is no longer assumed that
    this is true of all devices (which was never true anyway). Second, the
    flag is used to relaxed the wait_on_page_writeback calls so that wait
    only occurs if the device needs it. Third, it fixes up the remaining
    disk-backed filesystems to use this improved conditional-wait logic to
    provide stable page writes on those filesystems.

    It is hoped that (for people not using checksumming devices, anyway)
    this patchset will give back unnecessary performance decreases since the
    original stable page write patchset went into 3.0. Sorry about not
    fixing it sooner.

    Complaints were registered by several people about the long write
    latencies introduced by the original stable page write patchset.
    Generally speaking, the kernel ought to allocate as little extra memory
    as possible to facilitate writeout, but for people who simply cannot
    wait, a second page stability strategy is (re)introduced: snapshotting
    page contents. The waiting behavior is still the default strategy; to
    enable page snapshotting, a superblock flag (MS_SNAP_STABLE) must be
    set. This flag is used to bandaid^Henable stable page writeback on
    ext3[1], and is not used anywhere else.

    Given that there are already a few storage devices and network FSes that
    have rolled their own page stability wait/page snapshot code, it would
    be nice to move towards consolidating all of these. It seems possible
    that iscsi and raid5 may wish to use the new stable page write support
    to enable zero-copy writeout.

    Thank you to Jan Kara for helping fix a couple more filesystems.

    Per Andrew Morton's request, here are the result of using dbench to measure
    latencies on ext2:

    3.8.0-rc3:
    Operation Count AvgLat MaxLat
    ----------------------------------------
    WriteX 109347 0.028 59.817
    ReadX 347180 0.004 3.391
    Flush 15514 29.828 287.283

    Throughput 57.429 MB/sec 4 clients 4 procs max_latency=287.290 ms

    3.8.0-rc3 + patches:
    WriteX 105556 0.029 4.273
    ReadX 335004 0.005 4.112
    Flush 14982 30.540 298.634

    Throughput 55.4496 MB/sec 4 clients 4 procs max_latency=298.650 ms

    As you can see, for ext2 the maximum write latency decreases from ~60ms
    on a laptop hard disk to ~4ms. I'm not sure why the flush latencies
    increase, though I suspect that being able to dirty pages faster gives
    the flusher more work to do.

    On ext4, the average write latency decreases as well as all the maximum
    latencies:

    3.8.0-rc3:
    WriteX 85624 0.152 33.078
    ReadX 272090 0.010 61.210
    Flush 12129 36.219 168.260

    Throughput 44.8618 MB/sec 4 clients 4 procs max_latency=168.276 ms

    3.8.0-rc3 + patches:
    WriteX 86082 0.141 30.928
    ReadX 273358 0.010 36.124
    Flush 12214 34.800 165.689

    Throughput 44.9941 MB/sec 4 clients 4 procs max_latency=165.722 ms

    XFS seems to exhibit similar latency improvements as ext2:

    3.8.0-rc3:
    WriteX 125739 0.028 104.343
    ReadX 399070 0.005 4.115
    Flush 17851 25.004 131.390

    Throughput 66.0024 MB/sec 4 clients 4 procs max_latency=131.406 ms

    3.8.0-rc3 + patches:
    WriteX 123529 0.028 6.299
    ReadX 392434 0.005 4.287
    Flush 17549 25.120 188.687

    Throughput 64.9113 MB/sec 4 clients 4 procs max_latency=188.704 ms

    ...and btrfs, just to round things out, also shows some latency
    decreases:

    3.8.0-rc3:
    WriteX 67122 0.083 82.355
    ReadX 212719 0.005 2.828
    Flush 9547 47.561 147.418

    Throughput 35.3391 MB/sec 4 clients 4 procs max_latency=147.433 ms

    3.8.0-rc3 + patches:
    WriteX 64898 0.101 71.631
    ReadX 206673 0.005 7.123
    Flush 9190 47.963 219.034

    Throughput 34.0795 MB/sec 4 clients 4 procs max_latency=219.044 ms

    Before this patchset, all filesystems would block, regardless of whether
    or not it was necessary. ext3 would wait, but still generate occasional
    checksum errors. The network filesystems were left to do their own
    thing, so they'd wait too.

    After this patchset, all the disk filesystems except ext3 and btrfs will
    wait only if the hardware requires it. ext3 (if necessary) snapshots
    pages instead of blocking, and btrfs provides its own bdi so the mm will
    never wait. Network filesystems haven't been touched, so either they
    provide their own wait code, or they don't block at all. The blocking
    behavior is back to what it was before 3.0 if you don't have a disk
    requiring stable page writes.

    This patchset has been tested on 3.8.0-rc3 on x64 with ext3, ext4, and
    xfs. I've spot-checked 3.8.0-rc4 and seem to be getting the same
    results as -rc3.

    [1] The alternative fixes to ext3 include fixing the locking order and
    page bit handling like we did for ext4 (but then why not just use
    ext4?), or setting PG_writeback so early that ext3 becomes extremely
    slow. I tried that, but the number of write()s I could initiate dropped
    by nearly an order of magnitude. That was a bit much even for the
    author of the stable page series! :)

    This patch:

    Creates a per-backing-device flag that tracks whether or not pages must
    be held immutable during writeout. Eventually it will be used to waive
    wait_for_page_writeback() if nothing requires stable pages.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Jan Kara
    Cc: Adrian Hunter
    Cc: Andy Lutomirski
    Cc: Artem Bityutskiy
    Cc: Joel Becker
    Cc: Mark Fasheh
    Cc: Steven Whitehouse
    Cc: Jens Axboe
    Cc: Eric Van Hensbergen
    Cc: Ron Minnich
    Cc: Latchesar Ionkov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Darrick J. Wong
     

01 Nov, 2011

1 commit


06 Apr, 2011

1 commit

  • The current block integrity (DIF/DIX) support in DM is verifying that
    all devices' integrity profiles match during DM device resume (which
    is past the point of no return). To some degree that is unavoidable
    (stacked DM devices force this late checking). But for most DM
    devices (which aren't stacking on other DM devices) the ideal time to
    verify all integrity profiles match is during table load.

    Introduce the notion of an "initialized" integrity profile: a profile
    that was blk_integrity_register()'d with a non-NULL 'blk_integrity'
    template. Add blk_integrity_is_initialized() to allow checking if a
    profile was initialized.

    Update DM integrity support to:
    - check all devices with _initialized_ integrity profiles match
    during table load; uninitialized profiles (e.g. for underlying DM
    device(s) of a stacked DM device) are ignored.
    - disallow a table load that would result in an integrity profile that
    conflicts with a DM device's existing (in-use) integrity profile
    - avoid clearing an existing integrity profile
    - validate all integrity profiles match during resume; but if they
    don't all we can do is report the mismatch (during resume we're past
    the point of no return)

    Signed-off-by: Mike Snitzer
    Cc: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Mike Snitzer
     

15 Oct, 2010

1 commit


11 Sep, 2010

1 commit

  • Some controllers have a hardware limit on the number of protection
    information scatter-gather list segments they can handle.

    Introduce a max_integrity_segments limit in the block layer and provide
    a new scsi_host_template setting that allows HBA drivers to provide a
    value suitable for the hardware.

    Add support for honoring the integrity segment limit when merging both
    bios and requests.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

08 Mar, 2010

1 commit

  • Constify struct sysfs_ops.

    This is part of the ops structure constification
    effort started by Arjan van de Ven et al.

    Benefits of this constification:

    * prevents modification of data that is shared
    (referenced) by many other structure instances
    at runtime

    * detects/prevents accidental (but not intentional)
    modification attempts on archs that enforce
    read-only kernel data at runtime

    * potentially better optimized code as the compiler
    can assume that the const data cannot be changed

    * the compiler/linker move const data into .rodata
    and therefore exclude them from false sharing

    Signed-off-by: Emese Revfy
    Acked-by: David Teigland
    Acked-by: Matt Domsch
    Acked-by: Maciej Sosnowski
    Acked-by: Hans J. Koch
    Acked-by: Pekka Enberg
    Acked-by: Jens Axboe
    Acked-by: Stephen Hemminger
    Signed-off-by: Greg Kroah-Hartman

    Emese Revfy
     

28 Jul, 2009

1 commit


23 May, 2009

1 commit

  • Until now we have had a 1:1 mapping between storage device physical
    block size and the logical block sized used when addressing the device.
    With SATA 4KB drives coming out that will no longer be the case. The
    sector size will be 4KB but the logical block size will remain
    512-bytes. Hence we need to distinguish between the physical block size
    and the logical ditto.

    This patch renames hardsect_size to logical_block_size.

    Signed-off-by: Martin K. Petersen
    Signed-off-by: Jens Axboe

    Martin K. Petersen
     

30 Jan, 2009

1 commit


09 Oct, 2008

4 commits


03 Jul, 2008

3 commits