Commit c1c72b59941e2f5aad4b02609d7ee7b121734b8d

Authored by Martin K. Petersen
Committed by Jens Axboe
1 parent 7ba1ba12ee

block: Data integrity infrastructure documentation

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>

Showing 2 changed files with 361 additions and 0 deletions Side-by-side Diff

Documentation/ABI/testing/sysfs-block
... ... @@ -26,4 +26,38 @@
26 26 I/O statistics of partition <part>. The format is the
27 27 same as the above-written /sys/block/<disk>/stat
28 28 format.
  29 +
  30 +
  31 +What: /sys/block/<disk>/integrity/format
  32 +Date: June 2008
  33 +Contact: Martin K. Petersen <martin.petersen@oracle.com>
  34 +Description:
  35 + Metadata format for integrity capable block device.
  36 + E.g. T10-DIF-TYPE1-CRC.
  37 +
  38 +
  39 +What: /sys/block/<disk>/integrity/read_verify
  40 +Date: June 2008
  41 +Contact: Martin K. Petersen <martin.petersen@oracle.com>
  42 +Description:
  43 + Indicates whether the block layer should verify the
  44 + integrity of read requests serviced by devices that
  45 + support sending integrity metadata.
  46 +
  47 +
  48 +What: /sys/block/<disk>/integrity/tag_size
  49 +Date: June 2008
  50 +Contact: Martin K. Petersen <martin.petersen@oracle.com>
  51 +Description:
  52 + Number of bytes of integrity tag space available per
  53 + 512 bytes of data.
  54 +
  55 +
  56 +What: /sys/block/<disk>/integrity/write_generate
  57 +Date: June 2008
  58 +Contact: Martin K. Petersen <martin.petersen@oracle.com>
  59 +Description:
  60 + Indicates whether the block layer should automatically
  61 + generate checksums for write requests bound for
  62 + devices that support receiving integrity metadata.
Documentation/block/data-integrity.txt
  1 +----------------------------------------------------------------------
  2 +1. INTRODUCTION
  3 +
  4 +Modern filesystems feature checksumming of data and metadata to
  5 +protect against data corruption. However, the detection of the
  6 +corruption is done at read time which could potentially be months
  7 +after the data was written. At that point the original data that the
  8 +application tried to write is most likely lost.
  9 +
  10 +The solution is to ensure that the disk is actually storing what the
  11 +application meant it to. Recent additions to both the SCSI family
  12 +protocols (SBC Data Integrity Field, SCC protection proposal) as well
  13 +as SATA/T13 (External Path Protection) try to remedy this by adding
  14 +support for appending integrity metadata to an I/O. The integrity
  15 +metadata (or protection information in SCSI terminology) includes a
  16 +checksum for each sector as well as an incrementing counter that
  17 +ensures the individual sectors are written in the right order. And
  18 +for some protection schemes also that the I/O is written to the right
  19 +place on disk.
  20 +
  21 +Current storage controllers and devices implement various protective
  22 +measures, for instance checksumming and scrubbing. But these
  23 +technologies are working in their own isolated domains or at best
  24 +between adjacent nodes in the I/O path. The interesting thing about
  25 +DIF and the other integrity extensions is that the protection format
  26 +is well defined and every node in the I/O path can verify the
  27 +integrity of the I/O and reject it if corruption is detected. This
  28 +allows not only corruption prevention but also isolation of the point
  29 +of failure.
  30 +
  31 +----------------------------------------------------------------------
  32 +2. THE DATA INTEGRITY EXTENSIONS
  33 +
  34 +As written, the protocol extensions only protect the path between
  35 +controller and storage device. However, many controllers actually
  36 +allow the operating system to interact with the integrity metadata
  37 +(IMD). We have been working with several FC/SAS HBA vendors to enable
  38 +the protection information to be transferred to and from their
  39 +controllers.
  40 +
  41 +The SCSI Data Integrity Field works by appending 8 bytes of protection
  42 +information to each sector. The data + integrity metadata is stored
  43 +in 520 byte sectors on disk. Data + IMD are interleaved when
  44 +transferred between the controller and target. The T13 proposal is
  45 +similar.
  46 +
  47 +Because it is highly inconvenient for operating systems to deal with
  48 +520 (and 4104) byte sectors, we approached several HBA vendors and
  49 +encouraged them to allow separation of the data and integrity metadata
  50 +scatter-gather lists.
  51 +
  52 +The controller will interleave the buffers on write and split them on
  53 +read. This means that the Linux can DMA the data buffers to and from
  54 +host memory without changes to the page cache.
  55 +
  56 +Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs
  57 +is somewhat heavy to compute in software. Benchmarks found that
  58 +calculating this checksum had a significant impact on system
  59 +performance for a number of workloads. Some controllers allow a
  60 +lighter-weight checksum to be used when interfacing with the operating
  61 +system. Emulex, for instance, supports the TCP/IP checksum instead.
  62 +The IP checksum received from the OS is converted to the 16-bit CRC
  63 +when writing and vice versa. This allows the integrity metadata to be
  64 +generated by Linux or the application at very low cost (comparable to
  65 +software RAID5).
  66 +
  67 +The IP checksum is weaker than the CRC in terms of detecting bit
  68 +errors. However, the strength is really in the separation of the data
  69 +buffers and the integrity metadata. These two distinct buffers much
  70 +match up for an I/O to complete.
  71 +
  72 +The separation of the data and integrity metadata buffers as well as
  73 +the choice in checksums is referred to as the Data Integrity
  74 +Extensions. As these extensions are outside the scope of the protocol
  75 +bodies (T10, T13), Oracle and its partners are trying to standardize
  76 +them within the Storage Networking Industry Association.
  77 +
  78 +----------------------------------------------------------------------
  79 +3. KERNEL CHANGES
  80 +
  81 +The data integrity framework in Linux enables protection information
  82 +to be pinned to I/Os and sent to/received from controllers that
  83 +support it.
  84 +
  85 +The advantage to the integrity extensions in SCSI and SATA is that
  86 +they enable us to protect the entire path from application to storage
  87 +device. However, at the same time this is also the biggest
  88 +disadvantage. It means that the protection information must be in a
  89 +format that can be understood by the disk.
  90 +
  91 +Generally Linux/POSIX applications are agnostic to the intricacies of
  92 +the storage devices they are accessing. The virtual filesystem switch
  93 +and the block layer make things like hardware sector size and
  94 +transport protocols completely transparent to the application.
  95 +
  96 +However, this level of detail is required when preparing the
  97 +protection information to send to a disk. Consequently, the very
  98 +concept of an end-to-end protection scheme is a layering violation.
  99 +It is completely unreasonable for an application to be aware whether
  100 +it is accessing a SCSI or SATA disk.
  101 +
  102 +The data integrity support implemented in Linux attempts to hide this
  103 +from the application. As far as the application (and to some extent
  104 +the kernel) is concerned, the integrity metadata is opaque information
  105 +that's attached to the I/O.
  106 +
  107 +The current implementation allows the block layer to automatically
  108 +generate the protection information for any I/O. Eventually the
  109 +intent is to move the integrity metadata calculation to userspace for
  110 +user data. Metadata and other I/O that originates within the kernel
  111 +will still use the automatic generation interface.
  112 +
  113 +Some storage devices allow each hardware sector to be tagged with a
  114 +16-bit value. The owner of this tag space is the owner of the block
  115 +device. I.e. the filesystem in most cases. The filesystem can use
  116 +this extra space to tag sectors as they see fit. Because the tag
  117 +space is limited, the block interface allows tagging bigger chunks by
  118 +way of interleaving. This way, 8*16 bits of information can be
  119 +attached to a typical 4KB filesystem block.
  120 +
  121 +This also means that applications such as fsck and mkfs will need
  122 +access to manipulate the tags from user space. A passthrough
  123 +interface for this is being worked on.
  124 +
  125 +
  126 +----------------------------------------------------------------------
  127 +4. BLOCK LAYER IMPLEMENTATION DETAILS
  128 +
  129 +4.1 BIO
  130 +
  131 +The data integrity patches add a new field to struct bio when
  132 +CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer
  133 +to a struct bip which contains the bio integrity payload. Essentially
  134 +a bip is a trimmed down struct bio which holds a bio_vec containing
  135 +the integrity metadata and the required housekeeping information (bvec
  136 +pool, vector count, etc.)
  137 +
  138 +A kernel subsystem can enable data integrity protection on a bio by
  139 +calling bio_integrity_alloc(bio). This will allocate and attach the
  140 +bip to the bio.
  141 +
  142 +Individual pages containing integrity metadata can subsequently be
  143 +attached using bio_integrity_add_page().
  144 +
  145 +bio_free() will automatically free the bip.
  146 +
  147 +
  148 +4.2 BLOCK DEVICE
  149 +
  150 +Because the format of the protection data is tied to the physical
  151 +disk, each block device has been extended with a block integrity
  152 +profile (struct blk_integrity). This optional profile is registered
  153 +with the block layer using blk_integrity_register().
  154 +
  155 +The profile contains callback functions for generating and verifying
  156 +the protection data, as well as getting and setting application tags.
  157 +The profile also contains a few constants to aid in completing,
  158 +merging and splitting the integrity metadata.
  159 +
  160 +Layered block devices will need to pick a profile that's appropriate
  161 +for all subdevices. blk_integrity_compare() can help with that. DM
  162 +and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6
  163 +will require extra work due to the application tag.
  164 +
  165 +
  166 +----------------------------------------------------------------------
  167 +5.0 BLOCK LAYER INTEGRITY API
  168 +
  169 +5.1 NORMAL FILESYSTEM
  170 +
  171 + The normal filesystem is unaware that the underlying block device
  172 + is capable of sending/receiving integrity metadata. The IMD will
  173 + be automatically generated by the block layer at submit_bio() time
  174 + in case of a WRITE. A READ request will cause the I/O integrity
  175 + to be verified upon completion.
  176 +
  177 + IMD generation and verification can be toggled using the
  178 +
  179 + /sys/block/<bdev>/integrity/write_generate
  180 +
  181 + and
  182 +
  183 + /sys/block/<bdev>/integrity/read_verify
  184 +
  185 + flags.
  186 +
  187 +
  188 +5.2 INTEGRITY-AWARE FILESYSTEM
  189 +
  190 + A filesystem that is integrity-aware can prepare I/Os with IMD
  191 + attached. It can also use the application tag space if this is
  192 + supported by the block device.
  193 +
  194 +
  195 + int bdev_integrity_enabled(block_device, int rw);
  196 +
  197 + bdev_integrity_enabled() will return 1 if the block device
  198 + supports integrity metadata transfer for the data direction
  199 + specified in 'rw'.
  200 +
  201 + bdev_integrity_enabled() honors the write_generate and
  202 + read_verify flags in sysfs and will respond accordingly.
  203 +
  204 +
  205 + int bio_integrity_prep(bio);
  206 +
  207 + To generate IMD for WRITE and to set up buffers for READ, the
  208 + filesystem must call bio_integrity_prep(bio).
  209 +
  210 + Prior to calling this function, the bio data direction and start
  211 + sector must be set, and the bio should have all data pages
  212 + added. It is up to the caller to ensure that the bio does not
  213 + change while I/O is in progress.
  214 +
  215 + bio_integrity_prep() should only be called if
  216 + bio_integrity_enabled() returned 1.
  217 +
  218 +
  219 + int bio_integrity_tag_size(bio);
  220 +
  221 + If the filesystem wants to use the application tag space it will
  222 + first have to find out how much storage space is available.
  223 + Because tag space is generally limited (usually 2 bytes per
  224 + sector regardless of sector size), the integrity framework
  225 + supports interleaving the information between the sectors in an
  226 + I/O.
  227 +
  228 + Filesystems can call bio_integrity_tag_size(bio) to find out how
  229 + many bytes of storage are available for that particular bio.
  230 +
  231 + Another option is bdev_get_tag_size(block_device) which will
  232 + return the number of available bytes per hardware sector.
  233 +
  234 +
  235 + int bio_integrity_set_tag(bio, void *tag_buf, len);
  236 +
  237 + After a successful return from bio_integrity_prep(),
  238 + bio_integrity_set_tag() can be used to attach an opaque tag
  239 + buffer to a bio. Obviously this only makes sense if the I/O is
  240 + a WRITE.
  241 +
  242 +
  243 + int bio_integrity_get_tag(bio, void *tag_buf, len);
  244 +
  245 + Similarly, at READ I/O completion time the filesystem can
  246 + retrieve the tag buffer using bio_integrity_get_tag().
  247 +
  248 +
  249 +6.3 PASSING EXISTING INTEGRITY METADATA
  250 +
  251 + Filesystems that either generate their own integrity metadata or
  252 + are capable of transferring IMD from user space can use the
  253 + following calls:
  254 +
  255 +
  256 + struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages);
  257 +
  258 + Allocates the bio integrity payload and hangs it off of the bio.
  259 + nr_pages indicate how many pages of protection data need to be
  260 + stored in the integrity bio_vec list (similar to bio_alloc()).
  261 +
  262 + The integrity payload will be freed at bio_free() time.
  263 +
  264 +
  265 + int bio_integrity_add_page(bio, page, len, offset);
  266 +
  267 + Attaches a page containing integrity metadata to an existing
  268 + bio. The bio must have an existing bip,
  269 + i.e. bio_integrity_alloc() must have been called. For a WRITE,
  270 + the integrity metadata in the pages must be in a format
  271 + understood by the target device with the notable exception that
  272 + the sector numbers will be remapped as the request traverses the
  273 + I/O stack. This implies that the pages added using this call
  274 + will be modified during I/O! The first reference tag in the
  275 + integrity metadata must have a value of bip->bip_sector.
  276 +
  277 + Pages can be added using bio_integrity_add_page() as long as
  278 + there is room in the bip bio_vec array (nr_pages).
  279 +
  280 + Upon completion of a READ operation, the attached pages will
  281 + contain the integrity metadata received from the storage device.
  282 + It is up to the receiver to process them and verify data
  283 + integrity upon completion.
  284 +
  285 +
  286 +6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY
  287 + METADATA
  288 +
  289 + To enable integrity exchange on a block device the gendisk must be
  290 + registered as capable:
  291 +
  292 + int blk_integrity_register(gendisk, blk_integrity);
  293 +
  294 + The blk_integrity struct is a template and should contain the
  295 + following:
  296 +
  297 + static struct blk_integrity my_profile = {
  298 + .name = "STANDARDSBODY-TYPE-VARIANT-CSUM",
  299 + .generate_fn = my_generate_fn,
  300 + .verify_fn = my_verify_fn,
  301 + .get_tag_fn = my_get_tag_fn,
  302 + .set_tag_fn = my_set_tag_fn,
  303 + .tuple_size = sizeof(struct my_tuple_size),
  304 + .tag_size = <tag bytes per hw sector>,
  305 + };
  306 +
  307 + 'name' is a text string which will be visible in sysfs. This is
  308 + part of the userland API so chose it carefully and never change
  309 + it. The format is standards body-type-variant.
  310 + E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC.
  311 +
  312 + 'generate_fn' generates appropriate integrity metadata (for WRITE).
  313 +
  314 + 'verify_fn' verifies that the data buffer matches the integrity
  315 + metadata.
  316 +
  317 + 'tuple_size' must be set to match the size of the integrity
  318 + metadata per sector. I.e. 8 for DIF and EPP.
  319 +
  320 + 'tag_size' must be set to identify how many bytes of tag space
  321 + are available per hardware sector. For DIF this is either 2 or
  322 + 0 depending on the value of the Control Mode Page ATO bit.
  323 +
  324 + See 6.2 for a description of get_tag_fn and set_tag_fn.
  325 +
  326 +----------------------------------------------------------------------
  327 +2007-12-24 Martin K. Petersen <martin.petersen@oracle.com>