Commit c1c72b59941e2f5aad4b02609d7ee7b121734b8d
Committed by
Jens Axboe
1 parent
7ba1ba12ee
Exists in
master
and in
4 other branches
block: Data integrity infrastructure documentation
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Showing 2 changed files with 361 additions and 0 deletions Side-by-side Diff
Documentation/ABI/testing/sysfs-block
... | ... | @@ -26,4 +26,38 @@ |
26 | 26 | I/O statistics of partition <part>. The format is the |
27 | 27 | same as the above-written /sys/block/<disk>/stat |
28 | 28 | format. |
29 | + | |
30 | + | |
31 | +What: /sys/block/<disk>/integrity/format | |
32 | +Date: June 2008 | |
33 | +Contact: Martin K. Petersen <martin.petersen@oracle.com> | |
34 | +Description: | |
35 | + Metadata format for integrity capable block device. | |
36 | + E.g. T10-DIF-TYPE1-CRC. | |
37 | + | |
38 | + | |
39 | +What: /sys/block/<disk>/integrity/read_verify | |
40 | +Date: June 2008 | |
41 | +Contact: Martin K. Petersen <martin.petersen@oracle.com> | |
42 | +Description: | |
43 | + Indicates whether the block layer should verify the | |
44 | + integrity of read requests serviced by devices that | |
45 | + support sending integrity metadata. | |
46 | + | |
47 | + | |
48 | +What: /sys/block/<disk>/integrity/tag_size | |
49 | +Date: June 2008 | |
50 | +Contact: Martin K. Petersen <martin.petersen@oracle.com> | |
51 | +Description: | |
52 | + Number of bytes of integrity tag space available per | |
53 | + 512 bytes of data. | |
54 | + | |
55 | + | |
56 | +What: /sys/block/<disk>/integrity/write_generate | |
57 | +Date: June 2008 | |
58 | +Contact: Martin K. Petersen <martin.petersen@oracle.com> | |
59 | +Description: | |
60 | + Indicates whether the block layer should automatically | |
61 | + generate checksums for write requests bound for | |
62 | + devices that support receiving integrity metadata. |
Documentation/block/data-integrity.txt
1 | +---------------------------------------------------------------------- | |
2 | +1. INTRODUCTION | |
3 | + | |
4 | +Modern filesystems feature checksumming of data and metadata to | |
5 | +protect against data corruption. However, the detection of the | |
6 | +corruption is done at read time which could potentially be months | |
7 | +after the data was written. At that point the original data that the | |
8 | +application tried to write is most likely lost. | |
9 | + | |
10 | +The solution is to ensure that the disk is actually storing what the | |
11 | +application meant it to. Recent additions to both the SCSI family | |
12 | +protocols (SBC Data Integrity Field, SCC protection proposal) as well | |
13 | +as SATA/T13 (External Path Protection) try to remedy this by adding | |
14 | +support for appending integrity metadata to an I/O. The integrity | |
15 | +metadata (or protection information in SCSI terminology) includes a | |
16 | +checksum for each sector as well as an incrementing counter that | |
17 | +ensures the individual sectors are written in the right order. And | |
18 | +for some protection schemes also that the I/O is written to the right | |
19 | +place on disk. | |
20 | + | |
21 | +Current storage controllers and devices implement various protective | |
22 | +measures, for instance checksumming and scrubbing. But these | |
23 | +technologies are working in their own isolated domains or at best | |
24 | +between adjacent nodes in the I/O path. The interesting thing about | |
25 | +DIF and the other integrity extensions is that the protection format | |
26 | +is well defined and every node in the I/O path can verify the | |
27 | +integrity of the I/O and reject it if corruption is detected. This | |
28 | +allows not only corruption prevention but also isolation of the point | |
29 | +of failure. | |
30 | + | |
31 | +---------------------------------------------------------------------- | |
32 | +2. THE DATA INTEGRITY EXTENSIONS | |
33 | + | |
34 | +As written, the protocol extensions only protect the path between | |
35 | +controller and storage device. However, many controllers actually | |
36 | +allow the operating system to interact with the integrity metadata | |
37 | +(IMD). We have been working with several FC/SAS HBA vendors to enable | |
38 | +the protection information to be transferred to and from their | |
39 | +controllers. | |
40 | + | |
41 | +The SCSI Data Integrity Field works by appending 8 bytes of protection | |
42 | +information to each sector. The data + integrity metadata is stored | |
43 | +in 520 byte sectors on disk. Data + IMD are interleaved when | |
44 | +transferred between the controller and target. The T13 proposal is | |
45 | +similar. | |
46 | + | |
47 | +Because it is highly inconvenient for operating systems to deal with | |
48 | +520 (and 4104) byte sectors, we approached several HBA vendors and | |
49 | +encouraged them to allow separation of the data and integrity metadata | |
50 | +scatter-gather lists. | |
51 | + | |
52 | +The controller will interleave the buffers on write and split them on | |
53 | +read. This means that the Linux can DMA the data buffers to and from | |
54 | +host memory without changes to the page cache. | |
55 | + | |
56 | +Also, the 16-bit CRC checksum mandated by both the SCSI and SATA specs | |
57 | +is somewhat heavy to compute in software. Benchmarks found that | |
58 | +calculating this checksum had a significant impact on system | |
59 | +performance for a number of workloads. Some controllers allow a | |
60 | +lighter-weight checksum to be used when interfacing with the operating | |
61 | +system. Emulex, for instance, supports the TCP/IP checksum instead. | |
62 | +The IP checksum received from the OS is converted to the 16-bit CRC | |
63 | +when writing and vice versa. This allows the integrity metadata to be | |
64 | +generated by Linux or the application at very low cost (comparable to | |
65 | +software RAID5). | |
66 | + | |
67 | +The IP checksum is weaker than the CRC in terms of detecting bit | |
68 | +errors. However, the strength is really in the separation of the data | |
69 | +buffers and the integrity metadata. These two distinct buffers much | |
70 | +match up for an I/O to complete. | |
71 | + | |
72 | +The separation of the data and integrity metadata buffers as well as | |
73 | +the choice in checksums is referred to as the Data Integrity | |
74 | +Extensions. As these extensions are outside the scope of the protocol | |
75 | +bodies (T10, T13), Oracle and its partners are trying to standardize | |
76 | +them within the Storage Networking Industry Association. | |
77 | + | |
78 | +---------------------------------------------------------------------- | |
79 | +3. KERNEL CHANGES | |
80 | + | |
81 | +The data integrity framework in Linux enables protection information | |
82 | +to be pinned to I/Os and sent to/received from controllers that | |
83 | +support it. | |
84 | + | |
85 | +The advantage to the integrity extensions in SCSI and SATA is that | |
86 | +they enable us to protect the entire path from application to storage | |
87 | +device. However, at the same time this is also the biggest | |
88 | +disadvantage. It means that the protection information must be in a | |
89 | +format that can be understood by the disk. | |
90 | + | |
91 | +Generally Linux/POSIX applications are agnostic to the intricacies of | |
92 | +the storage devices they are accessing. The virtual filesystem switch | |
93 | +and the block layer make things like hardware sector size and | |
94 | +transport protocols completely transparent to the application. | |
95 | + | |
96 | +However, this level of detail is required when preparing the | |
97 | +protection information to send to a disk. Consequently, the very | |
98 | +concept of an end-to-end protection scheme is a layering violation. | |
99 | +It is completely unreasonable for an application to be aware whether | |
100 | +it is accessing a SCSI or SATA disk. | |
101 | + | |
102 | +The data integrity support implemented in Linux attempts to hide this | |
103 | +from the application. As far as the application (and to some extent | |
104 | +the kernel) is concerned, the integrity metadata is opaque information | |
105 | +that's attached to the I/O. | |
106 | + | |
107 | +The current implementation allows the block layer to automatically | |
108 | +generate the protection information for any I/O. Eventually the | |
109 | +intent is to move the integrity metadata calculation to userspace for | |
110 | +user data. Metadata and other I/O that originates within the kernel | |
111 | +will still use the automatic generation interface. | |
112 | + | |
113 | +Some storage devices allow each hardware sector to be tagged with a | |
114 | +16-bit value. The owner of this tag space is the owner of the block | |
115 | +device. I.e. the filesystem in most cases. The filesystem can use | |
116 | +this extra space to tag sectors as they see fit. Because the tag | |
117 | +space is limited, the block interface allows tagging bigger chunks by | |
118 | +way of interleaving. This way, 8*16 bits of information can be | |
119 | +attached to a typical 4KB filesystem block. | |
120 | + | |
121 | +This also means that applications such as fsck and mkfs will need | |
122 | +access to manipulate the tags from user space. A passthrough | |
123 | +interface for this is being worked on. | |
124 | + | |
125 | + | |
126 | +---------------------------------------------------------------------- | |
127 | +4. BLOCK LAYER IMPLEMENTATION DETAILS | |
128 | + | |
129 | +4.1 BIO | |
130 | + | |
131 | +The data integrity patches add a new field to struct bio when | |
132 | +CONFIG_BLK_DEV_INTEGRITY is enabled. bio->bi_integrity is a pointer | |
133 | +to a struct bip which contains the bio integrity payload. Essentially | |
134 | +a bip is a trimmed down struct bio which holds a bio_vec containing | |
135 | +the integrity metadata and the required housekeeping information (bvec | |
136 | +pool, vector count, etc.) | |
137 | + | |
138 | +A kernel subsystem can enable data integrity protection on a bio by | |
139 | +calling bio_integrity_alloc(bio). This will allocate and attach the | |
140 | +bip to the bio. | |
141 | + | |
142 | +Individual pages containing integrity metadata can subsequently be | |
143 | +attached using bio_integrity_add_page(). | |
144 | + | |
145 | +bio_free() will automatically free the bip. | |
146 | + | |
147 | + | |
148 | +4.2 BLOCK DEVICE | |
149 | + | |
150 | +Because the format of the protection data is tied to the physical | |
151 | +disk, each block device has been extended with a block integrity | |
152 | +profile (struct blk_integrity). This optional profile is registered | |
153 | +with the block layer using blk_integrity_register(). | |
154 | + | |
155 | +The profile contains callback functions for generating and verifying | |
156 | +the protection data, as well as getting and setting application tags. | |
157 | +The profile also contains a few constants to aid in completing, | |
158 | +merging and splitting the integrity metadata. | |
159 | + | |
160 | +Layered block devices will need to pick a profile that's appropriate | |
161 | +for all subdevices. blk_integrity_compare() can help with that. DM | |
162 | +and MD linear, RAID0 and RAID1 are currently supported. RAID4/5/6 | |
163 | +will require extra work due to the application tag. | |
164 | + | |
165 | + | |
166 | +---------------------------------------------------------------------- | |
167 | +5.0 BLOCK LAYER INTEGRITY API | |
168 | + | |
169 | +5.1 NORMAL FILESYSTEM | |
170 | + | |
171 | + The normal filesystem is unaware that the underlying block device | |
172 | + is capable of sending/receiving integrity metadata. The IMD will | |
173 | + be automatically generated by the block layer at submit_bio() time | |
174 | + in case of a WRITE. A READ request will cause the I/O integrity | |
175 | + to be verified upon completion. | |
176 | + | |
177 | + IMD generation and verification can be toggled using the | |
178 | + | |
179 | + /sys/block/<bdev>/integrity/write_generate | |
180 | + | |
181 | + and | |
182 | + | |
183 | + /sys/block/<bdev>/integrity/read_verify | |
184 | + | |
185 | + flags. | |
186 | + | |
187 | + | |
188 | +5.2 INTEGRITY-AWARE FILESYSTEM | |
189 | + | |
190 | + A filesystem that is integrity-aware can prepare I/Os with IMD | |
191 | + attached. It can also use the application tag space if this is | |
192 | + supported by the block device. | |
193 | + | |
194 | + | |
195 | + int bdev_integrity_enabled(block_device, int rw); | |
196 | + | |
197 | + bdev_integrity_enabled() will return 1 if the block device | |
198 | + supports integrity metadata transfer for the data direction | |
199 | + specified in 'rw'. | |
200 | + | |
201 | + bdev_integrity_enabled() honors the write_generate and | |
202 | + read_verify flags in sysfs and will respond accordingly. | |
203 | + | |
204 | + | |
205 | + int bio_integrity_prep(bio); | |
206 | + | |
207 | + To generate IMD for WRITE and to set up buffers for READ, the | |
208 | + filesystem must call bio_integrity_prep(bio). | |
209 | + | |
210 | + Prior to calling this function, the bio data direction and start | |
211 | + sector must be set, and the bio should have all data pages | |
212 | + added. It is up to the caller to ensure that the bio does not | |
213 | + change while I/O is in progress. | |
214 | + | |
215 | + bio_integrity_prep() should only be called if | |
216 | + bio_integrity_enabled() returned 1. | |
217 | + | |
218 | + | |
219 | + int bio_integrity_tag_size(bio); | |
220 | + | |
221 | + If the filesystem wants to use the application tag space it will | |
222 | + first have to find out how much storage space is available. | |
223 | + Because tag space is generally limited (usually 2 bytes per | |
224 | + sector regardless of sector size), the integrity framework | |
225 | + supports interleaving the information between the sectors in an | |
226 | + I/O. | |
227 | + | |
228 | + Filesystems can call bio_integrity_tag_size(bio) to find out how | |
229 | + many bytes of storage are available for that particular bio. | |
230 | + | |
231 | + Another option is bdev_get_tag_size(block_device) which will | |
232 | + return the number of available bytes per hardware sector. | |
233 | + | |
234 | + | |
235 | + int bio_integrity_set_tag(bio, void *tag_buf, len); | |
236 | + | |
237 | + After a successful return from bio_integrity_prep(), | |
238 | + bio_integrity_set_tag() can be used to attach an opaque tag | |
239 | + buffer to a bio. Obviously this only makes sense if the I/O is | |
240 | + a WRITE. | |
241 | + | |
242 | + | |
243 | + int bio_integrity_get_tag(bio, void *tag_buf, len); | |
244 | + | |
245 | + Similarly, at READ I/O completion time the filesystem can | |
246 | + retrieve the tag buffer using bio_integrity_get_tag(). | |
247 | + | |
248 | + | |
249 | +6.3 PASSING EXISTING INTEGRITY METADATA | |
250 | + | |
251 | + Filesystems that either generate their own integrity metadata or | |
252 | + are capable of transferring IMD from user space can use the | |
253 | + following calls: | |
254 | + | |
255 | + | |
256 | + struct bip * bio_integrity_alloc(bio, gfp_mask, nr_pages); | |
257 | + | |
258 | + Allocates the bio integrity payload and hangs it off of the bio. | |
259 | + nr_pages indicate how many pages of protection data need to be | |
260 | + stored in the integrity bio_vec list (similar to bio_alloc()). | |
261 | + | |
262 | + The integrity payload will be freed at bio_free() time. | |
263 | + | |
264 | + | |
265 | + int bio_integrity_add_page(bio, page, len, offset); | |
266 | + | |
267 | + Attaches a page containing integrity metadata to an existing | |
268 | + bio. The bio must have an existing bip, | |
269 | + i.e. bio_integrity_alloc() must have been called. For a WRITE, | |
270 | + the integrity metadata in the pages must be in a format | |
271 | + understood by the target device with the notable exception that | |
272 | + the sector numbers will be remapped as the request traverses the | |
273 | + I/O stack. This implies that the pages added using this call | |
274 | + will be modified during I/O! The first reference tag in the | |
275 | + integrity metadata must have a value of bip->bip_sector. | |
276 | + | |
277 | + Pages can be added using bio_integrity_add_page() as long as | |
278 | + there is room in the bip bio_vec array (nr_pages). | |
279 | + | |
280 | + Upon completion of a READ operation, the attached pages will | |
281 | + contain the integrity metadata received from the storage device. | |
282 | + It is up to the receiver to process them and verify data | |
283 | + integrity upon completion. | |
284 | + | |
285 | + | |
286 | +6.4 REGISTERING A BLOCK DEVICE AS CAPABLE OF EXCHANGING INTEGRITY | |
287 | + METADATA | |
288 | + | |
289 | + To enable integrity exchange on a block device the gendisk must be | |
290 | + registered as capable: | |
291 | + | |
292 | + int blk_integrity_register(gendisk, blk_integrity); | |
293 | + | |
294 | + The blk_integrity struct is a template and should contain the | |
295 | + following: | |
296 | + | |
297 | + static struct blk_integrity my_profile = { | |
298 | + .name = "STANDARDSBODY-TYPE-VARIANT-CSUM", | |
299 | + .generate_fn = my_generate_fn, | |
300 | + .verify_fn = my_verify_fn, | |
301 | + .get_tag_fn = my_get_tag_fn, | |
302 | + .set_tag_fn = my_set_tag_fn, | |
303 | + .tuple_size = sizeof(struct my_tuple_size), | |
304 | + .tag_size = <tag bytes per hw sector>, | |
305 | + }; | |
306 | + | |
307 | + 'name' is a text string which will be visible in sysfs. This is | |
308 | + part of the userland API so chose it carefully and never change | |
309 | + it. The format is standards body-type-variant. | |
310 | + E.g. T10-DIF-TYPE1-IP or T13-EPP-0-CRC. | |
311 | + | |
312 | + 'generate_fn' generates appropriate integrity metadata (for WRITE). | |
313 | + | |
314 | + 'verify_fn' verifies that the data buffer matches the integrity | |
315 | + metadata. | |
316 | + | |
317 | + 'tuple_size' must be set to match the size of the integrity | |
318 | + metadata per sector. I.e. 8 for DIF and EPP. | |
319 | + | |
320 | + 'tag_size' must be set to identify how many bytes of tag space | |
321 | + are available per hardware sector. For DIF this is either 2 or | |
322 | + 0 depending on the value of the Control Mode Page ATO bit. | |
323 | + | |
324 | + See 6.2 for a description of get_tag_fn and set_tag_fn. | |
325 | + | |
326 | +---------------------------------------------------------------------- | |
327 | +2007-12-24 Martin K. Petersen <martin.petersen@oracle.com> |